Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to slashdot.org.
You can often go around such blocks (not that I’m advocating such behavior) by using services such as SureProxy.com. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/. If SureProxy isn’t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)
I’m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.
The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn’t look anything like the real URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.
Fortunately, it’s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I’ve encountered use CGIProxy–a free proxy script. The script itself is usually called nph-proxy.cgi or nph-proxy.pl, although I’ve also seen nph-go and nph-proy, among others. It’s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I’ll have to find and block manually.
I’ve also seen proxies (Invisible Surfing is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=. I’m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. That turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking of the URLs myself and manually block the domains. It’s like playing Whac-A-Mole, though, because new proxies appear all the time.
The other problem with crawling through proxies is that it makes the crawler ignore the robots.txt file on the target Web site. Since the crawler thinks it’s accessing the proxy site, it checks the proxy’s robots.txt. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.
Perhaps most surprising is that proxy sites don’t have robots.txt files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren’t viewing the Web pages, so the proxy site doesn’t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site’s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don’t take the simple step of writing a simple robots.txt, please explain that to me in the comments, or by email. I’m very curious.