Jim’s Random Notes

Musings on technology and life

August 8th, 2008

Hey, you deleted my files!

We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site.  The news that our crawler is capable of deleting files was quite a surprise to us.  Like other crawlers, ours just downloads HTML files, extracts links, and then visits those links.  There is no “delete a file” logic in there.  But if the crawler stumbles upon a link whose action is to delete a file, then visiting that link will indeed delete the file.

Further investigation in this particular case revealed a file management page that includes, among other things, links that have the form:  www.example.com/files/?delete=filename.txt.  Surprisingly enough, clicking on that link deletes the file.  The file management page is not protected by a password, nor is there any kind of confirmation displayed before the file is permanently deleted.

Examining the logs, we saw accesses from other search engine crawlers.  We also learned from the Webmaster that some time back, a kid had “hacked in” to the site and deleted a bunch of files.

I’m a little surprised that anybody would create such a page and not provide any protection.  I’m very surprised to find out that a supposedly professional Web developer would do such a thing and not learn the lesson when a random surfer came in and deleted files.  And I’m shocked that, even after we explained this to the Webmaster, he insists that we can take this as an opportunity to learn from our “mistake” and “fix” the crawler so that it doesn’t happen again.

It’s unfortunate that our crawler visited those links, causing the files to be deleted.  But the mistake was on the part of the person who posted those destructive links.  The crawler was operating exactly as it should.  Exactly, in fact, as every major search engine crawler acts.  It’d be nice if we could imbue the crawler with enough intelligence to “understand” Web pages and know in advance what the effects of clicking a link will be.  But that kind of machine intelligence is far, far in the future.

If you post something on the Web, it will be found, unless you take active measures to protect it.  Posting a destructive link on an unprotected page and then blaming somebody else when the link is clicked by an “unauthorized” person is akin to running out into a busy street and then blaming your injuries on the driver of the bus that hits you.

July 19th, 2008

More URL Filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these:

http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3
http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3

I don’t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely.  Or until our self-imposed URL length limit kicks in.  Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating http://www.example.com/mp3/song.mp3) is the right thing to do.

A single repeated component is by far the most common, but we frequently see two or three repeated components:

http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3
http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3

It’s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial.  But it’s not a good general solution.  For example this blog (and many others) uses URLs of the form blog.mischel.com/yyyy/mm/dd/post-name/, so the entry for July 7 is blog.mischel.com/2008/07/07/post-name/.  Globally applying the repeated component removal rules would break a very large number of URLs.

This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate.  We can use some heuristics based on the crawl history (i.e. if /mp3/song.mp3 generates /mp3/mp3/song.mp3) to identify problem sites, but in the end we end up having to write domain-specific filtering rules.  Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.

Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like PHPSESSID= and sessionid=, but these URLs will escape the filter unscathed:

http://www.example.com/file.html?exSession=123456xyzzy
http://www.example.com/file.html?exSession=845038plugh
http://www.example.com/coolstuff/123456xyzzy/index.html

http://www.example.com/coolstuff/845038plugh/index.html

It’s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn’t quite that smart, though, and making it that smart is very difficult.

Developing a system that automatically identifies problem URLs and generates filtering rules is a “big-R” research project–something that we don’t have time to work on at the moment.  Even if we were to develop such a thing, it’d be pretty fragile and would require constant monitoring and tweaking.  If a site’s URL format changes (something that happens with distressing frequency), the filtering rules become invalid.  Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.

When I started this project, I knew that crawling the Web was non-trivial.  But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.

July 10th, 2008

Proxy fits

Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to slashdot.org.

You can often go around such blocks (not that I’m advocating such behavior) by using services such as SureProxy.com. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/. If SureProxy isn’t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)

I’m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.

The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn’t look anything like the real URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.

Fortunately, it’s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I’ve encountered use CGIProxy–a free proxy script. The script itself is usually called nph-proxy.cgi or nph-proxy.pl, although I’ve also seen nph-go and nph-proy, among others. It’s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I’ll have to find and block manually.

I’ve also seen proxies (Invisible Surfing is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=. I’m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. That turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking of the URLs myself and manually block the domains. It’s like playing Whac-A-Mole, though, because new proxies appear all the time.

The other problem with crawling through proxies is that it makes the crawler ignore the robots.txt file on the target Web site. Since the crawler thinks it’s accessing the proxy site, it checks the proxy’s robots.txt. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.

Perhaps most surprising is that proxy sites don’t have robots.txt files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren’t viewing the Web pages, so the proxy site doesn’t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site’s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don’t take the simple step of writing a simple robots.txt, please explain that to me in the comments, or by email. I’m very curious.

July 9th, 2008

Crawler versus the URLs

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you’ll start to wonder how the Web manages to work at all.  Nowhere is this more evident than when working with URLs.

It’s a pleasant fantasy to believe that a document on the Web can be reached through one and only one URL.  That is, our training as programmers pushes us into the belief that the URL http://www.example.com/docs/resume.html is the way to reference that particular document.  It might be the preferred way, but it’s certainly not the only way.  On most servers, for example, you can drop the “www”, so that http://example.com/docs/resume.html will get you to the same place. We call this “the www problem.”

That’s just the simplest example.  Did you know that multiple slashes are irrelevant?  That is, http://www.example.com/////docs////resume.html will go to the same place as the two URLs above. You can also do some path navigation within the URL so that http://www.example.com/docs/../docs/resume.html goes to the same place as all the other examples I’ve shown.

You can also “escape” any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: http://www.example.com%2Fdocs%2Fresume.html. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.

Above are just some of the simplest examples. I haven’t even started on query strings–parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the “canonical” form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.

If you haven’t written a crawler, you might think I’m just contriving examples. I’m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant–so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in RFC 3986 will get you most of the way there, but there are ambiguities that simply cannot be resolved.  So we do the best we can.

You might also wonder where these weird URLs come from.  The answer is, “everywhere.”  Scripts are high on the lists of culprits.  They can mangle URLs beyond belief.  For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string.  The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.

So imagine a script  reached from the URL http://www.example.com/script.php?page=1&username=Jim%20Mischel. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: http://www.example.com/script.php?page=2&username=Jim%2520Mischel. “%25″ is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with http://www.example.com/script.php?page=10&username=Jim%2525252525252525252520Mischel.

What’s a poor crawler to do?

We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it’s a never-ending battle. Whenever we think we’ve seen it all, we run into another surprise.

June 19th, 2008

Major search engines support robots.txt standard

Google, Yahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives.  This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers.  The common directives are: Disallow, Allow, and Sitemaps.  In addition, all three support the use of wildcards (* and $) in specifying paths for Allow and Disallow.  It’s interesting to note that Yahoo says they support “$ Wildcards,” whereas Google and Microsoft say that they support “* Wildcards” as well as “$ Wildcards.”  From reading Yahoo’s documentation, though, I’d say that they also support “* Wildcards.”

All three also support several HTML META tags, such as NOINDEX and NOFOLLOW, that give content authors much tighter control over crawlers than can be accomplished with robots.txt. 

This isn’t exactly a new step.  The three major search engines have been collaborating for the last few years, trying to make Webmasters’ jobs easier with respect to the major search engines.  For example, back in February they announced common support for cross-submission of Sitemaps.

Unfortunately, all three also support individual extensions to the Robots Exclusion Protocol.  For example, Yahoo and Microsoft support the Crawl-Delay directive, which Google does not support. Both Google and Yahoo support some unique META tags that the others don’t support.

Even with the incompatibilities, this is a big step in the right direction. With unified support of the major robots.txt directives among the three major search engine crawlers, we can expect to see more support by smaller crawlers. I know that many authors of smaller-scale crawlers look to the majors to see what they should support. Having all three support the same directives in the same way, makes other developers’ jobs (including mine!) easier.

But ultimately it’s the Webmasters who benefit the most by giving them a standard way to control crawlers’ access to their sites.

June 16th, 2008

One more time: the Internet is public

[Note:  As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.]

I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on the Internet.  If you’ve uploaded something to your Web site, it’s highly likely that Google, MSN, Yahoo, or any (or all) of the many other search engines out there has found it.  Even our Web crawler–a small-scale operation–finds things in hidden nooks and crannies of the Web that most people with browsers would never stumble upon.

For example, the other day a coworker was spot-checking some of the crawler’s latest finds and stumbled upon a site where the owner had uploaded what looks like (from examining the file names) a bunch of very private stuff.  This all in an unprotected directory.  A person with a browser could go to that URL, get a listing of all files, and then browse to his heart’s content.  Although it’s unlikely that a person browsing would stumble upon the directory, a crawler almost certainly will.  Eventually.

When we run across something like that, we don’t actually browse, but rather find out how to contact the site owner and send him a very nice email suggesting that he either protect the directory or not upload that information.

The day after discovering the site I mentioned above, we ran across the story of Alex Kozinski, a judge in the 9th Circuit whose personal porn stash was found publicly accessible online:

Kozinski, 57, said that he thought the site was for his private storage and that he was not aware the images could be seen by the public, although he also said he had shared some material on the site with friends. After the interview Tuesday evening, he blocked public access to the site.

Of particular interest in this case is that the judge was presiding over an obscenity trial (now postponed) that involves material that’s apparently similar to some of the material on the judge’s site.  The judge also had some copyrighted music on the site, opening up the possibility of copyright violation.

No matter how far out in the country you live, if you stand naked in front of an uncovered window, somebody will eventually see you.  Similarly, if you upload something to your Web site and don’t take active measures to prevent access, it will be found.  Do not assume that it can’t be found because you never told anybody about it.  That’s like putting a key under the doormat and figuring it’s safe because only you know it’s there.

June 5th, 2008

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk’s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

May 23rd, 2008

Reducing bandwidth used by crawlers

Some site operators block web crawlers because they’re concerned that the crawlers will use too much of the site’s allocated bandwidth. What they don’t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage than the people running the sites that the crawlers visit. There are several reasons for this concern:

  • The visible Web is so large that no crawler can examine the entire thing in any reasonable amount of time. At best estimates, even Google covers only about 25% of the visible Web.
  • The Web grows faster than the ability to crawl it grows.
  • It takes time (on average between one and two seconds) to find, download, and store a Web page. Granted, a large crawler can download thousands of pages per second, but it still takes time.
  • It requires more time, storage, and CPU power to store, parse, and index a downloaded page.

I suspect that the large search engines can give you a per-page dollar cost for locating, downloading, storing, and processing. That per-page cost would be very small, but when you multiply it by 25 billion (or more!) pages it’s a staggering amount of money–a cost that’s incurred every time they crawl the Web. As you can imagine, they have ample incentive to reduce unnecessary crawling as much as possible. In addition, time and bandwidth spent downloading unnecessary pages means that some previously undiscovered pages are not visited.

The HTTP specification includes someting called a conditional GET. It’s a way for a client to request that the server send the page only if it meets some criteria. The specification identifies several different criteria, one of which is called If-Modified-Since. If the client has seen the page before and has saved the page and the date it received the page, then the client can send a request to the server that says, in effect, “If the page has changed since this date, then send me the page. Otherwise just tell me that the page hasn’t changed.” The this date would be replaced with the actual date that the client last saw the page.

If the server supports If-Modified-Since (which almost all do), there is a big difference in how much bandwidth is used. If the Web page has not been modified, the server responds with a standard header and a 304 NotModified status code: total payload maybe a few hundred bytes. That’s a far cry from the average 30 kilobytes for an HTML page, or the hundreds of kilobytes for a page that has complicated scripts and lots of content.

The only catch is that server software (Apache, IIS, etc.) only support If-Modified-Since for static content: pages that you create and store as HTML on your site. If your site is dynamically generated with PHP, ASP, Java, etc., then the script itself has to determine if the content has changed since the requested date, and act accordingly by sending the proper response. If your site is dynamically generated, it’s a good idea to ask your developers if it supports If-Modified-Since.

Crawlers aren’t the only clients that use If-Modified-Since to save bandwidth. All the major browsers cache content, and can be configured to do conditional GETs.

The direct savings of using If-Modified-Since can be small when compared to the indirect savings. Imagine that your site’s front page contains links to all the other pages on your site. If a crawler downloads the main page, it’s going to extract the links to all the other pages and attempt to visit them, too. If you don’t support If-Modified-Since, the crawler will end up downloading every page on your site. If, on the other hand, you support If-Modified-Since and your front page doesn’t change, the crawler won’t download the page and thus won’t see links to the other pages on the site.

Don’t take the above to mean that your site won’t be indexed if you don’t change the main page. Large-scale crawlers keep track of the things they index, and will periodically check to see that those things still exist. The larger systems even keep track of how often individual sites or pages change, and will check for changes on a fairly regular schedule. If their crawling history shows that a particular page changes every few days, then you can expect that page to be visited every few days. If history shows that the page changes very rarely, it’s likely that the page won’t be visited very often.

Smaller-scale crawlers that don’t have the resources to keep track of the change frequency for billions of Web sites will typically institute a blanket policy that controls the frequency that they revisit pages–once per day, once per week, etc.

Supporting If-Modified-Since is a very easy and inexpensive way to reduce the load that search engine crawlers put on your servers. If you’re publishing static content, then most likely you’re already benefitting from this. If your Web site is dynamically generated, be sure that your scripts recognize the If-Modified-Since header and respond accordingly.

May 15th, 2008

A variation on the homegrown DOS attack

Tuesday, in How to DOS yourself, I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack. There’s another way to do it that is even more insidious.

In Tuesday’s post I showed how to configure error documents. There’s apparently another way to configure things so that, rather than returning an error status code (403 Forbidden, 404 Not Found, etc.), the server returns a 302 Redirect status code. The redirect tells the client (i.e. the browser or crawler) that the page requested can be found at a new location. That new location is returned along with the 302 Redirect status code.

When a browser sees the 302 status code, it issues a request for the new page.

Now, imagine what happens if you block an IP address from accessing your site (see Tuesday’s article) and you configure the server to return a redirect status code when somebody tries to access from that blocked IP address:

  1. Client tries to access http://yoursite.com/index.html
  2. Server notices the blocked IP address and says, “return 403 Forbidden.”
  3. Custom error handling returns a 302 Redirect pointing to http://yoursite.com/forbidden.html.
  4. Browser receives redirect status code and issues a request for http://yoursite.com/forbidden.html
  5. Go to step 2.

The browser and server now enter a cooperative infinite loop, with the browser saying “Show me the forbidden.html page,” and the server saying, “View forbidden.html instead.”

This is more insidious because from the server’s point of view it looks like the client is perpetrating a denial of service attack by continually attempting to access the same document. But the client is simply following the server’s directions.

Web crawlers won’t fall into this trap because they keep track of the pages they’ve visited or tried to visit. A Web crawler will see the first redirect and attempt to access the forbidden.html page, but on the next redirect the crawler will see that it’s already attempted that page, and give up.

Not all browsers are that smart. Firefox tries a few times and then stops, showing an error message that says:

Firefox has detected that the server is redirecting the request for this address in a way that will never complete.

Internet Explorer, on the other hand, appears to continue trying indefinitely.

I don’t know enough about Apache server configuration to give an example of redirecting on error. I do know it’s possible, though, because I discovered such a redirect loop recently while investigating a problem report. Unfortunately, the Webmaster in question was not willing to share with me the pertinent sections of his .htaccess file.

May 13th, 2008

How to DOS yourself

It’s surprising the things you’ll learn when you write a Web crawler. Today’s lesson: how to be both perpetrator and victim of your own denial of service attack.

Not everybody likes crawlers accessing their sites. Most will modify their robots.txt files first, which will prevent polite bots from crawling. But blocking impolite bots requires that you configure your server to deny access based on IP address or user-agent string. Some Web site operators, either because they don’t know any better or because they want to prevent bots from even accessing robots.txt, prefer to use the server configuration file for all bot-blocking. Doing so is easy enough, but you have to be careful or you can create a home-grown denial of service attack.

The discussion below covers Web sites running the Apache server. I don’t know how to effect IP blocks or custom error pages using IIS or any other Web server.

There are two ways (at least) to prevent access from a particular IP address to your Web site. The two ways I know of involve editing the .htaccess file, which usually is stored in the root directory of your Web site. [Note: The filename really does start with a period. For some reason, WordPress doesn't like me putting that filename in a post without putting some HTML noise around it. So for the rest of this entry, I'll refer to the file as htaccess, without the leading period.] As this isn’t a tutorial on htaccess I suggest that you do a Web search for “htaccess tutorial”, or consult your hosting provider’s help section for full information on how to use this file.

The simple method of blocking a particular IP address, available on all versions of Apache that I know of, is to use the <Files> directive. This htaccess fragment will block an IP address:

<Files *>
order deny,allow
deny from abc.def.ghi.jkl
</Files>

Of course, you would replace abc.def.ghi.jkl in that example with the actual IP address you want to block. If you want to block multiple addresses, you can specify them in separate deny directives, one per line. Some sites say that you can put multiple IP addresses on a single line. I don’t know if that works. There also is a way to block ranges of IP addresses.

If you do this, then any attempted access from the specified IP address will result in a “403 Forbidden” error code being returned to the client. The Web page returned with the error code is the default error page, which is very plain (some would say ugly), and not very helpful. Many sites, in order to make the error pages more helpful or to make them have the same look and feel as the rest of the site, configure the server to return a custom error page. Again, there are htaccess directives that control the use of custom error pages.

If you want a custom page to display when a 403 Forbidden is returned, you create the error page and add a line telling Apache where the page is and when it should be returned. If your error page is stored on your site at /forbidden.html, then adding this directive to htaccess tells Apache to return that page along with the 403 error:

ErrorDocument 403 /forbidden.html

Now, if somebody visits your site from the denied IP address, the server will return the custom error page along with a 403 Forbidden status code. It really does work. As far as I’ve been able to determine, nothing can go wrong with this configuration.

I said before that there are at least two ways prevent access from a particular IP address. The other way that I know of involves using an Apache add-on called mod_rewrite, a very useful but also very complicated and difficult to master module with which you can do all manner of wondrous things. I don’t claim to be an expert in mod_rewrite. But it appears that you can block an IP address by adding this command:

RewriteCond %{REMOTE_ADDR} ^abc\.def\.ghi\.jkl$
RewriteRule .* [F]

Again, you would replace the abc, def, etc. with the actuall IP address numbers. As I understand it, this rule (assuming that mod_rewrite is installed and working) will prevent all accesses to your site from the given IP address. But there’s a potential problem.

If you have a custom 403 error document, the above can put your server into an infinite loop. According to this forum post at Webmaster World:

A blocked request is redirected to /forbidden.html, and the server tries to serve that instead, but since the user-agent or ip address is still blocked, it again redirects to the custom error page… it gets stuck in this loop.

There you have it: you are the perpetrator and victim of your own denial of service attack.

The forum post linked above shows how to avoid that problem.

I’ve seen some posts indicating that the infinite loop also is possible if you use the simple way of doing the blocking and error redirects. I haven’t been able to verify that. If you’re interested, check out this post, which also offers a solution if the problem occurs.

How I came to learn about this is another story. Perhaps I can relate it one day.

May 6th, 2008

Opt in or opt out?

I mentioned before that there is a small but very vocal group of webmasters who say that crawlers should stay off their sites unless specifically invited. It is their opinion that they shouldn’t have to include a robots.txt file in order to prevent bots from crawling their sites. Their reasons for holding this opinion vary, but generally fall into one of two categories: they have private content they don’t want indexed, or they don’t want their bandwidth “wasted” by bots. I understand their point of view, but in my opinion their model won’t work.

The nature of the Web is that everything is available to whoever wants to access it. If you don’t want your content publicly accesible, you have to take active measures to block access, either by blocking particular users or by adding some kind of authentication and authorization to allow only those users to whom you’ve granted access. The Web has always operated on this “opt-out” principle. One could argue that this open access to information is the whole reason for the Web. It’s certainly one of the primary reasons (if not the primary reason) that the Web continues to grow.

Search engines are essential to the operation of the Web. Most people who post information on the Web depend on the various search engines to index their sites so that when somebody is looking for a product or information, those people are directed to the right place. And users who want something depend on search engines to present them with relevant search results. Search engines can’t do that unless they can crawl Web sites.

The argument is that those people who want to be crawled should submit their sites to search engines, and that search engine crawlers should crawl only those sites that are submitted. It’s unlikely that such a thing could work, and even if it could the result would be a much less interesting Web because the difficulties involved in getting indexed would be too much for most site owners.

The first hurdle to overcome would be to determine who gets to submit a site’s URL to a search engine. It would be nearly impossible to police this and ensure that the person who submitted the site actually had authority to do so. Search engines could require written requests from the site’s published owner or technical contact (as reported by a whois search, for example), but the amount of paperwork involved would be astronomical. You could also build some kind of Web submission process that requires the submitter to supply some identifying information, but even that would be unreasonably difficult to build and manage.

There are approximately 100 million registered domain names at any given time. Names come and go and site owners change. It’s unreasonable to ask a search engine to keep track of that information. Imagine if the owner of the domain example.com submitted his site for crawling, but after a year let his domain name expire. A short time later somebody else registers example.com, but doesn’t notify the search engine of the ownership change. The new owner has no idea that the name was previously registered with the search engine and gets upset when his site is crawled. Is the search engine at fault?

There are many, many search engines, with more coming online all the time. To expect a webmaster to submit his site to every search engine is unreasonable. Granted, there are sites that will submit to multiple search engines, but going this route makes it even more difficult to keep track of things. Every submission site has to keep track of who got submitted where, and there has to be some kind of infrastructure so that webmasters can query the submission site’s database to determine if a particular bot is authorized.

Even if we somehow got the major search engines and the majority of site owners to agree that the opt-in policy is a good thing, you still run into the problem of ill-behaved bots: those that crawl regardless of permission. Again, the fundamental structure of the Web is openness. Absent legislation that makes uninvited crawling a crime (and the first hurdle there would be to define “crawling”–a problem that would be even more difficult than the policing problems I mentioned above), those ill-behaved bots will continue to crawl.

When you add it all up, it seems like a huge imposition on the majority of site owners who want visibility, just to satisfy a small number of owners who don’t want their sites crawled. It also places an unreasonable burden on those people who operate polite crawlers, while doing nothing to prevent the impolite crawlers from making a mess of things. This is especially true in light of the Robots Exclusion Standard, which is a very simple and effective way to ask those polite crawlers not to crawl. To prevent bots from crawling your site, just create a robots.txt file that looks like this:

User-agent: *
Disallow: /

That won’t prevent all crawling, as most crawlers will still hit robots.txt periodically, but almost all crawlers respect that simple robots.txt file and will crawl no further. Those that don’t respect it likely wouldn’t respect an opt-in policy, either. Creating and posting that file takes approximately two minutes (maybe five if you’re not familiar with the process), and it’s a whole lot more effective than trying to change a fundamental operating principle of the Web.

May 6th, 2008

Why every site should have a robots.txt file

People often ask if they need a robots.txt file on their sites. I’ve seen some Web site tutorials that say, in effect, “don’t post a robots.txt file unless you really need it.” I think that is bad advice. In my opinion, every site needs a robots.txt file.

First a disclaimer. I’ve had my own Web site for 10 years, and my experience operating this site led me to the conclusion that all sites need a robots.txt file. In addition, as I’ve mentioned a time or three over the past year, I’m building a Web crawler. That work has strengthened my opinion that even the most modest Web site should have a robots.txt file.

Now let me explain why.

Search engine indexing is a fact of life on the Internet. Google, Yahoo, MSN, Internet Archive, and dozens of other search engines continually crawl the Web, downloading pages and storing them for later indexing. For most people, this is a Good Thing: search engines let other people find the content that you post. Without Web-scale search engines, sites would have to depend on linking from others, and most would simply die in obscurity. Most people who post things on the Internet want to be visible, and Web search engines provide that visibility. It is in your best interest, if you want to be visible, to make it as easy as possible for search engines to find and index your site. Part of making your site easy to crawl is including a robots.txt file.

Because we’re talking about an exclusion standard, you might ask why you need a robots.txt file if you don’t want to block anybody’s access. The answer has to do with what happens when a program tries to get a file from your site.

When a well-behaved Web crawler (that is, one that recognizes and adheres to the robots.txt convention) first visits your site, it tries to read a file called robots.txt. That file is expected to be in the root directory, and be in plain text format. (For example, the robots.txt file for this blog site is at http://blog.mischel.com/robots.txt.) If the file exists, the crawler downloads and parses it, and then adheres to the Allow and Disallow rules that are there.

Crawlers don’t actually download robots.txt before visiting every URL. Typically, a crawler will read robots.txt when it first comes to your site, and then cache it for a day or two. That saves you bandwidth and saves the crawler a whole lot of time. It also means that if you make changes to robots.txt, it might be a few days before the crawler in question sees them. For example, if you see Googlebot accessing your site and you change robots.txt to block it, don’t be surprised if Googlebot keeps accessing your site for the next couple of days.

If you don’t have a robots.txt file, then two things happen: a “document not found” (code 404) error message is written to your server’s error log file, and the server returns something to the the Web crawler.

The entry in your server error log can be annoying if you periodically scan the error log. Since 404 is the error code returned when any document isn’t found, scanning the error log from time to time is a good way to find bad links in (or to) your site. Having to wade through potentially hundreds of 404 errors for robots.txt is pretty annoying.

I said that your server returns “something” to the crawler that requested the document. On simple sites, “something” turns out to be a very short message with a 404 Not Found status code. Crawlers handle that without trouble. But many Web sites have custom error handling, and “Not Found” errors will redirect to an HTML page saying that the file was not found. That page often has lots of other stuff on it, making it fairly large. In this case, not having a robots.txt file ends up costing you bandwidth.

Your not having a robots.txt doesn’t particularly inconvenience the crawler. If your server returns a 404 or a custom error page, the crawler just notes that you don’t have a robots.txt, and continues on its way. If it visits your site in a day or two, it’ll try to read robots.txt again.

So that’s why you should have a robots.txt: it prevents messages in your error log, potentially saves you bandwidth, and it lets you inform crawlers which parts of your site you want indexed.

Every Web site should have, at minimum, this robots.txt file:

User-agent: *
Disallow:

All this says is that you’re not disallowing anything: all crawlers have full access to read the entire site. That’s exactly what having no robots.txt file means, but it’s always a good idea to be specific when possible. Plus, if you create the file now while you’re thinking about it, you’ll find it much easier to modify in a hurry when you want to block a particular crawler.

How you create and post a robots.txt on your own site will depend on what hosting service you use. If you use FTP to put files up on the site, then you can create a file with a plain text editor (like Windows Notepad), add the two lines shown above, save it as robots.txt on your local drive, and then FTP it to the root of your Web site. If you use some kind of Web-based file manager, create a plain text file (NOT a Web page) at the top level of your site, add those two lines, and save the file as robots.txt. You can test it by going to http://yoursitename/robots.txt in your favorite Web browser.

May 6th, 2008

More On Robots Exclusion

As I mentioned yesterday, the Robots Exclusion Standard is a very simple protocol that lets webmasters tell well-behaved crawlers how to access their sites. But the “standard” isn’t as well defined as some would have you think, and there’s plenty of room for interpretation.

Consider this simple file:

User-agent: *
Disallow: /xfiles/


User-agent: YourBot
Disallow: /myfiles/

This says, “YourBot can access everything but /myfiles/. All other bots can access everything except /xfiles/.” Note that it does not prevent YourBot from accessing /xfiles/, as some robots.txt tutorials would have you believe.

Crawlers use the following rules, in order, to determine what they’re allowed to crawl:

  1. If there is no robots.txt, I can access anything on the site.
  2. If there is an entry with my bot’s name in robots.txt, then I follow those instructions.
  3. If there is a “*” entry in robots.txt, then I follow those instructions.
  4. Otherwise, I can access everything.

It’s important to note that the crawler stops checking rules once it finds one that fits. So if there is an entry for YourBot in robots.txt, then YourBot will follow those rules and ignore the entry for all bots (*).

If Disallow is the only directive, then there is no further room for interpretation. But the addition of the Allow directive threw a new wrench into the works: in what order do you process Allow and Disallow directives?

The revised Internet-Draft specification (note that I linked to archive.org here because the primary site has been down recently) for robots.txt says:

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

According to that description, if you wanted to allow access to /xfiles/mulder/, but disallow access to all other files in the /xfiles/ directory, you would write:

User-agent: *
Allow: /xfiles/mulder/
Disallow: /xfiles/

Several publicly available robots.txt modules work in this way, but that’s not the way that Google interprets robots.txt. In How do I block or allow Googlebot?, there is this example:

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Obviously, Googlebot reads all of the entries, checks first to see if the URL in question is specifically allowed, and if not, then checks to see if it is disallowed.

It’s a very big difference. If a bot were to implement the proposed standard, then it would never crawl /folder1/myfile.html, because the previous Disallow line would prevent it from getting beyond /folder1/.

Yahoo says that they work the same way as Google, with respect to handling Allow and Disallow. It’s unclear what MSNBot does, or how other crawlers handle this. But, hey, if it’s good enough for Google…

I never would have thought that a simple protocol like robots.txt could raise so many questions. And I’ve only touched on the Allow and Disallow directives. There are plenty of other proposals and extensions out there that are even more confusing, if you can imagine. Add to that the META tags that you can add to individual HTML documents to prevent crawling or indexing, and things get really confusing.

But I’ll leave that alone for now. Next time I’ll explain why every Web site should have a robots.txt file, even if it doesn’t restrict access to anything.

May 5th, 2008

Struggling with the Robots Exclusion Standard

The Internet community loves standards. We must. We have so many of them. Many of those “standards” are poorly defined or, even worse, ambiguous. Or, in the case of robots.txt, subject to a large number of extensions that have become something of a de facto standard because they’re supported by Google, Yahoo, and MSN Search. Unfortunately, those extensions can be ambiguous and difficult for a crawler to interpret correctly.

A little history is in order. The robots.txt “standard” is not an official standard in the same way that HTTP, SMTP, and other common Internet protocols are. There is no RFC that defines the standard, nor is there an associated standards body. The Robots Exclusion Standard was created by consensus in June 1994 by members of the robots mailing list. At the time it was created, the standard described how to tell Web robots (”bots,” “spiders,” “crawlers,” etc.) which parts of a site they should not visit. For example, to prevent all bots from visiting the path /dumbFiles/, and to stop WeirdBot from visiting anything on the site, you could create this robots.txt file:

# Prevent all bots from visiting /dumbFiles/
User-agent: *
Disallow: /dumbFiles/


# keep WeirdBot away!
User-agent: WeirdBot
Disallow: /

Understand, robots.txt doesn’t actually prevent a bot from visiting the site. It’s an advisory standard. It still requires the cooperation of the bot. The idea is that a well-behaved bot will read and parse the robots.txt file, and politely not crawl things it’s not supposed to crawl. In the absence of a robots.txt file, or if the robots.txt does not block the bot, the bot has free access to read any file.

There is a small but rather vocal group of webmasters who insist that having to include a robots.txt file is an unnecessary burden. Their view is that bots should stay off the site unless they’re invited. That is, the robots.txt should be an opt-in rather than an opt-out. In this model, the lack of a robots.txt file or a line within robots.txt specifically allowing the bot, the bot should stay away. In my opinion, this is an unreasonable position, but it’s a topic for another discussion.

In its initial form, robots.txt was a simple and reasonably effective way to control bots’ access to Web sites. But it’s a rather blunt instrument. For example, imagine that your site has five directories, but you only want one of them accessible by bots. With the original standard, you’d have to write this:

User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/
Disallow: /dir4/

Not so bad with only five directories, but it can quickly become unwieldy with a much larger site. In addition, if you add another directory to the site, you’d have to add that directory to robots.txt if you don’t want it crawled.

One of the first modifications to the standard was the inclusion of an “Allow” directive, which overrides the Disallow. With Allow, you can block access to everything except that which you want crawled. The example above becomes:

User-agent: *
Disallow: /
Allow: /dir5/

But not all bots understand the Allow directive. A well-behaved bot that does not support Allow will see the Disallow directive and not crawl the site at all.

Another problem is that of case sensitivity, and there’s no perfect solution. In its default operating mode, the Apache Web server treats case as significant in URLs. That is, the URL http://example.com/myfile.html is not the same as http://example.com/MYFILE.html. But the default mode of Microsoft’s IIS is to ignore case. So on IIS, those two URLs would go to the same file. Imagine, then, what happens if you have a site that contains a directory called /files/ that you don’t want indexed. This simple robots.txt should suffice:

User-agent: *
Disallow: /files/

If the site is in case-sensitive mode (Apache’s default configuration), then bots have no problem. A bot will check the URL they want to crawl to see if it starts with “/files/”, and if it does the bot will move on without requesting the document. But if the URL starts with “/Files/”, the bot will request the document.

But what happens if the site is running in case-insensitive mode (IIS default configuration), and the bot wants to crawl the file /Files/index.html? If it does a naive case-sensitive comparison, it will end up crawling the file, because as far as the Web server is concerned, /Files/ and /files/ are the same thing.

Since both Web servers can operate in either mode (case significant or not), it’s exceedingly difficult (impossible, in some cases) for a bot to determine whether case is significant in URLs. So those of us who are trying to create polite Web crawlers end up writing our robots.txt parsers with the assumption that all servers are case-insensitive (i.e. they operate like IIS). Given the above robots.txt file, we won’t crawl a URL that begins with “/files/”, “/Files/”, “/fILEs/”, or any other variation that differs only in case. To do otherwise would risk violating what the webmaster intended when he wrote the robots.txt file, but we end up potentially not crawling files that we’re allowed to crawl.

In a perfect world, this wouldn’t be required. But in the wide world of the Internet, it’s quite common for people to change case in links. I did it myself when I was running on IIS. My blog used to be at /Diary/index.html, but I and others often linked to /diary/index.html. That caused no end of confusion when I moved to a server running Apache. I had to make redirects that converted references to /Diary/ to /diary/.

Somewhere along the line, somebody decided that the Disallow and Allow directives should support wildcards and pattern matching. Google and Yahoo support this, but I’m not sure yet that their syntax and semantics are identical. I see no information that MSNBot supports these features. Some other crawlers support wildcards and pattern matching to different degrees and with varying syntax.

As useful as robots.txt has been and continues to be, it definitely needs an update. I fear, though, that any proposed “official” standard will never see the light of day, and if it does it will be overly complex and impossible to implement. The alternative isn’t particularly attractive, either: webmasters have to know the peculiarities of dozens of different bots, and bot writers have to decide which extended directives to support. It’s a difficult situation for both, but I don’t see how it can be reconciled. Likely the best a bot writer can do is attempt to implement those extensions that are supported by the major search engines, and document that on their sites.

July 27th, 2007

Web Search Ramblings

Most people reading this blog understand conceptually how Google and other search engines work. In brief, they have a program called a Web crawler that goes from one Web site to the next, downloading and storing pages, and extracting links to other pages. A separate process reads the stored documents and creates an inverted index that is (conceptually) similar to the index at the back of a book except that it indexes every single word in the document. When a user does a “search,” the search engine need only look up the terms in the index and return a list of documents that contain those terms.

I’ve waved my hand over significant technical detail, but the details of the implementation are not the point. The point is that many people–perhaps a majority of Internet users–do not have even this level of understanding. They think that when they pose a query to a search engine, the search engine searches the Web in real time. Those of us who understand a little bit about the Internet and the inner workings of search engines might find that idea absurd, but the term “search engine” does imply that some kind of searching is going on. What we call a search engine is more correctly a Web index.

Except that it’s not an index of the entire Web. In fact, not even Google indexes a majority of the visible Web. The best estimates I’ve seen put the number of publicly visible Web pages at somewhere between 50 and 100 billion. Researchers estimate that nobody indexes even 20% of them. If you give it a little thought, you can understand why.

The data I’ve gathered in crawling more than 100 million Web pages over the last few months indicates that the average Web page size is about 30 kilobytes. A one-megabit Internet connection can pull down 100 kilobytes per second, or about 3.3 Web pages per second. Large search engines, of course, have much faster Internet connections–on the order of gigabits. But even a gigabit connection can only pull down 3,300 pages per second. At that rate, it would take about 35 days to download 10 billion documents.

Granted, some pages are updated less frequently than others, and search engines have been optimized to take that into account. Still, it’s not possible right now for any search engine to have a current index of the entire visible Web. At best, a general search engine can have a reasonably current–say 24 hours–index of the most popular one percent of the Web. Everything else has to wait until the crawler gets around to it.

Note that I said “general search engine.” Targeted search engines that index specific topics or particular subsets of the Web are becoming more popular because they can keep their indexes more up to date. Some of them can update their indexes several times per day. Not only is their information more current, but it’s also more focused and more likely to give you higher quality results in whatever narrow field it targets. The drawback, of course, is that you won’t find information about Twinkies on the Steam Train search site. (I made that up. I have no idea if there really is a Steam Train search site.)

The large search engine companies understand this, of course. Google, for example, has introduced Google Custom Search, which allows you to create what is, in effect, your own custom search engine. This is the first step in what I think will be a very large emerging market.

April 28th, 2007

You want it when?

The web crawler I’m working on, as I’ve mentioned before, is a distributed application. Currently it consists of a URL Server and multiple Crawlers. The basic idea is that the URL Server is a traffic director that tells each Crawler machine which Web sites to visit. Each Crawler machine hosts multiple Worker threads, each of which reads a URL from a queue, downloads the page, harvests the links, and reports the results.

Coordinating the the Worker threads and the communication with the URL Server (which itself contains several threads) isn’t exactly easy, but it’s manageable. This past week I’ve been working on what I call the Crawler Control Panel: a GUI application that lets me monitor and control the URL Server and the individual Crawlers. Today I finally got around to adding the shutdown commands, and making the Crawlers and URL Server shut down gracefully. That turned out to be an interesting experience.

Read the rest of this entry »

April 12th, 2007

Bloom Filters in C#

As I’ve pointed out before, writing a Web crawler is conceptually simple: read a page, extract the links, and then go visit those links. Lather, rinse, repeat. But it gets complicated in a hurry.

The first thing that comes to mind is the problem of duplicate URLs. If you don’t have a way to discard URLs that you’ve already seen, you’ll spend most of your time revisiting pages. The naive approach of keeping an in-memory list of visited URLs breaks down pretty quickly when the number of visited sites moves into the tens of millions. Overflowing that list to disk also has problems: maintaining a sorted index and searching that index quickly are very difficult problems. Databases can handle many billions of records fairly quickly, but I don’t know of a database that can keep up with a distributed crawler that’s processing almost 1,000 documents per second, each of which has an average of 100. [As an aside, I was very surprised to find that the average number of links per HTML page--this is based on examining several million documents--is about 130. I would have guessed 15 or 20.]

I ran across a mention of Bloom filters while reading about a different Web crawler. I’ll let you get the detail from the linked Wikipedia article, and just say that a Bloom filter is a space-efficient way of allowing you to test for inclusion in a set. It uses hashing, but has a much lower false positive rate than a simple hash table. The cost is slightly more processing time, but the memory savings and decreased false positive rate are well worth the cost. Besides, my crawler is not at all processor bound.

I coded up a C# class to experiment with Bloom filters, and compared it with a simple hash table. The code along with more detailed explanation is available in my new article, Bloom Filters in C#.

March 30th, 2007

Multi-threaded programming

I’ve been head-down here working on the Web crawler and haven’t had much occasion to sit down and write blog entries. It’s been a very busy but interesting and rewarding time. A high performance distributed Web crawler is a rather complex piece of work. I knew that it would be involved, but there’s a lot more to it than I originally thought. Managing hundreds of concurrent socket connections on a single machine and coordinating multiple machines gets complicated in a hurry.

It’s been a long time since I did any serious multi-threaded programming, and I’m re-learning some lessons:

  • When there are multiple threads, everything is subject to concurrency issues.
  • Unless you can guarantee that a data structure cannot possibly be accessed by more than one thread of execution at a time, you will eventually experience a problem if you don’t protect it with some kind of synchronization lock.
  • You’ll find that guarantee very difficult to make.
  • It can take hours or days before your program crashes inexplicably.
  • It can take just as long to find that one place where you forgot to protect a global data structure with a mutual exclusion lock.
  • If you’re developing a multi-threaded application on a single-processor machine, you’ll invariably end up finding more bugs when you move to a dual-processor machine where the computer really can do more than one thing at a time.
  • Learn how to use trace logging and prepare to spend a lot of time examining the log files for problems.
  • Race conditions and deadlocks are facts of life, difficult to predict, and painfully obvious in retrospect.

It’s really good to be programming again.

March 5th, 2007

Crawling Along

After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a link to blog.mischel.com, how do I feed that new link back to the crawler so that it will download and examine that page? At first glance, it doesn’t seem like a problem at all: simply add the new link to the queue. If only it were so easy.

There are several problems one must deal with. First, you have to make sure that the crawler hasn’t already visited that page, and that the new url isn’t already in the queue. If you just add every harvested link to the queue, you’ll soon find yourself in an infinite loop. Not a problem, right? Just keep a list of visited links. Except there are billions of URLs. If you’re going to crawl any significant portion of the web, you won’t be able to keep the queue or the visited URLs list in memory. Forget storing the things in a traditional database. SQL Server, mySQL, and the like won’t perform well enough to keep your bandwidth saturated.

You could devise a hashing scheme and keep a big array (on the order of four to eight gigabytes) of bits in memory. That would solve the problem at the risk of some small number of false positives–URLs that hash to the same key. If you have a distributed crawler, you can soup up one machine to do all the URL management tasks and have the individual crawler machines send their harvested links to it. This “URL server” makes sure that no link is visited twice, and also controls which URLs the individual crawlers visit. You can draw a pretty picture that shows how it all works together. It’s feasible, but seriously complicated.

There is a simpler method. The URL server tells each crawling process which URLs to visit. They report their harvested links back to the server, which simply writes them to disk. When all of the crawlers have exhausted the links that they’ve been given, they exit. The URL server then sorts the harvested links file, eliminates duplicates by merging with the existing visited links file, and then starts another crawler pass with the newly discovered links. Granted, that sort and merge is going to take a lot of time, but the job can be split up among many machines.

This approach also has the benefit of providing a stopping point where some other process can take the collected data up to this point and start processing. After all, the point of crawling the web is to collect information, not just have the crawlers out there harvesting links.

There is a major drawback to this approach. When the crawlers are idle, you’re not using any of that expensive bandwidth you’re paying for. It is possible to use a hybrid approach, where the URL server fires off a sort/merge (on a separate machine) after collecting some large number of links, and continues to keep the crawlers busy while the other machine is doing its thing. When it’s done, that other machine gives the URL server another batch of links to process. This keeps the bandwidth saturated while still simplifying the process of identifying duplicate URLs.

Of course, if you have something else to do with the bandwidth while the sort/merge is going on, then the crawl-process-crawl scheme comes in very handy indeed.

March 1st, 2007

Crawling the Web

I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down hundreds or thousands of documents per second. Granted, processing more than a few dozen documents per second will require a heck of a lot more bandwidth than I currently have, and if I want to get to the thousands I’ll need multiple machines working together.

But first things first. Before you can scale up, you need something that works no matter how slow it is. This turns out to be more difficult than it seems at first glance. The idea is very simple: read a Web page, scan it for links, put those links in a queue, grab a link from the queue, lather, rinse, repeat. Ten billion times.

Ten billion is a big number. If you process 20 documents per second, it’ll only take you 500 million seconds to finish the job. That’s almost 16 years. Obviously, you need to process things much faster than that. It’s a bandwidth issue and a machine optimization issue. Let’s look at both.

The easy part is bandwidth. It’s hard to get an authoritative figure, but a good round number for average Web page size (HTML only) is about 25 kilobytes. The best throughput I’ve seen on my cable modem is about 5 megabits (approx 500 kilobytes) per second. So figure a sustained upper limit of 20 pages per second. That’s 20 HTML pages downloaded and links extracted. Doesn’t seem so tough, right? I mean, a 2 GHz computer with a 100 megabit network card can easily keep up with that.

It’s harder than it sounds. Not because the processor can’t keep up, but because it’s incredibly difficult to saturate the bandwidth. There’s a whole lot of latency involved in reading Web pages. Most of the time, the computer is just sitting there waiting for servers to return data. The solution is to keep multiple connections going so that while some are waiting, others are pulling down data. That turns out to be a non-trivial task. First, because managing dozens or hundreds of concurrent connections is hard, and second because domain name resolution takes an absurdly long time. And it’s often implemented very inefficiently.

There are many other issues you have to think about if you want to crawl the Web. You’d think that keeping track of the URLs you’ve visited and those that you haven’t visited yet would be easy. And it would be if the numbers were manageable. 10 billion documents means 10 billion URLs. Just the URLs themselves, stored naively, would take close to a terabyte. So you have to come up with a memory-efficient way to store them and a time-efficient way to search them.

There are other issues as well. For example, did you know that www.mischel.com and mischel.com go to exactly the same place? That’s a trivial example, but you’ll find out that some Web pages have many different URLs. If you don’t have some way of resolving those and other traps, URLs with odd query strings, unique session IDs, and any number of other oddities will quickly have you chasing your own tail.

Today I finally got my first multiple-connection web crawler up and limping. It took a whle lot longer than I thought it would because the asynchronous HTTP library I downloaded was buggy and it took me forever to figure out how to make Pytho break out of an infinite loop. For all that’s good about Python, the development and debugging environment is distressingly primitive.

All of which is to say that I’ve latched onto a project that is very interesting and challenging, and doesn’t leave me time for much else. When I’m not banging on code, I’m reading research papers about crawling, storing, indexing, and searching large amounts of data, or thinking about better ways to implement parts of my crawler. It’s exhausting, but pleasantly so.

|