As I’ve mentioned before, writing a Web crawler is conceptually simple: read a page, extract the links, and then go visit those links. Lather, rinse, repeat. But it gets complicated in a hurry.
The first thing that comes to mind is the problem of duplicate URLs. If you don’t have a way to discard URLs that you’ve already seen, you’ll spend most of your time revisiting pages. The naive approach of keeping an in-memory list of visited URLs breaks down pretty quickly when the number of visited sites moves into the tens of millions. Overflowing that list to disk also has problems: maintaining a sorted index and searching that index quickly are very difficult problems. Databases can handle many billions of records fairly quickly, but I don’t know of a database that can keep up with a distributed crawler that’s processing almost 1,000 documents per second, each of which has an average of 100. [As an aside, I was very surprised to find that the average number of links per HTML page–this is based on examining several million documents–is about 130. I would have guessed 15 or 20.]
I ran across a mention of Bloom filters while reading about a different Web crawler. I’ll let you get the detail from the linked Wikipedia article, and just say that a Bloom filter is a space-efficient way of allowing you to test for inclusion in a set. It uses hashing, but has a much lower false positive rate than a simple hash table. The cost is slightly more processing time, but the memory savings and decreased false positive rate are well worth the cost. Besides, my crawler is not at all processor bound.
I coded up a C# class to experiment with Bloom filters, and compared it with a simple hash table. The code along with more detailed explanation is available in my new article, Bloom Filters in C#.