Crawling along

After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a link to blog.mischel.com, how do I feed that new link back to the crawler so that it will download and examine that page? At first glance, it doesn’t seem like a problem at all: simply add the new link to the queue. If only it were so easy.

There are several problems one must deal with. First, you have to make sure that the crawler hasn’t already visited that page, and that the new url isn’t already in the queue. If you just add every harvested link to the queue, you’ll soon find yourself in an infinite loop. Not a problem, right? Just keep a list of visited links. Except there are billions of URLs. If you’re going to crawl any significant portion of the web, you won’t be able to keep the queue or the visited URLs list in memory. Forget storing the things in a traditional database. SQL Server, mySQL, and the like won’t perform well enough to keep your bandwidth saturated.

You could devise a hashing scheme and keep a big array (on the order of four to eight gigabytes) of bits in memory. That would solve the problem at the risk of some small number of false positives–URLs that hash to the same key. If you have a distributed crawler, you can soup up one machine to do all the URL management tasks and have the individual crawler machines send their harvested links to it. This “URL server” makes sure that no link is visited twice, and also controls which URLs the individual crawlers visit. You can draw a pretty picture that shows how it all works together. It’s feasible, but seriously complicated.

There is a simpler method. The URL server tells each crawling process which URLs to visit. They report their harvested links back to the server, which simply writes them to disk. When all of the crawlers have exhausted the links that they’ve been given, they exit. The URL server then sorts the harvested links file, eliminates duplicates by merging with the existing visited links file, and then starts another crawler pass with the newly discovered links. Granted, that sort and merge is going to take a lot of time, but the job can be split up among many machines.

This approach also has the benefit of providing a stopping point where some other process can take the collected data up to this point and start processing. After all, the point of crawling the web is to collect information, not just have the crawlers out there harvesting links.

There is a major drawback to this approach. When the crawlers are idle, you’re not using any of that expensive bandwidth you’re paying for. It is possible to use a hybrid approach, where the URL server fires off a sort/merge (on a separate machine) after collecting some large number of links, and continues to keep the crawlers busy while the other machine is doing its thing. When it’s done, that other machine gives the URL server another batch of links to process. This keeps the bandwidth saturated while still simplifying the process of identifying duplicate URLs.

Of course, if you have something else to do with the bandwidth while the sort/merge is going on, then the crawl-process-crawl scheme comes in very handy indeed.