Crawling the Web

I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down hundreds or thousands of documents per second. Granted, processing more than a few dozen documents per second will require a heck of a lot more bandwidth than I currently have, and if I want to get to the thousands I’ll need multiple machines working together.

But first things first. Before you can scale up, you need something that works no matter how slow it is. This turns out to be more difficult than it seems at first glance. The idea is very simple: read a Web page, scan it for links, put those links in a queue, grab a link from the queue, lather, rinse, repeat. Ten billion times.

Ten billion is a big number. If you process 20 documents per second, it’ll only take you 500 million seconds to finish the job. That’s almost 16 years. Obviously, you need to process things much faster than that. It’s a bandwidth issue and a machine optimization issue. Let’s look at both.

The easy part is bandwidth. It’s hard to get an authoritative figure, but a good round number for average Web page size (HTML only) is about 25 kilobytes. The best throughput I’ve seen on my cable modem is about 5 megabits (approx 500 kilobytes) per second. So figure a sustained upper limit of 20 pages per second. That’s 20 HTML pages downloaded and links extracted. Doesn’t seem so tough, right? I mean, a 2 GHz computer with a 100 megabit network card can easily keep up with that.

It’s harder than it sounds. Not because the processor can’t keep up, but because it’s incredibly difficult to saturate the bandwidth. There’s a whole lot of latency involved in reading Web pages. Most of the time, the computer is just sitting there waiting for servers to return data. The solution is to keep multiple connections going so that while some are waiting, others are pulling down data. That turns out to be a non-trivial task. First, because managing dozens or hundreds of concurrent connections is hard, and second because domain name resolution takes an absurdly long time. And it’s often implemented very inefficiently.

There are many other issues you have to think about if you want to crawl the Web. You’d think that keeping track of the URLs you’ve visited and those that you haven’t visited yet would be easy. And it would be if the numbers were manageable. 10 billion documents means 10 billion URLs. Just the URLs themselves, stored naively, would take close to a terabyte. So you have to come up with a memory-efficient way to store them and a time-efficient way to search them.

There are other issues as well. For example, did you know that www.mischel.com and mischel.com go to exactly the same place? That’s a trivial example, but you’ll find out that some Web pages have many different URLs. If you don’t have some way of resolving those and other traps, URLs with odd query strings, unique session IDs, and any number of other oddities will quickly have you chasing your own tail.

Today I finally got my first multiple-connection web crawler up and limping. It took a whole lot longer than I thought it would because the asynchronous HTTP library I downloaded was buggy and it took me forever to figure out how to make Python break out of an infinite loop. For all that’s good about Python, the development and debugging environment is distressingly primitive.

All of which is to say that I’ve latched onto a project that is very interesting and challenging, and doesn’t leave me time for much else. When I’m not banging on code, I’m reading research papers about crawling, storing, indexing, and searching large amounts of data, or thinking about better ways to implement parts of my crawler. It’s exhausting, but pleasantly so.