Writing a Web Crawler: Crawling Models

This is the second post in a series of posts about writing a Web crawler. Read the Introduction to get the background information.

Expectations

I failed to set expectations in the Introduction, which might have misled some readers to believe that I will be presenting a fully-coded, working Web crawler. That is not the case. I will be discussing the issues in detail, and will certainly include pseudo code for some of the tasks or algorithms. It is decidedly not my intention to design, code, test, and debug a fully working Web crawler. The focus in this series is on the issues involved and possible solutions, but it will stop short of actual implementations.

The reason for that is simple: the issues are more important than the code. I wrote my large-scale Web crawler the hard way. I started with a naive approach to crawling, encountered problems, solved them, found more problems, solved them, etc. It was a fun, but frustrating and terribly inefficient way to create a Web crawler. It involved rewriting large portions of the crawler many times. It resulted in some messy code and, worse, some high level design decisions that can’t be changed without a major refactoring. Had I known the issues ahead of time, I would have designed the crawler differently and avoided a lot of pain.

I would like to save you that pain.

On to crawling

The high-level view of a Web crawler’s operation is very simple. In pseudo code, it looks like this.

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    parse document for links and add to queue
}

Admittedly, that’s a very simplistic view of how a crawler works, but it’s essentially correct. There are, however, a number of details that aren’t included. The Web, although finite, is rather large: on the order of 100 billion documents. The number of URIs pointing to those documents, though, is much larger–way too many to count. It’s difficult to say how many pages are actually indexed by major search engines. I’ve seen numbers that range from 20 percent to 60 percent of the visible Web is indexed by Google, Bing, etc. The “visible” Web includes only those pages that are reachable through a simple URL without having to log in and without having to execute JavaScript, etc. It’s unclear if the 100 billion number includes images and other downloadable files, or if it’s just HTML pages.

Whatever the case, the size of the Web is staggering–large enough that the queue would continually grow and your crawler would never stop. This is especially true if you’re running a crawler on a small scale (one machine crawling from a cable modem). It takes a huge amount of space just to store the URLs that you’re going to crawl.

Another detail missing from the pseudo code above is what to do with the documents. The algorithm just extracts and queues links. A real crawler is crawling for a purpose: typically to do something with the documents that it downloads. A search engine crawler, for example, processes the documents and builds an inverted index that can be searched very quickly to identify documents that contain requested keywords. What really happens is that the document is stored in some repository, and a separate process comes along later to read the document repository and create the index.

Note that this series will pointedly not foray into the world of creating a search engine. There is another entire field of technology involved in storing, indexing, and retrieving documents. My focus is on the crawler, which finds the documents.

The final thing missing from the simplified algorithm shown above is some means of preventing the crawler from requesting the same document multiple times. This turns out to be a very hard problem, and one I’ll cover in much detail in a later post. For now, we’ll limit our definition of “same document” to “same URI”. That is, once the crawler has downloaded the page from http://www.mischel.com/index.html, it shouldn’t go to that document again within the crawling period. I’ll define “crawling period” below.

A more realistic (although not quite real) high level view of the crawling process, then, looks like this:

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    store document for later processing
    parse document for links
    add unseen links to queue
}

As I mentioned above, the queue size can get unmanageably large. If you start your crawler with a very small seed that contains portal sites like Yahoo and MSN, and perhaps a few popular blogging sites like WordPress and LiveJournal, your queue is going to grow very quickly. Our experience has been that, on average, an HTML page will contain about 100 links. That, to me, was an astonishingly high number. It turns out that, again on average, only 10 percent of those links will be new. That is, of the 100 links you’ll scrape from a page, only 10 of them will be new links that you haven’t seen yet. That’s a big reduction, but it’s still a lot. For every URL you take out of the queue, you’ll put 10 back in. If you crawl one page per second, you’ll have almost 900,000 URLs in your queue at the end of one day.

One page per second, by the way, is about what you can expect from a very simple single-threaded crawler over the long term. Crawling speed isn’t particularly relevant for the current discussion, so I’ll defer detailed coverage of that topic to a later posting.

It should be obvious that the queue will quickly grow beyond memory capacity, so you need some kind of backing store for it. Not only do you need to maintain the URLs that will be crawled, you also have to maintain a list of the URLs you’ve already crawled, so you don’t crawl them again. You might think of using a database for that, but you’ll quickly find that you need a very fast database cluster to handle all the lookups you’d be making. At one page per second, you’d be querying the database 100 times per second just to see if the extracted URLs are new. When you get your crawler doing 30 pages per second (easily achievable on a single machine and a cable modem), you’ll be doing 3,000 lookups and 30 updates, and 300 insertions per second. That’s going to be beyond the capacity of a simple database server.

Another problem with a “live” queue is that URLs are inserted in essentially random order. There will be blocks of URLs for the same site, but links to the mischel.com domain, for example, might appear in multiple places within the queue. There are certain benefits to sorting links by domain and crawling all of the links for a single domain, one right after the other. You can avoid the cost of repeated DNS lookups, and also get a “snapshot” of a single domain over a short period of time rather than getting some now and some later.

For those reasons, most traditional crawlers use what we’ve come to call an “offline” model. The crawler starts crawling with a queue (sometimes called a segment) that’s ordered in some fashion (typically by domain name), and as it crawls pages it just stores them. A separate process reads the stored pages, extracts links, decides which links need to be crawled, sorts them by domain, and creates a new segment. When the crawler finishes with its initial segment, it grabs a new segment and starts crawling again.

The crawler’s code, then, becomes much simpler:

while (segment is available)
{
    load segment into queue
    while (queue is not empty)
    {
        dequeue url
        request document
        store document
    }
}

One benefit of this model is that the crawler requires almost no CPU resources. All it’s doing is downloading pages and storing them in some kind of repository. It might be possible, on a single machine, to have both the crawler and the URL extractor running at the same time on a single machine. If that’s possible, you should do it. Otherwise you’ll have to use the “crawl-process-crawl” model. That is, you create the initial segment and let the crawler complete that one segment. When the crawler is done, you harvest the links to create another segment, start the crawler again, etc. It seems clunky, but it’s quite effective. The major drawback to the crawl-process-crawl model is that you spend a lot of time not crawling, which means that your Internet connection is idle. If you’re paying big dollars for a fat pipe, that idle time is money down the drain.

The offline model is very effective for many types of crawling. If you want to exhaustively crawl a select number of sites (or the entire Internet), it works quite well. In general, if your crawler is essentially a very fast downloader that just stores whatever it finds, then this model is the way to go.

Adaptive crawling

The “secret” to doing Web-scale crawling at cable modem speeds is to limit what you crawl to just the important stuff–things that are related to whatever you’ve determined is relevant. And that requires a crawler that can react in real time to the information that it’s seeing. It has to be selective about what it downloads. This implies a live queue, a way to prioritize the URLs it encounters, and the ability to prune the queue when it becomes too large. Such a crawler is a somewhat more complicated thing to write, but it’s amazingly effective. If you’re looking for specific types of documents or documents about particular topics, then an adaptive crawler can increase the number of relevant documents you find in a given period of time.

Before I get to the numbers, let’s talk about how an adaptive crawler works. The general algorithm is:

queue = LoadSeed()
while (queue is not empty)
{
    dequeue url
    request document
    store document

    // process document
    analyze document to determine if it's relevant
    update prioritizer with information about this document

    // extract and queue links
    parse document for links
    eliminate duplicate links
    prioritize links
    add prioritized links to queue
}

Note that I show storing the document, even if it isn’t relevant. Depending on your requirements, that might not be necessary. Our crawler, for example, doesn’t store HTML pages or other documents that aren’t videos. Even when it encounters a video, the crawler only stores metadata–information about the video like the title, the author, length, etc.

In addition to the above, an adaptive crawler typically has a background thread that continually goes through the queue, re-prioritizing links based on the ever-changing information that the crawler is collecting from analyzing the documents that it finds. When the queue fills up, the lowest-priority URLs are simply removed. If they’re important (i.e. contain relevant information), the crawler will see them again.

We’ve found that re-prioritizing links in this way reduces over training, and also allows the crawler to adapt when it finds that a previously unproductive path becomes productive. And, conversely, to stop crawling a particular path when that path becomes unproductive.

As you can see, an adaptive crawler is quite a bit more complicated than a general-coverage crawler. It needs some way to analyze a document to derive a relevance score (what we call a “nugget function”), and some type of machine learning (ML) system that can continually learn and then prioritize URLs based on what it’s learned. Not only is this more complicated to build, it puts a much heavier load on the crawler’s memory and CPU resources.

The nature of the nugget function and the ML system will remain unspecified at this point. In later posts, I’ll discuss ways to determine a document’s relevance, and how to build a simple ML system. For now, we can treat those as black boxes.

With this model, the crawler can zero in on relevant information very quickly. If your nugget function returns true when the document is relevant and false when the document is not relevant, the ML system can reward or punish the path that led to the document. The ML system quickly learns that links from site A lead to nuggets 50% of the time, links from site B lead to nuggets only 2% of the time, and links from site C never lead to nuggets. The prioritizer can then prioritize links so that those from site A are crawled before links from site B, and links from site C aren’t crawled unless there’s nothing else to do.

This works amazingly well.

We modified the ML system of our crawler so that it would return a random priority value for every URL given to it. That simulates randomly crawling the Web. We fed the crawler a list of 100 starting URLs, and told it to find videos. After crawling for several hours, we turned it off and measured its effectiveness. Simulating a random crawl, we found approximately one video in every 500 pages crawled. We then re-enabled the ML system and ran the test again, using the same starting point. Within two hours, the crawler was finding, on average, one new video in every 12 pages it examined. That early ML system made the crawler 40 times more productive.

Our crawler is geared towards finding video, but it would be a relatively small matter to change the nugget function and tell it to look for anything we like. An example I like to use is steam trains. I recently did a Google search for [steam trains], and got approximately 794,000 results. If we were to crawl the Web at random–just grab any old URL and download the document–then, on average, we would expect to find one document about steam trains in every 25,000 documents we examine. That’s assuming, of course, that we prevent the crawler from looking at things like videos, images, and other types of files that aren’t “documents.” We would have to examine 25 million documents in order to get 1,000 references to steam trains, and there’s no telling how relevant those documents might be. For example, we might find this page, which mentions steam trains but isn’t really relevant to them.

Even at 30 pages per second, you’re talking 833 seconds (almost 14 minutes) between hits. So if you crawled all day you’d find about 100 pages that mention steam trains, and of those a large number would be like the one I linked above–it mentions steam trains but isn’t relevant.

I would expect the behavior to be much different with the adpative crawler. When it first started up, it would act like a naive random crawler, and its performance would be very similar. But as it began to find documents about steam trains, the URL prioritization would cause it to begin finding relevant documents much faster. Within a few hours, it would be finding a new document about steam trains every 20 seconds or so, and it would likely do even better than that in spurts. The crawler would focus on clusters of relevant documents, find links to other clusters, eventually exhaust those clusters and wander about aimlessly until it found new clusters.

I haven’t actually run that test with steam trains, but my experience with the crawler’s behavior when searching for video gives me some strong evidence that the crawler would likely find more than 10,000 relevant documents in a day, starting from scratch, using cable modem speeds. In addition, the average relevance of the documents would be much higher because the ML system would favor links from highly relevant sources over random links. If we gave the crawler a good starting point (say, the Wikipedia page on steam trains, or a steam train enthusiast’s site), its performance would be even better because it would start with a source of high-quality links (i.e. links to pages that are very likely to be relevant).

Crawling period

Throughout this discussion, I have not mentioned the dynamic nature of the Web. It’s constantly changing. It changes so fast that no search engine can be completely up to date. Early search engines worked by taking a “snapshot” of the Web over some time period. That is, they’d set their crawlers to work and crawl, say, 20 billion pages in a few days. Then the indexer would come along and start building an index from the stored documents while the crawlers started over, building a new repository that the next index run would use. The larger search engines don’t typically work this way anymore, but many smaller scale crawlers do.

A “crawl,” when using this model, is a unit of work. It might be measured by time (“today’s crawl”), or by the number of documents or sites indexed. How you measure it doesn’t really matter so much for this discussion. The point is that you don’t want to visit the same site more than once per crawl–the crawling period.

That’s fine when you’re using an offline model. An adaptive crawler, though, is often run continuously, and is expected to cover old ground from time to time in order to find new material. That steam train afficianado, for example, might have posted a new article. We want our crawler to find that new material as soon as possible.

In that model, there isn’t a “crawl” as a unit of work. As a result, you need some mechanism that allows the crawler to re-crawl a URL from time to time. I’ll cover how that’s done in a later post.

Which model to use?

If you’re building a general coverage crawler, or a crawler that’s expected to index pre-identified sites or pages, then the offline crawler is the way to go. It’s easier to build and operate, and you can take advantage of batching URLs by domain.

Adaptive crawlers are mostly used for exploration: “find documents that meet this criteria.” They are much harder to build than typical crawlers, but they can find more relevant information faster, and with less bandwidth.

One thing that discourages people from writing crawlers is the idea that they need a huge data center, a fat pipe, distributed file system, and parallel processing cluster in order to make it work. That’s just not so! If you’re looking for a relative handful of documents, you can operate an adaptive crawler on a single machine hooked to a cable modem. Even at five megabits per second, you can exhaustively search the Web for information on any topic in just a few days. The trick is limiting where you look and what you look at.

I know that’s a lot to digest in one blog entry, but the choice of crawling model affects what you can do with your crawler. The rest of this series is going to focus on the adaptive crawlers, since that’s what I’m most familiar with, and information about offline crawlers is readily available. In addition, I think most people who are interested in finding business intelligence on the Web are better served by a crawler that seeks out new sources of information rather than depending on a human to identify the sources.

In the next post, I’ll be talking about politeness, which is the most important aspect of operating a Web crawler. I’ll probably follow that up with a discussion of queue management, which turns out to be closely related to politeness in many respects.

Jim's Random Notes

Random notes about random stuff