This is the fourth in a series of posts about writing a Web crawler. Read the Introduction for background and a table of contents. The previous entry is Politeness.
In the discussion of Crawling Models, I distinguished between two types of crawlers: “offline” crawlers that download documents and have a separate process create queue segments that are submitted as seeds to the crawler, and “online” crawlers that maintain a live queue. Queue management for these two types of crawlers is quite different, with the offline crawler presenting a somewhat easier problem to solve. Queue management for the online crawler gets rather complicated.
To review, the general processing model of the offline crawler is:
queue = LoadSeed(); while (queue not empty) { dequeue URL request document store document }
A separate program scans the downloaded documents and creates seeds (or segments) that are fed to the crawler. This queue building program maintains a list of URLs that have already been crawled. As it reads documents that the crawlers download, it discards those that have already been crawled or that have already been submitted in a seed. The remainder get added to the new seed. When the seed reaches some threshold, it is typically sorted by domain and then fed to the crawler.
If you think about that a moment, you’ll recognize that the crawler can’t then read the queue sequentially. If it did, then it would crawl all of the URLs for domain A, then domain B, etc. That would violate politeness if the crawler didn’t delay after each request. And if the crawler did delay after each request, it would spend most of its time waiting doing nothing.
What happens is that when the crawler loads the seed, it creates a queue of domains, and each domain has a queue of URLs. When the crawler goes to dequeue a URL, it dequeues a domain, dequeues a URL from that domain’s queue, crawls it, and then adds that domain back to the queue. If there are 100 domains in the queue, then, the first domain is visited once, followed by the other 99 domains before the first domain is visited again.
That’s a somewhat simplified explanation of one way to manage the queue, but it’s very effective. Using that technique almost guarantees that the crawler won’t hammer any domain with too many requests in too short a time. The nice thing is that it works well for a single-threaded crawler and for a multi-threaded crawler. There’s some synchronization involved in managing the queue when you have multiple threads, but that’s easily managed with a simple mutual exclusion lock. As long as a thread obtains a lock on the domain queue before dequeuing or enqueuing an item, things work well. You could use a concurrent lock-free queue rather than a lock, but the truth is that the lock just won’t be contended very often. If you’re crawling on a cable modem, you won’t be able to do more than about 30 requests per second, meaning that you won’t have to take that lock more often than 60 times per second. It’s just not an issue.
That simple domain queue handles the general case very well because there are usually many thousands of domains in the queue. But you’ll find that over time the number of domains decreases because most domains will have only a few URLs to crawl, and some domains will have hundreds or thousands. As the queue size decreases, the time between requests to the same domain decreases. At the end of a crawl, you could have very few domains in the queue. A naive implementation would end up making requests to those domains too quickly.
In general, there are two ways to handle that situation. The easiest is to have the crawler stop and save the remaining seed when the number of domains drops below some threshold. The process that’s controlling the crawler can then combine the remaining seed with the next seed to be crawled. If you have the system set up so that the crawler can request a new seed, then you can have the crawler request a new seed when the queue becomes too small. The crawler would then load that seed and add it to the existing queue.
Both of those solutions are simple to implement, but they will still violate politeness if there is no new seed to be had. In addition, the general model might not be sufficient to guarantee politeness. Remember, different domains have different politeness rules: specifically the Crawl-Delay
. One domain might allow requests every 15 seconds, and another might have a delay value of 60 seconds or more. The simple queues that I outlined above do not take that into account.
A better way to handle things is to create another queue–a priority queue–that is kind of like a penalty box. After you crawl a URL from a particular domain, you add that domain to the penalty box, with an expiration time that’s equal to the current time plus the delay value. When that domain’s delay time has expired, you remove it from the penalty box and add it to the back of the domain queue. It looks something like this:
queue = LoadSeed(); while (queue not empty) { dequeue domain dequeue URL from domain request document store document if domain queue not empty { add domain to penalty box } }
You then need a way to remove things from the penalty box. One way is to have the downloader thread do it after crawling. That is, add the following to the loop:
while (penalty box not empty AND first item release time < now) { remove item and add to domain queue }
That’s simple and effective, but somewhat wasteful. Another way to is to set up a timer that checks the penalty box every second. That’s going to require less processor time, and it’s just not going to matter if a domain remains in the penalty box for an extra second. Either way you do it, you’re guaranteeing politeness. Using the penalty box will prevent you from hitting any domain too often.
I’m implying here that the domain record contains more than the domain name and a queue of URLs. You’ll also need to maintain the robots.txt information for each domain, including the Crawl-Delay
value if any. If there is no Crawl-Delay
, you should start with a default value (probably one minute), and then keep a running average of the amount of time it takes to request, download, and process a response for that domain. Save the last five times, average them, and use that for the delay time, perhaps multiplying by some constant first. Again, I would recommend a delay time of at least 15 seconds between requests to the same site. Larger sites might not mind you making requests more frequently, but smaller sites will be annoyed.
You’ll also notice in the above code that it doesn’t add a domain back to the domain queue if there are no URLs remaining in its queue. The implication is that the domain is discarded. That’s perhaps a reasonable thing to do if you’re running an offline crawler, although doing so will also discard the computed delay value, the robots.txt file, and any other information you’ve collected for that domain. That okay, although not optimal, for an offline crawler, but it is not something you want to do if you have a live queue that’s continually updated. If you did, you’d be downloading robots.txt far more often than you need to.
A better solution for the offline crawler is to have some persistent way to store information about a domain. So if the first seed you crawl has a dozen URLs for domain A and the second seed has a dozen more for the same domain, you don’t have to download robots.txt again. Instead, you can just load the persisted information. Of course, you’ll want to download robots.txt again if it’s been more than 24 hours (or whatever your cache time is) since the last time you downloaded it, but if it’s only been a few hours there’s no reason to request the document again.
With the offline crawler, you can probably get away with checking the robots.txt file when you load a new seed, provided you don’t expect the crawler to take more than 24 hours to exhaust a seed. If your crawler will be long-running (more than a day), then you’ll want to check every time before you crawl a URL. For example:
dequeue domain if (domain.robotstxt.downloadtime < now - cachetime) { request robots.txt from domain domain.robotstxt.downloadtime = now } dequeue URL from domain if (can crawl URL) // check domain.robotstxt { request document store document } // etc.
Things get complicated in a hurry, don’t they? It gets worse, but for now I’ll assume that you have some sort of domain persistence mechanism from which the crawler can get information. In a later post I’ll talk about how to build that.
If you have an offline crawler, queue management really can be as simple as what I showed above. In the simplest model, the crawler is just a consumer; it never adds to a domain’s URL queue. Domains come out of the queue, go into the penalty box, and then come out of the penalty box and go back into the domain queue. When a domain’s URL queue is exhausted, that domain is discarded. Even a multi-threaded crawler of this type requires very little “smarts” in the queue management.
The offline queue builder, too, is simple in concept. Implementation can be difficult due to the huge amount of data it has to deal with, but for a small-scale (single machine) crawler, it’s quite manageable. At the base level, it’s just a huge sort/merge operation. The general algorithm is:
for each downloaded document { scan document to extract URLs for each extracted URL { if (URL is not in index) { place URL in queue store URL in index } } } Sort URLs in queue by domain Save sorted queue
The harder part is maintaining the index. A relational database is effective, but not terribly efficient for this. If the number of URLs you plan to save in the index is less than 100 million, you can probably load the index from disk at the start of a run and then write it back to disk when you’re done. This will eat up lots of memory, though, since the average URL length will be in the neighborhood of 80 characters. If your index is larger than available memory, you have to come up with some hybrid solution. If you’re willing to accept some small percentage of false positives (URLs that your index says were crawled, but in actuality weren’t), then a Bloom filter is a very effective and memory efficient way to manage your index. In my experience, the benefits of a Bloom filter far outweigh the drawbacks of a few false positives.
You’ll also want some way to filter URLs and domains that are unlikely to give you useful information. Remember, there are many more URLs than you’ll ever be able to crawl. In fact, the existence of mirrors, proxies, buggy automatic URL generation, and purposely-created crawler traps makes the number of URLs essentially infinite. Even without those problems (which I’ll discuss in detail in a later post), your crawler will uncover online forums, bulletin boards, social networks, and other sites that have no useful information. You don’t want to spend your precious resources looking at unproductive sites.
If all you want is to crawl sites that you’ve identified, then the problem becomes pretty easy. You just create a table that contains the sites you want to crawl (a white list), and any URL that doesn’t match an item in the table is discarded. Your table can be a list of domain names, or it can use regular expressions that are matched against each URL. The drawback to using regular expressions, of course, is that matching each URL against dozens or hundreds of regular expressions is going to be very expensive unless you use a more advanced algorithm similar to what the Unix grep
program uses. Implementation is left as an exercise to the reader.
Handling redirects
There are two types of redirects. The permanent redirect, HTTP status code 301 (Moved Permanently) is supposed to be used for things that have been moved from their original location. When a Web browser encounters a 301, it caches the redirect location (the Location
header that is returned with the 301 response) so that the next time you request the original location, the browser automatically requests the new location. The 302 status code (Found) is a temporary redirect. Browsers aren’t supposed to cache any information about temporary redirects.
Redirects work well for Web browsers, but they make crawling more complicated. One way to handle them is to ignore them. That is, rather than handle them specially, just make your HTTP component automatically follow redirects. If your crawler requests http://example.com/file1.html
, and it redirects to http://example.com/content/coolstuff.html
, then the crawler gets whatever document is stored at the redirect location. You can examine the Content-Location
response header to get the URL of the actual document–the redirect destination.
Whereas that works, it has some drawbacks. First, it violates politeness because the HTTP component will make multiple requests in quick succession. That’s not much of a problem in a single-threaded crawler, because a browser would do the same thing. But in a multi-threaded crawler it could (and often does) result in many concurrent requests to the same domain. Imagine that you have two threads. The first requests http://example1.com/file1.html
and the second requests http://example2.com/file2.html
. The “example” domains, though, are owned by the same company and they’ve set it up so that any request for http://exampleN.com/*
redirects to http://allexamples.com/N/*
. So the first thread issues a request for http://allexamples.com/1/file1.html
, and the second issues a request for http://allexamples.com/2/file2.html
. You now have two threads issuing concurrent requests to the same domain, which violates politeness. I can tell you from experience that this can and does happen.
Fortunately, it doesn’t happen very often. When it does happen, it’s usually on a very large site that won’t usually notice the little bit of bandwidth that your crawler consumes. In almost all cases, politeness violations caused by redirects won’t get you into trouble. But there are other reasons you don’t want to automatically follow redirects.
If you automatically follow redirects, you bypass your URL filtering rules. Going to http://example.com/badthing/
could redirect you to http://example.com/badthing/bigfile.pdf
, meaning that your crawler will end up downloading a PDF document that it can’t use. Or perhaps an image, a video, a Word document, etc. Granted, your crawler has to have logic that discards unusable documents, since it’s possible that requesting an HTML file will return an image, so this isn’t actually a new problem. But all of your URL filtering is bypassed. If the redirect location is to a domain that’s in your exclusion list, you’re going to violate politeness. If the redirect location is a document that you’ve already seen, you’re going to get duplicate documents.
And if there are multiple levels of redirection, your problems are multiplied. Double redirection is common, and triple redirection isn’t exactly rare. I’ve found that anything beyond three is usually some kind of spam site or crawler trap, so I cap my redirect following to just three levels. HTTP components that allow you to automatically follow redirects typically allow you to set the maximum redirection count in order to prevent infinite redirection (also not uncommon) from putting your crawler thread into an infinite loop.
Automatically following redirects is a really bad idea, and manually following the redirects is hard. It complicates the crawler. But you have to do it if you want to crawl the Web on a large scale. You have several options when it comes to implementing support for redirected URLs. Below I’ll present some options. All but the simplest will greatly complicate the crawler code.
The first thing is to treat all redirects as the same. Ignore the difference between temporary and permanent redirects. As far as the crawler is concerned, all redirects are permanent for that crawling session.
When the crawler encounters a redirect, it should mark the original URL as crawled and also store a document that contains the redirect location (the target URL) as a link. The offline queue builder program can then process that document in the normal way and the next queue segment will contain the redirect location. This approach has the benefit of being dead simple. The only problem is one of latency; you have to wait until the next queue segment is processed before the crawler downloads the target document.
If you want to handle the redirect within the crawler, then things get a bit more complicated. When the crawler encounters a redirect, it has to get the target URL from the Location
response header. Then, it needs to filter that URL. How much filtering you do depends on how strictly you want to adhere to the politeness policy. At minimum, the crawler should check that URL against the exclusion list (see the article on Politeness) so that your crawler doesn’t hit a domain that you’ve been asked not to crawl. The crawler also needs to check the URL against the robots.txt information for the target domain. And if you don’t have a record for that domain, you need to create one, download the robots.txt information, etc. Things get complicated in a hurry. You will find that you cannot make the following work.
// BUGGY CODE - DO NOT USE dequeue URL request document while (response is a redirect) { filter URL request target URL }
There are just too many things going on in the “filter URL” phase, and too much can go wrong. You’re better off having the crawling thread add the target URL to the target domain’s URL queue. That complicates queue management, because you then need some way to look up a domain’s information by name, and a way to synchronize access to a domain’s URL queue. Although it seems like this complicates the crawler, the truth is that you probably would have added this more advanced queue management anyway. A crawler with a live queue requires more advanced queue management, so I’ll defer detailed discussion to the next post. You don’t need all of that complexity in order to handle redirects, but you do need the basics. This is also related to the domain persistence discussion that I put off until a later post.
It’s harder than it looks
I think by now you’re getting the idea that writing an effective and efficient Web crawler is a non-trivial bit of work. What looks like a simple queue-based graph traversal problem turns out to be very complicated. Even without redirects, the simple offline crawler is difficult because of politeness concerns. And redirects make things even harder. Maintaining the queue is simplified because it’s handled by an offline process that has lots of time, memory, and disk resources.
In my next post, I’ll talk about queue management for the adaptive crawler. That’s much more complicated because the queue becomes a much more dynamic structure that has to fit into memory, and the crawler needs to update it efficiently.