Fun with text encodings

Text encoding is a programmer’s nightmare. Life was much simpler when I didn’t have to know anything other than the ASCII character set. Even having to deal with extended ASCII–characters with values between 128 and 255–wasn’t too bad. But when we started having to work with international character sets, multibyte character sets, and the Unicode variants, things got messy quick. And they’re going to remain messy for a long time.

When I wrote a Web crawler to gather information about media files on the Web, I spent a lot of time making it read the metadata (ID3 information) from MP3 music files. Text fields in ID3 Version 2 are marked with an encoding byte that says which character encoding is used. The recognized encodings are ISO-8859-1 (an 8-bit character set for Western European languages, including English), two 16-bit Unicode encodings, and UTF-8. Unfortunately, many tag editors would write the data in the computer’s default 8-bit character set (Cyrillic, for example) and mark the fields as ISO-8859-1.

That’s not a problem if the resulting MP3 file is always read on that one computer. But then people started sharing files and uploading them to the Web, and the world fell apart. Were I to download a file that somebody in Russia had added tags to, I would find the tags unreadable because I’d be trying to interpret his Cyrillic character set as ISO-8859-1. The result is commonly referred to as mojibake.

The Cyrillic-to-English problem isn’t so bad, by the way, but when the file was originally written with, say, ShiftJIS, it’s disastrous.

My Web crawler would grab the metadata, interpret it as ISO-8859-1, and save it to our database in UTF-8. We noticed early on that some of the data was garbled, but we didn’t know quite what the problem was or how widespread it was. Because we were a startup with too many things to do and not enough people to do them, we just let it go at the time, figuring we’d get back to it.

When we did get back to it, we discovered that we had two problems. First, we had to figure out how to stop getting mangled data. Second, we had to figure a way to un-mangle the millions of records we’d already collected.

To correct the first problem, we built trigram models for a large number of languages and text encodings. Whenever the crawler ran across a field that was marked as containing ISO-8859-1, it would run the raw uncoded bytes through the language model to determine the likely encoding. The crawler then used that encoding to interpret the text. That turned out to be incredibly effective, and almost eliminated the problem of adding new mangled records.

Fixing the existing mangled records turned out to be a more difficult proposition. The conversion from mangled ISO-8859-1 to UTF-8 resulted in lost data in some circumstances, and we couldn’t do anything about that. In other cases the conversion resulted in weird accented characters intermixed with what looked like normal text. It was hard to tell for sure sometimes because none of us knew Korean or Chinese or Greek or whatever other language the original text was written in. Un-mangling the text turned out to be a difficult problem that we never fully solved in the general case. We played with two potential solutions.

The first step was the same for both solutions: we ran text through the trigram model to determine if it was likely mangled, and what the original language probably was.

For the first solution attempt, we’d then step through the text character-by-character, using the language model to tell us the likelihood of the trigrams that would appear at that position and compare it against the trigram that did appear. If we ran across a trigram that was very unlikely or totally unknown (for example, the trigram “zqp” is not likely to occur in English), we’d replace the offending trigram with one of the highly likely trigrams. This required a bit of backtracking and it would often generate some rather strange results. The solution worked, after a fashion, but not well enough.

For the second attempt we selected several managled records for every language we could identify and then re-downloaded the metadata. By comparing the mangled text with the newly downloaded and presumably unmangled text, we created a table of substitutions. So, for example, we might have determined that “zqp” in mangled English text should be replaced by the word “foo.” (Not that we had very much mangled English text.) We would then go through the database, identify all the mangled records, and do the substitutions as required.

That approach was much more promising, but it didn’t catch everything and we couldn’t recover data that was lost in the original translation. Ultimately, we decided that it wasn’t a good enough solution to go through the effort of applying it to the millions of mangled records.

Our final solution was extremely low-tech. We generated a list of the likely mangled records and created a custom downloader that went out and grabbed those records again. It took a lot longer, but the result was about as good as we were likely to get.

Writing a Web Crawler: Queue Management, Part 1

This is the fourth in a series of posts about writing a Web crawler. Read the Introduction for background and a table of contents. The previous entry is Politeness.

In the discussion of Crawling Models, I distinguished between two types of crawlers: “offline” crawlers that download documents and have a separate process create queue segments that are submitted as seeds to the crawler, and “online” crawlers that maintain a live queue. Queue management for these two types of crawlers is quite different, with the offline crawler presenting a somewhat easier problem to solve. Queue management for the online crawler gets rather complicated.

To review, the general processing model of the offline crawler is:

queue = LoadSeed();
while (queue not empty)
{
    dequeue URL
    request document
    store document
}

A separate program scans the downloaded documents and creates seeds (or segments) that are fed to the crawler. This queue building program maintains a list of URLs that have already been crawled. As it reads documents that the crawlers download, it discards those that have already been crawled or that have already been submitted in a seed. The remainder get added to the new seed. When the seed reaches some threshold, it is typically sorted by domain and then fed to the crawler.

If you think about that a moment, you’ll recognize that the crawler can’t then read the queue sequentially. If it did, then it would crawl all of the URLs for domain A, then domain B, etc. That would violate politeness if the crawler didn’t delay after each request. And if the crawler did delay after each request, it would spend most of its time waiting doing nothing.

What happens is that when the crawler loads the seed, it creates a queue of domains, and each domain has a queue of URLs. When the crawler goes to dequeue a URL, it dequeues a domain, dequeues a URL from that domain’s queue, crawls it, and then adds that domain back to the queue. If there are 100 domains in the queue, then, the first domain is visited once, followed by the other 99 domains before the first domain is visited again.

That’s a somewhat simplified explanation of one way to manage the queue, but it’s very effective. Using that technique almost guarantees that the crawler won’t hammer any domain with too many requests in too short a time. The nice thing is that it works well for a single-threaded crawler and for a multi-threaded crawler. There’s some synchronization involved in managing the queue when you have multiple threads, but that’s easily managed with a simple mutual exclusion lock. As long as a thread obtains a lock on the domain queue before dequeuing or enqueuing an item, things work well. You could use a concurrent lock-free queue rather than a lock, but the truth is that the lock just won’t be contended very often. If you’re crawling on a cable modem, you won’t be able to do more than about 30 requests per second, meaning that you won’t have to take that lock more often than 60 times per second. It’s just not an issue.

That simple domain queue handles the general case very well because there are usually many thousands of domains in the queue. But you’ll find that over time the number of domains decreases because most domains will have only a few URLs to crawl, and some domains will have hundreds or thousands. As the queue size decreases, the time between requests to the same domain decreases. At the end of a crawl, you could have very few domains in the queue. A naive implementation would end up making requests to those domains too quickly.

In general, there are two ways to handle that situation. The easiest is to have the crawler stop and save the remaining seed when the number of domains drops below some threshold. The process that’s controlling the crawler can then combine the remaining seed with the next seed to be crawled. If you have the system set up so that the crawler can request a new seed, then you can have the crawler request a new seed when the queue becomes too small. The crawler would then load that seed and add it to the existing queue.

Both of those solutions are simple to implement, but they will still violate politeness if there is no new seed to be had. In addition, the general model might not be sufficient to guarantee politeness. Remember, different domains have different politeness rules: specifically the Crawl-Delay. One domain might allow requests every 15 seconds, and another might have a delay value of 60 seconds or more. The simple queues that I outlined above do not take that into account.

A better way to handle things is to create another queue–a priority queue–that is kind of like a penalty box. After you crawl a URL from a particular domain, you add that domain to the penalty box, with an expiration time that’s equal to the current time plus the delay value. When that domain’s delay time has expired, you remove it from the penalty box and add it to the back of the domain queue. It looks something like this:

queue = LoadSeed();
while (queue not empty)
{
    dequeue domain
    dequeue URL from domain
    request document
    store document
    if domain queue not empty
    {
        add domain to penalty box
    }
}

You then need a way to remove things from the penalty box. One way is to have the downloader thread do it after crawling. That is, add the following to the loop:

    while (penalty box not empty AND first item release time < now)
    {
        remove item and add to domain queue
    }

That’s simple and effective, but somewhat wasteful. Another way to is to set up a timer that checks the penalty box every second. That’s going to require less processor time, and it’s just not going to matter if a domain remains in the penalty box for an extra second. Either way you do it, you’re guaranteeing politeness. Using the penalty box will prevent you from hitting any domain too often.

I’m implying here that the domain record contains more than the domain name and a queue of URLs. You’ll also need to maintain the robots.txt information for each domain, including the Crawl-Delay value if any. If there is no Crawl-Delay, you should start with a default value (probably one minute), and then keep a running average of the amount of time it takes to request, download, and process a response for that domain. Save the last five times, average them, and use that for the delay time, perhaps multiplying by some constant first. Again, I would recommend a delay time of at least 15 seconds between requests to the same site. Larger sites might not mind you making requests more frequently, but smaller sites will be annoyed.

You’ll also notice in the above code that it doesn’t add a domain back to the domain queue if there are no URLs remaining in its queue. The implication is that the domain is discarded. That’s perhaps a reasonable thing to do if you’re running an offline crawler, although doing so will also discard the computed delay value, the robots.txt file, and any other information you’ve collected for that domain. That okay, although not optimal, for an offline crawler, but it is not something you want to do if you have a live queue that’s continually updated. If you did, you’d be downloading robots.txt far more often than you need to.

A better solution for the offline crawler is to have some persistent way to store information about a domain. So if the first seed you crawl has a dozen URLs for domain A and the second seed has a dozen more for the same domain, you don’t have to download robots.txt again. Instead, you can just load the persisted information. Of course, you’ll want to download robots.txt again if it’s been more than 24 hours (or whatever your cache time is) since the last time you downloaded it, but if it’s only been a few hours there’s no reason to request the document again.

With the offline crawler, you can probably get away with checking the robots.txt file when you load a new seed, provided you don’t expect the crawler to take more than 24 hours to exhaust a seed. If your crawler will be long-running (more than a day), then you’ll want to check every time before you crawl a URL. For example:

dequeue domain
if (domain.robotstxt.downloadtime < now - cachetime)
{
    request robots.txt from domain
    domain.robotstxt.downloadtime = now
}
dequeue URL from domain
if (can crawl URL) // check domain.robotstxt
{
    request document
    store document
}
// etc.

Things get complicated in a hurry, don’t they? It gets worse, but for now I’ll assume that you have some sort of domain persistence mechanism from which the crawler can get information. In a later post I’ll talk about how to build that.

If you have an offline crawler, queue management really can be as simple as what I showed above. In the simplest model, the crawler is just a consumer; it never adds to a domain’s URL queue. Domains come out of the queue, go into the penalty box, and then come out of the penalty box and go back into the domain queue. When a domain’s URL queue is exhausted, that domain is discarded. Even a multi-threaded crawler of this type requires very little “smarts” in the queue management.

The offline queue builder, too, is simple in concept. Implementation can be difficult due to the huge amount of data it has to deal with, but for a small-scale (single machine) crawler, it’s quite manageable. At the base level, it’s just a huge sort/merge operation. The general algorithm is:

for each downloaded document
{
    scan document to extract URLs
    for each extracted URL
    {
        if (URL is not in index)
        {
            place URL in queue
            store URL in index
        }
    }
}
Sort URLs in queue by domain
Save sorted queue

The harder part is maintaining the index. A relational database is effective, but not terribly efficient for this. If the number of URLs you plan to save in the index is less than 100 million, you can probably load the index from disk at the start of a run and then write it back to disk when you’re done. This will eat up lots of memory, though, since the average URL length will be in the neighborhood of 80 characters. If your index is larger than available memory, you have to come up with some hybrid solution. If you’re willing to accept some small percentage of false positives (URLs that your index says were crawled, but in actuality weren’t), then a Bloom filter is a very effective and memory efficient way to manage your index. In my experience, the benefits of a Bloom filter far outweigh the drawbacks of a few false positives.

You’ll also want some way to filter URLs and domains that are unlikely to give you useful information. Remember, there are many more URLs than you’ll ever be able to crawl. In fact, the existence of mirrors, proxies, buggy automatic URL generation, and purposely-created crawler traps makes the number of URLs essentially infinite. Even without those problems (which I’ll discuss in detail in a later post), your crawler will uncover online forums, bulletin boards, social networks, and other sites that have no useful information. You don’t want to spend your precious resources looking at unproductive sites.

If all you want is to crawl sites that you’ve identified, then the problem becomes pretty easy. You just create a table that contains the sites you want to crawl (a white list), and any URL that doesn’t match an item in the table is discarded. Your table can be a list of domain names, or it can use regular expressions that are matched against each URL. The drawback to using regular expressions, of course, is that matching each URL against dozens or hundreds of regular expressions is going to be very expensive unless you use a more advanced algorithm similar to what the Unix grep program uses. Implementation is left as an exercise to the reader.

Handling redirects

There are two types of redirects. The permanent redirect, HTTP status code 301 (Moved Permanently) is supposed to be used for things that have been moved from their original location. When a Web browser encounters a 301, it caches the redirect location (the Location header that is returned with the 301 response) so that the next time you request the original location, the browser automatically requests the new location. The 302 status code (Found) is a temporary redirect. Browsers aren’t supposed to cache any information about temporary redirects.

Redirects work well for Web browsers, but they make crawling more complicated. One way to handle them is to ignore them. That is, rather than handle them specially, just make your HTTP component automatically follow redirects. If your crawler requests http://example.com/file1.html, and it redirects to http://example.com/content/coolstuff.html, then the crawler gets whatever document is stored at the redirect location. You can examine the Content-Location response header to get the URL of the actual document–the redirect destination.

Whereas that works, it has some drawbacks. First, it violates politeness because the HTTP component will make multiple requests in quick succession. That’s not much of a problem in a single-threaded crawler, because a browser would do the same thing. But in a multi-threaded crawler it could (and often does) result in many concurrent requests to the same domain. Imagine that you have two threads. The first requests http://example1.com/file1.html and the second requests http://example2.com/file2.html. The “example” domains, though, are owned by the same company and they’ve set it up so that any request for http://exampleN.com/* redirects to http://allexamples.com/N/*. So the first thread issues a request for http://allexamples.com/1/file1.html, and the second issues a request for http://allexamples.com/2/file2.html. You now have two threads issuing concurrent requests to the same domain, which violates politeness. I can tell you from experience that this can and does happen.

Fortunately, it doesn’t happen very often. When it does happen, it’s usually on a very large site that won’t usually notice the little bit of bandwidth that your crawler consumes. In almost all cases, politeness violations caused by redirects won’t get you into trouble. But there are other reasons you don’t want to automatically follow redirects.

If you automatically follow redirects, you bypass your URL filtering rules. Going to http://example.com/badthing/ could redirect you to http://example.com/badthing/bigfile.pdf, meaning that your crawler will end up downloading a PDF document that it can’t use. Or perhaps an image, a video, a Word document, etc. Granted, your crawler has to have logic that discards unusable documents, since it’s possible that requesting an HTML file will return an image, so this isn’t actually a new problem. But all of your URL filtering is bypassed. If the redirect location is to a domain that’s in your exclusion list, you’re going to violate politeness. If the redirect location is a document that you’ve already seen, you’re going to get duplicate documents.

And if there are multiple levels of redirection, your problems are multiplied. Double redirection is common, and triple redirection isn’t exactly rare. I’ve found that anything beyond three is usually some kind of spam site or crawler trap, so I cap my redirect following to just three levels. HTTP components that allow you to automatically follow redirects typically allow you to set the maximum redirection count in order to prevent infinite redirection (also not uncommon) from putting your crawler thread into an infinite loop.

Automatically following redirects is a really bad idea, and manually following the redirects is hard. It complicates the crawler. But you have to do it if you want to crawl the Web on a large scale. You have several options when it comes to implementing support for redirected URLs. Below I’ll present some options. All but the simplest will greatly complicate the crawler code.

The first thing is to treat all redirects as the same. Ignore the difference between temporary and permanent redirects. As far as the crawler is concerned, all redirects are permanent for that crawling session.

When the crawler encounters a redirect, it should mark the original URL as crawled and also store a document that contains the redirect location (the target URL) as a link. The offline queue builder program can then process that document in the normal way and the next queue segment will contain the redirect location. This approach has the benefit of being dead simple. The only problem is one of latency; you have to wait until the next queue segment is processed before the crawler downloads the target document.

If you want to handle the redirect within the crawler, then things get a bit more complicated. When the crawler encounters a redirect, it has to get the target URL from the Location response header. Then, it needs to filter that URL. How much filtering you do depends on how strictly you want to adhere to the politeness policy. At minimum, the crawler should check that URL against the exclusion list (see the article on Politeness) so that your crawler doesn’t hit a domain that you’ve been asked not to crawl. The crawler also needs to check the URL against the robots.txt information for the target domain. And if you don’t have a record for that domain, you need to create one, download the robots.txt information, etc. Things get complicated in a hurry. You will find that you cannot make the following work.

// BUGGY CODE - DO NOT USE
dequeue URL
request document
while (response is a redirect)
{
    filter URL
    request target URL
}

There are just too many things going on in the “filter URL” phase, and too much can go wrong. You’re better off having the crawling thread add the target URL to the target domain’s URL queue. That complicates queue management, because you then need some way to look up a domain’s information by name, and a way to synchronize access to a domain’s URL queue. Although it seems like this complicates the crawler, the truth is that you probably would have added this more advanced queue management anyway. A crawler with a live queue requires more advanced queue management, so I’ll defer detailed discussion to the next post. You don’t need all of that complexity in order to handle redirects, but you do need the basics. This is also related to the domain persistence discussion that I put off until a later post.

It’s harder than it looks

I think by now you’re getting the idea that writing an effective and efficient Web crawler is a non-trivial bit of work. What looks like a simple queue-based graph traversal problem turns out to be very complicated. Even without redirects, the simple offline crawler is difficult because of politeness concerns. And redirects make things even harder. Maintaining the queue is simplified because it’s handled by an offline process that has lots of time, memory, and disk resources.

In my next post, I’ll talk about queue management for the adaptive crawler. That’s much more complicated because the queue becomes a much more dynamic structure that has to fit into memory, and the crawler needs to update it efficiently.

Writing a Web crawler: Politeness

This is the third in a series of posts about writing a Web crawler. Read the Introduction for background and a table of contents. The previous entry is Crawling models.

When you’re writing a Web crawler, it’s important for you to understand that your crawler is using others’ resources. Whenever you download a file from somebody else’s server, you’re consuming their bandwidth and server resources. Most site operators welcome well-behaved crawlers, because those crawlers provide exposure, which means potentially more visitors. Ill-behaved crawlers are not welcomed, and often are banned. If you operate an ill-behaved crawler, not only will your crawler be banned by sites on which it misbehaves, but also on many other sites–even sites your crawler has not yet visited.

Site operators do talk amongst themselves. There are active forums on which operators can ask questions about crawlers, and on which they post reports about ill-behaved crawlers. Some operators will ban a crawler that gets one bad report.

Site operators have the right to ban any client for any reason, and the responsibility to their legitimate users to ban crawlers and other bots that interfere with normal site operation. As the author of a Web crawler, it is your responsibility to operate it within accepted community standards and to ensure that it doesn’t interfere with operation of the sites that it crawls.

I can tell you from experience that an ill-behaved crawler will be banned, the IP addresses it crawls from reported widely on Webmaster forums, and complaints about the bot will be lodged with the ISP. Repeated reports can result in your ISP terminating your service, and exceptional cases could result in civil litigation or criminal charges. It’s very difficult to repair the reputation of an ill-behaved crawler, so it’s important that your crawler adhere to community standards from the start.

Unfortunately, those community standards are at best vaguely defined. When it comes to accepted behavior there are grey areas that site operators will interpret differently. In addition, some operators aren’t familiar with the well-known community standards, or have wildly different interpretations of their meanings. Many of the rules I outline below aren’t written anywhere else that I know of, but instead have been developed by us in response to questions, comments, and complaints from site operators who have contacted us about our crawler.

If I had to sum it up in a single rule, it would be:

You are crawling at somebody else’s expense. Act accordingly.

The rules below apply to all types of bots, not just the adaptive crawlers that this series focuses on.

Identification

Web servers keep a log of every request that comes in. That log typically includes, among other things, the date and time, the IP address from which the request was made, the URL that was requested, and a user agent string that identifies the client making the request. There is nothing you can do to prevent that information from being stored in the server log. Often, a site operator will be interested to see who is making requests, and he will use the IP address and user agent strings to determine that. Browser user agent strings are pretty easy to identify. For example, here’s a list of Internet Explorer user agent strings. If an operator sees blank or cryptic user agent strings in his log, he might ban that IP address on principle, figuring it’s an ill-behaved bot or at minimum somebody trying to “hide.”

The first thing you should do is come up with a name for your crawler and make sure that that name is included in the User-Agent HTTP header with every request the crawler makes. That string should include your crawler’s name and a link to a page that contains information about the crawler. For example, our User-Agent string is “MLBot (www.metadatalabs.com/mlbot)”.

The page that contains information about your crawler should contain, at minimum:

  • General information about your crawler. This statement should describe you or your company, the kind of information the crawler is looking for, and what you will be doing with the information. It’s a very good idea to mention if your crawler is part of a school project, because many operators will give more leeway and can offer helpful suggestions.
  • If you think that your product will benefit the people whose sites you’re crawling, you should mention those benefits. For example, “I am building a comprehensive index of information about steam trains. When published, this index will drive many users who are interested in steam trains to your site.
  • List the IP addresses that your crawler operates from. If you have static IP addresses, this is very easy to do. If you’re crawling from home or from some other location that has dynamic IP, then you should update this list as often as you can to show the current IP addresses. It is also a good idea to sign up with a dynamic IP service and list the URL that will resolve to your server. Interested parties can then use NSLOOKUP to resolve that name with your current IP address.
  • You should mention that your crawler respects the robots.txt standard, including any extensions. See robots.txt, below.
  • Include information about how to block your crawler using robots.txt. For example, our FAQ page shows this:User-agent: MLBot Disallow: *You might also show examples of blocking specific directories, and you should certainly include a link to the robots.txt page.
  • Include an email address or form that people can use to contact you about your crawler.

It’s surprising how effective that page can be. Site operators are rightly suspicious of random crawlers that have no supporting information. By posting a page that describes your crawler, you rise above the majority of small-scale bots that continually plague their servers. I won’t say that operators won’t still look at you with suspicion, but they won’t typically block you out of hand. You might still be banned, especially if your crawler is ill-behaved, but the information page gives your crawler an air of legitimacy that it wouldn’t otherwise have.

I know how hard it can be to tear yourself away from working on your crawler, especially if you find writing difficult. But you should post about your crawler from time to time on your blog or FAQ site. Post about new additions, about the kinds of things you’re finding, and the interesting things you’re doing with the data you collect. If a visitor sees recent posts, he knows that you’re still actively working on your project. This isn’t directly related to politeness, but it does help improve your image.

It also helps to keep up on the latest developments in the world of search engine optimization, public sentiment about crawlers, and what other crawlers are doing. Accepted standards change over time. What was acceptable from a crawler a few years ago might now be considered impolite. It’s important that your crawler’s behavior reflect current standards.

robots.txt

The Robots Exclusion Standard is a consensus agreement that has achieved the status of an informal standard. Unfortunately, it’s never been approved as a “real” standard in the way that HTTP, SMTP, and other common Internet protocols are. As a result, there are many extensions to the standard, some contradictory. In 2008, Google, Microsoft, and Yahoo agreed on a set of common extensions to robots.txt, and many less visible crawlers followed suit. If you follow the lead of the major search engine crawlers, nobody can fault you for your handling of robots.txt. See Major search engines support robots.txt standard for more information.

At minimum, your crawler must support proper handling of the Disallow directive. Anything beyond that is optional, but helpful. You should support AllowSitemaps if it’s relevant to you, and wildcards. Some of the major crawlers support the Crawl-delay directive, as well, and if you can do it you should.

Related to robots.txt are the HTML Meta tags that some authors use to prevent crawling or indexing of particular pages. Some authors do not have access to their robots.txt file, so their only option is to use Meta tags in the HTML. If you are crawling or indexing HTML documents, you should support these tags.

Proper handling of robots.txt can be kind of confusing. I’ve written about this in the past. See Struggling with the Robots Exclusion Standard and More On Robots Exclusion for more information. In addition, Google’s robots.txt page has some good description of how robots.txt should work.

There are many robots.txt parser implementations available in different languages. A search for robots.txt parser should give you plenty of hits for whatever language you’re working in. You’re much better off using one that somebody else wrote rather than trying to implement it yourself.

You’ll want to cache the robots.txt files that you download, so that you don’t have to request robots.txt every time you request a URL from a domain. In general, a 24 hour cache time is acceptable. Some large crawlers say that they cache robots.txt for a week! I’ve found that I get very few complaints with a cache time of 24 hours. A simple MRU caching scheme with a buffer that holds one million robots.txt files works well for us. We crawl perhaps 40 million URLs per day. Your cache size will depend on how many different domains you typically crawl per day.

A common mistake, by the way, is for a site to report the file type of robots.txt as “text/html”. A strict interpretation of the rules says that you don’t have to recognize such a file. I’ve found that those files typically are plain text (i.e. they’re formatted correctly), but have the wrong MIME type attached to them. Our crawler will try to parse anything that calls itself robots.txt, regardless of the MIME type.

Crawling frequency

A single computer, even operating on a cable modem, can cause significant disruption to a Web server if it makes requests too often. If you have a multi-threaded crawler and a whole bunch of URLs from a single site, you might be tempted to request all of those URLs at once. Your cable modem can probably handle you making 30 concurrent requests to a server. And the server can probably handle you doing that … once. If your crawler is making many concurrent requests over a long period of time, it will impact the server. And when the owner figures out why his server stopped responding to customer requests, your bot will be blocked and possibly reported to your ISP as taking part in a denial of service attack. You must limit the rate at which your crawler makes requests to individual sites.

I noted above that some crawlers support the Crawl-delay directive in robots.txt. That directive lists the number of seconds your bot should delay between requests to the site. Not all crawlers support that directive, and relatively few robots.txt files that I’ve seen actually contain the directive. That said, you should support Crawl-delay if you can. But it can’t be your only means of limiting your crawler’s request rate. A technique that works well for us is to keep an average of the server’s response time over the last few requests, and then use a multiplier to come up with a delay time.

For example, say that the last five requests to the example.com domain averaged 1,500 milliseconds each, and you use a multiplier of 30. In that case, you would hit example.com once every 45 seconds. You can use a smaller multiplier, but I would caution against hitting any site more frequently than once every 15 seconds. Some larger sites will allow that, but many smaller sites would not be happy about it. It also depends on how many total requests you’re making. Hitting a site four times a minute for two minutes probably won’t raise any red flags. If you do it all day long, they’re going to wonder why you’re crawling them so often.

In our crawler, the politness delay is integrated with the queue management. I’ll discuss it in more detail then.

Take only what you need

To a crawler, the Web is an infinite free buffet. But as I said previously, it’s not really free. Every document you download required a server’s bandwidth and resources. If you go to an all-you-can-eat place, load your plate with food, eat just the good stuff, and then go back for seconds, the manager will probably kick you out. The rule at the buffet is “Take only what you need, and eat all you take.” The rule is the same when crawling the Web. If you download a bunch of stuff that you don’t need, consume somebody else’s server resources and then discard the document, they’re likely to block you. Take only what you can use.

This is one rule that not only makes your crawler more polite, but also makes it more efficient.

An early version of my crawler was looking for links to MP3 files in HTML documents. The crawler’s general algorithm was this:

Download file
If it's an HTML file
    parse the HTML for links
    queue new links
else if it's an MP3 file
    extract metadata
    store link and metadata
else
    discard the document

That works well, but the program downloaded a huge number of documents that it would never be able to use! For example, about 20% of the documents it was downloading were image files! The crawler couldn’t do anything with a .gif or a .jpeg, but it was downloading them nonetheless. It was an incredible waste of others’ server resources and our bandwidth. By modifying the crawler so that it checked for common file extensions before queuing a URL, we increased our productivity by over 25%, and also made the crawler more polite. We were no longer taking more than we could use.

There’s no guarantee, of course, that a URL ending in “.jpg” is an image. There’s a high likelihood, though. Our experiments at the time showed that if a URL ended in “.jpg”, “.jpeg”, “.bmp”, or “.gif” (among many others), the probability of it being an image file was above 98%. Of the remaining two percent, only a very small fraction were HTML files, and none were audio files. Blocking those extensions cost us nothing and gave a huge benefit.

Images aren’t the only files that most crawlers will find useless. If you’re looking for HTML documents, then you can safely ignore URLs with extensions that are common to images, audio, video, compressed archives, PDFs, executables, and many more. I don’t have an exhaustive list of extensions that you should ignore, but the following is a good start:

".asx",     // Windows video
".bmp",     // bitmap image
".css",     // Cascading Style Sheet
".doc",     // Microsoft Word (mostly)
".docx",    // Microsoft Word
".flv",     // Old Flash video format
".gif",     // GIF image
".jpeg",    // JPEG image
".jpg",     // JPEG image
".mid",     // MIDI file
".mov",     // Quicktime movie
".mp3",     // MP3 audio
".ogg",     // .ogg format media
".pdf",     // PDF files
".png",     // image
".ppt",     // powerpoint
".ra",      // real media
".ram",     // real media
".rm",      // real media
".swf",     // Flash files
".txt",     // plain text
".wav",     // WAV format sound
".wma",     // Windows media audio
".wmv",     // Windows media video
".xml",     // XML files
".zip",     // ZIP files
".m4a",     // MP4 audio
".m4v",     // MP4 video
".mov",     // Quicktime movie
".mp4",     // MP4 video or audio
".m4b",	    // MP4 video or audio

If your crawler derives information from some of those file types, then of course you shouldn’t block them.

I don’t know what the distribution of files is on the Web these days. When I created that list four or five years ago, those extensions made up well over 25% of the files my crawler was encountering. Today, the crawler’s mix is something like 91% HTML files, 5% files with no extension (i.e. http://example.com/), 3% video, and every other file type is less than 0.01%. I’ve blocked lots of extensions other than those I identified above, but together they made up less than 1% of the total number of URLs I encountered.

I determined the extensions to block by logging (see below) every request the crawler makes, along with the HTML headers that it gets back in the response. By correlating the file extension with the Content-Type header, I was able to get a list of extensions that rarely (if ever) lead to a file that I’m interested in.

Preventing your crawler from downloading stuff it can’t use makes you look more professional in the eyes of the people who own the servers you’re visiting, and also makes your crawler more efficient. You have to parse URLs anyway (another subject I’ll get to at some point), so checking for common useless extensions isn’t going to cost you significant processor time. Eliminating those useless files early makes queue management easier (less churn in the queue), and prevents you from wasting your precious bandwidth on things you can’t use.

Maintain an excluded domains file

There are some domains that, for various reasons, you don’t want your crawler to access. When we were crawling for audio files, for example, we encountered a lot of sites that had pirated music or that contained nothing but 30-second music samples. We didn’t want those files, but as far as the crawler and the ML system were concerned, those sites were gold. They had nuggets galore. Our only way to block them was go add them as exclusions.

Another reason you might want to block a site is in response to a complaint. Some site operators do not want you to crawl, period, and do not have access to robots.txt or the htaccess file in order to block your crawler. If somebody doesn’t want you on their site, you should honor their wishes.

You can go a long way with a simple text file that lists one domain per line, and perhaps a comment that says why you blocked it and when. Here’s a sample:

# sites that have requested that we not crawl
site1.com
site93.com
blog.somebody.net

# link farms that return nothing good
spamsiteA.com
spamsiteB.info

We created a file called NeverCrawl.txt in a common location. The crawler checks the last modified date on that file every 15 minutes or so. If the last modified date has changed, the crawler reads and parses the file, adding the entries to a hash table that’s checked whenever a URL is extracted from a Web page. The beauty of this solution is that it’s simple to add an exclusion (just open the text file, add the domain, and save the file), and the file format is easy to parse. To my knowledge this exclusion list has never failed. It’s also very nice being able to say to somebody, “I have instructed the crawler never to visit your site again. That exclusion should take effect within the next half hour.”

It would be nice, at times, to have a richer syntax for specifying which sites to block, but the cost of doing so is pretty high in comparison to its benefit. We could add regular expression support so that we could block any URL that matches a particular query string syntax, regardless of domain. That would catch Web proxies, for example, and many link farm sites. But it complicates the parsing code and could affect reliability. It’s certainly something we’d like to do at some point, but for now the cost outweighs the benefit.

Don’t go overboard on trying to identify sites to block. I know from experience that you could spend the rest of your days adding things to your exclusion file. There are approximately 100 million registered domains that have at least one viable Web page. Add to that the subdomains and the number probably approaches a billion. The vast majority of those sites don’t have anything of interest to you on them. But trying to identify and list them in a file would be impossible. Besides, if you did list 90 million sites, you wouldn’t be able to keep the exclusion list in memory.

You should depend on your ML system to keep you away from unproductive sites. Use the exclusion list to keep you away from things that the ML system thinks are productive, and from sites that you have been asked not to crawl.

Log everything

Your crawler is going to make a lot of requests. Even if you’re running a single-threaded bot that’s making an average of one request per second, you’re going to make more than 85,000 requests per day. You want to have a log of every request for two reasons:

  1. You will receive a complaint about your crawler’s behavior. The person complaining often will not supply dates and times, and if he does you’ll want to verify that it really was your crawler. It might be some other crawler that just happens to have the same name as yours, or it might be somebody spoofing your bot.
  2. You’ll want to analyze the logs to find link farms, crawler traps, and other unproductive sites, errors, and to find file extensions that are (almost) always unproductive.

You should log as much as you can with every request. At minimum you should save the date and time of the request, the request URL, the server’s HTTP response (200, 404, 302, etc.), the Content-TypeContent-Length, the Location header (for redirects), the response URI (the address of the page that responded), and the total amount of time required for the request and response. Optionally, you can save the actual response (the HTML file received, for example), although depending on your process you might want to save that elsewhere.

I’ve found it very helpful to log the entire contents of any robots.txt file that I download. I’ve been able to solve some curious crawler behavior by looking at the logs to see when the crawler last downloaded robots.txt and what the file said at the time. I’ve also been able to help some operators who had bad syntax in their robots.txt, or had saved it as a .doc or .html file.

A log is essential for debugging, performance analysis, and responding to customer complaints. Keep your logs for a month. I’ve rarely had to go back more than a week or two in order to resolve a complaint, but a month gives a good buffer. And analysis done on a full month of logs can provide better data than just a day or a week worth of info. If space is a problem, compress the logs. They typically compress at the rate of five or ten to one.

Responding to complaints

Regardless of how polite your crawler is, you will receive complaints. Most of the complaints will be from reasonable people who are more curious than angry, but from time to time you will receive a scathing letter from somebody who seems completely unreasonable. Do not take an aggressive or confrontational tone with anybody who complains about your crawler’s behavior. Every complaint, regardless of its tone, is best handled by a friendly reply that thanks the person for the problem report, apologizes for problems, expresses concern at the perceived misbehavior, and asks for more information if you need it in order to research the problem. Here’s an example:

Dear Mr. Webmaster:

Thank you for contacting me about my Web crawler. I make a concerted effort to ensure that the crawler operates within accepted community standards, and I apologize if it caused problems on your site.

I would like to research this matter to discover the source of the problem, which could very well be a bug in the crawler. Can you please send me the domain name for your site, and the approximate date and time from your Web server’s logs? It would be best if I could see the exact entry from your server log. That way I can match up the crawler’s logs with your information and more reliably discover the source of the problem.

Thank you again for notifying me of this problem. I will contact you again when I have discovered the source of the problem.

For more information about my crawler and what I’m doing with the information the crawler collects, please visit my information page at http://example.com/crawlerinfo.html

You would be surprised at how effective such a simple response can be. We have received some strongly worded complaints in the past that evoked pictures of slathering monsters out for blood. A response similar to the one above resulted in a letter of apology, some help diagnosing a problem (which turned out to be a misunderstanding on the part of the site operator), and a recommendation from that person on a Webmaster forum, saying that people should allow our bot to crawl. If we had responded to his initial complaint in kind, we would have made an enemy who very likely would have posted negative information about us. Instead, this person became an advocate and publicly praised us for the way we operate. It’s difficult to imagine a better outcome.

After the initial response, research the matter thoroughly and report your findings to the person who complained. If it was a bug in your crawler, admit it. Furthermore, post a blog entry or note on your crawler’s site that explains the error, how it manifests, how long it was going on, and when it was fixed. If you can’t fix the problem immediately, post information about when it will be fixed and how to avoid it if at all possible. It pays to be up front about your crawler’s bugs. If people think you’re being open and honest, they will be much more likely to forgive the occasional error.

But know that you can’t please everybody. There will be people who, for whatever reason, will not accept your explanation. That’s what the exclusion list is for. If you’re unable to have a productive conversation with somebody who complains, apologize for the problem, add their domain to your exclusion list, and let them know that your crawler will not bother them again. There’s just nothing to be gained by arguing with an unreasonable person. Cut your losses and move on.

It’s also a good idea to monitor Webmaster forums for complaints about your crawler. You can set up a Google alert that will notify you whenever Google’s crawler encounters a new posting about your crawler. It’s a good way to gauge your crawler’s reputation, and a way that you can respond in public quickly to any complaints. The rules above still apply, perhaps even more strongly. Be courteous and helpful. Flaming somebody in a public forum will not improve your crawler’s reputation.

I covered a lot of ground in this posting. Politeness, in my opinion, is the most important issue you face when operating a Web crawler. Everything your crawler does will be controlled by the politeness policy. There are plenty of technical hurdles, but if your crawler gets a bad reputation you will find it blocked by a large number of sites, including many that have valuable information that you’re interested in. No amount of technical wizardry will allow you to get past those blocks, and you will find it very difficult to repair a bad reputation.

Design in politeness from the start. You’ll regret it if you don’t.

This isn’t the last word on politeness. Many of the issues I’ll cover later in this series have politeness ramifications. One example is the DUST (different URL, similar text) problem: preventing your crawler from visiting the same page repeatedly, using a different URL each time.

In my next installment, I’ll begin talking about queue management, which is the most technically challenging part of an adaptive crawler. It’s also related to politeness in many ways.

Writing a Web Crawler: Crawling Models

This is the second post in a series of posts about writing a Web crawler. Read the Introduction to get the background information.

Expectations

I failed to set expectations in the Introduction, which might have misled some readers to believe that I will be presenting a fully-coded, working Web crawler. That is not the case. I will be discussing the issues in detail, and will certainly include pseudo code for some of the tasks or algorithms. It is decidedly not my intention to design, code, test, and debug a fully working Web crawler. The focus in this series is on the issues involved and possible solutions, but it will stop short of actual implementations.

The reason for that is simple: the issues are more important than the code. I wrote my large-scale Web crawler the hard way. I started with a naive approach to crawling, encountered problems, solved them, found more problems, solved them, etc. It was a fun, but frustrating and terribly inefficient way to create a Web crawler. It involved rewriting large portions of the crawler many times. It resulted in some messy code and, worse, some high level design decisions that can’t be changed without a major refactoring. Had I known the issues ahead of time, I would have designed the crawler differently and avoided a lot of pain.

I would like to save you that pain.

On to crawling

The high-level view of a Web crawler’s operation is very simple. In pseudo code, it looks like this.

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    parse document for links and add to queue
}

Admittedly, that’s a very simplistic view of how a crawler works, but it’s essentially correct. There are, however, a number of details that aren’t included. The Web, although finite, is rather large: on the order of 100 billion documents. The number of URIs pointing to those documents, though, is much larger–way too many to count. It’s difficult to say how many pages are actually indexed by major search engines. I’ve seen numbers that range from 20 percent to 60 percent of the visible Web is indexed by Google, Bing, etc. The “visible” Web includes only those pages that are reachable through a simple URL without having to log in and without having to execute JavaScript, etc. It’s unclear if the 100 billion number includes images and other downloadable files, or if it’s just HTML pages.

Whatever the case, the size of the Web is staggering–large enough that the queue would continually grow and your crawler would never stop. This is especially true if you’re running a crawler on a small scale (one machine crawling from a cable modem). It takes a huge amount of space just to store the URLs that you’re going to crawl.

Another detail missing from the pseudo code above is what to do with the documents. The algorithm just extracts and queues links. A real crawler is crawling for a purpose: typically to do something with the documents that it downloads. A search engine crawler, for example, processes the documents and builds an inverted index that can be searched very quickly to identify documents that contain requested keywords. What really happens is that the document is stored in some repository, and a separate process comes along later to read the document repository and create the index.

Note that this series will pointedly not foray into the world of creating a search engine. There is another entire field of technology involved in storing, indexing, and retrieving documents. My focus is on the crawler, which finds the documents.

The final thing missing from the simplified algorithm shown above is some means of preventing the crawler from requesting the same document multiple times. This turns out to be a very hard problem, and one I’ll cover in much detail in a later post. For now, we’ll limit our definition of “same document” to “same URI”. That is, once the crawler has downloaded the page from http://www.mischel.com/index.html, it shouldn’t go to that document again within the crawling period. I’ll define “crawling period” below.

A more realistic (although not quite real) high level view of the crawling process, then, looks like this:

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    store document for later processing
    parse document for links
    add unseen links to queue
}

As I mentioned above, the queue size can get unmanageably large. If you start your crawler with a very small seed that contains portal sites like Yahoo and MSN, and perhaps a few popular blogging sites like WordPress and LiveJournal, your queue is going to grow very quickly. Our experience has been that, on average, an HTML page will contain about 100 links. That, to me, was an astonishingly high number. It turns out that, again on average, only 10 percent of those links will be new. That is, of the 100 links you’ll scrape from a page, only 10 of them will be new links that you haven’t seen yet. That’s a big reduction, but it’s still a lot. For every URL you take out of the queue, you’ll put 10 back in. If you crawl one page per second, you’ll have almost 900,000 URLs in your queue at the end of one day.

One page per second, by the way, is about what you can expect from a very simple single-threaded crawler over the long term. Crawling speed isn’t particularly relevant for the current discussion, so I’ll defer detailed coverage of that topic to a later posting.

It should be obvious that the queue will quickly grow beyond memory capacity, so you need some kind of backing store for it. Not only do you need to maintain the URLs that will be crawled, you also have to maintain a list of the URLs you’ve already crawled, so you don’t crawl them again. You might think of using a database for that, but you’ll quickly find that you need a very fast database cluster to handle all the lookups you’d be making. At one page per second, you’d be querying the database 100 times per second just to see if the extracted URLs are new. When you get your crawler doing 30 pages per second (easily achievable on a single machine and a cable modem), you’ll be doing 3,000 lookups and 30 updates, and 300 insertions per second. That’s going to be beyond the capacity of a simple database server.

Another problem with a “live” queue is that URLs are inserted in essentially random order. There will be blocks of URLs for the same site, but links to the mischel.com domain, for example, might appear in multiple places within the queue. There are certain benefits to sorting links by domain and crawling all of the links for a single domain, one right after the other. You can avoid the cost of repeated DNS lookups, and also get a “snapshot” of a single domain over a short period of time rather than getting some now and some later.

For those reasons, most traditional crawlers use what we’ve come to call an “offline” model. The crawler starts crawling with a queue (sometimes called a segment) that’s ordered in some fashion (typically by domain name), and as it crawls pages it just stores them. A separate process reads the stored pages, extracts links, decides which links need to be crawled, sorts them by domain, and creates a new segment. When the crawler finishes with its initial segment, it grabs a new segment and starts crawling again.

The crawler’s code, then, becomes much simpler:

while (segment is available)
{
    load segment into queue
    while (queue is not empty)
    {
        dequeue url
        request document
        store document
    }
}

One benefit of this model is that the crawler requires almost no CPU resources. All it’s doing is downloading pages and storing them in some kind of repository. It might be possible, on a single machine, to have both the crawler and the URL extractor running at the same time on a single machine. If that’s possible, you should do it. Otherwise you’ll have to use the “crawl-process-crawl” model. That is, you create the initial segment and let the crawler complete that one segment. When the crawler is done, you harvest the links to create another segment, start the crawler again, etc. It seems clunky, but it’s quite effective. The major drawback to the crawl-process-crawl model is that you spend a lot of time not crawling, which means that your Internet connection is idle. If you’re paying big dollars for a fat pipe, that idle time is money down the drain.

The offline model is very effective for many types of crawling. If you want to exhaustively crawl a select number of sites (or the entire Internet), it works quite well. In general, if your crawler is essentially a very fast downloader that just stores whatever it finds, then this model is the way to go.

Adaptive crawling

The “secret” to doing Web-scale crawling at cable modem speeds is to limit what you crawl to just the important stuff–things that are related to whatever you’ve determined is relevant. And that requires a crawler that can react in real time to the information that it’s seeing. It has to be selective about what it downloads. This implies a live queue, a way to prioritize the URLs it encounters, and the ability to prune the queue when it becomes too large. Such a crawler is a somewhat more complicated thing to write, but it’s amazingly effective. If you’re looking for specific types of documents or documents about particular topics, then an adaptive crawler can increase the number of relevant documents you find in a given period of time.

Before I get to the numbers, let’s talk about how an adaptive crawler works. The general algorithm is:

queue = LoadSeed()
while (queue is not empty)
{
    dequeue url
    request document
    store document

    // process document
    analyze document to determine if it's relevant
    update prioritizer with information about this document

    // extract and queue links
    parse document for links
    eliminate duplicate links
    prioritize links
    add prioritized links to queue
}

Note that I show storing the document, even if it isn’t relevant. Depending on your requirements, that might not be necessary. Our crawler, for example, doesn’t store HTML pages or other documents that aren’t videos. Even when it encounters a video, the crawler only stores metadata–information about the video like the title, the author, length, etc.

In addition to the above, an adaptive crawler typically has a background thread that continually goes through the queue, re-prioritizing links based on the ever-changing information that the crawler is collecting from analyzing the documents that it finds. When the queue fills up, the lowest-priority URLs are simply removed. If they’re important (i.e. contain relevant information), the crawler will see them again.

We’ve found that re-prioritizing links in this way reduces over training, and also allows the crawler to adapt when it finds that a previously unproductive path becomes productive. And, conversely, to stop crawling a particular path when that path becomes unproductive.

As you can see, an adaptive crawler is quite a bit more complicated than a general-coverage crawler. It needs some way to analyze a document to derive a relevance score (what we call a “nugget function”), and some type of machine learning (ML) system that can continually learn and then prioritize URLs based on what it’s learned. Not only is this more complicated to build, it puts a much heavier load on the crawler’s memory and CPU resources.

The nature of the nugget function and the ML system will remain unspecified at this point. In later posts, I’ll discuss ways to determine a document’s relevance, and how to build a simple ML system. For now, we can treat those as black boxes.

With this model, the crawler can zero in on relevant information very quickly. If your nugget function returns true when the document is relevant and false when the document is not relevant, the ML system can reward or punish the path that led to the document. The ML system quickly learns that links from site A lead to nuggets 50% of the time, links from site B lead to nuggets only 2% of the time, and links from site C never lead to nuggets. The prioritizer can then prioritize links so that those from site A are crawled before links from site B, and links from site C aren’t crawled unless there’s nothing else to do.

This works amazingly well.

We modified the ML system of our crawler so that it would return a random priority value for every URL given to it. That simulates randomly crawling the Web. We fed the crawler a list of 100 starting URLs, and told it to find videos. After crawling for several hours, we turned it off and measured its effectiveness. Simulating a random crawl, we found approximately one video in every 500 pages crawled. We then re-enabled the ML system and ran the test again, using the same starting point. Within two hours, the crawler was finding, on average, one new video in every 12 pages it examined. That early ML system made the crawler 40 times more productive.

Our crawler is geared towards finding video, but it would be a relatively small matter to change the nugget function and tell it to look for anything we like. An example I like to use is steam trains. I recently did a Google search for [steam trains], and got approximately 794,000 results. If we were to crawl the Web at random–just grab any old URL and download the document–then, on average, we would expect to find one document about steam trains in every 25,000 documents we examine. That’s assuming, of course, that we prevent the crawler from looking at things like videos, images, and other types of files that aren’t “documents.” We would have to examine 25 million documents in order to get 1,000 references to steam trains, and there’s no telling how relevant those documents might be. For example, we might find this page, which mentions steam trains but isn’t really relevant to them.

Even at 30 pages per second, you’re talking 833 seconds (almost 14 minutes) between hits. So if you crawled all day you’d find about 100 pages that mention steam trains, and of those a large number would be like the one I linked above–it mentions steam trains but isn’t relevant.

I would expect the behavior to be much different with the adpative crawler. When it first started up, it would act like a naive random crawler, and its performance would be very similar. But as it began to find documents about steam trains, the URL prioritization would cause it to begin finding relevant documents much faster. Within a few hours, it would be finding a new document about steam trains every 20 seconds or so, and it would likely do even better than that in spurts. The crawler would focus on clusters of relevant documents, find links to other clusters, eventually exhaust those clusters and wander about aimlessly until it found new clusters.

I haven’t actually run that test with steam trains, but my experience with the crawler’s behavior when searching for video gives me some strong evidence that the crawler would likely find more than 10,000 relevant documents in a day, starting from scratch, using cable modem speeds. In addition, the average relevance of the documents would be much higher because the ML system would favor links from highly relevant sources over random links. If we gave the crawler a good starting point (say, the Wikipedia page on steam trains, or a steam train enthusiast’s site), its performance would be even better because it would start with a source of high-quality links (i.e. links to pages that are very likely to be relevant).

Crawling period

Throughout this discussion, I have not mentioned the dynamic nature of the Web. It’s constantly changing. It changes so fast that no search engine can be completely up to date. Early search engines worked by taking a “snapshot” of the Web over some time period. That is, they’d set their crawlers to work and crawl, say, 20 billion pages in a few days. Then the indexer would come along and start building an index from the stored documents while the crawlers started over, building a new repository that the next index run would use. The larger search engines don’t typically work this way anymore, but many smaller scale crawlers do.

A “crawl,” when using this model, is a unit of work. It might be measured by time (“today’s crawl”), or by the number of documents or sites indexed. How you measure it doesn’t really matter so much for this discussion. The point is that you don’t want to visit the same site more than once per crawl–the crawling period.

That’s fine when you’re using an offline model. An adaptive crawler, though, is often run continuously, and is expected to cover old ground from time to time in order to find new material. That steam train afficianado, for example, might have posted a new article. We want our crawler to find that new material as soon as possible.

In that model, there isn’t a “crawl” as a unit of work. As a result, you need some mechanism that allows the crawler to re-crawl a URL from time to time. I’ll cover how that’s done in a later post.

Which model to use?

If you’re building a general coverage crawler, or a crawler that’s expected to index pre-identified sites or pages, then the offline crawler is the way to go. It’s easier to build and operate, and you can take advantage of batching URLs by domain.

Adaptive crawlers are mostly used for exploration: “find documents that meet this criteria.” They are much harder to build than typical crawlers, but they can find more relevant information faster, and with less bandwidth.

One thing that discourages people from writing crawlers is the idea that they need a huge data center, a fat pipe, distributed file system, and parallel processing cluster in order to make it work. That’s just not so! If you’re looking for a relative handful of documents, you can operate an adaptive crawler on a single machine hooked to a cable modem. Even at five megabits per second, you can exhaustively search the Web for information on any topic in just a few days. The trick is limiting where you look and what you look at.

I know that’s a lot to digest in one blog entry, but the choice of crawling model affects what you can do with your crawler. The rest of this series is going to focus on the adaptive crawlers, since that’s what I’m most familiar with, and information about offline crawlers is readily available. In addition, I think most people who are interested in finding business intelligence on the Web are better served by a crawler that seeks out new sources of information rather than depending on a human to identify the sources.

In the next post, I’ll be talking about politeness, which is the most important aspect of operating a Web crawler. I’ll probably follow that up with a discussion of queue management, which turns out to be closely related to politeness in many respects.

Writing a web crawler: Introduction

This is the first of a series of posts about writing a custom Web crawler. It assumes some knowledge of what a Web crawler is, and perhaps what crawlers are typically used for. I don’t know how many posts this subject will require, but it could be a rather long series. It turns out that Web crawlers are much more complicated than they look at first.

Articles in this series:

Crawling Models
Politeness
Queue Management: Part 1

When you hear the term “Web crawler,” it’s likely that your first thought is Google. Certainly, Google’s crawler is better known than any other. It’s probably the largest, as well. There are many other crawlers, though: Microsoft’s Bing search runs one, as do Blekko and other search companies (Yandex, Baidu, Internet Archive, etc.) In addition, there are a few open source crawlers such as NutchHeretrix, and others. There are also commercial-licensed crawlers available.

The other thing that typically comes to mind when you think of Web crawlers is search engines. Again, Google’s crawler is used primarily to gather data for the Google search engine. When speaking of crawlers, the big search engines get all the press. After all, they’re solving big problems, and their solutions are impressive just on their sheer scale. But that’s not all that crawlers are good for.

The big search engines run what we call “general coverage” crawlers. Their intent is to index a very large part of the Web to support searching for “anything.” But there is a huge number of smaller crawlers: those that are designed to crawl a single site, for example, or those that crawl a relatively small number of sites. And there are crawlers that try to scour the entire Web to find specific information. All of these smaller crawlers are generally lumped together into a category called focused crawlers. The truth is that even the largest crawlers are focused in some way–some more tightly than others.

Smaller focused crawlers might support smaller, targeted, search engines, or they might be used to find particular information for a multitude of purposes. Perhaps a cancer researcher is using a crawler to keep abreast of advances in his field. Or a business is looking for information about competitors. Or a government agency is looking for information about terrorists. I know of projects that do all that and more.

Another class of programs that automatically reads Web pages are called screen scrapers. In general, the difference between a screen scraper and a crawler is that a screen scraper is typically looking for specific information on specific Web pages. For example, a program that reads the HTML page for your location from weather.com and extracts the current forecast would be a screen scraper. Typically, a scraper is a custom piece of software that is written to parse a specific page or small set of pages for very specific information. A crawler is typically more general. Crawlers are designed to traverse the Web graph (follow links from one HTML page to another). Many programs share some attributes of both crawlers and screen scrapers, so there is no clear dividing line between the two. But, in general, a crawler is an explorer that’s given a starting place and told to wander and find new things. A scraper is directed to specific pages from which it extracts very specific information.

My credentials

I’ve spent a good part of the list five years writing and maintaining a Web crawler (MLBot) that examines somewhere in the neighborhood of 40 million URLs every day. The crawler’s primary purpose is to locate and extract information from media (video and audio) files. The crawler’s basic design was pretty well fixed after three months of development, but it took about a year before we had “figured it out.” Even then, it took another year of tweaking, refactoring, and even some major architectural changes before we were happy with it. Since that time (the last three years), we’ve mostly made small incremental changes and added some features to recognize and specially process particular types of URLs that either didn’t exist when we started crawling, or that have become more important to us over time.

That said, I won’t claim to be an “expert” on crawling the Web. If there’s one thing my experiences have taught me, it’s that there’s a whole lot about the Web and how to crawl it that I just don’t know. As a small (four people) startup, we all wear many hats, and there are many things to be done. Although we know there are things our crawler could do better, we just don’t have the time to make those changes. We still discuss possible modifications to the crawler, and in many cases we have a very good idea of what changes to make in order to solve particular problems. But there are still hard problems that we know would take significant research work to solve. It’s unfortunate that we don’t have the resources to make those changes. For now, the crawler does a very good job of finding the information we want.

I have to point out here that, although I wrote almost all the code that makes up the crawler, I could not have done it without the help of my business partners. My understanding of the many challenges associated with crawling the Web, and the solutions that I implemented in code are the result of many hours spent sitting on the beanbags in front of the white board, tossing around ideas with my co-workers. In addition, major components of the crawler and some of the supporting infrastructure were contributed by David and Joe, again as a result of those long brainstorming sessions.

Based on the above, I can say with some confidence that I know a few things about crawling the Web. Again, I won’t claim to be an expert. I do, however, have a crawler that finds, on average, close to a million new videos every day from all over the Web, using a surprisingly small amount of bandwidth in the process.

Why write a Web crawler?

As I pointed out above, there are many open source and commercially licensed Web crawlers available. All can be configured to some degree through configuration files, add-ons or plug-ins, or by directly modifying the source code. But all of the crawlers you come across impose a particular crawling model, and most make assumptions about what you want to do with the data the crawler finds. Most of the crawlers I’ve seen assume some kind of search engine. Whereas it’s often possible to configure or modify those crawlers to do non-traditional things with the data, doing so is not necessarily easy. In addition, many of the crawlers assume a particular back-end data store and, again, whereas it’s possible to use a different data store, the assumptions are often deeply rooted in the code and in the crawler’s operation. It’s often more difficult to modify the existing crawler to do something non-traditional than it is to just write what you want from scratch.

For us, the decision to build our own was not a hard one at all, simply because five years ago the available crawlers were not up to the task that we envisioned. Five years ago, modifying any of the existing crawlers to do what we needed was not possible. That might be possible today, although the research I’ve done leads me to doubt that. Certainly, if I could make one of the crawlers work as we need it to, the result would require more servers and a lot more disk space than what we currently use.

If you’re thinking that you need a Web crawler, it’s definitely a good idea to look at existing solutions to see if they will meet your requirements. But there are still legitimate reasons to build your own, especially if your needs fall far outside the lines of a traditional search engine.

My next post in this series will talk about different crawling models and why the traditional model that’s used in most crawler implementations, although it works well, is not necessarily the most effective way to crawl the Web.

Getting videos from YouTube

YouTube has a very rich API that developers can call to get information about videos, post new videos, etc. Lots of Web sites use this API to do all kinds of wonderful things with YouTube videos.

So suppose you wanted to create a little application that works like an “all news” channel. But instead of just showing you CNN or MSNBC or whatever channel, it gathers news from lots of different channels and aggregates them. You then have an “all news all the time” channel that shows lots of different stories and will give you different viewpoints on the same story. Fox news, for example, will have a much different take on a story than will CNN or Al Jazeera.

The YouTube API gives you at least three ways to get new videos from a particular YouTube user. The simplest is to do a search and filter it by the YouTube author name. For example, this query:

http://gdata.youtube.com/feeds/api/videos?orderby=published&max-results=50&author=AssociatedPress&v=2

will give you the 50 most recent videos that were posted by the YouTube user “AssociatedPress”.

There’s a problem, though: the results in that feed will be delayed sometimes by more than 15 minutes. If I go to the AssociatedPress page on YouTube, I’ll often see videos there that do not show up in the feed returned by that query.

You can get up-to-date results by querying the individual user:

http://gdata.youtube.com/feeds/api/users/AssociatedPress/uploads?orderby=published&max-results=50&v=2

With that query, I’m able to get the most recent videos. Anything that shows up in the YouTube page for that user also shows up in the results I get back from the query.

But that’s just one source! If I want my news channel to get data from 50 different sources, I’d have to poll each one individually. That’s not terribly difficult, but it takes time. YouTube places limits on how often I can poll for videos. To keep from being throttled, you have to wait at least 15 seconds (a minute or longer if you don’t have a developer API key) between requests to the API. So getting the latest videos from all 50 sources will take somewhere between 12 and 50 minutes. Perhaps longer.

Not many sources post a video every 12 minutes, so there’s a lot of waste involved in polling all the time. And 50 minutes is way too long to wait for an update. 50 minutes? It’s not news anymore!

There is a third way. Go to your YouTube account and subscribe to the sources that you’re interested in. According to the documentation, you can subscribe to up to 2,000 different sources, and whenever you go to your subscriptions page you’ll see the most recent videos from those sources.

The really nice thing is that you can then do a single query to get the most recent videos from all your sources:

http://gdata.youtube.com/feeds/api/users/YourUserName/newsubscriptionvideos?orderby=published&max-results=50&v=2

Replace YourUserName in the URL above with the name of the user who subscribed to the sources you want to see.

With this, you can poll YouTube once every five minutes and get the most recent videos from all of your sources.

Another benefit is that you don’t have to change your program if you want to add or remove sources. All you have to do is log in to YouTube and edit your subscriptions.

Reduce your bandwidth usage and get your videos more timely. Sounds like a great deal to me.

Sniffing network traffic

My latest crawler modifications require me to scrape Web pages that host videos so that I can obtain metadata (title, description, date posted, etc.) that we place in our index.  Unfortunately, there’s no standard way for sites to present such information.  ESPN and Vimeo have HTML <meta> tags that provide some info, but I have to go parsing through the body of the document to find the date.  (And yes, I’m aware that Vimeo has an API that will make this a moot point.  I’ll be investigating that soon.)

Other sites are much worse in that they provide no metadata in the HTML.  For example, one site’s video page is very code-heavy.  Requiring that the page be reloaded every time you request a new video would require a lot of network traffic.  Their design instead uses JavaScript to request a particular video’s metadata from a server.  Loading a new video involves downloading just a few kilobytes of data.

I spent some time this afternoon searching through the a video page HTML and the associated JavaScript, looking for the magic incantation that would get me the data I’m looking for.  The amount of code involved is staggering, and I quickly went crosseyed trying to decipher it before I hit on the idea of hooking up a sniffer to see if I could identify the HTTP request that gets the data.

It took me all of five minutes to download and install Free Http Sniffer, request a video from the site in question, and locate the magic line in the 230 or so requests that the page makes when it loads.  Problem solved.  Now all I have to do is write code that’ll transform a video page url into a request for the metadata, and I’m set.

I have no idea why I didn’t think of the sniffer earlier.  I’d used one before for a similar purpose.  I suspect I’ll be making heavy use of it in the near future as I expand the number of sites that we crawl for media.

Multithreaded programming

I’ve been head-down here working on the Web crawler and haven’t had much occasion to sit down and write blog entries. It’s been a very busy but interesting and rewarding time. A high performance distributed Web crawler is a rather complex piece of work. I knew that it would be involved, but there’s a lot more to it than I originally thought. Managing hundreds of concurrent socket connections on a single machine and coordinating multiple machines gets complicated in a hurry.

It’s been a long time since I did any serious multithreaded programming, and I’m re-learning some lessons:

  • When there are multiple threads, everything is subject to concurrency issues.
  • Unless you can guarantee that a data structure cannot possibly be accessed by more than one thread of execution at a time, you will eventually experience a problem if you don’t protect it with some kind of synchronization lock.
  • You’ll find that guarantee very difficult to make.
  • It can take hours or days before your program crashes inexplicably.
  • It can take just as long to find that one place where you forgot to protect a global data structure with a mutual exclusion lock.
  • If you’re developing a multithreaded application on a single processor machine, you’ll invariably end up finding more bugs when you move to a dual processor machine where the computer really can do more than one thing at a time.
  • Learn how to use trace logging and prepare to spend a lot of time examining the log files for problems.
  • Race conditions and deadlocks are facts of life, difficult to predict, and painfully obvious in retrospect.

It’s really good to be programming again.

Crawling the Web

I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down hundreds or thousands of documents per second. Granted, processing more than a few dozen documents per second will require a heck of a lot more bandwidth than I currently have, and if I want to get to the thousands I’ll need multiple machines working together.

But first things first. Before you can scale up, you need something that works no matter how slow it is. This turns out to be more difficult than it seems at first glance. The idea is very simple: read a Web page, scan it for links, put those links in a queue, grab a link from the queue, lather, rinse, repeat. Ten billion times.

Ten billion is a big number. If you process 20 documents per second, it’ll only take you 500 million seconds to finish the job. That’s almost 16 years. Obviously, you need to process things much faster than that. It’s a bandwidth issue and a machine optimization issue. Let’s look at both.

The easy part is bandwidth. It’s hard to get an authoritative figure, but a good round number for average Web page size (HTML only) is about 25 kilobytes. The best throughput I’ve seen on my cable modem is about 5 megabits (approx 500 kilobytes) per second. So figure a sustained upper limit of 20 pages per second. That’s 20 HTML pages downloaded and links extracted. Doesn’t seem so tough, right? I mean, a 2 GHz computer with a 100 megabit network card can easily keep up with that.

It’s harder than it sounds. Not because the processor can’t keep up, but because it’s incredibly difficult to saturate the bandwidth. There’s a whole lot of latency involved in reading Web pages. Most of the time, the computer is just sitting there waiting for servers to return data. The solution is to keep multiple connections going so that while some are waiting, others are pulling down data. That turns out to be a non-trivial task. First, because managing dozens or hundreds of concurrent connections is hard, and second because domain name resolution takes an absurdly long time. And it’s often implemented very inefficiently.

There are many other issues you have to think about if you want to crawl the Web. You’d think that keeping track of the URLs you’ve visited and those that you haven’t visited yet would be easy. And it would be if the numbers were manageable. 10 billion documents means 10 billion URLs. Just the URLs themselves, stored naively, would take close to a terabyte. So you have to come up with a memory-efficient way to store them and a time-efficient way to search them.

There are other issues as well. For example, did you know that www.mischel.com and mischel.com go to exactly the same place? That’s a trivial example, but you’ll find out that some Web pages have many different URLs. If you don’t have some way of resolving those and other traps, URLs with odd query strings, unique session IDs, and any number of other oddities will quickly have you chasing your own tail.

Today I finally got my first multiple-connection web crawler up and limping. It took a whole lot longer than I thought it would because the asynchronous HTTP library I downloaded was buggy and it took me forever to figure out how to make Python break out of an infinite loop. For all that’s good about Python, the development and debugging environment is distressingly primitive.

All of which is to say that I’ve latched onto a project that is very interesting and challenging, and doesn’t leave me time for much else. When I’m not banging on code, I’m reading research papers about crawling, storing, indexing, and searching large amounts of data, or thinking about better ways to implement parts of my crawler. It’s exhausting, but pleasantly so.