This is the first of a series of posts about writing a custom Web crawler. It assumes some knowledge of what a Web crawler is, and perhaps what crawlers are typically used for. I don’t know how many posts this subject will require, but it could be a rather long series. It turns out that Web crawlers are much more complicated than they look at first.
Articles in this series:
Crawling Models
Politeness
Queue Management: Part 1
When you hear the term “Web crawler,” it’s likely that your first thought is Google. Certainly, Google’s crawler is better known than any other. It’s probably the largest, as well. There are many other crawlers, though: Microsoft’s Bing search runs one, as do Blekko and other search companies (Yandex, Baidu, Internet Archive, etc.) In addition, there are a few open source crawlers such as Nutch, Heretrix, and others. There are also commercial-licensed crawlers available.
The other thing that typically comes to mind when you think of Web crawlers is search engines. Again, Google’s crawler is used primarily to gather data for the Google search engine. When speaking of crawlers, the big search engines get all the press. After all, they’re solving big problems, and their solutions are impressive just on their sheer scale. But that’s not all that crawlers are good for.
The big search engines run what we call “general coverage” crawlers. Their intent is to index a very large part of the Web to support searching for “anything.” But there is a huge number of smaller crawlers: those that are designed to crawl a single site, for example, or those that crawl a relatively small number of sites. And there are crawlers that try to scour the entire Web to find specific information. All of these smaller crawlers are generally lumped together into a category called focused crawlers. The truth is that even the largest crawlers are focused in some way–some more tightly than others.
Smaller focused crawlers might support smaller, targeted, search engines, or they might be used to find particular information for a multitude of purposes. Perhaps a cancer researcher is using a crawler to keep abreast of advances in his field. Or a business is looking for information about competitors. Or a government agency is looking for information about terrorists. I know of projects that do all that and more.
Another class of programs that automatically reads Web pages are called screen scrapers. In general, the difference between a screen scraper and a crawler is that a screen scraper is typically looking for specific information on specific Web pages. For example, a program that reads the HTML page for your location from weather.com and extracts the current forecast would be a screen scraper. Typically, a scraper is a custom piece of software that is written to parse a specific page or small set of pages for very specific information. A crawler is typically more general. Crawlers are designed to traverse the Web graph (follow links from one HTML page to another). Many programs share some attributes of both crawlers and screen scrapers, so there is no clear dividing line between the two. But, in general, a crawler is an explorer that’s given a starting place and told to wander and find new things. A scraper is directed to specific pages from which it extracts very specific information.
My credentials
I’ve spent a good part of the list five years writing and maintaining a Web crawler (MLBot) that examines somewhere in the neighborhood of 40 million URLs every day. The crawler’s primary purpose is to locate and extract information from media (video and audio) files. The crawler’s basic design was pretty well fixed after three months of development, but it took about a year before we had “figured it out.” Even then, it took another year of tweaking, refactoring, and even some major architectural changes before we were happy with it. Since that time (the last three years), we’ve mostly made small incremental changes and added some features to recognize and specially process particular types of URLs that either didn’t exist when we started crawling, or that have become more important to us over time.
That said, I won’t claim to be an “expert” on crawling the Web. If there’s one thing my experiences have taught me, it’s that there’s a whole lot about the Web and how to crawl it that I just don’t know. As a small (four people) startup, we all wear many hats, and there are many things to be done. Although we know there are things our crawler could do better, we just don’t have the time to make those changes. We still discuss possible modifications to the crawler, and in many cases we have a very good idea of what changes to make in order to solve particular problems. But there are still hard problems that we know would take significant research work to solve. It’s unfortunate that we don’t have the resources to make those changes. For now, the crawler does a very good job of finding the information we want.
I have to point out here that, although I wrote almost all the code that makes up the crawler, I could not have done it without the help of my business partners. My understanding of the many challenges associated with crawling the Web, and the solutions that I implemented in code are the result of many hours spent sitting on the beanbags in front of the white board, tossing around ideas with my co-workers. In addition, major components of the crawler and some of the supporting infrastructure were contributed by David and Joe, again as a result of those long brainstorming sessions.
Based on the above, I can say with some confidence that I know a few things about crawling the Web. Again, I won’t claim to be an expert. I do, however, have a crawler that finds, on average, close to a million new videos every day from all over the Web, using a surprisingly small amount of bandwidth in the process.
Why write a Web crawler?
As I pointed out above, there are many open source and commercially licensed Web crawlers available. All can be configured to some degree through configuration files, add-ons or plug-ins, or by directly modifying the source code. But all of the crawlers you come across impose a particular crawling model, and most make assumptions about what you want to do with the data the crawler finds. Most of the crawlers I’ve seen assume some kind of search engine. Whereas it’s often possible to configure or modify those crawlers to do non-traditional things with the data, doing so is not necessarily easy. In addition, many of the crawlers assume a particular back-end data store and, again, whereas it’s possible to use a different data store, the assumptions are often deeply rooted in the code and in the crawler’s operation. It’s often more difficult to modify the existing crawler to do something non-traditional than it is to just write what you want from scratch.
For us, the decision to build our own was not a hard one at all, simply because five years ago the available crawlers were not up to the task that we envisioned. Five years ago, modifying any of the existing crawlers to do what we needed was not possible. That might be possible today, although the research I’ve done leads me to doubt that. Certainly, if I could make one of the crawlers work as we need it to, the result would require more servers and a lot more disk space than what we currently use.
If you’re thinking that you need a Web crawler, it’s definitely a good idea to look at existing solutions to see if they will meet your requirements. But there are still legitimate reasons to build your own, especially if your needs fall far outside the lines of a traditional search engine.
My next post in this series will talk about different crawling models and why the traditional model that’s used in most crawler implementations, although it works well, is not necessarily the most effective way to crawl the Web.