When you’re writing a Web crawler, it’s important for you to understand that your crawler is using others’ resources. Whenever you download a file from somebody else’s server, you’re consuming their bandwidth and server resources. Most site operators welcome well-behaved crawlers, because those crawlers provide exposure, which means potentially more visitors. Ill-behaved crawlers are not welcomed, and often are banned. If you operate an ill-behaved crawler, not only will your crawler be banned by sites on which it misbehaves, but also on many other sites–even sites your crawler has not yet visited.
Site operators do talk amongst themselves. There are active forums on which operators can ask questions about crawlers, and on which they post reports about ill-behaved crawlers. Some operators will ban a crawler that gets one bad report.
Site operators have the right to ban any client for any reason, and the responsibility to their legitimate users to ban crawlers and other bots that interfere with normal site operation. As the author of a Web crawler, it is your responsibility to operate it within accepted community standards and to ensure that it doesn’t interfere with operation of the sites that it crawls.
I can tell you from experience that an ill-behaved crawler will be banned, the IP addresses it crawls from reported widely on Webmaster forums, and complaints about the bot will be lodged with the ISP. Repeated reports can result in your ISP terminating your service, and exceptional cases could result in civil litigation or criminal charges. It’s very difficult to repair the reputation of an ill-behaved crawler, so it’s important that your crawler adhere to community standards from the start.
Unfortunately, those community standards are at best vaguely defined. When it comes to accepted behavior there are grey areas that site operators will interpret differently. In addition, some operators aren’t familiar with the well-known community standards, or have wildly different interpretations of their meanings. Many of the rules I outline below aren’t written anywhere else that I know of, but instead have been developed by us in response to questions, comments, and complaints from site operators who have contacted us about our crawler.
If I had to sum it up in a single rule, it would be:
You are crawling at somebody else’s expense. Act accordingly.
The rules below apply to all types of bots, not just the adaptive crawlers that this series focuses on.
Web servers keep a log of every request that comes in. That log typically includes, among other things, the date and time, the IP address from which the request was made, the URL that was requested, and a user agent string that identifies the client making the request. There is nothing you can do to prevent that information from being stored in the server log. Often, a site operator will be interested to see who is making requests, and he will use the IP address and user agent strings to determine that. Browser user agent strings are pretty easy to identify. For example, here’s a list of Internet Explorer user agent strings. If an operator sees blank or cryptic user agent strings in his log, he might ban that IP address on principle, figuring it’s an ill-behaved bot or at minimum somebody trying to “hide.”
The first thing you should do is come up with a name for your crawler and make sure that that name is included in the User-Agent HTTP header with every request the crawler makes. That string should include your crawler’s name and a link to a page that contains information about the crawler. For example, our User-Agent string is “MLBot (www.metadatalabs.com/mlbot)”.
The page that contains information about your crawler should contain, at minimum:
- General information about your crawler. This statement should describe you or your company, the kind of information the crawler is looking for, and what you will be doing with the information. It’s a very good idea to mention if your crawler is part of a school project, because many operators will give more leeway and can offer helpful suggestions.
- If you think that your product will benefit the people whose sites you’re crawling, you should mention those benefits. For example, “I am building a comprehensive index of information about steam trains. When published, this index will drive many users who are interested in steam trains to your site.
- List the IP addresses that your crawler operates from. If you have static IP addresses, this is very easy to do. If you’re crawling from home or from some other location that has dynamic IP, then you should update this list as often as you can to show the current IP addresses. It is also a good idea to sign up with a dynamic IP service and list the URL that will resolve to your server. Interested parties can then use NSLOOKUP to resolve that name with your current IP address.
- You should mention that your crawler respects the robots.txt standard, including any extensions. See robots.txt, below.
- Include information about how to block your crawler using robots.txt. For example, our FAQ page shows this:
User-agent: MLBot Disallow: *
You might also show examples of blocking specific directories, and you should certainly include a link to the robots.txt page.
- Include an email address or form that people can use to contact you about your crawler.
It’s surprising how effective that page can be. Site operators are rightly suspicious of random crawlers that have no supporting information. By posting a page that describes your crawler, you rise above the majority of small-scale bots that continually plague their servers. I won’t say that operators won’t still look at you with suspicion, but they won’t typically block you out of hand. You might still be banned, especially if your crawler is ill-behaved, but the information page gives your crawler an air of legitimacy that it wouldn’t otherwise have.
I know how hard it can be to tear yourself away from working on your crawler, especially if you find writing difficult. But you should post about your crawler from time to time on your blog or FAQ site. Post about new additions, about the kinds of things you’re finding, and the interesting things you’re doing with the data you collect. If a visitor sees recent posts, he knows that you’re still actively working on your project. This isn’t directly related to politeness, but it does help improve your image.
It also helps to keep up on the latest developments in the world of search engine optimization, public sentiment about crawlers, and what other crawlers are doing. Accepted standards change over time. What was acceptable from a crawler a few years ago might now be considered impolite. It’s important that your crawler’s behavior reflect current standards.
The Robots Exclusion Standard is a consensus agreement that has achieved the status of an informal standard. Unfortunately, it’s never been approved as a “real” standard in the way that HTTP, SMTP, and other common Internet protocols are. As a result, there are many extensions to the standard, some contradictory. In 2008, Google, Microsoft, and Yahoo agreed on a set of common extensions to robots.txt, and many less visible crawlers followed suit. If you follow the lead of the major search engine crawlers, nobody can fault you for your handling of robots.txt. See Major search engines support robots.txt standard for more information.
At minimum, your crawler must support proper handling of the
Disallow directive. Anything beyond that is optional, but helpful. You should support
Sitemaps if it’s relevant to you, and wildcards. Some of the major crawlers support the
Crawl-delay directive, as well, and if you can do it you should.
Related to robots.txt are the HTML Meta tags that some authors use to prevent crawling or indexing of particular pages. Some authors do not have access to their robots.txt file, so their only option is to use Meta tags in the HTML. If you are crawling or indexing HTML documents, you should support these tags.
Proper handling of robots.txt can be kind of confusing. I’ve written about this in the past. See Struggling with the Robots Exclusion Standard and More On Robots Exclusion for more information. In addition, Google’s robots.txt page has some good description of how robots.txt should work.
There are many robots.txt parser implementations available in different languages. Do a search for
robots.txt parser should give you plenty of hits for whatever language you’re working in. You’re much better off using one that somebody else wrote rather than trying to implement it yourself.
You’ll want to cache the robots.txt files that you download, so that you don’t have to request robots.txt every time you request a URL from a domain. In general, a 24 hour cache time is acceptable. Some large crawlers say that they cache robots.txt for a week! I’ve found that I get very few complaints with a cache time of 24 hours. A simple MRU caching scheme with a buffer that holds one million robots.txt files works well for us. We crawl perhaps 40 million URLs per day. Your cache size will depend on how many different domains you typically crawl per day.
A common mistake, by the way, is for a site to report the file type of robots.txt as “text/html”. A strict interpretation of the rules says that you don’t have to recognize such a file. I’ve found that those files typically are plain text (i.e. they’re formatted correctly), but have the wrong MIME type attached to them. Our crawler will try to parse anything that calls itself robots.txt, regardless of the MIME type.
A single computer, even operating on a cable modem, can cause significant disruption to a Web server if it makes requests too often. If you have a multi-threaded crawler and a whole bunch of URLs from a single site, you might be tempted to request all of those URLs at once. Your cable modem can probably handle you making 30 concurrent requests to a server. And the server can probably handle you doing that … once. If your crawler is making many concurrent requests over a long period of time, it will impact the server. And when the owner figures out why his server stopped responding to customer requests, your bot will be blocked and possibly reported to your ISP as taking part in a denial of service attack. You must limit the rate at which your crawler makes requests to individual sites.
I noted above that some crawlers support the
Crawl-delay directive in robots.txt. That directive lists the number of seconds your bot should delay between requests to the site. Not all crawlers support that directive, and relatively few robots.txt files that I’ve seen actually contain the directive. That said, you should support
Crawl-delay if you can. But it can’t be your only means of limiting your crawler’s request rate. A technique that works well for us is to keep an average of the server’s response time over the last few requests, and then use a multiplier to come up with a delay time.
For example, say that the last five requests to the example.com domain averaged 1,500 milliseconds each, and you use a multiplier of 30. In that case, you would hit example.com once every 45 seconds. You can use a smaller multiplier, but I would caution against hitting any site more frequently than once every 15 seconds. Some larger sites will allow that, but many smaller sites would not be happy about it. It also depends on how many total requests you’re making. Hitting a site four times a minute for two minutes probably won’t raise any red flags. If you do it all day long, they’re going to wonder why you’re crawling them so often.
In our crawler, the politness delay is integrated with the queue management. I’ll discuss it in more detail then.
Take only what you need
To a crawler, the Web is an infinite free buffet. But as I said previously, it’s not really free. Every document you download required a server’s bandwidth and resources. If you go to an all-you-can-eat place, load your plate with food, eat just the good stuff, and then go back for seconds, the manager will probably kick you out. The rule at the buffet is “Take only what you need, and eat all you take.” The rule is the same when crawling the Web. If you download a bunch of stuff that you don’t need, consume somebody else’s server resources and then discard the document, they’re likely to block you. Take only what you can use.
This is one rule that not only makes your crawler more polite, but also makes it more efficient.
An early version of my crawler was looking for links to MP3 files in HTML documents. The crawler’s general algorithm was this:
Download file If it's an HTML file parse the HTML for links queue new links else if it's an MP3 file extract metadata store link and metadata else discard the document
That works well, but the program downloaded a huge number of documents that it would never be able to use! For example, about 20% of the documents it was downloading were image files! The crawler couldn’t do anything with a .gif or a .jpeg, but it was downloading them nonetheless. It was an incredible waste of others’ server resources and our bandwidth. By modifying the crawler so that it checked for common file extensions before queuing a URL, we increased our productivity by over 25%, and also made the crawler more polite. We were no longer taking more than we could use.
There’s no guarantee, of course, that a URL ending in “.jpg” is an image. There’s a high likelihood, though. Our experiments at the time showed that if a URL ended in “.jpg”, “.jpeg”, “.bmp”, or “.gif” (among many others), the probability of it being an image file was above 98%. Of the remaining two percent, only a very small fraction were HTML files, and none were audio files. Blocking those extensions cost us nothing and gave a huge benefit.
Images aren’t the only files that most crawlers will find useless. If you’re looking for HTML documents, then you can safely ignore URLs with extensions that are common to images, audio, video, compressed archives, PDFs, executables, and many more. I don’t have an exhaustive list of extensions that you should ignore, but the following is a good start:
".asx", // Windows video ".bmp", // bitmap image ".css", // Cascading Style Sheet ".doc", // Microsoft Word (mostly) ".docx", // Microsoft Word ".flv", // Old Flash video format ".gif", // GIF image ".jpeg", // JPEG image ".jpg", // JPEG image ".mid", // MIDI file ".mov", // Quicktime movie ".mp3", // MP3 audio ".ogg", // .ogg format media ".pdf", // PDF files ".png", // image ".ppt", // powerpoint ".ra", // real media ".ram", // real media ".rm", // real media ".swf", // Flash files ".txt", // plain text ".wav", // WAV format sound ".wma", // Windows media audio ".wmv", // Windows media video ".xml", // XML files ".zip", // ZIP files ".m4a", // MP4 audio ".m4v", // MP4 video ".mov", // Quicktime movie ".mp4", // MP4 video or audio ".m4b", // MP4 video or audio
If your crawler derives information from some of those file types, then of course you shouldn’t block them.
I don’t know what the distribution of files is on the Web these days. When I created that list four or five years ago, those extensions made up well over 25% of the files my crawler was encountering. Today, the crawler’s mix is something like 91% HTML files, 5% files with no extension (i.e. http://example.com/), 3% video, and every other file type is less than 0.01%. I’ve blocked lots of extensions other than those I identified above, but together they made up less than 1% of the total number of URLs I encountered.
I determined the extensions to block by logging (see below) every request the crawler makes, along with the HTML headers that it gets back in the response. By correlating the file extension with the
Content-Type header, I was able to get a list of extensions that rarely (if ever) lead to a file that I’m interested in.
Preventing your crawler from downloading stuff it can’t use makes you look more professional in the eyes of the people who own the servers you’re visiting, and also makes your crawler more efficient. You have to parse URLs anyway (another subject I’ll get to at some point), so checking for common useless extensions isn’t going to cost you significant processor time. Eliminating those useless files early makes queue management easier (less churn in the queue), and prevents you from wasting your precious bandwidth on things you can’t use.
Maintain an excluded domains file
There are some domains that, for various reasons, you don’t want your crawler to access. When we were crawling for audio files, for example, we encountered a lot of sites that had pirated music or that contained nothing but 30-second music samples. We didn’t want those files, but as far as the crawler and the ML system were concerned, those sites were gold. They had nuggets galore. Our only way to block them was go add them as exclusions.
Another reason you might want to block a site is in response to a complaint. Some site operators do not want you to crawl, period, and do not have access to robots.txt or the htaccess file in order to block your crawler. If somebody doesn’t want you on their site, you should honor their wishes.
You can go a long way with a simple text file that lists one domain per line, and perhaps a comment that says why you blocked it and when. Here’s a sample:
# sites that have requested that we not crawl site1.com site93.com blog.somebody.net # link farms that return nothing good spamsiteA.com spamsiteB.info
We created a file called
NeverCrawl.txt in a common location. The crawler checks the last modified date on that file every 15 minutes or so. If the last modified date has changed, the crawler reads and parses the file, adding the entries to a hash table that’s checked whenever a URL is extracted from a Web page. The beauty of this solution is that it’s simple to add an exclusion (just open the text file, add the domain, and save the file), and the file format is easy to parse. To my knowledge this exclusion list has never failed. It’s also very nice being able to say to somebody, “I have instructed the crawler never to visit your site again. That exclusion should take effect within the next half hour.”
It would be nice, at times, to have a richer syntax for specifying which sites to block, but the cost of doing so is pretty high in comparison to its benefit. We could add regular expression support so that we could block any URL that matches a particular query string syntax, regardless of domain. That would catch Web proxies, for example, and many link farm sites. But it complicates the parsing code and could affect reliability. It’s certainly something we’d like to do at some point, but for now the cost outweighs the benefit.
Don’t go overboard on trying to identify sites to block. I know from experience that you could spend the rest of your days adding things to your exclusion file. There are approximately 100 million registered domains that have at least one viable Web page. Add to that the subdomains and the number probably approaches a billion. The vast majority of those sites don’t have anything of interest to you on them. But trying to identify and list them in a file would be impossible. Besides, if you did list 90 million sites, you wouldn’t be able to keep the exclusion list in memory.
You should depend on your ML system to keep you away from unproductive sites. Use the exclusion list to keep you away from things that the ML system thinks are productive, and from sites that you have been asked not to crawl.
Your crawler is going to make a lot of requests. Even if you’re running a single-threaded bot that’s making an average of one request per second, you’re going to make more than 85,000 requests per day. You want to have a log of every request for two reasons:
- You will receive a complaint about your crawler’s behavior. The person complaining often will not supply dates and times, and if he does you’ll want to verify that it really was your crawler. It might be some other crawler that just happens to have the same name as yours, or it might be somebody spoofing your bot.
- You’ll want to analyze the logs to find link farms, crawler traps, and other unproductive sites, errors, and to find file extensions that are (almost) always unproductive.
You should log as much as you can with every request. At minimum you should save the date and time of the request, the request URL, the server’s HTTP response (200, 404, 302, etc.), the
Location header (for redirects), the response URI (the address of the page that responsed), and the total amount of time required for the request and response. Optionally, you can save the actual response (the HTML file received, for example), although depending on your process you might want to save that elsewhere.
I’ve found it very helpful to log the entire contents of any robots.txt file that I download. I’ve been able to solve some curious crawler behavior by looking at the logs to see when the crawler last downloaded robots.txt and what the file said at the time. I’ve also been able to help some operators who had bad syntax in their robots.txt, or had saved it as a .doc or .html file.
A log is essential for debugging, performance analysis, and responding to customer complaints. Keep your logs for a month. I’ve rarely had to go back more than a week or two in order to resolve a complaint, but a month gives a good buffer. And analysis done on a full month of logs can provide better data than just a day or a week worth of info. If space is a problem, compress the logs. They typically compress at the rate of five or ten to one.
Responding to complaints
Regardless of how polite your crawler is, you will receive complaints. Most of the complaints will be from reasonable people who are more curious than angry, but from time to time you will receive a scathing letter from somebody who seems completely unreasonable. Do not take an aggressive or confrontational tone with anybody who complains about your crawler’s behavior. Every complaint, regardless of its tone, is best handled by a friendly reply that thanks the person for the problem report, apologizes for problems, expresses concern at the perceived misbehavior, and asks for more information if you need it in order to research the problem. Here’s an example:
Dear Mr. Webmaster:
Thank you for contacting me about my Web crawler. I make a concerted effort to ensure that the crawler operates within accepted community standards, and I apologize if it caused problems on your site.
I would like to research this matter to discover the source of the problem, which could very well be a bug in the crawler. Can you please send me the domain name for your site, and the approximate date and time from your Web server’s logs? It would be best if I could see the exact entry from your server log. That way I can match up the crawler’s logs with your information and more reliably discover the source of the problem.
Thank you again for notifying me of this problem. I will contact you again when I have discovered the source of the problem.
For more information about my crawler and what I’m doing with the information the crawler collects, please visit my information page at http://example.com/crawlerinfo.html
You would be surprised at how effective such a simple response can be. We have received some strongly worded complaints in the past that evoked pictures of slathering monsters out for blood. A response similar to the one above resulted in a letter of apology, some help diagnosing a problem (which turned out to be a misunderstanding on the part of the site operator), and a recommendation from that person on a Webmaster forum, saying that people should allow our bot to crawl. If we had responded to his initial complaint in kind, we would have made an enemy who very likely would have posted negative information about us. Instead, this person became an advocate and publicly praised us for the way we operate. It’s difficult to imagine a better outcome.
After the initial response, research the matter throroughly and report your findings to the person who complained. If it was a bug in your crawler, admit it. Furthermore, post a blog entry or note on your crawler’s site that explains the error, how it manifests, how long it was going on, and when it was fixed. If you can’t fix the problem immediately, post information about when it will be fixed and how to avoid it if at all possible. It pays to be up front about your crawler’s bugs. If people think you’re being open and honest, they will be much more likely to forgive the occasional error.
But know that you can’t please everybody. There will be people who, for whatever reason, will not accept your explanation. That’s what the exclusion list is for. If you’re unable to have a productive conversation with somebody who complains, apologize for the problem, add their domain to your exclusion list, and let them know that your crawler will not bother them again. There’s just nothing to be gained by arguing with an unreasonable person. Cut your losses and move on.
It’s also a good idea to monitor Webmaster forums for complaints about your crawler. You can set up a Google alert that will notify you whenever Google’s crawler encounters a new posting about your crawler. It’s a good way to gauge your crawler’s reputation, and a way that you can respond in public quickly to any complaints. The rules above still apply, perhaps even more strongly. Be courteous and helpful. Flaming somebody in a public forum will not improve your crawler’s reputation.
I covered a lot of ground in this posting. Politeness, in my opinion, is the most important issue you face when operating a Web crawler. Everything your crawler does will be controlled by the politeness policy. There are plenty of technical hurdles, but if your crawler gets a bad reputation you will find it blocked by a large number of sites, including many that have valuable information that you’re interested in. No amount of technical wizardry will allow you to get past those blocks, and you will find it very difficult to repair a bad reputation.
Design in politeness from the start. You’ll regret it if you don’t.
This isn’t the last word on politeness. Many of the issues I’ll cover later in this series have politeness ramifications. One example is the DUST (different URL, similar text) problem: preventing your crawler from visiting the same page repeatedly, using a different URL each time.
In my next installment, I’ll begin talking about queue management, which is the most technically challenging part of an adaptive crawler. It’s also related to politeness in many ways.