Reducing bandwidth used by crawlers

Some site operators block web crawlers because they’re concerned that the crawlers will use too much of the site’s allocated bandwidth. What they don’t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage than the people running the sites that the crawlers visit. There are several reasons for this concern:

  • The visible Web is so large that no crawler can examine the entire thing in any reasonable amount of time. At best estimates, even Google covers only about 25% of the visible Web.
  • The Web grows faster than the ability to crawl it grows.
  • It takes time (on average between one and two seconds) to find, download, and store a Web page. Granted, a large crawler can download thousands of pages per second, but it still takes time.
  • It requires more time, storage, and CPU power to store, parse, and index a downloaded page.

I suspect that the large search engines can give you a per-page dollar cost for locating, downloading, storing, and processing. That per-page cost would be very small, but when you multiply it by 25 billion (or more!) pages it’s a staggering amount of money–a cost that’s incurred every time they crawl the Web. As you can imagine, they have ample incentive to reduce unnecessary crawling as much as possible. In addition, time and bandwidth spent downloading unnecessary pages means that some previously undiscovered pages are not visited.

The HTTP specification includes someting called a conditional GET. It’s a way for a client to request that the server send the page only if it meets some criteria. The specification identifies several different criteria, one of which is called If-Modified-Since. If the client has seen the page before and has saved the page and the date it received the page, then the client can send a request to the server that says, in effect, “If the page has changed since this date, then send me the page. Otherwise just tell me that the page hasn’t changed.” The this date would be replaced with the actual date that the client last saw the page.

If the server supports If-Modified-Since (which almost all do), there is a big difference in how much bandwidth is used. If the Web page has not been modified, the server responds with a standard header and a 304 NotModified status code: total payload maybe a few hundred bytes. That’s a far cry from the average 30 kilobytes for an HTML page, or the hundreds of kilobytes for a page that has complicated scripts and lots of content.

The only catch is that server software (Apache, IIS, etc.) only support If-Modified-Since for static content: pages that you create and store as HTML on your site. If your site is dynamically generated with PHP, ASP, Java, etc., then the script itself has to determine if the content has changed since the requested date, and act accordingly by sending the proper response. If your site is dynamically generated, it’s a good idea to ask your developers if it supports If-Modified-Since.

Crawlers aren’t the only clients that use If-Modified-Since to save bandwidth. All the major browsers cache content, and can be configured to do conditional GETs.

The direct savings of using If-Modified-Since can be small when compared to the indirect savings. Imagine that your site’s front page contains links to all the other pages on your site. If a crawler downloads the main page, it’s going to extract the links to all the other pages and attempt to visit them, too. If you don’t support If-Modified-Since, the crawler will end up downloading every page on your site. If, on the other hand, you support If-Modified-Since and your front page doesn’t change, the crawler won’t download the page and thus won’t see links to the other pages on the site.

Don’t take the above to mean that your site won’t be indexed if you don’t change the main page. Large-scale crawlers keep track of the things they index, and will periodically check to see that those things still exist. The larger systems even keep track of how often individual sites or pages change, and will check for changes on a fairly regular schedule. If their crawling history shows that a particular page changes every few days, then you can expect that page to be visited every few days. If history shows that the page changes very rarely, it’s likely that the page won’t be visited very often.

Smaller-scale crawlers that don’t have the resources to keep track of the change frequency for billions of Web sites will typically institute a blanket policy that controls the frequency that they revisit pages–once per day, once per week, etc.

Supporting If-Modified-Since is a very easy and inexpensive way to reduce the load that search engine crawlers put on your servers. If you’re publishing static content, then most likely you’re already benefitting from this. If your Web site is dynamically generated, be sure that your scripts recognize the If-Modified-Since header and respond accordingly.