Struggling with the Robots Exclusion Standard

The Internet community loves standards. We must. We have so many of them. Many of those “standards” are poorly defined or, even worse, ambiguous. Or, in the case of robots.txt, subject to a large number of extensions that have become something of a de facto standard because they’re supported by Google, Yahoo, and MSN Search. Unfortunately, those extensions can be ambiguous and difficult for a crawler to interpret correctly.

A little history is in order. The robots.txt “standard” is not an official standard in the same way that HTTP, SMTP, and other common Internet protocols are. There is no RFC that defines the standard, nor is there an associated standards body. The Robots Exclusion Standard was created by consensus in June 1994 by members of the robots mailing list. At the time it was created, the standard described how to tell Web robots (“bots,” “spiders,” “crawlers,” etc.) which parts of a site they should not visit. For example, to prevent all bots from visiting the path /dumbFiles/, and to stop WeirdBot from visiting anything on the site, you could create this robots.txt file:

# Prevent all bots from visiting /dumbFiles/
User-agent: *
Disallow: /dumbFiles/


# keep WeirdBot away!
User-agent: WeirdBot
Disallow: /

Understand, robots.txt doesn’t actually prevent a bot from visiting the site. It’s an advisory standard. It still requires the cooperation of the bot. The idea is that a well-behaved bot will read and parse the robots.txt file, and politely not crawl things it’s not supposed to crawl. In the absence of a robots.txt file, or if the robots.txt does not block the bot, the bot has free access to read any file.

There is a small but rather vocal group of webmasters who insist that having to include a robots.txt file is an unnecessary burden. Their view is that bots should stay off the site unless they’re invited. That is, the robots.txt should be an opt-in rather than an opt-out. In this model, the lack of a robots.txt file or a line within robots.txt specifically allowing the bot, the bot should stay away. In my opinion, this is an unreasonable position, but it’s a topic for another discussion.

In its initial form, robots.txt was a simple and reasonably effective way to control bots’ access to Web sites. But it’s a rather blunt instrument. For example, imagine that your site has five directories, but you only want one of them accessible by bots. With the original standard, you’d have to write this:

User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/
Disallow: /dir4/

Not so bad with only five directories, but it can quickly become unwieldy with a much larger site. In addition, if you add another directory to the site, you’d have to add that directory to robots.txt if you don’t want it crawled.

One of the first modifications to the standard was the inclusion of an “Allow” directive, which overrides the Disallow. With Allow, you can block access to everything except that which you want crawled. The example above becomes:

User-agent: *
Disallow: /
Allow: /dir5/

But not all bots understand the Allow directive. A well-behaved bot that does not support Allow will see the Disallow directive and not crawl the site at all.

Another problem is that of case sensitivity, and there’s no perfect solution. In its default operating mode, the Apache Web server treats case as significant in URLs. That is, the URL http://example.com/myfile.html is not the same as http://example.com/MYFILE.html. But the default mode of Microsoft’s IIS is to ignore case. So on IIS, those two URLs would go to the same file. Imagine, then, what happens if you have a site that contains a directory called /files/ that you don’t want indexed. This simple robots.txt should suffice:

User-agent: *
Disallow: /files/

If the site is in case-sensitive mode (Apache’s default configuration), then bots have no problem. A bot will check the URL they want to crawl to see if it starts with “/files/”, and if it does the bot will move on without requesting the document. But if the URL starts with “/Files/”, the bot will request the document.

But what happens if the site is running in case-insensitive mode (IIS default configuration), and the bot wants to crawl the file /Files/index.html? If it does a naive case-sensitive comparison, it will end up crawling the file, because as far as the Web server is concerned, /Files/ and /files/ are the same thing.

Since both Web servers can operate in either mode (case significant or not), it’s exceedingly difficult (impossible, in some cases) for a bot to determine whether case is significant in URLs. So those of us who are trying to create polite Web crawlers end up writing our robots.txt parsers with the assumption that all servers are case-insensitive (i.e. they operate like IIS). Given the above robots.txt file, we won’t crawl a URL that begins with “/files/”, “/Files/”, “/fILEs/”, or any other variation that differs only in case. To do otherwise would risk violating what the webmaster intended when he wrote the robots.txt file, but we end up potentially not crawling files that we’re allowed to crawl.

In a perfect world, this wouldn’t be required. But in the wide world of the Internet, it’s quite common for people to change case in links. I did it myself when I was running on IIS. My blog used to be at /Diary/index.html, but I and others often linked to /diary/index.html. That caused no end of confusion when I moved to a server running Apache. I had to make redirects that converted references to /Diary/ to /diary/.

Somewhere along the line, somebody decided that the Disallow and Allow directives should support wildcards and pattern matching. Google and Yahoo support this, but I’m not sure yet that their syntax and semantics are identical. I see no information that MSNBot supports these features. Some other crawlers support wildcards and pattern matching to different degrees and with varying syntax.

As useful as robots.txt has been and continues to be, it definitely needs an update. I fear, though, that any proposed “official” standard will never see the light of day, and if it does it will be overly complex and impossible to implement. The alternative isn’t particularly attractive, either: webmasters have to know the peculiarities of dozens of different bots, and bot writers have to decide which extended directives to support. It’s a difficult situation for both, but I don’t see how it can be reconciled. Likely the best a bot writer can do is attempt to implement those extensions that are supported by the major search engines, and document that on their sites.