More on Robots Exclusion | Jim's Random Notes

As I mentioned yesterday, the Robots Exclusion Standard is a very simple protocol that lets webmasters tell well-behaved crawlers how to access their sites. But the “standard” isn’t as well defined as some would have you think, and there’s plenty of room for interpretation.

Consider this simple file:

User-agent: * Disallow: /xfiles/
User-agent: YourBot Disallow: /myfiles/

This says, “YourBot can access everything but /myfiles/. All other bots can access everything except /xfiles/.” Note that it does not prevent YourBot from accessing /xfiles/, as some robots.txt tutorials would have you believe.

Crawlers use the following rules, in order, to determine what they’re allowed to crawl:

If there is no robots.txt, I can access anything on the site.
If there is an entry with my bot’s name in robots.txt, then I follow those instructions.
If there is a “*” entry in robots.txt, then I follow those instructions.
Otherwise, I can access everything.

It’s important to note that the crawler stops checking rules once it finds one that fits. So if there is an entry for YourBot in robots.txt, then YourBot will follow those rules and ignore the entry for all bots (*).

If Disallow is the only directive, then there is no further room for interpretation. But the addition of the Allow directive threw a new wrench into the works: in what order do you process Allow and Disallow directives?

The revised Internet-Draft specification (note that I linked to archive.org here because the primary site has been down recently) for robots.txt says:

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

According to that description, if you wanted to allow access to /xfiles/mulder/, but disallow access to all other files in the /xfiles/ directory, you would write:

User-agent: * Allow: /xfiles/mulder/ Disallow: /xfiles/

Several publicly available robots.txt modules work in this way, but that’s not the way that Google interprets robots.txt. In How do I block or allow Googlebot?, there is this example:

User-agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html

Obviously, Googlebot reads all of the entries, checks first to see if the URL in question is specifically allowed, and if not, then checks to see if it is disallowed.

It’s a very big difference. If a bot were to implement the proposed standard, then it would never crawl /folder1/myfile.html, because the previous Disallow line would prevent it from getting beyond /folder1/.

Yahoo says that they work the same way as Google, with respect to handling Allow and Disallow. It’s unclear what MSNBot does, or how other crawlers handle this. But, hey, if it’s good enough for Google…

I never would have thought that a simple protocol like robots.txt could raise so many questions. And I’ve only touched on the Allow and Disallow directives. There are plenty of other proposals and extensions out there that are even more confusing, if you can imagine. Add to that the META tags that you can add to individual HTML documents to prevent crawling or indexing, and things get really confusing.

But I’ll leave that alone for now. Next time I’ll explain why every Web site should have a robots.txt file, even if it doesn’t restrict access to anything.