People often ask if they need a robots.txt file on their sites. I’ve seen some Web site tutorials that say, in effect, “don’t post a robots.txt file unless you really need it.” I think that is bad advice. In my opinion, every site needs a robots.txt file.
First a disclaimer. I’ve had my own Web site for 10 years, and my experience operating this site led me to the conclusion that all sites need a robots.txt file. In addition, as I’ve mentioned a time or three over the past year, I’m building a Web crawler. That work has strengthened my opinion that even the most modest Web site should have a robots.txt file.
Now let me explain why.
Search engine indexing is a fact of life on the Internet. Google, Yahoo, MSN, Internet Archive, and dozens of other search engines continually crawl the Web, downloading pages and storing them for later indexing. For most people, this is a Good Thing: search engines let other people find the content that you post. Without Web-scale search engines, sites would have to depend on linking from others, and most would simply die in obscurity. Most people who post things on the Internet want to be visible, and Web search engines provide that visibility. It is in your best interest, if you want to be visible, to make it as easy as possible for search engines to find and index your site. Part of making your site easy to crawl is including a robots.txt file.
Because we’re talking about an exclusion standard, you might ask why you need a robots.txt file if you don’t want to block anybody’s access. The answer has to do with what happens when a program tries to get a file from your site.
When a well-behaved Web crawler (that is, one that recognizes and adheres to the robots.txt convention) first visits your site, it tries to read a file called robots.txt. That file is expected to be in the root directory, and be in plain text format. (For example, the robots.txt file for this blog site is at http://blog.mischel.com/robots.txt.) If the file exists, the crawler downloads and parses it, and then adheres to the Allow and Disallow rules that are there.
Crawlers don’t actually download robots.txt before visiting every URL. Typically, a crawler will read robots.txt when it first comes to your site, and then cache it for a day or two. That saves you bandwidth and saves the crawler a whole lot of time. It also means that if you make changes to robots.txt, it might be a few days before the crawler in question sees them. For example, if you see Googlebot accessing your site and you change robots.txt to block it, don’t be surprised if Googlebot keeps accessing your site for the next couple of days.
If you don’t have a robots.txt file, then two things happen: a “document not found” (code 404) error message is written to your server’s error log file, and the server returns something to the the Web crawler.
The entry in your server error log can be annoying if you periodically scan the error log. Since 404 is the error code returned when any document isn’t found, scanning the error log from time to time is a good way to find bad links in (or to) your site. Having to wade through potentially hundreds of 404 errors for robots.txt is pretty annoying.
I said that your server returns “something” to the crawler that requested the document. On simple sites, “something” turns out to be a very short message with a 404 Not Found status code. Crawlers handle that without trouble. But many Web sites have custom error handling, and “Not Found” errors will redirect to an HTML page saying that the file was not found. That page often has lots of other stuff on it, making it fairly large. In this case, not having a robots.txt file ends up costing you bandwidth.
Your not having a robots.txt doesn’t particularly inconvenience the crawler. If your server returns a 404 or a custom error page, the crawler just notes that you don’t have a robots.txt, and continues on its way. If it visits your site in a day or two, it’ll try to read robots.txt again.
So that’s why you should have a robots.txt: it prevents messages in your error log, potentially saves you bandwidth, and it lets you inform crawlers which parts of your site you want indexed.
Every Web site should have, at minimum, this robots.txt file:
User-agent: *
Disallow:
All this says is that you’re not disallowing anything: all crawlers have full access to read the entire site. That’s exactly what having no robots.txt file means, but it’s always a good idea to be specific when possible. Plus, if you create the file now while you’re thinking about it, you’ll find it much easier to modify in a hurry when you want to block a particular crawler.
How you create and post a robots.txt on your own site will depend on what hosting service you use. If you use FTP to put files up on the site, then you can create a file with a plain text editor (like Windows Notepad), add the two lines shown above, save it as robots.txt on your local drive, and then FTP it to the root of your Web site. If you use some kind of Web-based file manager, create a plain text file (NOT a Web page) at the top level of your site, add those two lines, and save the file as robots.txt. You can test it by going to http://yoursitename/robots.txt in your favorite Web browser.