I mentioned before that there is a small but very vocal group of webmasters who say that crawlers should stay off their sites unless specifically invited. It is their opinion that they shouldn’t have to include a robots.txt file in order to prevent bots from crawling their sites. Their reasons for holding this opinion vary, but generally fall into one of two categories: they have private content they don’t want indexed, or they don’t want their bandwidth “wasted” by bots. I understand their point of view, but in my opinion their model won’t work.
The nature of the Web is that everything is available to whoever wants to access it. If you don’t want your content publicly accesible, you have to take active measures to block access, either by blocking particular users or by adding some kind of authentication and authorization to allow only those users to whom you’ve granted access. The Web has always operated on this “opt-out” principle. One could argue that this open access to information is the whole reason for the Web. It’s certainly one of the primary reasons (if not the primary reason) that the Web continues to grow.
Search engines are essential to the operation of the Web. Most people who post information on the Web depend on the various search engines to index their sites so that when somebody is looking for a product or information, those people are directed to the right place. And users who want something depend on search engines to present them with relevant search results. Search engines can’t do that unless they can crawl Web sites.
The argument is that those people who want to be crawled should submit their sites to search engines, and that search engine crawlers should crawl only those sites that are submitted. It’s unlikely that such a thing could work, and even if it could the result would be a much less interesting Web because the difficulties involved in getting indexed would be too much for most site owners.
The first hurdle to overcome would be to determine who gets to submit a site’s URL to a search engine. It would be nearly impossible to police this and ensure that the person who submitted the site actually had authority to do so. Search engines could require written requests from the site’s published owner or technical contact (as reported by a whois search, for example), but the amount of paperwork involved would be astronomical. You could also build some kind of Web submission process that requires the submitter to supply some identifying information, but even that would be unreasonably difficult to build and manage.
There are approximately 100 million registered domain names at any given time. Names come and go and site owners change. It’s unreasonable to ask a search engine to keep track of that information. Imagine if the owner of the domain example.com submitted his site for crawling, but after a year let his domain name expire. A short time later somebody else registers example.com, but doesn’t notify the search engine of the ownership change. The new owner has no idea that the name was previously registered with the search engine and gets upset when his site is crawled. Is the search engine at fault?
There are many, many search engines, with more coming online all the time. To expect a webmaster to submit his site to every search engine is unreasonable. Granted, there are sites that will submit to multiple search engines, but going this route makes it even more difficult to keep track of things. Every submission site has to keep track of who got submitted where, and there has to be some kind of infrastructure so that webmasters can query the submission site’s database to determine if a particular bot is authorized.
Even if we somehow got the major search engines and the majority of site owners to agree that the opt-in policy is a good thing, you still run into the problem of ill-behaved bots: those that crawl regardless of permission. Again, the fundamental structure of the Web is openness. Absent legislation that makes uninvited crawling a crime (and the first hurdle there would be to define “crawling”–a problem that would be even more difficult than the policing problems I mentioned above), those ill-behaved bots will continue to crawl.
When you add it all up, it seems like a huge imposition on the majority of site owners who want visibility, just to satisfy a small number of owners who don’t want their sites crawled. It also places an unreasonable burden on those people who operate polite crawlers, while doing nothing to prevent the impolite crawlers from making a mess of things. This is especially true in light of the Robots Exclusion Standard, which is a very simple and effective way to ask those polite crawlers not to crawl. To prevent bots from crawling your site, just create a robots.txt file that looks like this:
User-agent: *
Disallow: /
That won’t prevent all crawling, as most crawlers will still hit robots.txt periodically, but almost all crawlers respect that simple robots.txt file and will crawl no further. Those that don’t respect it likely wouldn’t respect an opt-in policy, either. Creating and posting that file takes approximately two minutes (maybe five if you’re not familiar with the process), and it’s a whole lot more effective than trying to change a fundamental operating principle of the Web.