It’s surprising the things you’ll learn when you write a Web crawler. Today’s lesson: how to be both perpetrator and victim of your own denial of service attack.
Not everybody likes crawlers accessing their sites. Most will modify their robots.txt files first, which will prevent polite bots from crawling. But blocking impolite bots requires that you configure your server to deny access based on IP address or user-agent string. Some Web site operators, either because they don’t know any better or because they want to prevent bots from even accessing robots.txt, prefer to use the server configuration file for all bot-blocking. Doing so is easy enough, but you have to be careful or you can create a home-grown denial of service attack.
The discussion below covers Web sites running the Apache server. I don’t know how to effect IP blocks or custom error pages using IIS or any other Web server.
There are two ways (at least) to prevent access from a particular IP address to your Web site. The two ways I know of involve editing the .htaccess file, which usually is stored in the root directory of your Web site. [Note: The filename really does start with a period. For some reason, WordPress doesn’t like me putting that filename in a post without putting some HTML noise around it. So for the rest of this entry, I’ll refer to the file as htaccess, without the leading period.] As this isn’t a tutorial on htaccess I suggest that you do a Web search for “htaccess tutorial”, or consult your hosting provider’s help section for full information on how to use this file.
The simple method of blocking a particular IP address, available on all versions of Apache that I know of, is to use the <Files> directive. This htaccess fragment will block an IP address:
<Files *>
order deny,allow
deny from abc.def.ghi.jkl
</Files>
Of course, you would replace abc.def.ghi.jkl in that example with the actual IP address you want to block. If you want to block multiple addresses, you can specify them in separate deny directives, one per line. Some sites say that you can put multiple IP addresses on a single line. I don’t know if that works. There also is a way to block ranges of IP addresses.
If you do this, then any attempted access from the specified IP address will result in a “403 Forbidden” error code being returned to the client. The Web page returned with the error code is the default error page, which is very plain (some would say ugly), and not very helpful. Many sites, in order to make the error pages more helpful or to make them have the same look and feel as the rest of the site, configure the server to return a custom error page. Again, there are htaccess directives that control the use of custom error pages.
If you want a custom page to display when a 403 Forbidden is returned, you create the error page and add a line telling Apache where the page is and when it should be returned. If your error page is stored on your site at /forbidden.html, then adding this directive to htaccess tells Apache to return that page along with the 403 error:
ErrorDocument 403 /forbidden.html
Now, if somebody visits your site from the denied IP address, the server will return the custom error page along with a 403 Forbidden status code. It really does work. As far as I’ve been able to determine, nothing can go wrong with this configuration.
I said before that there are at least two ways prevent access from a particular IP address. The other way that I know of involves using an Apache add-on called mod_rewrite, a very useful but also very complicated and difficult to master module with which you can do all manner of wondrous things. I don’t claim to be an expert in mod_rewrite. But it appears that you can block an IP address by adding this command:
RewriteCond %{REMOTE_ADDR} ^abc\.def\.ghi\.jkl$
RewriteRule .* [F]
Again, you would replace the abc, def, etc. with the actuall IP address numbers. As I understand it, this rule (assuming that mod_rewrite is installed and working) will prevent all accesses to your site from the given IP address. But there’s a potential problem.
If you have a custom 403 error document, the above can put your server into an infinite loop. According to this forum post at Webmaster World:
A blocked request is redirected to /forbidden.html, and the server tries to serve that instead, but since the user-agent or ip address is still blocked, it again redirects to the custom error page… it gets stuck in this loop.
There you have it: you are the perpetrator and victim of your own denial of service attack.
The forum post linked above shows how to avoid that problem.
I’ve seen some posts indicating that the infinite loop also is possible if you use the simple way of doing the blocking and error redirects. I haven’t been able to verify that. If you’re interested, check out this post, which also offers a solution if the problem occurs.
How I came to learn about this is another story. Perhaps I can relate it one day.