What’s that house worth?

It’s property appraisal time again. We got our Notice of Appraised Value in the mail last month, and were shocked to learn that our property value increased by 16% last year. That’s on top of the 33% increase from the year before. At least, that’s what the local appraisal district would have you believe. Last year I missed the deadline for filing a protest. You can bet I won’t be missing this year’s.

I called a Realtor friend of mine to get comparable sales information, and then compared that with the proposed appraisal. The difference is quite remarkable. If I’m extremely generous, I can make the house’s market value almost equal to last year’s appraised value. When you take into account the comparable sales and subtract the cost of the many repairs we need to make, the house’s market value is about 1/3 less than the proposed appraisal.

One of the tools I tried to use for research is Zillow. This is a pretty cool mashup that shows satellite pictures with property lines and home prices, along with pertinent information about the houses. Zillow also gives a “Zestimate” of home values. I’m sure there’s some complicated formula for these estimates, but at least in my area I noticed that the estimates are much closer to the tax appraisals than to the sales of comparable homes.

In my experience, tax appraisals are trailing indicators: they continue to rise after home prices have leveled off following a boom, and they continue to fall (although not quite as much as they rise) after falling home prices have leveled off. The result is that sources like Zillow and others end up over- or under-reporting on market swings. For my area, Zillow is reporting values that are quite a bit higher than are justified by actual sales, indicating to me that it relies too heavily on tax appraisals.

As far as I’m concerned–especially in today’s market–the value Zillow reports is the “if everything goes exactly right and you find the perfect buyer” price. It’s a useful tool for comparison, but even then I’d look on it with a large dose of skepticism. As far as absolute values are concerned, though, Zillow’s numbers bear little resemblance to reality.

I was hacked

I discovered last week that somebody had hacked my blog and added a bunch of link spam at the end of the footer script. For some unknown period of time, all of my blog pages contained hundreds of spam links–mostly for prescription drugs. But nobody saw them.

I don’t understand why it was done that way, but the links were invisible in browsers. At least they were invisible in the browsers that I use, and none of my regular readers sent me a message notifying me of the spam. I found out about it when I upgraded my WordPress to the latest version. After the upgrade I was checking out the footer script and discovered all those lines.

I know that it was there on May 22–the last day Google crawled the site. Their stats for my site show that prescription drug terms are the most prevalent terms on my site. I guess I look like a link spammer now. I hope they crawl again soon.

The most important lesson I learned here is to pay attention to the Dashboard when I log in to WordPress–especially when it contains warnings about vulnerabilities and upgrades. I hadn’t upgraded in many months, and was several releases behind.

I don’t know what exploit the malefactors made made use of in order to change my footer.php file, but I’m pretty happy that’s all they did. I suspect they could have modified any of my WordPress files and really made a mess of things. I don’t think they actually compromised my WordPress administrator account or my account with my ISP, but I changed the passwords anyway.

Clearing the book list

I’ve been meaning to review or at least mention the books I’ve been reading lately. I realized after I posted my negative review of Infinite Ascent that there are plenty of good books that I haven’t mentioned. So, here are capsule reviews of five books I’ve read recently–all picked up at either the remainder table at Half Price Books, or the bargain table at the big box retailer in the local mega shopping center.

Mario Livio’s The Golden Ratio: The Story of Phi, the World’s Most Astonishing Number is an engaging story. The book begins with a brief history of early arithmetic before diving into the discovery of and usefulness of what has become known as The Golden Ratio. From its first use in constructing pentagrams and the Platonic solids, to its uncanny appearance in nature, Livio shows the significance of the number 1.6180339…–the number that satisfies the equality: x2 – x = 1. Perhaps just as importantly, he debunks many myths about the Golden Ratio and its supposed mystical properties. Altogether a delightful read, and one that I recommend highly.

I’ve always been curious about how words come into being, how they change meanings, and how they eventually fall out of favor. I’m not a huge word nerd (and I mean that in the best possible way) like some of my friends, but I do enjoy learning about them. In The Life of Language: The fascinating ways words are born, live & die, authors Sol Steinmetz and Barbara Ann Kipfer take us on a tour through the English language, describing the many different ways words come into the language and how their pronunciations and meanings change over time. They also explain how to read the etymological information found in dictionaries–something my high school and college English teachers never bothered to teach. If they even knew. The writing style is a little bit dry in places, and the book is probably a third larger than it really has to be, but I quite enjoyed the read.

Who would have thought that jigsaw puzzles had such a rich history. Did you know that there are manufacturers of custom jigsaw puzzles that cost $5.00 or more per piece? People will pay $2,500 for a high quality wooden jigsaw puzzle of 500 pieces. I always thought that a jigsaw puzzle was little more than a trinket–something to pass the time. Anne D. Williams’ The Jigsaw Puzzle: Piecing Together a History opened my eyes to a whole new world of jigsaw puzzles, puzzle collectors and enthusiasts, and custom manufacturers. I’m not a huge jigsaw puzzle fan, but it was kind of interesting learning about this particular obsession that’s shared by a surprising number of people. Well written and mostly engaging, it was a good way to pass a few hours.

It’s hard to characterize Dava Sobel’s The Planets. It’s a tour of all the planets in our solar system, plus the Sun and Earth’s Moon, and including the recently demoted Pluto. The “tour” is somewhat superficial in that it doesn’t go into a whole lot of detail about any of the planets, but it does give the basic facts: size, distance from the Sun, orbital period, etc. For the planets known to the ancients, we learn how they were viewed throughout history. She also describes how the moons of other planets were discovered, and gives us some history of the discovery of Uranus, Neptune, Pluto, and many other objects. It’s a light read, well written and enjoyable.

I have to include one stinker in the list. I don’t know what possessed me to buy Apocalypse 2012: An Investigation into Civilization’s End, by Lawrence E. Joseph, and I can’t give a real good reason why I actually read it. But I’m kind of glad I did. Not because I believe the “prophecies” of doom, but because it’s such a fascinating mix of superstition, science, faulty reasoning, and plain old scare mongering. Looked at critically, the arguments just don’t hold water: there’s nothing there. But the bullshit is so skillfully disguised and beautifully rendered that the book is hard to put down. I was just amazed at how well the author was able to weave the story together. He didn’t do a perfect job, though. In several places I got the distinct impression that he was laughing his ass off as he wrote. It’s impossible that the person who wrote this book actually believes what he’s peddling. I won’t recommend the book as anything but an interesting and somewhat amusing study in pandering to a deluded audience. At that, it succeeds brilliantly.

Infinite Annoyance

Browsing the remainder table in Half Price Books a few weeks ago, I ran across David Berlinski’s Infinite Ascent: A short history of mathematics. The cover copy looked good, and a quick flip through a few pages was enough to convince me that it was worth the three bucks. At 180 pages, you’d expect it to be a pretty short read, and it might be for some. I found it tough going.

The book focuses on what the author (and others, I gather) considers “the ten most important breakthroughs in mathematics,” giving some biographical information about the people most closely associated with those discoveries, the historical context, and also an explanation of why the breakthroughs are important. At least, that’s how the first five chapters (Number, Proof, Analytic Geometry, The Calculus, and Complex Numbers) went. The next five chapters (Groups, Non-Euclidean Geometry, Sets, Incompleteness, The Present) seemed much less approachable.

I freely admit that some of my difficulty could be that I’m fairly comfortable with the topics discussed in the first five chapters, but with the exception of Sets I have no experience with or more than passing knowledge of the topics discussed in the later chapters. Somehow, though, I get the feeling that the fault is not entirely mine. I didn’t expect to gain a detailed understanding of Gödel’s incompleteness theorems by reading a short chapter, but I had hoped to learn something. Instead, I’m treated to prose like this:

The final cut–the director’s cut–now follows by means of the ventriloquism induced by Gödel numbering. This same formula just seen making an arithmetical statement in that subtle shade of fuchsia now acquires a palette of quite hysterical reds and sobbing violets, those serving to highlight the metamathematical scene presently unfolding, for while Bew(x) says something about the numbers, it also says that

x is a provable formula,

meaning that honey the number x is the number associated under the code with a provable formula, whereupon the director, lost in admiration for his own art, can mutter only that deep down it’s a movie about a movie.

That’s all pretty writing, but by the time I wade through the director’s psychedelic visions I’ve totally lost track of whatever mathematical subject we’re talking about. The first time I read that chapter, I put my lack of understanding down to having read it in bed, just before I fell asleep. The author’s point continues to elude me after a second reading. I learned more by skimming the Wikipedia article linked above than I did trying to puzzle out whatever Berlinski was trying to say.

Flipping through the book again after finishing it, I noticed that the style is pretentious throughout. The book suffers from too many inappropriate and incomprehensible metaphors, too much temporal hopping around in its short biographies, and too many paragraphs that jump off the page screaming, “Look, Ma, at how pretty I can write!” Like the director in the excerpt above, Berlinski seems lost in admiration of his own writing.

All in all, I’d say you’d be much better off reading Wikipedia articles about mathematics than trying to decipher the word splatter that Berlinski is trying to pass off as intelligent writing in Infinite Ascent. Not only is Wikipedia free, but you’ll learn a lot more and you won’t be tempted to track down the author and smack him upside the head for killing trees and wasting your time with his drivel.

Sometimes there’s a very good reason for a book to be on the remainder table.

Reducing bandwidth used by crawlers

Some site operators block web crawlers because they’re concerned that the crawlers will use too much of the site’s allocated bandwidth. What they don’t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage than the people running the sites that the crawlers visit. There are several reasons for this concern:

  • The visible Web is so large that no crawler can examine the entire thing in any reasonable amount of time. At best estimates, even Google covers only about 25% of the visible Web.
  • The Web grows faster than the ability to crawl it grows.
  • It takes time (on average between one and two seconds) to find, download, and store a Web page. Granted, a large crawler can download thousands of pages per second, but it still takes time.
  • It requires more time, storage, and CPU power to store, parse, and index a downloaded page.

I suspect that the large search engines can give you a per-page dollar cost for locating, downloading, storing, and processing. That per-page cost would be very small, but when you multiply it by 25 billion (or more!) pages it’s a staggering amount of money–a cost that’s incurred every time they crawl the Web. As you can imagine, they have ample incentive to reduce unnecessary crawling as much as possible. In addition, time and bandwidth spent downloading unnecessary pages means that some previously undiscovered pages are not visited.

The HTTP specification includes someting called a conditional GET. It’s a way for a client to request that the server send the page only if it meets some criteria. The specification identifies several different criteria, one of which is called If-Modified-Since. If the client has seen the page before and has saved the page and the date it received the page, then the client can send a request to the server that says, in effect, “If the page has changed since this date, then send me the page. Otherwise just tell me that the page hasn’t changed.” The this date would be replaced with the actual date that the client last saw the page.

If the server supports If-Modified-Since (which almost all do), there is a big difference in how much bandwidth is used. If the Web page has not been modified, the server responds with a standard header and a 304 NotModified status code: total payload maybe a few hundred bytes. That’s a far cry from the average 30 kilobytes for an HTML page, or the hundreds of kilobytes for a page that has complicated scripts and lots of content.

The only catch is that server software (Apache, IIS, etc.) only support If-Modified-Since for static content: pages that you create and store as HTML on your site. If your site is dynamically generated with PHP, ASP, Java, etc., then the script itself has to determine if the content has changed since the requested date, and act accordingly by sending the proper response. If your site is dynamically generated, it’s a good idea to ask your developers if it supports If-Modified-Since.

Crawlers aren’t the only clients that use If-Modified-Since to save bandwidth. All the major browsers cache content, and can be configured to do conditional GETs.

The direct savings of using If-Modified-Since can be small when compared to the indirect savings. Imagine that your site’s front page contains links to all the other pages on your site. If a crawler downloads the main page, it’s going to extract the links to all the other pages and attempt to visit them, too. If you don’t support If-Modified-Since, the crawler will end up downloading every page on your site. If, on the other hand, you support If-Modified-Since and your front page doesn’t change, the crawler won’t download the page and thus won’t see links to the other pages on the site.

Don’t take the above to mean that your site won’t be indexed if you don’t change the main page. Large-scale crawlers keep track of the things they index, and will periodically check to see that those things still exist. The larger systems even keep track of how often individual sites or pages change, and will check for changes on a fairly regular schedule. If their crawling history shows that a particular page changes every few days, then you can expect that page to be visited every few days. If history shows that the page changes very rarely, it’s likely that the page won’t be visited very often.

Smaller-scale crawlers that don’t have the resources to keep track of the change frequency for billions of Web sites will typically institute a blanket policy that controls the frequency that they revisit pages–once per day, once per week, etc.

Supporting If-Modified-Since is a very easy and inexpensive way to reduce the load that search engine crawlers put on your servers. If you’re publishing static content, then most likely you’re already benefitting from this. If your Web site is dynamically generated, be sure that your scripts recognize the If-Modified-Since header and respond accordingly.

More on .NET Collection Sizes

Last month in HashSet Limitations, I noted what I thought was an absurd limitation on the maximum number of items that you can store in a .NET HashSet or Dictionary collection. I did more research on all the major collection types and wrote a series of articles on the topic for my .NET Reference Guide column. If you’re interested in how many items of a particular type can fit into one of the .NET collection types, you’ll want to read those four or five articles.

I mentioned last month that the HashSet and Dictionary collections appeared to have a limit of 47,995,854 items. Whereas that really is the limit if you use the code that I showed, the real limit is quite a bit larger: about 89.5 million with the 64-bit runtime (61.7 million on the 32-bit runtime) if you have String keys and your values are references. The difference has to do with the way that the data structure grows as you add items to it.

Like all of the .NET collection types, HashSet and Dictionary grow dynamically as you add items. They start out with a very small capacity–fewer than 16 items. As you add items and overflow the capacity, the collection is resized by doubling. But the capacity is not exactly doubled. It appears that the new capacity is set to the first prime number that is larger than twice the current capacity. (Some hashing algorithms perform better when the number of buckets is a prime number.) I don’t know the exact resize sequence, but I do know that there are noticeable pauses at around 5 million, 11 million, and 23 million, and then the program crashes at 47 million. The reason for the crash is that the collection is trying to resize its internal array to the next prime number that’s somewhere around 96 million. And that’s larger than the empirically determined maximum size of about 89.5 million.

So how does one get a collection to hold more than that 48 million items? By pre-allocating it. If you know you’re going to need 50 million items, you can specify that as the initial capacity when you create the object:

Dictionary<string, object> myItems = new Dictionary<string, object>(50000000);

You can then add up to 50 million items to the collection. But if you try to add more, you’ll get that OutOfMemory exception again.

That works well with Dictionary, but HashSet doesn’t have a constructor that will let you specify the initial capacity. When I was writing my .NET column on this topic, I discovered that you can create a larger HashSet indirectly, by building a List and then passing that to the HashSet constructor that takes an IEnumerable parameter. Here’s how:

static void Main(string[] args)
{
  int maxSize = 89478457;
  Console.WriteLine("Max size = {0:N0}", maxSize);

  // Initialize a List<long> to hold maxSize items
  var l = new List<long>(maxSize);

  // now add items to the HashSet
  for (long i = 0; i < maxSize; i++)
  {
    if ((i % 1000000) == 0)
    {
      Console.Write("\r{0:N0}", i);
    }
    l.Add(i);
  }
  Console.WriteLine();

  // Construct a HashSet from that list
  var h = new HashSet<long>(l);

  Console.WriteLine("{0:N0} items in the HashSet", h.Count);

  Console.Write("Press Enter:");
  Console.ReadLine();
 }

That works fine if you know in advance what items you want to put into your HashSet, but it doesn’t do you much good if you want to add things one at a time. At the time I wrote the article, I didn’t have a solution to the problem.

I later discovered that after creating the collection, you can remove all the items (either by calling Remove 89 million times, or by calling Clear) and you’re left with an empty HashSet that has a capacity of 89 million items. It’s a roundabout way to get some functionality that I think should have been included with the HashSet class, but if you need it, that’s how you do it. Understand, though, that it’s undocumented behavior that might change with the next release of the .NET Framework.

In last month’s note, I also grumbled a bit about the perceived 5x memory requirement of Dictionary and HashSet. It turns out that I was off base there. Internally, these collections store a key/value pair for each item. Since typically both the key and the value are references, that adds up to 16 bytes per item, giving a best case of about 134 million* items fitting into the .NET maximum 2 gigabyte allocation. The real limit of 89,478,457 indicates that the per-item overhead is 24 bytes (2 GB / 24 = 89,478,485), which is actually pretty reasonable.

*The number is 128 * 1024 * 1024. It seems to be understood that when we programmers say “128 K items,” we really mean 131,072 items (128 * 1,024). But there doesn’t seem to be a generally accepted terminology for millions or billions of items. I don’t hear “128 M items” or “128 G items” in conversation, nor do I recall seeing them in print. I’ve heard (and used) “128 mega items” or “128 giga items,” but those sound clunky to me. We don’t say “128 kilo items.” Is there generally accepted nomenclature for these larger magnitudes? If not, shouldn’t there be? Any suggestions?

Goodbye Windows Vista

I upgraded to Windows Vista (from Windows XP) back in November, when I moved from a dual-core to a quad-core machine. I was less than pleased with Vista, for a number of reasons, but primarily because I found the Aero user interface enhancements more annoying than useful. That’s all pretty eye candy, but the few benefits it brought were not worth the 2 gigabyte footprint or the continual distraction. I turned off what I could in a few minutes of tinkering, but didn’t spend a lot of time trying to turn everything off.

And that dang machine was flakey! The system would become unresponsive for no apparent reason. Windows Explorer would lock up and even Task Manager wouldn’t come up. It got progressively worse until I was hitting the reset button a couple of times per day. Yahoo Messenger, for some reason, often seemed to cause the lock up. If Messenger lost its connection, it would try to re-connect, and sometimes that would cause the entire user interface to lock up. I still don’t understand how an application can bring down the whole operating system, but there you have it.

At one point I was getting a number of blue screen crashes (a few per week), so I down-clocked the machine (it had been slightly over-clocked) thinking that was the problem. Then I thought memory was the problem, so I spent a couple of nights running the Windows Memory Diagnostic (available on the Administrative Tools menu). That didn’t reveal any errors, either. I had pretty much decided that the problem was with the video driver (GeForce 8500 GT), but never tested it because at that point yet another new machine arrived: a Dell Precision 490 (used) with 16 gigabytes of RAM and a quad-core Xeon running at 2 gigahertz.

We’ve been running Windows Server 2008 on the servers here, and have been very happy with its performance and stability. Given the choice between Vista and Server 2008 on the desktop, there was no contest. I ran Server 2003 on a laptop development machine for two years and was extremely pleased with it–much more so than with Windows XP–and I expect to be much happier with Server 2008 than with Vista. One really nice thing is that the user interface is clean and lacking all those annoying Aero enhancements.

There are a few things you probably want to change in the default configuration of Server 2008 if you’re going to use it as a desktop development system. I’ve found several sites that talk about this, the best being Vijayshinva Karnure’s Windows Server 2008 as a SUPER workstation OS. It’s only been a few days, but so far I’m really liking the switch.

I’ve also said goodbye to Firefox in favor of (gasp) Internet Explorer. I’m not especially fond of IE, but Firefox has been unreliable for over a year–ever since I installed it on Windows XP 64. The 32-bit version of Firefox tends to crash, hang, or do unexpected things when running on a 64-bit version of Windows. And since they aren’t planning a 64-bit Windows version any time soon, I’ll move on. I understand that there are third-party x64 builds, but those don’t have full plug-in support nor do they appear to have the same quality standards as the official Firefox builds.

If I get ambitious, I might give Opera a try. For the near term, IE will do. At the moment I’m more interested in getting my new Server 2008 machine fully configured with all the development tools and such. Changing machines takes so much longer than you think it will.

Where is everybody?

From Jeff Duntemann comes a link to an article on the Fermi Paradox, which puts forth the idea that it may be a good thing that we’ve been unable to find proof of extraterrestrial life.

Put simply, Fermi’s Paradox is a simple question: If there is other life in the universe, where is everybody? Given the age of the universe and the large number of stars, doesn’t it stand to reason that life should be common? And yet we have no direct evidence that life exists other than on our Earth. Why is that?

Since we have but one very small sample (the small part of this one solar system that we’ve studied) of evidence, we’re left with logical arguments and pseudo-scientific silliness like the Drake equation to explain why we’ve not made contact with other civilizations.

Yes, I realize that some people put a lot of faith in the Drake equation. But there’s no there there. It consists of seven variables whose values are incalculable. We have absolutely no idea what reasonable values are for any of them. The Drake equation is nothing but a formalized way to make wild guesses.

The logical arguments against extraterrestrial life go something like this: “If we’re not unique, then in a galaxy of 100 billion stars, many of which are older than our sun, you would expect that if even a tiny percentage of the planets developed a space-faring race, they would have spread throughout the galaxy.” The implication seems to be that if extraterrestrial life were possible, then it’s highly unlikely that we would have developed because the planet would have been colonized by somebody else.

Nick Bostrom, author of the article linked above, lays that out very nicely and concludes that there must be some Great Filter (a natural or societal calamity) that prevents development of civilizations that are capable of interstellar travel. His hope is that the Great Filter is something that happens early in the development of life or civilizations–something that would have happened to us eons ago–because to think otherwise would mean that the human race is doomed.

It’s interesting reading and one can hardly fault his logic, but it’s all just so much mental masturbation–exactly like the Drake equation. We simply don’t have enough evidence to say one way or the other. Lack of proof is not proof of lack. Drawing conclusions based on scant physical evidence and wild-assed guesses is mysticism, not science.

A variation on the homegrown DOS attack

Tuesday, in How to DOS yourself, I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack. There’s another way to do it that is even more insidious.

In Tuesday’s post I showed how to configure error documents. There’s apparently another way to configure things so that, rather than returning an error status code (403 Forbidden, 404 Not Found, etc.), the server returns a 302 Redirect status code. The redirect tells the client (i.e. the browser or crawler) that the page requested can be found at a new location. That new location is returned along with the 302 Redirect status code.

When a browser sees the 302 status code, it issues a request for the new page.

Now, imagine what happens if you block an IP address from accessing your site (see Tuesday’s article) and you configure the server to return a redirect status code when somebody tries to access from that blocked IP address:

  1. Client tries to access http://yoursite.com/index.html
  2. Server notices the blocked IP address and says, “return 403 Forbidden.”
  3. Custom error handling returns a 302 Redirect pointing to http://yoursite.com/forbidden.html.
  4. Browser receives redirect status code and issues a request for http://yoursite.com/forbidden.html
  5. Go to step 2.

The browser and server now enter a cooperative infinite loop, with the browser saying “Show me the forbidden.html page,” and the server saying, “View forbidden.html instead.”

This is more insidious because from the server’s point of view it looks like the client is perpetrating a denial of service attack by continually attempting to access the same document. But the client is simply following the server’s directions.

Web crawlers won’t fall into this trap because they keep track of the pages they’ve visited or tried to visit. A Web crawler will see the first redirect and attempt to access the forbidden.html page, but on the next redirect the crawler will see that it’s already attempted that page, and give up.

Not all browsers are that smart. Firefox tries a few times and then stops, showing an error message that says:

Firefox has detected that the server is redirecting the request for this address in a way that will never complete.

Internet Explorer, on the other hand, appears to continue trying indefinitely.

I don’t know enough about Apache server configuration to give an example of redirecting on error. I do know it’s possible, though, because I discovered such a redirect loop recently while investigating a problem report. Unfortunately, the Webmaster in question was not willing to share with me the pertinent sections of his .htaccess file.

How to DOS yourself

It’s surprising the things you’ll learn when you write a Web crawler. Today’s lesson: how to be both perpetrator and victim of your own denial of service attack.

Not everybody likes crawlers accessing their sites. Most will modify their robots.txt files first, which will prevent polite bots from crawling. But blocking impolite bots requires that you configure your server to deny access based on IP address or user-agent string. Some Web site operators, either because they don’t know any better or because they want to prevent bots from even accessing robots.txt, prefer to use the server configuration file for all bot-blocking. Doing so is easy enough, but you have to be careful or you can create a home-grown denial of service attack.

The discussion below covers Web sites running the Apache server. I don’t know how to effect IP blocks or custom error pages using IIS or any other Web server.

There are two ways (at least) to prevent access from a particular IP address to your Web site. The two ways I know of involve editing the .htaccess file, which usually is stored in the root directory of your Web site. [Note: The filename really does start with a period. For some reason, WordPress doesn’t like me putting that filename in a post without putting some HTML noise around it. So for the rest of this entry, I’ll refer to the file as htaccess, without the leading period.] As this isn’t a tutorial on htaccess I suggest that you do a Web search for “htaccess tutorial”, or consult your hosting provider’s help section for full information on how to use this file.

The simple method of blocking a particular IP address, available on all versions of Apache that I know of, is to use the <Files> directive. This htaccess fragment will block an IP address:

<Files *>
order deny,allow
deny from abc.def.ghi.jkl
</Files>

Of course, you would replace abc.def.ghi.jkl in that example with the actual IP address you want to block. If you want to block multiple addresses, you can specify them in separate deny directives, one per line. Some sites say that you can put multiple IP addresses on a single line. I don’t know if that works. There also is a way to block ranges of IP addresses.

If you do this, then any attempted access from the specified IP address will result in a “403 Forbidden” error code being returned to the client. The Web page returned with the error code is the default error page, which is very plain (some would say ugly), and not very helpful. Many sites, in order to make the error pages more helpful or to make them have the same look and feel as the rest of the site, configure the server to return a custom error page. Again, there are htaccess directives that control the use of custom error pages.

If you want a custom page to display when a 403 Forbidden is returned, you create the error page and add a line telling Apache where the page is and when it should be returned. If your error page is stored on your site at /forbidden.html, then adding this directive to htaccess tells Apache to return that page along with the 403 error:

ErrorDocument 403 /forbidden.html

Now, if somebody visits your site from the denied IP address, the server will return the custom error page along with a 403 Forbidden status code. It really does work. As far as I’ve been able to determine, nothing can go wrong with this configuration.

I said before that there are at least two ways prevent access from a particular IP address. The other way that I know of involves using an Apache add-on called mod_rewrite, a very useful but also very complicated and difficult to master module with which you can do all manner of wondrous things. I don’t claim to be an expert in mod_rewrite. But it appears that you can block an IP address by adding this command:

RewriteCond %{REMOTE_ADDR} ^abc\.def\.ghi\.jkl$
RewriteRule .* [F]

Again, you would replace the abc, def, etc. with the actuall IP address numbers. As I understand it, this rule (assuming that mod_rewrite is installed and working) will prevent all accesses to your site from the given IP address. But there’s a potential problem.

If you have a custom 403 error document, the above can put your server into an infinite loop. According to this forum post at Webmaster World:

A blocked request is redirected to /forbidden.html, and the server tries to serve that instead, but since the user-agent or ip address is still blocked, it again redirects to the custom error page… it gets stuck in this loop.

There you have it: you are the perpetrator and victim of your own denial of service attack.

The forum post linked above shows how to avoid that problem.

I’ve seen some posts indicating that the infinite loop also is possible if you use the simple way of doing the blocking and error redirects. I haven’t been able to verify that. If you’re interested, check out this post, which also offers a solution if the problem occurs.

How I came to learn about this is another story. Perhaps I can relate it one day.

The New ReaderWriterLockSlim Class

Last year, in Improper Use of Exceptions, I mentioned that the ReaderWriterLock.AcquireReaderLock and ReaderWriterLock.AcquireWriterLock methods were improperly written because they throw exceptions when the lock is not available. I mentioned further that whoever designed the ReaderWriterLock should have studied the Monitor class for a more rational API.

Apparently I wasn’t the only one to think that, as .NET 3.5 introduced the ReaderWriterLockSlim class, which has a much more Monitor-like interface. ReaderWriterLockSlim reportedly has much better performance than ReaderWriterLock, as well as much simpler rules for lock recursion and upgrading/downgrading locks. The documentation says that ReaderWriterLockSlim avoids many cases of potential deadlock. All in all, it’s recommended over ReaderWriterLock for all new development.

At the time I wrote my note last year, I was especially disappointed that there was no way to wait indefinitely to acquire a reader or writer lock. It turns out that I was wrong: you can wait indefinitely if you pass Timeout.Infinite as the timeout value to AcquireReaderLock or AcquireWriterLock. It’s documented, but not very well. Rather than stating the valid values in the description of the timeout parameter, the documentation for ReaderWriterLock has a link at the end of the Remarks section says, “For valid time-out values, see ReaderWriterLock.”

I guess I need to follow the advice I read so long ago: “Periodically re-read the documentation for the functions you’re calling.”

Opt in or opt out?

I mentioned before that there is a small but very vocal group of webmasters who say that crawlers should stay off their sites unless specifically invited. It is their opinion that they shouldn’t have to include a robots.txt file in order to prevent bots from crawling their sites. Their reasons for holding this opinion vary, but generally fall into one of two categories: they have private content they don’t want indexed, or they don’t want their bandwidth “wasted” by bots. I understand their point of view, but in my opinion their model won’t work.

The nature of the Web is that everything is available to whoever wants to access it. If you don’t want your content publicly accesible, you have to take active measures to block access, either by blocking particular users or by adding some kind of authentication and authorization to allow only those users to whom you’ve granted access. The Web has always operated on this “opt-out” principle. One could argue that this open access to information is the whole reason for the Web. It’s certainly one of the primary reasons (if not the primary reason) that the Web continues to grow.

Search engines are essential to the operation of the Web. Most people who post information on the Web depend on the various search engines to index their sites so that when somebody is looking for a product or information, those people are directed to the right place. And users who want something depend on search engines to present them with relevant search results. Search engines can’t do that unless they can crawl Web sites.

The argument is that those people who want to be crawled should submit their sites to search engines, and that search engine crawlers should crawl only those sites that are submitted. It’s unlikely that such a thing could work, and even if it could the result would be a much less interesting Web because the difficulties involved in getting indexed would be too much for most site owners.

The first hurdle to overcome would be to determine who gets to submit a site’s URL to a search engine. It would be nearly impossible to police this and ensure that the person who submitted the site actually had authority to do so. Search engines could require written requests from the site’s published owner or technical contact (as reported by a whois search, for example), but the amount of paperwork involved would be astronomical. You could also build some kind of Web submission process that requires the submitter to supply some identifying information, but even that would be unreasonably difficult to build and manage.

There are approximately 100 million registered domain names at any given time. Names come and go and site owners change. It’s unreasonable to ask a search engine to keep track of that information. Imagine if the owner of the domain example.com submitted his site for crawling, but after a year let his domain name expire. A short time later somebody else registers example.com, but doesn’t notify the search engine of the ownership change. The new owner has no idea that the name was previously registered with the search engine and gets upset when his site is crawled. Is the search engine at fault?

There are many, many search engines, with more coming online all the time. To expect a webmaster to submit his site to every search engine is unreasonable. Granted, there are sites that will submit to multiple search engines, but going this route makes it even more difficult to keep track of things. Every submission site has to keep track of who got submitted where, and there has to be some kind of infrastructure so that webmasters can query the submission site’s database to determine if a particular bot is authorized.

Even if we somehow got the major search engines and the majority of site owners to agree that the opt-in policy is a good thing, you still run into the problem of ill-behaved bots: those that crawl regardless of permission. Again, the fundamental structure of the Web is openness. Absent legislation that makes uninvited crawling a crime (and the first hurdle there would be to define “crawling”–a problem that would be even more difficult than the policing problems I mentioned above), those ill-behaved bots will continue to crawl.

When you add it all up, it seems like a huge imposition on the majority of site owners who want visibility, just to satisfy a small number of owners who don’t want their sites crawled. It also places an unreasonable burden on those people who operate polite crawlers, while doing nothing to prevent the impolite crawlers from making a mess of things. This is especially true in light of the Robots Exclusion Standard, which is a very simple and effective way to ask those polite crawlers not to crawl. To prevent bots from crawling your site, just create a robots.txt file that looks like this:

User-agent: *
Disallow: /

That won’t prevent all crawling, as most crawlers will still hit robots.txt periodically, but almost all crawlers respect that simple robots.txt file and will crawl no further. Those that don’t respect it likely wouldn’t respect an opt-in policy, either. Creating and posting that file takes approximately two minutes (maybe five if you’re not familiar with the process), and it’s a whole lot more effective than trying to change a fundamental operating principle of the Web.

Why every site should have a robots.txt file

People often ask if they need a robots.txt file on their sites. I’ve seen some Web site tutorials that say, in effect, “don’t post a robots.txt file unless you really need it.” I think that is bad advice. In my opinion, every site needs a robots.txt file.

First a disclaimer. I’ve had my own Web site for 10 years, and my experience operating this site led me to the conclusion that all sites need a robots.txt file. In addition, as I’ve mentioned a time or three over the past year, I’m building a Web crawler. That work has strengthened my opinion that even the most modest Web site should have a robots.txt file.

Now let me explain why.

Search engine indexing is a fact of life on the Internet. Google, Yahoo, MSN, Internet Archive, and dozens of other search engines continually crawl the Web, downloading pages and storing them for later indexing. For most people, this is a Good Thing: search engines let other people find the content that you post. Without Web-scale search engines, sites would have to depend on linking from others, and most would simply die in obscurity. Most people who post things on the Internet want to be visible, and Web search engines provide that visibility. It is in your best interest, if you want to be visible, to make it as easy as possible for search engines to find and index your site. Part of making your site easy to crawl is including a robots.txt file.

Because we’re talking about an exclusion standard, you might ask why you need a robots.txt file if you don’t want to block anybody’s access. The answer has to do with what happens when a program tries to get a file from your site.

When a well-behaved Web crawler (that is, one that recognizes and adheres to the robots.txt convention) first visits your site, it tries to read a file called robots.txt. That file is expected to be in the root directory, and be in plain text format. (For example, the robots.txt file for this blog site is at http://blog.mischel.com/robots.txt.) If the file exists, the crawler downloads and parses it, and then adheres to the Allow and Disallow rules that are there.

Crawlers don’t actually download robots.txt before visiting every URL. Typically, a crawler will read robots.txt when it first comes to your site, and then cache it for a day or two. That saves you bandwidth and saves the crawler a whole lot of time. It also means that if you make changes to robots.txt, it might be a few days before the crawler in question sees them. For example, if you see Googlebot accessing your site and you change robots.txt to block it, don’t be surprised if Googlebot keeps accessing your site for the next couple of days.

If you don’t have a robots.txt file, then two things happen: a “document not found” (code 404) error message is written to your server’s error log file, and the server returns something to the the Web crawler.

The entry in your server error log can be annoying if you periodically scan the error log. Since 404 is the error code returned when any document isn’t found, scanning the error log from time to time is a good way to find bad links in (or to) your site. Having to wade through potentially hundreds of 404 errors for robots.txt is pretty annoying.

I said that your server returns “something” to the crawler that requested the document. On simple sites, “something” turns out to be a very short message with a 404 Not Found status code. Crawlers handle that without trouble. But many Web sites have custom error handling, and “Not Found” errors will redirect to an HTML page saying that the file was not found. That page often has lots of other stuff on it, making it fairly large. In this case, not having a robots.txt file ends up costing you bandwidth.

Your not having a robots.txt doesn’t particularly inconvenience the crawler. If your server returns a 404 or a custom error page, the crawler just notes that you don’t have a robots.txt, and continues on its way. If it visits your site in a day or two, it’ll try to read robots.txt again.

So that’s why you should have a robots.txt: it prevents messages in your error log, potentially saves you bandwidth, and it lets you inform crawlers which parts of your site you want indexed.

Every Web site should have, at minimum, this robots.txt file:

User-agent: *
Disallow:

All this says is that you’re not disallowing anything: all crawlers have full access to read the entire site. That’s exactly what having no robots.txt file means, but it’s always a good idea to be specific when possible. Plus, if you create the file now while you’re thinking about it, you’ll find it much easier to modify in a hurry when you want to block a particular crawler.

How you create and post a robots.txt on your own site will depend on what hosting service you use. If you use FTP to put files up on the site, then you can create a file with a plain text editor (like Windows Notepad), add the two lines shown above, save it as robots.txt on your local drive, and then FTP it to the root of your Web site. If you use some kind of Web-based file manager, create a plain text file (NOT a Web page) at the top level of your site, add those two lines, and save the file as robots.txt. You can test it by going to http://yoursitename/robots.txt in your favorite Web browser.

More On Robots Exclusion

As I mentioned yesterday, the Robots Exclusion Standard is a very simple protocol that lets webmasters tell well-behaved crawlers how to access their sites. But the “standard” isn’t as well defined as some would have you think, and there’s plenty of room for interpretation.

Consider this simple file:

User-agent: *
Disallow: /xfiles/


User-agent: YourBot
Disallow: /myfiles/

This says, “YourBot can access everything but /myfiles/. All other bots can access everything except /xfiles/.” Note that it does not prevent YourBot from accessing /xfiles/, as some robots.txt tutorials would have you believe.

Crawlers use the following rules, in order, to determine what they’re allowed to crawl:

  1. If there is no robots.txt, I can access anything on the site.
  2. If there is an entry with my bot’s name in robots.txt, then I follow those instructions.
  3. If there is a “*” entry in robots.txt, then I follow those instructions.
  4. Otherwise, I can access everything.

It’s important to note that the crawler stops checking rules once it finds one that fits. So if there is an entry for YourBot in robots.txt, then YourBot will follow those rules and ignore the entry for all bots (*).

If Disallow is the only directive, then there is no further room for interpretation. But the addition of the Allow directive threw a new wrench into the works: in what order do you process Allow and Disallow directives?

The revised Internet-Draft specification (note that I linked to archive.org here because the primary site has been down recently) for robots.txt says:

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

According to that description, if you wanted to allow access to /xfiles/mulder/, but disallow access to all other files in the /xfiles/ directory, you would write:

User-agent: *
Allow: /xfiles/mulder/
Disallow: /xfiles/

Several publicly available robots.txt modules work in this way, but that’s not the way that Google interprets robots.txt. In How do I block or allow Googlebot?, there is this example:

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Obviously, Googlebot reads all of the entries, checks first to see if the URL in question is specifically allowed, and if not, then checks to see if it is disallowed.

It’s a very big difference. If a bot were to implement the proposed standard, then it would never crawl /folder1/myfile.html, because the previous Disallow line would prevent it from getting beyond /folder1/.

Yahoo says that they work the same way as Google, with respect to handling Allow and Disallow. It’s unclear what MSNBot does, or how other crawlers handle this. But, hey, if it’s good enough for Google…

I never would have thought that a simple protocol like robots.txt could raise so many questions. And I’ve only touched on the Allow and Disallow directives. There are plenty of other proposals and extensions out there that are even more confusing, if you can imagine. Add to that the META tags that you can add to individual HTML documents to prevent crawling or indexing, and things get really confusing.

But I’ll leave that alone for now. Next time I’ll explain why every Web site should have a robots.txt file, even if it doesn’t restrict access to anything.

Struggling with the Robots Exclusion Standard

The Internet community loves standards. We must. We have so many of them. Many of those “standards” are poorly defined or, even worse, ambiguous. Or, in the case of robots.txt, subject to a large number of extensions that have become something of a de facto standard because they’re supported by Google, Yahoo, and MSN Search. Unfortunately, those extensions can be ambiguous and difficult for a crawler to interpret correctly.

A little history is in order. The robots.txt “standard” is not an official standard in the same way that HTTP, SMTP, and other common Internet protocols are. There is no RFC that defines the standard, nor is there an associated standards body. The Robots Exclusion Standard was created by consensus in June 1994 by members of the robots mailing list. At the time it was created, the standard described how to tell Web robots (“bots,” “spiders,” “crawlers,” etc.) which parts of a site they should not visit. For example, to prevent all bots from visiting the path /dumbFiles/, and to stop WeirdBot from visiting anything on the site, you could create this robots.txt file:

# Prevent all bots from visiting /dumbFiles/
User-agent: *
Disallow: /dumbFiles/


# keep WeirdBot away!
User-agent: WeirdBot
Disallow: /

Understand, robots.txt doesn’t actually prevent a bot from visiting the site. It’s an advisory standard. It still requires the cooperation of the bot. The idea is that a well-behaved bot will read and parse the robots.txt file, and politely not crawl things it’s not supposed to crawl. In the absence of a robots.txt file, or if the robots.txt does not block the bot, the bot has free access to read any file.

There is a small but rather vocal group of webmasters who insist that having to include a robots.txt file is an unnecessary burden. Their view is that bots should stay off the site unless they’re invited. That is, the robots.txt should be an opt-in rather than an opt-out. In this model, the lack of a robots.txt file or a line within robots.txt specifically allowing the bot, the bot should stay away. In my opinion, this is an unreasonable position, but it’s a topic for another discussion.

In its initial form, robots.txt was a simple and reasonably effective way to control bots’ access to Web sites. But it’s a rather blunt instrument. For example, imagine that your site has five directories, but you only want one of them accessible by bots. With the original standard, you’d have to write this:

User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/
Disallow: /dir4/

Not so bad with only five directories, but it can quickly become unwieldy with a much larger site. In addition, if you add another directory to the site, you’d have to add that directory to robots.txt if you don’t want it crawled.

One of the first modifications to the standard was the inclusion of an “Allow” directive, which overrides the Disallow. With Allow, you can block access to everything except that which you want crawled. The example above becomes:

User-agent: *
Disallow: /
Allow: /dir5/

But not all bots understand the Allow directive. A well-behaved bot that does not support Allow will see the Disallow directive and not crawl the site at all.

Another problem is that of case sensitivity, and there’s no perfect solution. In its default operating mode, the Apache Web server treats case as significant in URLs. That is, the URL http://example.com/myfile.html is not the same as http://example.com/MYFILE.html. But the default mode of Microsoft’s IIS is to ignore case. So on IIS, those two URLs would go to the same file. Imagine, then, what happens if you have a site that contains a directory called /files/ that you don’t want indexed. This simple robots.txt should suffice:

User-agent: *
Disallow: /files/

If the site is in case-sensitive mode (Apache’s default configuration), then bots have no problem. A bot will check the URL they want to crawl to see if it starts with “/files/”, and if it does the bot will move on without requesting the document. But if the URL starts with “/Files/”, the bot will request the document.

But what happens if the site is running in case-insensitive mode (IIS default configuration), and the bot wants to crawl the file /Files/index.html? If it does a naive case-sensitive comparison, it will end up crawling the file, because as far as the Web server is concerned, /Files/ and /files/ are the same thing.

Since both Web servers can operate in either mode (case significant or not), it’s exceedingly difficult (impossible, in some cases) for a bot to determine whether case is significant in URLs. So those of us who are trying to create polite Web crawlers end up writing our robots.txt parsers with the assumption that all servers are case-insensitive (i.e. they operate like IIS). Given the above robots.txt file, we won’t crawl a URL that begins with “/files/”, “/Files/”, “/fILEs/”, or any other variation that differs only in case. To do otherwise would risk violating what the webmaster intended when he wrote the robots.txt file, but we end up potentially not crawling files that we’re allowed to crawl.

In a perfect world, this wouldn’t be required. But in the wide world of the Internet, it’s quite common for people to change case in links. I did it myself when I was running on IIS. My blog used to be at /Diary/index.html, but I and others often linked to /diary/index.html. That caused no end of confusion when I moved to a server running Apache. I had to make redirects that converted references to /Diary/ to /diary/.

Somewhere along the line, somebody decided that the Disallow and Allow directives should support wildcards and pattern matching. Google and Yahoo support this, but I’m not sure yet that their syntax and semantics are identical. I see no information that MSNBot supports these features. Some other crawlers support wildcards and pattern matching to different degrees and with varying syntax.

As useful as robots.txt has been and continues to be, it definitely needs an update. I fear, though, that any proposed “official” standard will never see the light of day, and if it does it will be overly complex and impossible to implement. The alternative isn’t particularly attractive, either: webmasters have to know the peculiarities of dozens of different bots, and bot writers have to decide which extended directives to support. It’s a difficult situation for both, but I don’t see how it can be reconciled. Likely the best a bot writer can do is attempt to implement those extensions that are supported by the major search engines, and document that on their sites.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.