It’s surprising how often I hear (or see written) a programmer saying, “…and then I’ll compute a unique hash code that I can use for a key.”
The term “unique hash code” is a red flag indicating that the programmer who uttered it does not understand hash codes and is about to do something incredibly stupid. Let me provide a simple example.
In .NET, every object has a GetHashCode method that computes a 32-bit number that can serve as a key in a dictionary or hash table. Used properly, the hash code lets you create very efficient data structures for looking things up by name. But there’s no magic involved, really.
Used as intended–as the keys in a dictionary or hash table–hash codes work very well. If you try to use them for something else, it’s not going to work.
Let’s say you foolishly decide to use a hash code for a “unique key” when indexing some strings.
The number of strings is essentially infinite. The number of unique values that a 32-bit hash code can represent is a little more than four billion. So it’s a certainty that two or more strings will produce the same hash code. You might think, if you’re only storing 100,000 values, that the chance of collision (two strings hashing to the same value) is so small as not to matter. You’d be wrong.
If you’re familiar with the Birthday Paradox you know that, in any group of 25 people selected at random, the chance of two having the same birthday (month and day) are better than 50%. In a group of 60 people, it’s almost a certainty. If you don’t believe the math, you can do some empirical research of your own.
Hash codes are a lot like birthdays in this regard. A good rule of thumb is that the chance of collision (two items hashing to the same value) is 50% when the number of items hashed is equal to the square root of the number of possible values. So with a 32-bit hash code, the chance of two items hashing to the same value will be 50% when you’ve hashed 2^16, or 65,536 items. Again, if you don’t believe me just write a program that generates random strings and try it.
Another rule of thumb is that you’re almost certain to get a collision when the number of items is four times the square root. So with a 32-bit hash code, the chance of getting a collision when adding 256,000 items is almost 100%.
This is second year Computer Science stuff. Maybe even first year? And yet I hear the term “unique hash code” thrown around distressingly often by experienced programmers who should know better. It’s frightening.
If you’re interested in a bit more discussion and some sample programs that illustrate this point better, see Birthdays, Random Numbers, and Hash Keys.