So, yeah, I’m not the first person to point out the parallels between the recent Bitcoin frenzy and the Dutch tulip mania of the 1630s. Nor, I suspect, am I the first to mention that Bitcoin’s meteoric rise bears shocking resemblance to:
I wasn’t around for the first of those, but I saw the others happen. I even lost a large part of my meager savings in the 1980 gold frenzy. Every one of these events saw people betting their futures on a “sure thing” that “can’t lose.” They were putting their entire life savings into it, borrowing heavily to gamble on a speculative market that seemed like it would just keep going up. And in every case, the bubble burst, wiping out savings in a very short period.
Those bubbles burst because investors flocking to the “can’t lose” scheme drove the prices to levels that were unsustainable. Early investors get in, ride the rise for a while, and then sell to new investors who want the same kind of trip. It becomes a positive feedback loop, until the price becomes too high to attract new investors. One day, somebody wants to get off and discovers that nobody wants to pay what he’s asking for his position. He decides to sell at a loss, at which point everybody else starts to panic and starts unloading as fast as they can, hoping to get out with something.
I don’t know enough about Bitcoin and other crypto currencies to say what, if anything, they’re actually worth, or if the idea of crypto currency has any long-term merit. But the meteoric increase in Bitcoin prices over the last year, from $750 to $10,000, brings to mind those parallels, and a little bit more research reveals all the signs of a speculative bubble. The number of companies specializing in crypto currency trading has grown tremendously over the past year. There are “network marketing” schemes that pay you for “helping” others get in on the deal. New crypto currencies are popping up. People are borrowing money to invest. People are posting cheerleader messages (“Rah, rah Bitcoin!”) on social media. I’m seeing more hockey stick charts every day. “Look what it’s done in just the last three months!”
There may indeed be some lasting value to Bitcoin and other crypto currencies, just as there was lasting value in Beanie Babies. I saw some at a yard sale last week, going for about 50 cents each.
This is a re-post of an article I wrote for my .NET Reference Guide column some years ago.
You’re probably familiar with the Birthday problem, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.
This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.
class Program
{
const int NumKeys = 365;
const int Thousand = 1000;
const int Million = Thousand * Thousand;
const int NumTries = 10 * Million;
static void Main(string[] args)
{
// List contains number of items to collision for each iteration.
List<int> Totals = new List<int>();
// Do the test NumTries times, saving the result from each attempt.
for (int i = 0; i < NumTries; ++i)
{
int n = DoTest(NumKeys);
Totals.Add(n);
if ((i % 100000) == 0)
{
Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
}
}
// Group the items by their result.
var grouped = from cnt in Totals
group cnt by cnt into newGroup
orderby newGroup.Key
select newGroup;
// Output a table of results and counts, with running percentage.
int runningTotal = 0;
foreach (var grp in grouped)
{
runningTotal += grp.Count();
Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2}) ({3,8:P6})", grp.Key, grp.Count(),
(double)grp.Count() / NumTries, (double)runningTotal / NumTries);
}
Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
Console.ReadLine();
}
// Random number generator used for all attempts.
static Random rnd = new Random();
static int DoTest(int nkey)
{
HashSet keys = new HashSet();
// Pick random numbers in the range [0..nkey-1]
// until a duplicate is selected.
int key;
do
{
key = rnd.Next(nkey);
} while (keys.Add(key));
return keys.Count;
}
}
The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.
The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:
This result agrees with the math. It shows some other interesting things, too.
Look at how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 10% with just 14 numbers selected.
You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the birthday problem by a few points. Maybe the break even point is really 25 rather than 23.
You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.
So what do birthdays have to do with hash keys?
Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys. Right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.
You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the birthday problem raises its ugly head here. Let’s look at the math and see what it tells us.
What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?
According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:
Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article, but they’re close–within the range of what you’d expect for a coarse approximation.
So, then, about our one million items selected from a field of four billion:
DuplicateProb(4294967296, 1000000) = 1
What?!?!
How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?
It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:
Field Size
Min
Max
Average
2^16
4
1,081
316
2^20
21
4,390
1,281
2^30
342
142,952
41,498
2^31
1,036
193,529
57,606
These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.
The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:
Field Size
Min
Max
Average
16
4
10.08
8.30
20
4.39
12.10
10.32
30
8.42
17.13
15.34
31
10.02
17.56
15.81
For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.
A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.
More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.
Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.
That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.
It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.
In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.
I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re indexing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how much data some programs index.
Granted, if you were only indexing a million items, you’d probably use something like a hash table to index them. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.
Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the probability of two different items hashing to the same value is surprisingly high.