Birthdays, random numbers, and hash keys

You’re probably familiar with the Birthday Paradox, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.

Testing the birthday paradox

This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.

class Program
{
    const int NumKeys = 365;
    const int Thousand = 1000;
    const int Million = Thousand * Thousand;
    const int NumTries = 10 * Million;
    static void Main(string[] args)
    {
        // List contains number of items to collision for each iteration.
        List<int> Totals = new List<int>();
        // Do the test NumTries times, saving the result from each attempt.
        for (int i = 0; i < NumTries; ++i)
        {
            int n = DoTest(NumKeys);
            Totals.Add(n);
            if ((i % 100000) == 0)
            {
                Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
            }
        }
        // Group the items by their result.
        var grouped = from cnt in Totals
                      group cnt by cnt into newGroup
                      orderby newGroup.Key
                      select newGroup;
        // Output a table of results and counts, with running percentage.
        int runningTotal = 0;
        foreach (var grp in grouped)
        {
            runningTotal += grp.Count();
            Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2})  ({3,8:P6})", grp.Key, grp.Count(),
                (double)grp.Count() / NumTries, (double)runningTotal / NumTries);
        }
        Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
        Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
        Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
        Console.ReadLine();
    }
    // Random number generator used for all attempts.
    static Random rnd = new Random();
    static int DoTest(int nkey)
    {
        HashSet<int> keys = new HashSet<int>();
        // Pick random numbers in the range [0..nkey-1]
        // until a duplicate is selected.
        int key;
        do
        {
            key = rnd.Next(nkey);
        } while (keys.Add(key));
        return keys.Count;
    }
}

The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.

The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:

  1:  27,322 (  0.27 %)  (0.273220 %)
  2:  54,962 (  0.55 %)  (0.822840 %)
  3:  81,735 (  0.82 %)  (1.640190 %)
  4: 107,442 (  1.07 %)  (2.714610 %)
  5: 132,999 (  1.33 %)  (4.044600 %)
  6: 157,841 (  1.58 %)  (5.623010 %)
  7: 181,920 (  1.82 %)  (7.442210 %)
  8: 203,602 (  2.04 %)  (9.478230 %)
  9: 223,787 (  2.24 %)  (11.716100 %)
 10: 241,462 (  2.41 %)  (14.130720 %)
 11: 258,365 (  2.58 %)  (16.714370 %)
 12: 274,581 (  2.75 %)  (19.460180 %)
 13: 286,809 (  2.87 %)  (22.328270 %)
 14: 298,008 (  2.98 %)  (25.308350 %)
 15: 306,072 (  3.06 %)  (28.369070 %)
 16: 314,766 (  3.15 %)  (31.516730 %)
 17: 318,592 (  3.19 %)  (34.702650 %)
 18: 323,361 (  3.23 %)  (37.936260 %)
 19: 323,649 (  3.24 %)  (41.172750 %)
 20: 322,924 (  3.23 %)  (44.401990 %)
 21: 319,847 (  3.20 %)  (47.600460 %)
 22: 316,430 (  3.16 %)  (50.764760 %)
 23: 310,096 (  3.10 %)  (53.865720 %)
 24: 303,122 (  3.03 %)  (56.896940 %)
 25: 295,341 (  2.95 %)  (59.850350 %)
 26: 285,219 (  2.85 %)  (62.702540 %)
 27: 276,433 (  2.76 %)  (65.466870 %)
 28: 265,044 (  2.65 %)  (68.117310 %)
 29: 253,646 (  2.54 %)  (70.653770 %)
 30: 241,188 (  2.41 %)  (73.065650 %)
 31: 228,559 (  2.29 %)  (75.351240 %)
 32: 216,584 (  2.17 %)  (77.517080 %)
 33: 203,608 (  2.04 %)  (79.553160 %)
 34: 190,222 (  1.90 %)  (81.455380 %)
 35: 177,870 (  1.78 %)  (83.234080 %)
 36: 165,091 (  1.65 %)  (84.884990 %)
 37: 153,658 (  1.54 %)  (86.421570 %)
 38: 141,501 (  1.42 %)  (87.836580 %)
 39: 129,984 (  1.30 %)  (89.136420 %)
 40: 118,849 (  1.19 %)  (90.324910 %)
 41: 108,598 (  1.09 %)  (91.410890 %)
 42:  98,732 (  0.99 %)  (92.398210 %)
 43:  89,560 (  0.90 %)  (93.293810 %)
 44:  80,692 (  0.81 %)  (94.100730 %)
 45:  72,640 (  0.73 %)  (94.827130 %)
 46:  65,559 (  0.66 %)  (95.482720 %)
 47:  58,525 (  0.59 %)  (96.067970 %)
 48:  51,877 (  0.52 %)  (96.586740 %)
 49:  46,329 (  0.46 %)  (97.050030 %)
 50:  40,333 (  0.40 %)  (97.453360 %)
 51:  35,917 (  0.36 %)  (97.812530 %)
 52:  31,270 (  0.31 %)  (98.125230 %)
 53:  27,409 (  0.27 %)  (98.399320 %)
 54:  23,349 (  0.23 %)  (98.632810 %)
 55:  20,482 (  0.20 %)  (98.837630 %)
 56:  17,441 (  0.17 %)  (99.012040 %)
 57:  15,390 (  0.15 %)  (99.165940 %)
 58:  13,378 (  0.13 %)  (99.299720 %)
 59:  11,302 (  0.11 %)  (99.412740 %)
 60:   9,638 (  0.10 %)  (99.509120 %)
 61:   8,199 (  0.08 %)  (99.591110 %)
 62:   6,851 (  0.07 %)  (99.659620 %)
 63:   5,876 (  0.06 %)  (99.718380 %)
 64:   4,843 (  0.05 %)  (99.766810 %)
 65:   4,077 (  0.04 %)  (99.807580 %)
 66:   3,433 (  0.03 %)  (99.841910 %)
 67:   3,098 (  0.03 %)  (99.872890 %)
 68:   2,388 (  0.02 %)  (99.896770 %)
 69:   1,903 (  0.02 %)  (99.915800 %)
 70:   1,541 (  0.02 %)  (99.931210 %)
 71:   1,308 (  0.01 %)  (99.944290 %)
 72:   1,210 (  0.01 %)  (99.956390 %)
 73:     893 (  0.01 %)  (99.965320 %)
 74:     666 (  0.01 %)  (99.971980 %)
 75:     588 (  0.01 %)  (99.977860 %)
 76:     463 (  0.00 %)  (99.982490 %)
 77:     369 (  0.00 %)  (99.986180 %)
 78:     290 (  0.00 %)  (99.989080 %)
 79:     235 (  0.00 %)  (99.991430 %)
 80:     185 (  0.00 %)  (99.993280 %)
 81:     161 (  0.00 %)  (99.994890 %)
 82:     102 (  0.00 %)  (99.995910 %)
 83:     103 (  0.00 %)  (99.996940 %)
 84:      59 (  0.00 %)  (99.997530 %)
 85:      55 (  0.00 %)  (99.998080 %)
 86:      48 (  0.00 %)  (99.998560 %)
 87:      40 (  0.00 %)  (99.998960 %)
 88:      29 (  0.00 %)  (99.999250 %)
 89:      16 (  0.00 %)  (99.999410 %)
 90:      15 (  0.00 %)  (99.999560 %)
 91:      11 (  0.00 %)  (99.999670 %)
 92:       8 (  0.00 %)  (99.999750 %)
 93:       4 (  0.00 %)  (99.999790 %)
 94:       3 (  0.00 %)  (99.999820 %)
 95:       7 (  0.00 %)  (99.999890 %)
 96:       3 (  0.00 %)  (99.999920 %)
 97:       2 (  0.00 %)  (99.999940 %)
 98:       1 (  0.00 %)  (99.999950 %)
 99:       1 (  0.00 %)  (99.999960 %)
100:       1 (  0.00 %)  (99.999970 %)
101:       2 (  0.00 %)  (99.999990 %)
102:       1 (  0.00 %)  (100.000000 %)
Min items to collision = 1
Max items to collision = 102
Average items to collision = 24

This result agrees with the math. It shows some other interesting things, too.

Look at the how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 11% with just 10 numbers selected.

You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the “birthday paradox” by a few points. Maybe the break even point is really 25 rather than 23.

You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.

What does that have to do with hash keys?

Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys, right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.

You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the Birthday Paradox raises its ugly head here. Let’s look at the math and see what it tells us.

What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?

According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:

p = 1 – e^(-n^2/(2 * 2^32))

We can wrap that up in a neat little method:

static double DuplicateProb(long fieldSize, long nItems)
{
    double minusNSquared = -(nItems * nItems);
    return 1 - Math.Exp(minusNSquared / (2 * fieldSize));
}

A few tests will tell us if it’s working:

DuplicateProb(365, 10) = 0.128017828988458
DuplicateProb(365, 20) = 0.421863457482716
DuplicateProb(365, 23) = 0.515509538061517
DuplicateProb(365, 30) = 0.708547055710068
DuplicateProb(365, 50) = 0.967439570087906
DuplicateProb(365, 57) = 0.988329429309313
DuplicateProb(365, 100) = 0.999998876014983

Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article or with my experimental results above, but they’re close–within the range of what you’d expect for a coarse approximation.

So, then, about our one million items selected from a field of four billion:

DuplicateProb(4294967296, 1000000) = 1

What?!?!

How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?

It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:

Field SizeMinMaxAverage
2^1641,081316
2^20214,3901,281
2^30342142,95241,498
2^311,036193,52957,606

These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.

The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:

Field SizeMinMaxAverage
16410.088.30
204.3912.1010.32
308.4217.1315.34
3110.0217.5615.81

For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.

A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.

More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.

Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.

That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.

It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.

In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.

I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re hashing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how often some programs run.

Granted, if you were only indexing a million items, you’d probably use something like a hash table that can resolve collisions. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.

Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the likelihood of two different items hashing to the same value is surprisingly high.