Birthdays, random numbers, and hash keys

This is a re-post of an article I wrote for my .NET Reference Guide column some years ago.

You’re probably familiar with the Birthday problem, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.

This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.

    class Program
    {
        const int NumKeys = 365;
        const int Thousand = 1000;
        const int Million = Thousand * Thousand;
        const int NumTries = 10 * Million;
        static void Main(string[] args)
        {
            // List contains number of items to collision for each iteration.
            List<int> Totals = new List<int>();
            // Do the test NumTries times, saving the result from each attempt.
            for (int i = 0; i < NumTries; ++i)
            {
                int n = DoTest(NumKeys);
                Totals.Add(n);
                if ((i % 100000) == 0)
                {
                    Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
                }
            }
            // Group the items by their result.
            var grouped = from cnt in Totals
                        group cnt by cnt into newGroup
                        orderby newGroup.Key
                        select newGroup;
            // Output a table of results and counts, with running percentage.
            int runningTotal = 0;
            foreach (var grp in grouped)
            {
                runningTotal += grp.Count();
                Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2})  ({3,8:P6})", grp.Key, grp.Count(),
                    (double)grp.Count() / NumTries, (double)runningTotal / NumTries);
            }
            Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
            Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
            Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
            Console.ReadLine();
        }
        // Random number generator used for all attempts.
        static Random rnd = new Random();
        static int DoTest(int nkey)
        {
            HashSet keys = new HashSet();
            // Pick random numbers in the range [0..nkey-1]
            // until a duplicate is selected.
            int key;
            do
            {
                key = rnd.Next(nkey);
            } while (keys.Add(key));
            return keys.Count;
        }
    }

The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.

The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:

    1:  27,322 (  0.27 %)  (0.273220 %)
    2:  54,962 (  0.55 %)  (0.822840 %)
    3:  81,735 (  0.82 %)  (1.640190 %)
    4: 107,442 (  1.07 %)  (2.714610 %)
    5: 132,999 (  1.33 %)  (4.044600 %)
    6: 157,841 (  1.58 %)  (5.623010 %)
    7: 181,920 (  1.82 %)  (7.442210 %)
    8: 203,602 (  2.04 %)  (9.478230 %)
    9: 223,787 (  2.24 %)  (11.716100 %)
    10: 241,462 (  2.41 %)  (14.130720 %)
    11: 258,365 (  2.58 %)  (16.714370 %)
    12: 274,581 (  2.75 %)  (19.460180 %)
    13: 286,809 (  2.87 %)  (22.328270 %)
    14: 298,008 (  2.98 %)  (25.308350 %)
    15: 306,072 (  3.06 %)  (28.369070 %)
    16: 314,766 (  3.15 %)  (31.516730 %)
    17: 318,592 (  3.19 %)  (34.702650 %)
    18: 323,361 (  3.23 %)  (37.936260 %)
    19: 323,649 (  3.24 %)  (41.172750 %)
    20: 322,924 (  3.23 %)  (44.401990 %)
    21: 319,847 (  3.20 %)  (47.600460 %)
    22: 316,430 (  3.16 %)  (50.764760 %)
    23: 310,096 (  3.10 %)  (53.865720 %)
    24: 303,122 (  3.03 %)  (56.896940 %)
    25: 295,341 (  2.95 %)  (59.850350 %)
    26: 285,219 (  2.85 %)  (62.702540 %)
    27: 276,433 (  2.76 %)  (65.466870 %)
    28: 265,044 (  2.65 %)  (68.117310 %)
    29: 253,646 (  2.54 %)  (70.653770 %)
    30: 241,188 (  2.41 %)  (73.065650 %)
    31: 228,559 (  2.29 %)  (75.351240 %)
    32: 216,584 (  2.17 %)  (77.517080 %)
    33: 203,608 (  2.04 %)  (79.553160 %)
    34: 190,222 (  1.90 %)  (81.455380 %)
    35: 177,870 (  1.78 %)  (83.234080 %)
    36: 165,091 (  1.65 %)  (84.884990 %)
    37: 153,658 (  1.54 %)  (86.421570 %)
    38: 141,501 (  1.42 %)  (87.836580 %)
    39: 129,984 (  1.30 %)  (89.136420 %)
    40: 118,849 (  1.19 %)  (90.324910 %)
    41: 108,598 (  1.09 %)  (91.410890 %)
    42:  98,732 (  0.99 %)  (92.398210 %)
    43:  89,560 (  0.90 %)  (93.293810 %)
    44:  80,692 (  0.81 %)  (94.100730 %)
    45:  72,640 (  0.73 %)  (94.827130 %)
    46:  65,559 (  0.66 %)  (95.482720 %)
    47:  58,525 (  0.59 %)  (96.067970 %)
    48:  51,877 (  0.52 %)  (96.586740 %)
    49:  46,329 (  0.46 %)  (97.050030 %)
    50:  40,333 (  0.40 %)  (97.453360 %)
    51:  35,917 (  0.36 %)  (97.812530 %)
    52:  31,270 (  0.31 %)  (98.125230 %)
    53:  27,409 (  0.27 %)  (98.399320 %)
    54:  23,349 (  0.23 %)  (98.632810 %)
    55:  20,482 (  0.20 %)  (98.837630 %)
    56:  17,441 (  0.17 %)  (99.012040 %)
    57:  15,390 (  0.15 %)  (99.165940 %)
    58:  13,378 (  0.13 %)  (99.299720 %)
    59:  11,302 (  0.11 %)  (99.412740 %)
    60:   9,638 (  0.10 %)  (99.509120 %)
    61:   8,199 (  0.08 %)  (99.591110 %)
    62:   6,851 (  0.07 %)  (99.659620 %)
    63:   5,876 (  0.06 %)  (99.718380 %)
    64:   4,843 (  0.05 %)  (99.766810 %)
    65:   4,077 (  0.04 %)  (99.807580 %)
    66:   3,433 (  0.03 %)  (99.841910 %)
    67:   3,098 (  0.03 %)  (99.872890 %)
    68:   2,388 (  0.02 %)  (99.896770 %)
    69:   1,903 (  0.02 %)  (99.915800 %)
    70:   1,541 (  0.02 %)  (99.931210 %)
    71:   1,308 (  0.01 %)  (99.944290 %)
    72:   1,210 (  0.01 %)  (99.956390 %)
    73:     893 (  0.01 %)  (99.965320 %)
    74:     666 (  0.01 %)  (99.971980 %)
    75:     588 (  0.01 %)  (99.977860 %)
    76:     463 (  0.00 %)  (99.982490 %)
    77:     369 (  0.00 %)  (99.986180 %)
    78:     290 (  0.00 %)  (99.989080 %)
    79:     235 (  0.00 %)  (99.991430 %)
    80:     185 (  0.00 %)  (99.993280 %)
    81:     161 (  0.00 %)  (99.994890 %)
    82:     102 (  0.00 %)  (99.995910 %)
    83:     103 (  0.00 %)  (99.996940 %)
    84:      59 (  0.00 %)  (99.997530 %)
    85:      55 (  0.00 %)  (99.998080 %)
    86:      48 (  0.00 %)  (99.998560 %)
    87:      40 (  0.00 %)  (99.998960 %)
    88:      29 (  0.00 %)  (99.999250 %)
    89:      16 (  0.00 %)  (99.999410 %)
    90:      15 (  0.00 %)  (99.999560 %)
    91:      11 (  0.00 %)  (99.999670 %)
    92:       8 (  0.00 %)  (99.999750 %)
    93:       4 (  0.00 %)  (99.999790 %)
    94:       3 (  0.00 %)  (99.999820 %)
    95:       7 (  0.00 %)  (99.999890 %)
    96:       3 (  0.00 %)  (99.999920 %)
    97:       2 (  0.00 %)  (99.999940 %)
    98:       1 (  0.00 %)  (99.999950 %)
    99:       1 (  0.00 %)  (99.999960 %)
    100:       1 (  0.00 %)  (99.999970 %)
    101:       2 (  0.00 %)  (99.999990 %)
    102:       1 (  0.00 %)  (100.000000 %)
    Min items to collision = 1
    Max items to collision = 102
    Average items to collision = 24

This result agrees with the math. It shows some other interesting things, too.

Look at how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 10% with just 14 numbers selected.

You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the birthday problem by a few points. Maybe the break even point is really 25 rather than 23.

You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.

So what do birthdays have to do with hash keys?

Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys, right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.

You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the birthday problem raises its ugly head here. Let’s look at the math and see what it tells us.

What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?

According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:

    p = 1 - e^(-n^2/(2 * 2^32))

We can wrap that up in a neat little method:

    static double DuplicateProb(long fieldSize, long nItems)
    {
        double minusNSquared = -(nItems * nItems);
        return 1 - Math.Exp(minusNSquared / (2 * fieldSize));
    }

A few tests will tell us if it’s working:

    DuplicateProb(365, 10) = 0.128017828988458
    DuplicateProb(365, 20) = 0.421863457482716
    DuplicateProb(365, 23) = 0.515509538061517
    DuplicateProb(365, 30) = 0.708547055710068
    DuplicateProb(365, 50) = 0.967439570087906
    DuplicateProb(365, 57) = 0.988329429309313
    DuplicateProb(365, 100) = 0.999998876014983

Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article, but they’re close–within the range of what you’d expect for a coarse approximation.

So, then, about our one million items selected from a field of four billion:

    DuplicateProb(4294967296, 1000000) = 1

What?!?!

How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?

It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:

Field Size Min Max Average
2^16 4 1,081 316
2^20 21 4,390 1,281
2^30 342 142,952 41,498
2^31 1,036 193,529 57,606

These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.

The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:

Field Size Min Max Average
16 4 10.08 8.30
20 4.39 12.10 10.32
30 8.42 17.13 15.34
31 10.02 17.56 15.81

For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.

A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.

More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.

Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.

That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.

It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.

In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.

I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re indexing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how often some programs run.

Granted, if you were only indexing a million items, you’d probably use something like a HashTable. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.

Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the probability of two different items hashing to the same value is surprisingly high.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.