C# note: Comparer versus Comparison

The Array.Sort method has a generic overload, Sort<T>(T[], Comparison<T>), that sorts the array using the passed comparison delegate to do item comparisons.

Array.Sort has another overload, Sort<T>(T[], IComparer<T>), that accepts a reference to an object that implements the IComparer<T> interface.

There are some non-generic array sorting functions, too. I won’t be covering those.

To call the first sorting function shown above, you have to create a Comparison<T> delegate. That’s easy enough:

// Create a descending item comparer
var descendingComparison =
    new Comparison<int>((i1,i2) => i2.CompareTo(i1));

For the second one, you need a reference to an object that implements the IComparer<T> interface. It’s possible to create an IComparer<t> from a Comparison<T>, like this:

var descendingComparer =
    new Comparer<int>.Create(descendingComparison);

Given the above, the code below shows how to call those functions.

    var a = new[] { 5, 5, 6, 9, 5, 3, 7, 7, 1, 5 };

    // Sort passing an IComparer<T>
    Array.Sort(a, descendingComparer);

    // Sort, passing Comparison<T>
    Array.Sort(a, descendingComparison);

Those two function calls do the same thing.

Array.Sort also has a generic overload that lets you sort portions of an array. That’s a pretty common thing to do, so it’s a nice function to have. But there’s only one function: Sort<T>(T[], Int32, Int32, IComparer<T>).

So if I’m sorting the entire array, I can pass a Comparison<T> or an IComparer<T>. But if I want to sort only part of the array, I have to pass an IComparer<T>?

That’s just idiotic! The existence of the overload that accepts a Comparison<T> to sort the entire array creates the expectation of a method that accepts a Comparison<T> to sort part of the array. That such a function doesn’t exist is, in my mind, an interface design error.

It might be that the overload that takes a Comparison<T> exists solely to support something in LINQ. The concept of sorting part of an array doesn’t really make sense in LINQ, so LINQ wouldn’t require a range sorting function. But the expectation is there on the human side, and the IDE’s error message when you write code to call the non-existent function is not at all helpful. When presented with this line:

Array.Sort(a, 0, 12, descendingComparer);

Visual Studio non-helpfully tells me:

Error (active) CS1503 Argument 2: cannot convert from 'int' to 'System.Array?'

Error (active) CS1503 Argument 4: cannot convert from 'System.Comparison' to 'int'

I guess figuring out which Sort method I’m trying to call is difficult. Overload resolution is hard enough when all the parameters match up. When one of the parameters is incorrect and there are multiple overloads that accept four parameters, the compiler has to guess at which one I was trying to call. That sounds like a hard problem.

If you’re unfamiliar with the difference between Comparison and Comparer, the difference is going to frustrate you.

Fortunately, the fix is easy. I showed above how to create an IComparer<T> from a Comparison<T>:

var descendingComparer =
    new Comparer<int>.Create(descendingComparison);

Or, if you don’t want to declare a new variable for it:

Array.Sort(a, 0, a.Length/2, new Comparer<int>.Create(descendingComparison));

You can go the other way, too, if you want. That is, to obtain a Comparison<T> from an IComparer<T>:

Comparison<int> descendingComparison = descendingComparer.Compare;

It looks to me as though Comparison<T> is not widely used in the .NET Framework. The only place I’ve seen it is in overloads to the Array.Sort and List.Sort methods. Comparer<T>, on the other hand, is used all over the place. In my code, I make an effort to keep a Comparer<T> around and create a Comparison<T> from it in the few cases that I need one.


Don’t aggravate your users

In my career I’ve worked with and mentored many less experienced programmers. One thing I stress when reviewing their UI designs with them is data validation. Anybody who’s worked with me for more than a week has probably heard me say, “The best way to handle bad data is to prevent it from entering the system.”

It seems obvious, but I actually had to work on a system in which date fields were accepted as input strings, and then validated at each use. There was a whole mess of messy code involved with recovering from invalid date fields. But that’s a story for another time.

The typical way to prevent bad data from entering the system is to validate it on input and reject whatever isn’t valid. If the user enters “ham and cheese” in the Birthday field, don’t accept it. Tell the user to enter a valid date.

Some programmers take this too far and try to prevent the user from entering potentially bad data. In a date input field, for example, they’ll display a prompt showing the format “MM/DD/YYYY”.

The input control is customized so that the user can enter only numbers. The user’s input overwrites the letters, and the cursor is never placed on a slash. The control actually helps the user enter a date in the correct format, but it’s still possible to enter an invalid date. That is, nothing in the control itself prevents the user from entering a date of 99/99/9999. The format of the data is correct, but the value is not a valid date. That’s okay: the value is checked when the user exits the field or submits the form.

Every programmer, either of his own volition or because he was told to by a misguided UI designer, will eventually try to prevent the user from entering a bad date. And that’s when things get ugly.

For example, the user enters the date “10/13/2024”. Then she realizes that she made a mistake. That was supposed to be “10/31/2024”. So she hits the back arrow key and positions the cursor under the ‘1’ in “13”. She’s going to type “31” to replace “13”. But the control won’t let her do that because if she enters ‘3’ in that position it will result in “10/33/2024”. Clearly an invalid date and not to be allowed.

So now we’re in a situation where the user wants to do something completely reasonable (type “31” over the “13”), but the control won’t allow it. She has to jump through hoops to fix her error.

Control customization that actually helps the user enter valid data is A Good Thing. But if the control makes it difficult to enter valid data, it’s gone too far. Users are happy to correct fields in which they’ve entered bad data. They are not at all amused by controls that punish them for making common mistakes. That kind of thing is what I call “actively user-hostile.” It’ll make people hate your application.

I’ve heard programmers argue that it’s easier to let the control constrain the input than to validate all the fields when the form is submitted. It’s certainly not easier for the user! And I don’t think it’s any easier for the programmer, either. Forms quite often have interdependent fields that can’t easily be validated individually. The only reasonable time to validate is when the form data is submitted. Custom data input controls can’t change that.

Standard UI controls work well and modern UI frameworks support validate-on-submission. Everything’s there to make the process easy for the programmer and painless for the user. There’s nothing to be gained by attempting to prevent the user from making a mistake. It’s much easier and more user-friendly to let the user submit the form, validate all the data at once, and then prompt the user to fix the fields that have errors. Don’t waste your time on custom controls that do nothing but aggravate your users.

Re-thinking the retry

A common pattern used when communicating with external services is to retry a call that fails. Stripped of bells and whistles, the typical retry loop looks like this:

    result = makeServiceCall(parameters)
    numRetries = 0
    while !result.success && numRetries < MAX_RETRIES
        // insert delay before retry
        ++numRetries
        result = makeServiceCall(parameters)
    end while

    if (!result.success)
        // Raise error. 

We can quibble about the specific implementation, but the general idea is the same: make multiple attempts to get a success response from the service, inserting a delay between calls. The delay can be a fixed amount of time, an exponential fallback with jitter, etc., and you can include all kinds of other logic to improve things, but it all boils down to essentially the same thing.

Under normal circumstances, of course, service calls don’t fail, and this logic works very well. But if the service call failure rate increases, some very bad things can happen.

Imagine a critical service for which you have a contract (a service level agreement, or SLA) that guarantees an average 10 ms response time for up to 1,000 transactions per second (TPS). Remember, though, that this is a shared service. There are many other clients who have similar contracts: 10 ms response time for X TPS. Your application calls that service perhaps 900 times per second, on average. There will be brief periods when your call rate will exceed 1,000 TPS, but that’s okay because the service is scaled to handle large amounts of traffic from many clients. Say the service can guarantee that 10 ms response time for a total of 1,000,000 TPS from all of its clients, combined. Short-term bursts of excess traffic from a few clients aren’t a problem.

Even if calls to the service exceed 1,000,000 TPS, the likely result at first will be increased response time: perhaps latency increases by 50% with a sustained traffic increase of 10%, and doubles when traffic is 20% above the configured maximum. The specific breaking point differs for every service, but in general latency increases non-linearly with the call rate.

Clients, of course, won’t wait forever for a response. They typically configure a timeout (often two or three times the SLA), and consider the call a failure if it takes longer than that. Not a problem with this retry logic: just delay a bit and try again.

As I said above, this kind of thing works fine under normal conditions. But in a large system, lots of things can go wrong.

Imagine what would happen if the service starts getting 1,500,000 requests per second: a sustained 50% increase in traffic. Or one of the service’s dependencies can’t meet its SLA. Or network congestion increases the error rate. Whatever the cause, the service’s failure rate increases, or latency increases beyond the timeout value set by clients. Whatever the cause of the service’s distress, your application blindly responds by sending another request. So if your MAX_RETRIES value is two, then you’ve effectively tripled the number of calls you make to the service.

The last thing a service under distress needs is more requests. Even if your application is not experiencing increased traffic, your retries still have a negative effect on the service.

Some argue that services should protect themselves from such traffic storms. And many do. But that protection isn’t free. There comes a point when the service is spending so much time telling clients to go away that it can’t spend any time clearing its queue. Not that clearing the queue is much help. Even after the initial problem is fixed, the service is swamped with requests from all those clients who keep blindly retrying. It’s a positive feedback loop that won’t clear until the clients stop calling.

The retry loop above might improve your application’s reliability in normal operation. I say “might” because most applications I’ve encountered don’t actually track the number of retries, so they have no idea if the retry logic even works. I’ve seen the following in production code:

  1. A retry loop that always made the maximum number of retries, even if the initial call succeeded.
  2. Retry logic that never retried. That code was in production for two years before anybody realized there was a problem. Why? Because the service had never failed before.
  3. Retry logic that actually did the retries but then returned the result of the first call.
  4. Infinite retry. When a non-critical service went down one day, the entire site became inoperable.

As bad as it is that many programmers apparently don’t test their retry logic, even fewer monitor it. In all the applications I’ve seen with retry logic, only a handful can tell me how effective it is. If you want to know whether your retry logic is working, you have to log:

  • The number of initial calls to the service.
  • The number of initial call failures.
  • The total number of calls to the service (including retries).
  • The number of call successes (including success after retry).

From those numbers, you can determine the effectiveness of the retry logic. In my experience, the percentage of initial call failures to any service under normal operation is less than 1%, and retry succeeds in fewer than 50% of those cases. When a service is under distress and the initial failure percentage gets above about 10%, retry is almost never successful. The reason, I think, is that whatever condition caused the outage hasn’t cleared before the last retry: service outages last longer than clients are willing to wait.

For the majority of applications I’ve encountered, retry is rarely worth the effort it takes to design, implement, debug, test, and monitor. Under normal circumstances it’s almost irrelevant, maybe making the difference between 99% and 99.5% success rate. In unusual circumstances, it increases the load on the underlying service, and almost never results in a successful call. It’s a small win where it doesn’t matter, and a huge liability when it does matter.

If you have existing retry logic in your code, I encourage you to monitor its effectiveness. If, like me, you discover that it rarely provides benefit, I suggest you remove it.

If you’re considering adding retry logic to your code, be sure to consider the potential consequences. And add the monitoring up front.

Setting JVM memory limits

Java has an option, -XX:MaxRAMFraction=XXX, that lets developers specify the maximum amount of available RAM a Java application should occupy. This sounds like a great idea, until you realize how it’s implemented. The number that you specify for “XXX” is an integer that represents the fraction of total RAM that should be used. The intrepretation is 1/XXX. So -XX:MaxRamFraction=1 tells the application to use as much RAM as it wants. XX:MaxRAMFraction=2 tells it to use half of the memory. What if you want to use 75% of the memory? Or, perhaps 2/3 the memory? Tough luck. You get all, one-half, one-third, one-fourth, one-fifth, etc. You want to use 40%? Sorry, buddy. You can get 50% or 33%. Take your pick.

When I first heard this, I thought the guy who was explaining it was misinformed or pulling my leg.

I’d sure like to understand the thinking that went into this design decision, because from where I’m standing it looks like lunacy. I can’t imagine how such an idiotic design could make it through review and implementation without somebody with a modicum of intelligence and authority raising a stink. You mean nobody realized how stupid this is?

Java 10 introduced the -XX:MaxRAMPercentage and associated flags. See https://bugs.openjdk.java.net/browse/JDK-8186315 for more information.

Sanity prevails, but I still want to understand how the original design got approved. That something so obviously wrong managed to make it through the approval process and was released on an unsuspecting world doesn’t give me much confidence in the competence of Java platform developers.

Use your data types!

Imagine there’s a program that makes API calls through a proxy to another service. One of the parameters you pass to that proxy is a timeout value: the maximum time that the proxy should wait before giving up and returning with a timeout error. The function call would be something like:

    result = callProxy("SomeService", "ServiceFunction", {parameters}, timeout);

The timeout value is defined in code as:

    double timeout = 10;

And that code works for months or years until some programmer is in there editing things one day and says to himself, “A timeout of 10 milliseconds? That’s crazy. We should give the service at least a full second to respond.” He promptly changes the timeout value to 1000 and goes on his merry way. All the unit tests pass, the code is pushed, and continues to work for more months or years.

Until it doesn’t.

One fateful morning at about 02:00, you’re shocked out of bed by your pager screaming that there’s a problem. Bleary eyed, you stumble to the computer, log in, and find that your page has stopped serving requests. After a long search, you discover that “SomeServer” is experiencing high CPU usage, but you can’t figure out why that would make your program stop serving requests. After all, your program should be getting a timeout error after a second, and the error handling code looks like it works.

At some point you discover the problem: timeout is actually expressed in seconds, not milliseconds. So rather than waiting 10 seconds for a timeout, it’s waiting 16 minutes and 40 seconds.

Yes, something like this actually has happened. It wasn’t pretty.

Granted, quite a few things went wrong here, among them:

  • There was no comment describing the units that the timeout represents.
  • The timeout variable should have been named timeoutInSeconds.
  • The programmer should have looked more closely at the code before changing it.
  • The code reviewer should have caught the error.
  • The automated tests should have simulated a service timeout.

The error should have been caught. But it wasn’t.

A much more effective way to avoid an error like that is to use descriptive data types. If you’re expressing a time span, then don’t use a unit-less value like an integer or a floating point value. Use a data type intended for expressing time spans. In C#, for example, you’d use the TimeSpan, like this:

    TimeSpan timeout = TimeSpan.FromSeconds(10);

It would be very difficult for a programmer to look at that and think, “10 milliseconds isn’t long enough.” And there’s absolutely no need for a comment here because the code says exactly what it does: it creates a timeout value of 10 seconds. It’s nearly impossible to screw this up. If on the off chance some drunken coder did screw that up, the person doing the code review almost certainly wouldn’t.

These types of errors happen more often than you might like to think. Many of us don’t see these because we work in small companies on relatively new code, where the current employees are also the people who wrote the original code. Things are different when you work at a large company where people move around a lot, and the code you’re working on has been there for 10 years and the programmers who wrote it are off doing other things, or have left the company.

Do yourself and the people who will maintain your code in the future a favor. Use descriptive data types just like you use descriptive variable names. Unit-less values are errors waiting to happen. And I guarantee you don’t want to be the person who made this type of error when that error happens at 2 AM on a large retail web site, where every minute of down time means millions of dollars in lost sales. Nor do you want to be the person who has to track down such an error in those circumstances.

Birthdays, random numbers, and hash keys

This is a re-post of an article I wrote for my .NET Reference Guide column some years ago.

You’re probably familiar with the Birthday problem, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.

This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.

    class Program
    {
        const int NumKeys = 365;
        const int Thousand = 1000;
        const int Million = Thousand * Thousand;
        const int NumTries = 10 * Million;
        static void Main(string[] args)
        {
            // List contains number of items to collision for each iteration.
            List<int> Totals = new List<int>();
            // Do the test NumTries times, saving the result from each attempt.
            for (int i = 0; i < NumTries; ++i)
            {
                int n = DoTest(NumKeys);
                Totals.Add(n);
                if ((i % 100000) == 0)
                {
                    Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
                }
            }
            // Group the items by their result.
            var grouped = from cnt in Totals
                        group cnt by cnt into newGroup
                        orderby newGroup.Key
                        select newGroup;
            // Output a table of results and counts, with running percentage.
            int runningTotal = 0;
            foreach (var grp in grouped)
            {
                runningTotal += grp.Count();
                Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2})  ({3,8:P6})", grp.Key, grp.Count(),
                    (double)grp.Count() / NumTries, (double)runningTotal / NumTries);
            }
            Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
            Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
            Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
            Console.ReadLine();
        }
        // Random number generator used for all attempts.
        static Random rnd = new Random();
        static int DoTest(int nkey)
        {
            HashSet keys = new HashSet();
            // Pick random numbers in the range [0..nkey-1]
            // until a duplicate is selected.
            int key;
            do
            {
                key = rnd.Next(nkey);
            } while (keys.Add(key));
            return keys.Count;
        }
    }

The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.

The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:

    1:  27,322 (  0.27 %)  (0.273220 %)
    2:  54,962 (  0.55 %)  (0.822840 %)
    3:  81,735 (  0.82 %)  (1.640190 %)
    4: 107,442 (  1.07 %)  (2.714610 %)
    5: 132,999 (  1.33 %)  (4.044600 %)
    6: 157,841 (  1.58 %)  (5.623010 %)
    7: 181,920 (  1.82 %)  (7.442210 %)
    8: 203,602 (  2.04 %)  (9.478230 %)
    9: 223,787 (  2.24 %)  (11.716100 %)
    10: 241,462 (  2.41 %)  (14.130720 %)
    11: 258,365 (  2.58 %)  (16.714370 %)
    12: 274,581 (  2.75 %)  (19.460180 %)
    13: 286,809 (  2.87 %)  (22.328270 %)
    14: 298,008 (  2.98 %)  (25.308350 %)
    15: 306,072 (  3.06 %)  (28.369070 %)
    16: 314,766 (  3.15 %)  (31.516730 %)
    17: 318,592 (  3.19 %)  (34.702650 %)
    18: 323,361 (  3.23 %)  (37.936260 %)
    19: 323,649 (  3.24 %)  (41.172750 %)
    20: 322,924 (  3.23 %)  (44.401990 %)
    21: 319,847 (  3.20 %)  (47.600460 %)
    22: 316,430 (  3.16 %)  (50.764760 %)
    23: 310,096 (  3.10 %)  (53.865720 %)
    24: 303,122 (  3.03 %)  (56.896940 %)
    25: 295,341 (  2.95 %)  (59.850350 %)
    26: 285,219 (  2.85 %)  (62.702540 %)
    27: 276,433 (  2.76 %)  (65.466870 %)
    28: 265,044 (  2.65 %)  (68.117310 %)
    29: 253,646 (  2.54 %)  (70.653770 %)
    30: 241,188 (  2.41 %)  (73.065650 %)
    31: 228,559 (  2.29 %)  (75.351240 %)
    32: 216,584 (  2.17 %)  (77.517080 %)
    33: 203,608 (  2.04 %)  (79.553160 %)
    34: 190,222 (  1.90 %)  (81.455380 %)
    35: 177,870 (  1.78 %)  (83.234080 %)
    36: 165,091 (  1.65 %)  (84.884990 %)
    37: 153,658 (  1.54 %)  (86.421570 %)
    38: 141,501 (  1.42 %)  (87.836580 %)
    39: 129,984 (  1.30 %)  (89.136420 %)
    40: 118,849 (  1.19 %)  (90.324910 %)
    41: 108,598 (  1.09 %)  (91.410890 %)
    42:  98,732 (  0.99 %)  (92.398210 %)
    43:  89,560 (  0.90 %)  (93.293810 %)
    44:  80,692 (  0.81 %)  (94.100730 %)
    45:  72,640 (  0.73 %)  (94.827130 %)
    46:  65,559 (  0.66 %)  (95.482720 %)
    47:  58,525 (  0.59 %)  (96.067970 %)
    48:  51,877 (  0.52 %)  (96.586740 %)
    49:  46,329 (  0.46 %)  (97.050030 %)
    50:  40,333 (  0.40 %)  (97.453360 %)
    51:  35,917 (  0.36 %)  (97.812530 %)
    52:  31,270 (  0.31 %)  (98.125230 %)
    53:  27,409 (  0.27 %)  (98.399320 %)
    54:  23,349 (  0.23 %)  (98.632810 %)
    55:  20,482 (  0.20 %)  (98.837630 %)
    56:  17,441 (  0.17 %)  (99.012040 %)
    57:  15,390 (  0.15 %)  (99.165940 %)
    58:  13,378 (  0.13 %)  (99.299720 %)
    59:  11,302 (  0.11 %)  (99.412740 %)
    60:   9,638 (  0.10 %)  (99.509120 %)
    61:   8,199 (  0.08 %)  (99.591110 %)
    62:   6,851 (  0.07 %)  (99.659620 %)
    63:   5,876 (  0.06 %)  (99.718380 %)
    64:   4,843 (  0.05 %)  (99.766810 %)
    65:   4,077 (  0.04 %)  (99.807580 %)
    66:   3,433 (  0.03 %)  (99.841910 %)
    67:   3,098 (  0.03 %)  (99.872890 %)
    68:   2,388 (  0.02 %)  (99.896770 %)
    69:   1,903 (  0.02 %)  (99.915800 %)
    70:   1,541 (  0.02 %)  (99.931210 %)
    71:   1,308 (  0.01 %)  (99.944290 %)
    72:   1,210 (  0.01 %)  (99.956390 %)
    73:     893 (  0.01 %)  (99.965320 %)
    74:     666 (  0.01 %)  (99.971980 %)
    75:     588 (  0.01 %)  (99.977860 %)
    76:     463 (  0.00 %)  (99.982490 %)
    77:     369 (  0.00 %)  (99.986180 %)
    78:     290 (  0.00 %)  (99.989080 %)
    79:     235 (  0.00 %)  (99.991430 %)
    80:     185 (  0.00 %)  (99.993280 %)
    81:     161 (  0.00 %)  (99.994890 %)
    82:     102 (  0.00 %)  (99.995910 %)
    83:     103 (  0.00 %)  (99.996940 %)
    84:      59 (  0.00 %)  (99.997530 %)
    85:      55 (  0.00 %)  (99.998080 %)
    86:      48 (  0.00 %)  (99.998560 %)
    87:      40 (  0.00 %)  (99.998960 %)
    88:      29 (  0.00 %)  (99.999250 %)
    89:      16 (  0.00 %)  (99.999410 %)
    90:      15 (  0.00 %)  (99.999560 %)
    91:      11 (  0.00 %)  (99.999670 %)
    92:       8 (  0.00 %)  (99.999750 %)
    93:       4 (  0.00 %)  (99.999790 %)
    94:       3 (  0.00 %)  (99.999820 %)
    95:       7 (  0.00 %)  (99.999890 %)
    96:       3 (  0.00 %)  (99.999920 %)
    97:       2 (  0.00 %)  (99.999940 %)
    98:       1 (  0.00 %)  (99.999950 %)
    99:       1 (  0.00 %)  (99.999960 %)
    100:       1 (  0.00 %)  (99.999970 %)
    101:       2 (  0.00 %)  (99.999990 %)
    102:       1 (  0.00 %)  (100.000000 %)
    Min items to collision = 1
    Max items to collision = 102
    Average items to collision = 24

This result agrees with the math. It shows some other interesting things, too.

Look at how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 10% with just 14 numbers selected.

You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the birthday problem by a few points. Maybe the break even point is really 25 rather than 23.

You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.

So what do birthdays have to do with hash keys?

Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys. Right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.

You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the birthday problem raises its ugly head here. Let’s look at the math and see what it tells us.

What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?

According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:

    p = 1 - e^(-n^2/(2 * 2^32))

We can wrap that up in a neat little method:

    static double DuplicateProb(long fieldSize, long nItems)
    {
        double minusNSquared = -(nItems * nItems);
        return 1 - Math.Exp(minusNSquared / (2 * fieldSize));
    }

A few tests will tell us if it’s working:

    DuplicateProb(365, 10) = 0.128017828988458
    DuplicateProb(365, 20) = 0.421863457482716
    DuplicateProb(365, 23) = 0.515509538061517
    DuplicateProb(365, 30) = 0.708547055710068
    DuplicateProb(365, 50) = 0.967439570087906
    DuplicateProb(365, 57) = 0.988329429309313
    DuplicateProb(365, 100) = 0.999998876014983

Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article, but they’re close–within the range of what you’d expect for a coarse approximation.

So, then, about our one million items selected from a field of four billion:

    DuplicateProb(4294967296, 1000000) = 1

What?!?!

How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?

It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:

Field SizeMinMaxAverage
2^1641,081316
2^20214,3901,281
2^30342142,95241,498
2^311,036193,52957,606

These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.

The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:

Field SizeMinMaxAverage
16410.088.30
204.3912.1010.32
308.4217.1315.34
3110.0217.5615.81

For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.

A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.

More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.

Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.

That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.

It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.

In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.

I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re indexing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how much data some programs index.

Granted, if you were only indexing a million items, you’d probably use something like a hash table to index them. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.

Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the probability of two different items hashing to the same value is surprisingly high.

How to confuse a programmer

Computer programming is complicated enough that we really don’t need to look for ways to further confuse programmers. On the contrary, we should actively look for ways to increase programmer understanding. Perhaps one of the easiest ways to confuse programmers is to choose bad names for things. For example, consider this little code snippet that’s supposed to demonstrate using a ManualResetEvent to control thread execution.

private ManualResetEvent suspended = new ManualResetEvent(true);

public void Main()
{
    //

    // suspend
    suspended.Reset();

    // resume
    suspended.Set();

    //
}

public void Work()
{
    // long task
    while (!stop)
    {
        suspended.WaitOne(/* optional timeout */);

        //worker task
    }
}

There’s nothing wrong with the way the snippet works, but naming the event suspended is one of the most confusing things I’ve seen this year. The suspended flag is set when the program is running. WTF? I’m pretty sure I’m not the only programmer who, if he were to encounter this in production code, would very likely misunderstand it.

“I have to wait for the suspended flag to be set before I can resume.”

Come on, people. Think about what you call things. Naming is important!

Take advantage of the early out

The “new thing” when I first started working with computers was Structured Programming. One of the tenets was that a function should have exactly one entry point and exactly one exit point. Some languages (Pascal, for example) enforced this by not having a return statement that would allow you to exit the function from somewhere in the middle.

In retrospect, that probably was a good thing, because people used to do all kinds of things like jump into the middle of a function, exit a function without returning a value, etc. Restricting functions to a single entry point and a single exit point was an effective way to prevent those stupid programmer tricks. But the restrictions are a little too restrictive.

I can’t think of any reason for a function to have multiple entry points, but restricting a function to a single exit point leads to things like this:

    public boolean isExternal(final URI url) {
        boolean target = false;
	if (url.isAbsolute()) {
            // this can be expensive and slow, run only for absolute URLs which should be an exception
            final String serverName = requestAttributesProvider.get().getServerName();
            final String host = url.getHost();
            if (!host.equals(serverName)) {
                target = true;
            }
        }
        return target;
    }

This code has an unnecessary intermediate variable, and an unnecessary level of nesting. All due to the restriction on the number of exit points. If we relax that restriction and allow an early out, the code becomes less cluttered:

    public boolean isExternal(final URI url) {
        // this can be expensive and slow, run only for absolute URLs which should be an exception.
        if (!url.IsAbsolute()) {
            return false;
        }
        final String serverName = requestAttributesProvider.get().getServerName();
        final String host = url.getHost();
        return !host.equals(serverName);
    }

The idea here is to make a decision and then get out. There’s no reason to clutter the rest of the code with the “it’s not an absolute url” context. That allows us to eliminate the target variable, and reduces nesting in the function.

I find that code to be much more clear than the original.

A purist, by the way, would split that into two functions:

    public boolean isExternal(final URI url) {
        // this can be expensive and slow, run only for absolute URLs which should be an exception
        if (!url.IsAbsolute()) {
            return false;
        }
        return urlHostisExternal(url);
    }

    private boolean urlHostIsExternal(final URI url) {
        final String serverName = requestAttributesProvider.get().getServerName();
        final String host = url.getHost();
        return !host.equals(serverName);
    }

On a more general note, this is an anti-pattern:

    if (something) {
        return true;
    }
    else {
        // do other stuff
    }
    return whatever;

If you return from inside the if block, then the else is implied. There’s no need for it. You can reduce nesting and increase legibility by eliminating it:

    if (something) {
        return true;
    }
    // do other stuff
    return whatever;

I can’t think of any case in which such a transformation doesn’t make the code more readable, especially when working with cascading or nested if structures.

The same kind of thing applies to loops. Say you’re doing a sequential search for an item in a list. Structured Programming purists would have you write something like this:

    bool found = false;
    int ix = 0;
    while (ix < list.length && !found)
    {
        if (list[ix] == thingToFind)
        {
            found = true;
        }
        ++ix;
    }
    if (found)
    {
        return ix;
    }
    else
    {
        return -1;
    }

We can eliminate a lot of clutter by returning directly from within the loop:

    int ix = 0;
    while (ix < list.length)
    {
        if (list[ix] == thingToFind)
        {
            return ix;
        }
        ++ix;
    }
    // here, we know that the item wasn't found, so just return sentinel value.
    return -1;

Again, I can think of no case in which that transformation doesn’t improve readability.

Long-time readers probably wondered why some of the code examples above are in Java. Most of the code I’ll be working on in my new job is Java, so I figured it’d be a good idea to start using it. I’ll still be working with C# in my home projects, and reporting on them here, but there also will be Java from time to time.

How many feet of rope?

Elmer got a new job working for the procurement department of a very large company. Boss called him in on his second day, and said that he needs a rope long enough to go all the way around the Earth at the equator. Elmer’s job was to figure out how long the rope has to be, and to get quotes from suppliers: X feet of rope at Y dollars per foot.

Assume for the sake of this example that the Earth is perfectly spherical and smooth.

Elmer went back to his desk, did some calculation to figure out how much rope he’d need, and started burning up the phone talking to suppliers. A few hours later, quotes in hand, he entered the boss’s office and placed the quotes on his desk. Boss glanced at them and then said, “Good work. Before we place the order, this is enough rope to stretch all the way around the Earth, leaving a foot of space between the surface and the rope, right?”

Elmer, surprised, said, “No, sir. You asked for a rope to go all the way around the Earth. You didn’t say anything about a foot of slack.”

Boss pushed the quotes across the desk towards Elmer: “Well, you’ll have to re-do these quotes. I need the rope a foot away from the surface.”

Elmer, being the smart guy he is, replied “That’s okay, we’ll just add XX feet of rope to the original quote.”

What is the value of XX? How did Elmer figure that out so quickly?

My first games programming job was a huge learning experience. It was the first time I’d ever worked with people who were a lot smarter than I was. At least, smarter in terms of computer science. The people I’d worked with before understood specific things about programming, but they didn’t love programming like I did. To most of the people I’d worked with before, programming was just a job. I lived programming! I studied problems at home, solved problems at work, always had a few side projects, etc. Programming was my job and my most engaging hobby. Debra says that she shares me with my first wife: the computer.

One of the great things about working there was that we were always posing little brain teasers to each other. These could vary from math questions such as the above, algorithms for things like finding if a linked list has a loop, or sometimes larger design questions that don’t really have a “right” answer, but rather were designed to make us think about how to approach a problem. In many ways it was like an ongoing job interview. The puzzle that I presented above was one of them. This is one that I got after just a few seconds’ thought, but then had to check my work on the white board.

I’ve posed this question in interviews over the years, when a candidate says he’s good with math. The reason is that it’s a relatively simple problem that somebody who’s “good at math” should be able to solve pretty quickly. Whereas I’m interested in the solution, I’m more interested in how the candidate approaches the problem. Does he know how to set up the equation? Does he ask me questions to clarify the problem? Does he just sit there looking like he has no idea where to start?

Some people think the problem is too specialized, but I think it’s more than reasonable to expect a programmer, especially one who says he’s good at math, to know the equation for the circumference of a circle. This goes double for games programmers!

It’s surprising how many candidates utterly fail when it comes to this question. Some just sit there, dumbfounded. Others get up at the white board and start doodling, but can’t formulate the problem. It’s embarrassing.

The circumference of a circle is two times Pi, times the radius. That is:

C = 2πr

The original quote was for C feet of rope–enough rope to go around the Earth with radius r. The new rope has to be a foot away from the surface of the Earth. That radius is (r+1). So you have the original rope with length 2πr, and the new rope with length 2π(r+1). We want to know the difference. Showing my work, as my instructors always insisted:

C1 = 2πr
C2 = 2π(r+1)
C2-C1 = 2π(r+1) – 2πr
= 2πr + 2π – 2πr
= 2π

Elmer’s answer was 2π, or about 6.28 feet. It only takes about 6.28 feet to move that rope from being against the surface of the Earth to being a foot away.

I’ve had candidates reach that solution and say, “but that can’t be right,” and go back to check their work. I consider that a good thing; being willing to admit that one might have made a mistake is something I look for in the people I work with.

I’ll admit that this fits squarely in the category of “math that just don’t feel right,” but it is right.

I’ve had other candidates argue that the answer is just flat wrong. Even after I show them the math. That’s bad. Not only is arguing with the interviewer bad form, but unsupported statements like, “I know that can’t be the right answer” show an unwillingness to learn and an inability to form reasoned, coherent arguments.

Note that I have nothing against polite, informed, disagreement from a candidate. That shows backbone and an ability to support one’s argument.

It’s a fun interview question, and can tell you a lot about the candidate’s approach to problems, and his reaction to things that are non-intuitive. Will he accept the math, or will he try to argue that the math is somehow wrong?

How to generate unique “random-looking” keys

A common question on Stack Overflow goes something like this:

I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.

or

I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.

There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly not G75AC, which makes it obvious that the keys are sequential.

A related Stack Overflow question is this:

What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.

This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.

The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.

Encoding is easy: it’s just a base conversion. One common encoding is base-64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.

Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.

An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?

The answer is a particularly cool bit of math called a modular multiplicative inverse. Despite the name, it’s not a magical incantation. Eric Lippert explains it well in his blog post, Math from scratch, part thirteen: multiplicative inverses. (In case you’re wondering, the acronym he uses in that second sentence, “WOLOG” means, “Without loss of generality“.)

One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.

    private void DoIt()
    {
        const long m = 101;         // Number of keys + 1
        const long x = 387420489;   // must be coprime to m

        // Compute the multiplicative inverse
        var multInv = MultiplicativeInverse(x, m);

        // HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
        var nums = new HashSet<long>();

        // Obfuscate each number from 1 to 100.
        // Show that the process can be reversed.
        // Show that no duplicates are generated.
        for (long i = 1; i <= 100; ++i)
        {
            var obfuscated = i * x % m;
            var original = obfuscated * multInv % m;
            Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
            if (!nums.Add(obfuscated))
            {
                Console.WriteLine("Duplicate");
            }
        }    
    }

    private long MultiplicativeInverse(long x, long modulus)
    {
        return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
    }

    private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
    {
        if (a < 0)
        {
            var result = ExtendedEuclideanDivision(-a, b);
            return Tuple.Create(-result.Item1, result.Item2);
        }
        if (b < 0)
        {
            var result = ExtendedEuclideanDivision(a, -b);
            return Tuple.Create(result.Item1, -result.Item2);
        }
        if (b == 0)
        {
            return Tuple.Create(1L, 0L);
        }
        var q = a / b;
        var r = a % b;
        var rslt = ExtendedEuclideanDivision(b, r);
        var s = rslt.Item1;
        var t = rslt.Item2;
        return Tuple.Create(t, s - q * t);
    }

The first few lines of output for that program are:

    1 => 43 => 1
    2 => 86 => 2
    3 => 28 => 3
    4 => 71 => 4
    5 => 13 => 5
    6 => 56 => 6
    7 => 99 => 7
    8 => 41 => 8
    9 => 84 => 9
    10 => 26 => 10

If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.

The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.

By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.

The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.

In my next entry, I’ll wrap this all up into a nice easy-to-use class.