How many feet of rope?

Elmer got a new job working for the procurement department of a very large company. boss called him in on his second day, and said that he needs a rope long enough to go all the way around the Earth at the equator. Elmer’s job was to figure out how long the rope has to be, and to get quotes from suppliers: X feet of rope at Y dollars per foot.

Assume for the sake of this example that the Earth is perfectly spherical and smooth.

Elmer went back to his desk, did some calculation to figure out how much rope he’d need, and started burning up the phone talking to suppliers. A few hours later, quotes in hand, he entered the boss’s office and placed the quotes on his desk. Boss glanced at them and then said, “Good work. Before we place the order, this is enough rope to stretch all the way around the Earth, leaving a foot of space between the surface and the rope, right?”

Elmer, surprised, said, “No, sir. You asked for a rope to go all the way around the Earth. You didn’t say anything about a foot of slack.”

Boss pushed the quotes across the desk towards Elmer: “Well, you’ll have to re-do these quotes. I need the rope a foot away from the surface.”

Elmer, being the smart guy he is, replied “That’s okay, we’ll just add XX feet of rope to the original quote.”

What is the value of XX? How did Elmer figure that out so quickly?

My first games programming job was a huge learning experience. It was the first time I’d ever worked with people who were a lot smarter than I was. At least, smarter in terms of computer science. The people I’d worked with before understood specific things about programming, but they didn’t love programming like I did. To most of the people I’d worked with before, programming was just a job. I lived programming! I studied problems at home, solved problems at work, always had a few side projects, etc. Programming was my job and my most engaging hobby. Debra says that she shares me with my first wife: the computer.

One of the great things about working there was that we were always posing little brain teasers to each other. These could vary from math questions such as the above, algorithms for things like finding if a linked list has a loop, or sometimes larger design questions that don’t really have a “right” answer, but rather were designed to make us think about how to approach a problem. In many ways it was like an ongoing job interview. The puzzle that I presented above was one of them. This is one that I got after just a few seconds’ thought, but then had to check my work on the white board.

I’ve posed this question in interviews over the years, when a candidate says he’s good with math. The reason is that it’s a relatively simple problem that somebody who’s “good at math” should be able to solve relatively quickly. Whereas I’m interested in the solution, I’m more interested in how the candidate approaches the problem. Does he know how to set up the equation? Does he ask me questions to clarify the problem? Does he just sit there looking like he has no idea where to start?

Some people think the problem is too specialized, but I think it’s more than reasonable to expect a programmer, especially one who says he’s good at math, to know the equation for the circumference of a circle. This goes double for games programmers!

It’s surprising how many candidates utterly fail when it comes to this question. Some just sit there, dumbfounded. Others get up at the white board and start doodling, but can’t formulate the problem. It’s embarrassing.

The circumference of a circle is two times Pi, times the radius. That is:

C = 2πr

The original quote was for C feet of rope–enough rope to go around the Earth with radius r. The new rope has to be a foot away from the surface of the Earth. That radius is (r+1). So you have the original rope with length 2πr, and the new rope with length 2π(r+1). We want to know the difference. Showing my work, as my instructors always insisted:

C1 = 2πr
C2 = 2π(r+1)
C2-C1 = 2π(r+1) – 2πr
= 2πr + 2π – 2πr
= 2π

Elmer’s answer was 2π, or about 6.28 feet. It only takes about 6.28 feet to move that rope from being against the surface of the Earth to being a foot away.

I’ve had candidates reach that solution and say, “but that can’t be right,” and go back to check their work. I consider that a good thing; being willing to admit that one might have made a mistake is something I look for in the people I work with.

I’ll admit that this fits squarely in the category of “math that just don’t feel right,” but it is right.

I’ve had other candidates argue that the answer is just flat wrong. Even after I show them the math. That’s bad. Not only is arguing with the interviewer bad form, but unsupported statements like, “I know that can’t be the right answer” show an unwillingness to learn and an inability to form reasoned, coherent arguments.

Note that I have nothing against polite, informed, disagreement from a candidate. That shows backbone and an ability to support one’s argument.

It’s a fun interview question, and can tell you a lot about the candidate’s approach to problems, and his reaction to things that are non-intuitive. Will he accept the math, or will he try to argue that the math is somehow wrong?

Wikipedia: Trust, but verify

A recent question on Stack Overflow asked why Quicksort is called Quicksort, even though it sometimes exhibits O(n2) behavior whereas Merge sort and Heap sort are both O(log(n)) in all cases. (For those unfamiliar with the terminology, he’s asking why it’s called “Quicksort,” when other algorithms are theoretically faster.) It’s a reasonable question for a beginner to ask.

One person commented that the answer to the question can be found in the Quicksort article on Wikipedia. But the person asking the question rejected that source because “My professor said Wikipedia is not an authentic site.”

I doubt that’s what the professor said, because a college professor telling students not to use Wikipedia would be, at best, irresponsible. I suspect that the professor said Wikipedia should not be cited as a primary source in formal writing, and the student took that to mean Wikipedia is not a reliable source of information. It seems a prevalent opinion among many. I know intelligent people who will not use Wikipedia. At all. For anything. I cannot understand that behavior.

In my experience, Wikipedia is a highly reliable source of information on fact-based topics (historical events,general science,, algorithms and data structures, mathematics, astronomy, etc.), and a good introduction when it comes to political, religious, or other emotionally-charged topics. If I want to know what something is or was, then Wikipedia probably has an article about it, and that article will be accurate enough for me to get the general idea. Quickly skimming an article gives me a shallow understanding: enough that I don’t feel completely ignorant when somebody starts talking to me about the topic.

More critical reading not only supplies details, but also gives me a feel for points of controversy. Because Wikipedia articles, especially articles on current events or highly charged topics, are often edited by multiple people, it’s difficult for any single person or group to present a completely one-sided viewpoint. It’s also fairly easy to determine when a topic has been edited to remove any hint of controversy.

The best thing about Wikipedia is that it contains mostly reliable articles on almost every topic of any import. The second best thing is that those articles cite their sources. Every article has a “references” section in which it lists the sources for factual statements made in the article. If a factual statement does not have a reference, there is an annotation saying so. If the article lacks references in general, there’s a big note at the top of the article saying so.

With the references listed, one can easily spend a few minutes reading the primary source material to get more details, or to see if the article presents an accurate summation of the material. A Wikipedia article, like a well-written research paper, is verifiable. Contrast that with a printed encyclopedia, which rarely cites sources.

There are pretty high standards for Wikipedia articles, and the community of editors does a very good job of ensuring that articles meet those standards. If an article does not, there are prominent annotations in the text. If an article appears to be opinion-based, presents only one side of a controversial topic, lacks references, appears to be purely anecdotal, or runs afoul of any number of standards, there is a clearly marked indication of that in the article itself. I’ve seen things like, “this article lacks references” at the top of an article. Or “reference needed” when an unsupported assertion is made. Countless articles say, “This article does not meet standards.”

In short, a Wikipedia article gives you all the tools you need to gauge the reliability of the information presented. Certainly much more than a newspaper, television news channel, printed encyclopedia, magazine article, random Web site, your favorite politician, etc. Of those, only Wikipedia makes it a point to give you the resources you need to verify the information that’s presented.

Other than scientific journals, I can’t think of a general information source that I trust more than Wikipedia. It’s the first stop I make if I want to learn about anything.

Not that I take everything on Wikipedia as gospel. Like any other source of information, there are inadvertent mistakes and deliberate falsehoods in Wikipedia articles. But my experience has been that the number of such incidents is much smaller than in any other source of general information that I’m familiar with. More importantly, such things are typically identified and corrected very quickly.

I trust information from Wikipedia articles, but I also verify it through other means. By contrast, I’ve come to distrust information from most news sources, and use other means to determine the veracity of their articles. The primary differences here are that Wikipedia articles tell me where I can verify the information, and I’m usually able to verify it. News outlets pointedly do not reveal their sources, and very often I find that their versions of events are, to be kind, not entirely accurate.

How to generate unique “random-looking” keys

A common question on Stack Overflow goes something like this:

I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.

or

I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.

There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly not G75AC, which makes it obvious that the keys are sequential.

A related Stack Overflow question is this:

What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.

This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.

The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.

Encoding is easy: it’s just a base conversion. One common encoding is base=64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.

Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.

An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?

The answer is a particularly cool bit of math called a modular multiplicative inverse. Despite the name, it’s not a magical incantation. Eric Lippert explains it well in his blog post, Math from scratch, part thirteen: multiplicative inverses. (In case you’re wondering, the acronym he uses in that second sentence, “WOLOG” means, “Without loss of generality“.)

One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.

    private void DoIt()
    {
        const long m = 101;         // Number of keys + 1
        const long x = 387420489;   // must be coprime to m

        // Compute the multiplicative inverse
        var multInv = MultiplicativeInverse(x, m);

        // HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
        var nums = new HashSet<long>();

        // Obfuscate each number from 1 to 100.
        // Show that the process can be reversed.
        // Show that no duplicates are generated.
        for (long i = 1; i <= 100; ++i)
        {
            var obfuscated = i * x % m;
            var original = obfuscated * multInv % m;
            Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
            if (!nums.Add(obfuscated))
            {
                Console.WriteLine("Duplicate");
            }
        }    
    }

    private long MultiplicativeInverse(long x, long modulus)
    {
        return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
    }

    private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
    {
        if (a < 0)
        {
            var result = ExtendedEuclideanDivision(-a, b);
            return Tuple.Create(-result.Item1, result.Item2);
        }
        if (b < 0)
        {
            var result = ExtendedEuclideanDivision(a, -b);
            return Tuple.Create(result.Item1, -result.Item2);
        }
        if (b == 0)
        {
            return Tuple.Create(1L, 0L);
        }
        var q = a / b;
        var r = a % b;
        var rslt = ExtendedEuclideanDivision(b, r);
        var s = rslt.Item1;
        var t = rslt.Item2;
        return Tuple.Create(t, s - q * t);
    }

The first few lines of output for that program are:

    1 => 43 => 1
    2 => 86 => 2
    3 => 28 => 3
    4 => 71 => 4
    5 => 13 => 5
    6 => 56 => 6
    7 => 99 => 7
    8 => 41 => 8
    9 => 84 => 9
    10 => 26 => 10

If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.

The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.

By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.

The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.

In my next entry, I’ll wrap this all up into a nice easy-to-use class.

Serialize enums as numbers rather than as strings

Imagine that your application works with different document types. You have a server that stores documents in a database, and several applications that query the database, and briefly cache query results in shared memory.

One of the fields that’s returned from a search is the document type, which you’ve encoded in a C# enumeration:

    public enum DocumentType
    {
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3
    }

And in your document record:

    public class DocumentSummary
    {
        public string Name { get; set; }
        public DocumentType DocType { get; set; }
        // other stuff
    }

Now, say you serialize that to JSON, converting the enum values to strings. You end up with:

    {
        "Name":"MyDocument.msh",
        "DocType":"WordSmasher"
    }

All of the applications include a shared library that defines common data structures and such, like DocumentSummary and DocumentType.

This all works great until you add support for a new type of document: the company’s new PowerPoint competitor, called SleepyAudience. You update your enumeration:

    public enum DocumentType
    {
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3,
        SleepyAudience = 4
    }

Then you compile and test your server code, build one of the other applications to test things with, and release it to the world. Only to find that the other applications, which haven’t yet been rebuilt, die a horrible death when they see the type “SleepyAudience”.

The application crashes because the deserialization code doesn’t know what to do when it finds the string “SleepyAudience” in the DocType. As far as that code is concerned, the DocType field can be an integer value, or it can have one of the first three strings. “SleepyAudience” is not the name of an enum value, as far as the deserializer is concerned.

The easiest solution to this problem is to encode the enumeration value as a number, rather than as a string. Had the DocType been encoded as a number, the deserializer would have read and populated a DocumentSummary instance. It’s then up to the application code to decide what to do with a document type that it is not familiar with. Perhaps the application would filter that document from the list presented to the user. Or it would display “Unknown document type.” Whatever the case, it certainly shouldn’t crash.

You lose a little bit of readability by encoding the enum value as a number, but only if you’re in the practice of eyeballing JSON data. And, yes, it’s often easier for people who are writing code that interacts with your API to understand what “WordSmasher” means. But we programmers are used to associating numbers with names. It’s no particular burden for a programmer to look up the value “1” in the API documentation to determine that it identifies the WordSmasher application.

One solution to this problem, of course, is to make sure that all applications that depend on the enumeration are re-built and released whenever the enumeration changes. That would solve the problem, but it’s not practical when you have dozens of different applications online all the time. You can’t just stop the world for a few hours (or even a few minutes) to apply updates.

You could also argue that the applications themselves are at fault, and I would agree with that assessment. After all, an application shouldn’t crash horribly because it got some bad data. The application should ignore that one data item (in this case, a single document record from a list of documents), log an error message, and continue. But standard serialization libraries aren’t that robust. They either understand the entirety of what’s sent, or they roll over and die.

The Robustness principle says “Be conservative in what you send, be liberal in what you accept.” When applied to deserialization it means that if you don’t understand something, deal with it gracefully; do the best you can. Fail gracefully. Interpret the value or ignore it and supply a default. If you can’t do that, then discard the record. If you can’t discard the record, discard the input. At worst return an empty data set. But under no circumstances should the program go down in flames because it didn’t understand some input data.

Regardless of who’s at fault here, the point is that encoding enumeration values as strings has very little, if any, real benefit, and makes it harder to write robust deserialization code. Especially when working with deserialization libraries (as opposed to rolling your own deserialization code), you should do whatever you reasonably can to ensure that the deserializer can interpret your data. Let the rest of your application decide how to proceed, after the data has been read.

In my opinion, the application I describe above has a more fundamental flaw in that the document types it supports are hard coded in an enumeration. But that’s a discussion for another day.

How did this happen?

Last time I showed two different implementations of the naive method for generating unique coupon codes. The traditional method does this:

    do
    {
        id = pick random number
    } until id not in used numbers
    code = generate code from id

The other is slightly different:

    do
    {
        id = pick random number
        code = generate code from id
    } until code not in used codes

Before we start, understand that I strongly discourage you from actually implementing such code. The naive selection algorithm performs very poorly as the number of items you choose increases. But the seemingly small difference in these two implementations allows me to illustrate why I think that the second version is broken in a fundamental way.

Think about what we’re doing here. We’re obtaining a number by which we are going to identify something. Then we’re generating a string to express that number. It just so happens that in the code I’ve been discussing these past few entries, the expression is a six-digit, base-31 value.

So why are we saving the display version of the number? If we were going to display the number in decimal, there’d be no question: we’d save the binary value. After all, the user interface already knows how to display a binary value in decimal. Even if we were going to display the number in hexadecimal, octal, or binary, we’d store the binary value. We certainly wouldn’t store the string “3735928559”, “11011110101011011011111011101111”, or “DEADBEEF”. So, again, why store the string “26BB62” instead of the the 32-bit value 33,554,432?

I can’t give you a good reason to do that. I can, however, give you reasons not to.

  1. Storing a six-character string in the database takes more space than storing a 32-bit integer.
  2. Everything you do with with a six-character code string takes longer than if you were working with an integer.
  3. Display logic should be part of the user interface rather than part of the database schema.
  4. Changing your coding scheme becomes much more difficult.

The only argument I can come up with for storing codes rather than integer identifiers is that with the former, there’s no conversion necessary once the code is generated. Whereas that’s true, it doesn’t hold up under scrutiny. Again, nobody would store a binary string just because the user interface wants to display the number in binary. It’s a simple number base conversion, for crying out loud!

If you’re generating unique integer values for database objects, then let the database treat them as integers. Let everybody treat them as integers. Except the users. “26BB62” is a whole lot easier to read and to type than is “33554432”.

I’ll be charitable and say that whoever came up with the idea of storing the display representation was inexperienced. I’m much less forgiving of the person who approved the design. The combination of the naive selection algorithm, storing the generated coupon code rather than the corresponding integer, and the broken base conversion function smacks of an inexperienced programmer working without adult supervision. Or perhaps an inexperienced programmer working with incompetent adult supervision, which amounts to the same thing.

The mere existence of the naive selection algorithm in this code is a sure sign that whoever wrote the code either never studied computer science, or slept through the part of his algorithms class where they studied random selection and the birthday paradox (also, birthday problem). The comment, “There is a tiny chance that a generated code may collide with an existing code,” tells us all we need to know of how much whoever implemented the code understood about the asymptotic behavior of this algorithm. Unless your definition of “tiny” includes a 10% chance after only one percent of the codes have been generated.

The decision to store the generated code string surprises me. You’d think that a DBA would want to conserve space, processor cycles, and data transfer size. Certainly every DBA I’ve ever worked with would prefer to store and index a 32-bit integer rather than a six-character string.

The way in which the base conversion function is broken is hard evidence that whoever modified it was definitely not an experienced programmer. I suspect strongly that the person who wrote the original function knew what he was doing, but whoever came along later and modified it was totally out of his depth. That’s the only way I can explain how the function ending up the way it did.

Finally, that all this was implemented on the database looks to me like a case of seeing the world through database-colored glasses. Sure, one can implement an obfuscated unique key generator in T-SQL. Whether one should is another matter entirely. Whoever did it in this case shouldn’t have tried, because he wasn’t up to the task.

However this design came to exist, it should not have been approved. If there was any oversight at all, it should have been rejected. That it was implemented in the first place is surprising. That it was approved and put into production borders on the criminal. Either there was no oversight, or the person reviewing the implementation was totally incompetent.

With the code buried three layers deep in a database stored procedure, it was pretty safe from prying eyes. So the bug remained hidden for a couple of years. Until a crisis occurred: a duplicate code was generated. Yes, I learned that the original implementation didn’t even take into account the chance of a duplicate. Apparently somebody figured that with 887 million possible codes, the chance of getting a duplicate in the first few million was so remote as to be ignored. Little did they know that the chance of getting a duplicate within the first 35,000 codes is 50%.

I also happen to know that at this point there was oversight by a senior programmer who did understand the asymptotic behavior of the naive algorithm, and yet approved “fixing” the problem by implementing the duplicate check. He selected that quick fix rather than replacing the naive algorithm with the more robust algorithm that was proposed: one that does not suffer from the same pathological behavior. The combination of arrogance and stupidity involved in that decision boggles the mind.

The history of this bug is rooted in company culture, where the database is considered sacrosanct: the domain of DBAs, no programmers allowed, thank you very much. And although programmers have access to all of the source code, they aren’t exactly encouraged to look at parts of the system outside of their own areas. Worse, they’re actively discouraged from making suggestions for improvement.

In such an environment, it’s little surprise that the horribly broken unique key generation algorithm survives.

So much for that. Next time we’ll start looking at good ways to generate unique, “random-looking” keys.

A broken unique key generator

Hanlon’s Razor: “Never attribute to malice that which is adequately explained by stupidity.”

Recap: in my previous post I showed the naive method of generating unique “random-looking” six-character keys, and explained why it wasn’t a good solution for generating hundreds of millions of keys. I also mentioned that I used to think that the naive method was the worst possible way somebody would try to use for generating keys, and promised that I would show a worse way in today’s post.

If you recall, the naive method does this:

    do
    {
        generate a random number
    } until you find one that hasn't been used
    mark the number as used
    create the coupon code for that number

The worse way that I mentioned attempts to do this:

    do
    {
        generate a random number
        create the coupon code for that number
    } until you find a coupon code that hasn't been used
    mark the code as used

The only difference in the overall algorithm is that the second method uses the coupon code rather than the random number to decide whether a code has been used. That might seem like a minor difference, but it’s really a major change in thinking. It’s also more expensive and yet another place for bugs to creep in.

Let’s defer the discussion of why the second method is worse than the first. In preparation, I want to provide a real-life example of such an implementation.

I ran across this code in a production database that’s used to generate millions of six-character coupon codes every week. When I first saw it I was surprised that somebody would use the naive method to generate unique codes. But then I studied it a little closer and was astonished at how incompetently it was written. If I were the suspicious type, I’d be tempted to think that the person who wrote (or, possibly, modified) it actually broke it on purpose. But then I remembered Hanlon’s Razor.

Here is part of a database stored procedure that implements the second algorithm I showed above.

    WHILE (@codesGenerated < @NumberToGenerate AND @Attempts < @MaxAttempts)
    BEGIN
        BEGIN TRY
            INSERT INTO [dbo].[CouponCodes]
                    ([CouponNamespaceId], [CouponId], [Code], [Used])
                VALUES
                    (@CouponNamespaceId,@CouponId,LTRIM(RTRIM([dbo].[GenerateRandomBase32](ABS(CHECKSUM(NEWID()))))), 0)

            -- If a collision happened in the Insert above then this @codesGenerated will (correctly) not happen
            SET @codesGenerated = @codesGenerated + 1  

        END TRY
        BEGIN CATCH 
            -- There is a tiny chance that a generated code may collide with an existing code.
            -- which will violate the unique constraint.
            -- Error handling code is here.
        END CATCH

        --Always increment @Attempts to prevent an infinite loop if there are too many collisions
        SET @Attempts = @Attempts + 1
    END

This is a T-SQL version of the modified naive method: pick a number, generate the coupon code, see if the coupon code has been used. In the interest of brevity, I’ve removed some irrelevant code and the error handling. It’s a reasonable implementation of that modified naive method, although I question the use of a GUID checksum as the source of random numbers. I’m also not convinced that the person who wrote the error handling understands that the ABS function will throw an exception if the parameter is -2147483648. It’s highly unlikely that CHECKSUM will generate that value, but I’ve been burned before by code that blindly calls ABS without either first checking the input parameter, or handling the exception.

Whereas I believe that those issues should be addressed, they’re pretty minor in light of what follows.

I’m also a bit amused at the comment: “There is a tiny chance that a generated code may collide with an existing code.” Again, it’s especially amusing in light of what follows.

Let’s also ignore for now the question of whether this code even belongs in the database layer. I think it’s a good discussion to have, but we’ll save it for another time.

The problem lies in GenerateRandomBase32, a SQL Server scalar function:

    ALTER FUNCTION [dbo].[GenerateRandomBase32]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 1040187391 + 33554432;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        WHILE @seed > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
            END
        RETURN @answer;
    END

One problem is that the function is poorly named. The name “GenerateRandomBase32” implies that it’s going to generate a random value. At first I thought it was supposed to generate a random 32-bit value, but for reasons I’ll get to in a bit I’m now convinced that “32” was supposed to be the number base. The function doesn’t generate anything, there’s no randomness involved, and the number base is 31. A more appropriate name would be “ConvertToBase31,” or perhaps “ConvertToSixDigitBase31”. I realize that code changes over the years and this function might have done something quite different when it was originally written, but that’s no excuse for failing to change the function name to something meaningful.

On the face of it, the function looks like a T-SQL implementation of the base conversion method that I showed last time. That is, it takes a binary number and creates base-31 representation of it. That’s what it does, in general, but there are a few oddities that will surprise you. The first is this expression at the beginning:

    SET @seed = @seed % 1040187391 + 33554432;

With that expression, the effective range of keys that the function can create base-31 representations of is 33,554,432 to 1,073,741,822: a total of 1,040,187,390. Remember, though, that there are only 887,503,681 possible six-character coupon codes. Some of those values will get duplicates.

You might get the idea that this isn’t going to end well.

The peculiar thing about the expression is that, if we were generating a base-32 code, it would have the effect of preventing all codes that start with the first digit (presumably 0) from being generated. That is, if you sum the two magic numbers you end up with 1,073,741,823. It just so happens that 1,073,741,824 is the number of six-digit base-32 codes you can generate. I suspect that this code was originally written to generate base-32 codes, but for some reason later modified to eliminate ‘0’ from the valid digits, making it a base-31 code generator. But changing the number base without changing that expression has resulted in a horribly broken function.

I can’t prove that the function used to generate base-32 numbers, and was subsequently modified. I can only say that the evidence points strongly to that. I also can’t say for sure why one would want to prevent generation of codes that start with “0”. The only two reasons I can think of are:

  1. Somebody decided that numbers with leading 0’s were somehow difficult for users to handle.
  2. The programmer couldn’t figure out how to include leading 0’s when doing the base conversion.

At this point, I consider those two possibilities equally likely.

After mangling the input parameter, the code performs the standard base conversion logic, peeling out digits from lowest to highest, and creating a string. Note that each digit is prepended to the string. The code “7HNH51” would be constructed in the sequence “1”, “61”, “H61”, “NH61”, “HNH61”, “7HNH61”.

However, the range of values for which the function can generate codes exceeds the number of possible coupon codes. The function can generate codes for numbers greater than 887,503,681, which means that some of the codes generated will be seven characters long. They have to be truncated to six characters. How that truncation is done ends up making this code much worse than it appears at first.

If you pass the value 1040187360 to this function, it will generate seven characters. The first six characters generated are the code shown above, “7HNH61”. But then it generates a seventh character, “2”, which is prepended. Except the answer variable is declared as holding only six characters. So when this line of code is executed:

    SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;

The character “2” becomes the first character in the new string, and then the first five (from left to right) of the previous characters are copied. The result is “27HNH6”.

Now here’s the kicker. Guess what happens when you pass the value 1040187361. The function generates “27HNH62”, which is truncated to “27HNH6”. Duplicate! In fact, all 31 values from 1040187360 through 1040187390 generate that same code. That’s a huge oversight that seriously affects the duplicate rate.

There are 186,238,141 numbers for which this function will generate a seven-character code that’s truncated. If you pass a number in the range 853,949,249 to 1,040,187,390 to this function, it will generate one of 6,007,682 codes. Those 186 million values are 31 times more likely than average to be generated.

One other thing that affects the duplicate rate, albeit not very large, is that the function won’t ever generate a code for any number less than 33,554,432.

Combined, those two errors more than double the duplicate rate experienced by programs that call this function as part of the naive code generation algorithm.

Fixing that function really isn’t difficult, as I show below. The only real changes are to replace the seed mangling expression, and modify the loop so that it always generates exactly six characters.

    ALTER FUNCTION [dbo].[CreateSixDigitBase31]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        /*
         * This function generates a six-digit, base-31 representation of the passed seed parameter.
         * It is capable of generating 887,503,681 different codes (permutations of 31 items taken six
         * at a time, with repetition).
         * If the passed seed is larger than 887,593,680, the seed is taken modulo 887,593,680.
         * This function will not produce a useful value if seed is negative.
         */
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 887503680;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        DECLARE @counter as INT = 6;
        WHILE @counter > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
                SET @counter = @counter - 1;
            END
        RETURN @answer;
    END

I don’t particularly like that function because I suspect that the string manipulation is highly inefficient. We are in effect doing six string concatenation instructions with every code generated. I know that doing things this way in a C# program would bring the program to a crawl, which is why the implementation in my previous post used a character array. I don’t know enough about how T-SQL manipulates strings to say how efficient this is, and I don’t know if there’s a better alternative.

With the fixed base conversion function, this naive unique code generation algorithm is about as good as it can be. Which isn’t very good. Not only does it suffer from the duplicates problem that I pointed out yesterday, it has two other, related, strikes against it: it’s implemented at the database level, and it stores the generated code rather than the seed number.

Next time I’m going to talk a bit about why I think implementing the code generation in the database is a bad idea. I’ll also touch on how it’s possible for such a horribly broken implementation to go unnoticed for so long.

How not to generate unique codes

Imagine that part of the system you’re developing requires you to generate six-character unique alphanumeric coupon codes that you don’t want to appear sequential. For example, the first code might be XQJ97T and the next is 1B9QD4. You need to keep track of the codes, of course, and generating a duplicate would be a very bad thing.

For this discussion, let’s assume that you want codes to contain characters from the set 123456789ABCDFGHJKLMNPQRSTVWXYZ. I’ve eliminated the number 0 and the vowels E, I, O, and U. Why eliminate the vowels? Because some people think it reduces the likelihood of spelling a “bad” word. Whatever. You’d be surprised how common that set is.

So there are 31 characters, and we want to create a six-character code. If you allow repetition (codes where characters can be repeated, like XQJ97J), then there are 887,503,681 possible codes. (See permutations with repetition.)

I used to think that the worst possible way to generate a unique code would be to generate one, see if it’s been used, and if so go back and generate another one. Something like this:

    do
        code = generateRandomCode()
    until (code not in database)
    addGeneratedCode(code)

That actually works. However, you’ll find that as the number of codes you generate increases, it takes longer and longer to find one that hasn’t been used. The code below illustrates this effect using random numbers.

First, the DumbCodeGenerator class, which implements the code generation and keeps track of the number of duplicates.

    public class DumbCodeGenerator
    {
        private readonly Random _rnd = new Random();
        private readonly HashSet<int> _uniqueCodes = new HashSet<int>();

        public int TotalCodesGenerated { get; private set; }
        public int UniqueCodesGenerated { get { return _uniqueCodes.Count; } }
        public int DuplicatesGenerated { get { return TotalCodesGenerated - UniqueCodesGenerated; } }

        private const int TotalAvailableCodes = 887503681;

        public int NextCode()
        {
            while (true)
            {
                int code = _rnd.Next(TotalAvailableCodes);
                ++TotalCodesGenerated;
                if (_uniqueCodes.Add(code))
                {
                    return code;
                }
            }
        }
    }

And, the code that calls it:

    class Program
    {
        private const int Thousand = 1000;
        private const int Million = Thousand*Thousand;
        private const int NumCodesToGenerate = 25*Million;
        static void Main(string[] args)
        {
            var generator = new DumbCodeGenerator();
            int lastDup = 0;
            for (var i = 0; i < NumCodesToGenerate; ++i)
            {
                generator.NextCode();
                if (i%Million == 0)
                {
                    Console.WriteLine("{0:N0}, {1:N0}, {2:N0}", i, generator.DuplicatesGenerated-lastDup, generator.DuplicatesGenerated);
                    lastDup = generator.DuplicatesGenerated;
                }
            }
            Console.WriteLine("{0:N0} unique codes generated.", generator.UniqueCodesGenerated);
            Console.WriteLine("{0:N0} duplicates generated.", generator.DuplicatesGenerated);
            Console.WriteLine("{0:N0} total generated.", generator.TotalCodesGenerated);
            Console.WriteLine("Duplicate percent = {0:P2}", (double)generator.DuplicatesGenerated/NumCodesToGenerate);
            Console.Write("Press Enter:");
            Console.ReadLine();
        }
    }

When I run that code, it produces this output:

0  0  0
1,000,000  583  583
2,000,000  1,742  2,325
3,000,000  2,859  5,184
4,000,000  4,153  9,337
5,000,000  5,369  14,706
6,000,000  6,492  21,198
7,000,000  7,666  28,864
8,000,000  8,823  37,687
9,000,000  10,203  47,890
10,000,000  11,185  59,075
11,000,000  12,378  71,453
12,000,000  13,719  85,172
13,000,000  14,895  100,067
14,000,000  15,989  116,056
15,000,000  17,251  133,307
16,000,000  18,906  152,213
17,000,000  19,871  172,084
18,000,000  20,956  193,040
19,000,000  21,995  215,035
20,000,000  23,420  238,455
21,000,000  24,494  262,949
22,000,000  25,663  288,612
23,000,000  26,990  315,602
24,000,000  28,254  343,856
25,000,000 unique codes generated.
373,291 duplicates generated.
25,373,291 total generated.
Duplicate percent = 1.49 %

In short, the first million codes selected generated 583 duplicates. Between 1 million and 2 million, there were 1,742 duplicates generated. Between 23 million and 24 million, it generated 28,254 duplicates, or about 2.83%. The likelihood of a duplicate increases as we generate more codes. When I ran it for 45 million codes (the largest I could get on my laptop), it was generating 5.5% duplicates at the last interval, and the duplicate percentage up to that point was 2.74%.

If you were to generate 50 million unique codes this way, you would have generated approximately 3% more codes than you had to. That’s not really so bad, provided you don’t mind spending the memory (or database resources) saving and querying the generated numbers. Things get more troublesome when you want to generate hundreds of millions of codes, because your duplicate percentage increases so much that you spend a huge amount of time trying to generate a unique code.

So far, I’ve talked about generating random numbers. What about six-character codes? They’re just a base conversion away. That is, just like you can represent the decimal value 1793 as 0x0701, you can also represent any number in base-31 using whatever characters you want as digits. Here’s the method that turns a passed binary value into a base-31 code string.

    private const string ValidChars = "123456789ABCDFGHJKLMNPQRSTVWXYZ";
    private static readonly int NumberBase = ValidChars.Length;

    static string CreateEncodedString(int code)
    {
        var answer = new char[6];
        int codeVal = code;
        for (var i = 0; i < 6; ++i)
        {
            int digit = codeVal%NumberBase;
            answer[5 - i] = ValidChars[digit];
            codeVal = codeVal/NumberBase;
        }
        return new string(answer);
    }

In this case, I wanted to keep leading “zeros” (which are represented by the character “1”). Note that the code builds the string backwards. If this method is passed a number larger than 887,503,680, only the six low-order digits of the number are generated. It would be like generating the code for the number modulo 887,503,681.

As I said, I used to think that the above was the worst possible way anybody would consider generating codes. I was wrong. There’s a much worse way that I’ve actually seen used in production code. I’ll talk about that in my next entry.

ReSharper finds the bugs

I might have mentioned before that I really like ReSharper. In my opinion if you’re writing C# code and not using ReSharper, you’re handicapping yourself. Next to Visual Studio, it’s the most useful development tool I have.

Today’s discovery is a case in point.

I was reviewing some code today when I ran across a latent bug and a definite bug, both because ReSharper had flagged the lines involved.

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch != null) { paramName = paramNameMatch.Groups["ParamName"].Value; }
    else { paramName = string.Empty; }

    string actualValue;
    var actualValueMatch = Regex.Match(message, @"Actual value was (?.+)\.");
    if (paramNameMatch != null) { actualValue = paramNameMatch.Groups["ActualValue"].Value; }
    else { actualValue = string.Empty; }

Don’t worry too much about what the regular expressions are doing or why.

The code looks reasonable, right? Try to match a regular expression and assign a value based on whether the match was successful. Except that’s not what the code does. Let’s take a look at the first part. I’ve reformatted it for readability.

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch != null)
    { 
        paramName = paramNameMatch.Groups["ParamName"].Value;
    }
    else
    {
        paramName = string.Empty;
    }

The primary problem here is that Regex.Match never returns null!. Documentation says that if no match was found, the return value is a Match object that is equal to Match.Empty. In any case, the else clause will never be executed.

The code passed the programmer’s test only because trying to access a group that doesn’t exist returns an empty string. As documentation says:

If groupname is not the name of a capturing group in the collection, or if groupname is the name of a capturing group that has not been matched in the input string, the method returns a Group object whose Group.Success property is false and whose Group.Value property is String.Empty.

The code works by accident, which to me means that it’s a latent bug. Had we wanted to return something other than string.Empty in the else clause, this code would have been in error.

In this case, ReSharper told me that in this line:

    if (paramNameMatch != null)

The expression is always true. In other words, paramNameMatch will never be null.

The proper way to determine if a regular expression match was successful is to check the value of the Match.Success property. The above code is more correctly written as:

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch.Success)    // will be false if no match was made.
    { 
        paramName = paramNameMatch.Groups["ParamName"].Value;
    }
    else
    {
        paramName = string.Empty;
    }

I find that overly verbose. Situations like this just beg for the ternary operator:

    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    string paramName = (paramNameMatch.Success) ? paramNameMatch.Groups["ParamName"].Value : string.Empty;

ReSharper also notified me of a real bug in this code:

    string actualValue;
    var actualValueMatch = Regex.Match(message, @"Actual value was (?.+)\.");
    if (paramNameMatch != null) { actualValue = paramNameMatch.Groups["ActualValue"].Value; }
    else { actualValue = string.Empty; }

ReSharper’s warning was that the variable actualValueMatch is assigned a value that is never used. Look closely: the second line assigns a value to actualValueMatch, but the next line checks the value of paramNameMatch. This code could not have worked. The programmer obviously didn’t test it.

Had the programmer who wrote this been using ReSharper and paying attention to what it was telling him, neither of these bugs would have been checked into source control. Most likely, the programmer would have seen ReSharper’s warnings immediately after he wrote the bugs, and would have fixed them before he even tried to compile the program.

Some people complain that ReSharper is expensive. I find that a ridiculous argument. A ReSharper subscription costs $239 per year. Or, if you don’t want upgrades you can buy a license for $300 and keep it for a few years before upgrading. If you’re writing code for pay, $300 is something like four or six hours of work. ReSharper saves me that much time every month! It helps me avoid mistakes when I’m writing code, and identifies trouble spots when I’m reviewing code. It also helps me with refactoring when I’m working on older code, and helps me navigate this very large solution I’ve inherited. I could do my work without it, but I’d rather not. It’d be like riding a horse to work: I’d get there eventually, but it’d be uncomfortable and would take too long. Buy the best tools you can afford. And if you’re making a living writing C# code, you can afford ReSharper.

Simplifying code with anonymous methods

One of the easiest ways to avoid writing and having to maintain duplicate code is to use an anonymous method or anonymous function. For example, I recently ran across some code that was validating dates in a UI application, and reporting errors if the dates were out of range. The code looked something like this:

    void DoStuff(MyModel model)
    {
        if (model.StartDate < DateTime.UtcNow)
        {
            var dt = ConvertUtcToLocal(model.StartDate, model.TimeZone);
            AddErrorMessage("Start Date " + dt.ToString("M/d/yy h:mm tt") + " is out of range.");
        }
        if (model.EndDate < DateTime.UtcNow)
        {
            var dt = ConvertUtcToLocal(model.EndDate, model.TimeZone);
            AddErrorMessage("End Date " + dt.ToString("M/d/yy h:mm tt") + " is out of range.");
        }
        // six more similar date/time validations
    }

Our client asked that the format be changed. Which meant going through and changing the format in eight different places. In the process I found two bugs in the code, both obvious copy-paste errors.

With a little bit of thought, the programmer who wrote this (or the last programmer to modify it) could have simplified things quite a bit and in the process made errors much less likely to occur. How? By creating an anonymous method that does the validation and error message formatting:

    void DoStuff(MyModel model)
    {
        var validateAndReport = new Action<string, DateTime>((fieldName, dt) =>
        {
            if (dt < DateTime.UtcNow)
            {
                var localDate = ConvertUtcToLocak(dt, model.TimeZone);
                AddErrorMessage(fieldName + " " + localDate.ToString("M/d/yy h:mm tt") + " is out of range.");
            }
        });
        
        validateAndReport("Start Date", model.StartDate);
        validateAndReport("End Date", model.EndDate);
        // etc.
    } 

Programmers are often hesitant to add simple methods like this outside of their functions because they think it pollutes the namespace, and they often require passing a bunch of parameters for context. Anonymous methods solve both problems: they’re invisible outside of the containing method, and they have access to all of the context that the containing method has access to. With this technique I ensure that all of the dates are formatted correctly, and I can modify the formatting by changing just one line of code. In addition, there are many fewer lines of code and they’re much easier to understand.

The only drawback I see is that programmers have to learn something new: “We can’t use this Action thing! Other programmers won’t understand it!” (Yes, somebody did say that to me.) Tough. Things change. To me and most other programmers I know, the modified code is clearly superior to the old code in every way that matters. I realize that there are some performance implications and hidden memory allocations involved. A few small allocations and a few microseconds aren’t going to make a difference in this code, or any other code in the project.

C# 7.0 introduced local functions: the ability to create a function within a function. This is more than just syntactical sugar around the anonymous function shown above. The local function still can capture local variables, but it avoids the allocation and slight performance hit of the anonymous method. The code above, if written to use a local function, would look something like this:

    void DoStuff(MyModel model)
    {
        void validateAndReport(string fieldName, DateTime dt)
        {
            if (dt < DateTime.UtcNow)
            {
                var localDate = ConvertUtcToLocak(dt, model.TimeZone);
                AddErrorMessage(fieldName + " " + localDate.ToString("M/d/yy h:mm tt") + " is out of range.");
            }
        };
        
        validateAndReport("Start Date", model.StartDate);
        validateAndReport("End Date", model.EndDate);
        // etc.
    }

Unfortunately, I’m not yet using C# 7.0, so I can’t take advantage of this new feature. I’m also unable to find the official language specification for version 7. However, the article Thoughts on C# 7 Local Functions gives some good information about how it works.

Pickle crack

When I was young–10 years old or even younger–Mom used to buy this dill pickle dip. It was creamy, tasty stuff with chunks of dill pickle in it. It was my favorite dip for potato chips. It was a real treat whenever Mom would get some or, when it became unavailable, she would make some from a recipe she obtained somewhere.

When Debra and I were planning a pool party here shortly after we bought the house, I got to thinking about that dip and called Mom to see if she still had the recipe. She did, and I wrote it down. I think I’ve since lost that piece of paper, but the recipe is real simple. Debra and I make it from time to time. Some friends say it’s addictive and they’ve named it “Pickle crack.”

Here’s what you need:

  • 16 ounces of cream cheese
  • 16 ounces of sour cream
  • A 24 ounce jar of kosher dill pickles
  • Garlic powder (not garlic salt!)
  • Dill weed (fresh if you like, but the dried stuff works great)
  • A mixing bowl that will hold a bit more than a quart
  • A fork (for mixing)
  • A knife (for cutting pickles)
  • A bag of potato chips (for taste testing)

It’s easier if you leave the cream cheese out of the refrigerator for an hour or two before you start. That’ll soften it up. But it’s not necessary.

Dump the cream cheese into the mixing bowl and start smashing it with the fork. I suspect there’s some technical term other than “smashing.” What I’m doing is making the stuff more spreadable. Otherwise it tends to clump. When the cream cheese is nice and creamy, add about half of the sour cream and mix it in. Once that’s thoroughly mixed, add the other half of the sour cream and mix it in until you have a nice uniform mixture.

Put that aside, open the pickle jar, and grab your knife. Start dicing pickles. 1/4 inch (6 mm) chunks are good. You can go smaller if you like. Or larger, if that’s your thing. I like to stay around 1/4 inch, but this is definitely a matter of taste. Debra likes to use the onion chopper to dice the pickles. That works, but I kind of like cutting them with a knife.

In any event, chop up a bunch of pickles and mix them in with the sour cream and cream cheese. I like my dip really chunky, so I typically chop all the pickles in the jar and throw them in. Others might like a little less pickle.

If the mixture is too thick (again, it’s a matter of taste), you can pour some pickle juice from the jar into the mixture to thin it a bit. Be careful, though. This is dip for potato chips, not for a veggie tray. Although I have used it for that once or twice. Carrots dipped in pickle dip is pretty good.

If you like garlic, add some garlic powder. Start with just a bit. Mix it in. Taste it. Maybe add some more. Same with the dill. If you want to accentuate the dill taste, throw in a bunch of dill weed. I suggest you exercise restraint, though. Mix the stuff up real good and then set it in the refrigerator overnight for the flavors to permeate. Pull it out in the morning and mix it up. Taste it. Add more seasoning as you see fit.

This stuff is quick to make and people really do seem to like it. Give it a try if you’re looking for a new taste treat for your potato or other chips.

 

Inconsistent error reporting

The systems I maintain call several different partner APIs to get player account information. We noticed recently that one of the APIs returns seemingly inconsistent error messages when presented with invalid player account numbers. For example, if we ask for information on account number 123456789, the system returns a 404, meaning that the account was not found in the system. But if ask for account number 3420859494, the server returns a 500: internal server error. Very strange.

At first I thought it was that the account number was too long. Typically, the account numbers are eight digits, and the ones giving us errors were 10 digits in length. But when I requested account number 1420859494 (10 digits), I got a 404.

On a hunch, I tried the magic number 2147483647. Not surprisingly, that returned a 404. But 2147483648 returned a 500 error. What the heck?

The magic of 2147483647 is that it’s the largest signed integer that can be expressed in 32 bits. It appears that the API tries to convert the input to a 32 bit integer. If that conversion fails, the server throws an exception and returns a 500 internal server error.

This is the kind of error that unit testing is designed to ferret out. Had they tested the edge cases, they would have seen this inconsistent behavior and, one would hope, modified the code to handle it more reasonably.

I don’t know what to do about this. Their documentation doesn’t say specifically that the input must be a positive 32-bit integer, only that the value must be numeric. Based on the empirical evidence, I could put some validation on my end to ensure that users don’t enter account numbers larger than 2147483647, but there’s nothing preventing the API from changing underneath me and allowing larger numbers in the future. It’s unlikely that those in charge of the partner API would notify me of such a change.

North Dakota Mexican food

Debra and I tried a new place for dinner the other night. When she asked me how my meal was I said, “North Dakota Mexican food.” She laughed and nodded. Considering that we were at a Vietnamese restaurant, that exchange probably deserves some explanation.

We flew to North Dakota back in 1992 to attend my grandparents’ 65th wedding anniversary celebration. When we arrived in Bismark, we learned that our bags didn’t make the transfer in Denver. The airline assured us that the bags would be on the next flight. Bismark wasn’t a thriving aviation hub at the time (not sure what it’s like now), so we had a few hours to kill.

In retrospect we should have known better, but we were hungry and frustrated. We saw a restaurant advertising Mexican food, and the parking lot was reasonably full. “How bad can it be?”

As it turns out, pretty bad. Worse than Taco Bell. It’s not that it tasted bad, but that it hardly tasted at all! It looked like Mexican food, but it tasted like … pretty much nothing. The ground beef was bland. The lettuce and tomatoes were standard tasteless restaurant fare. The cheese … well, there was lots of cheese. In fact, everything was covered in cheese. Like their version of Mexican food was a bunch of bland stuff covered in cheese. And the salsa? Pretty much just ground up stewed tomatoes with a little green onion thrown in. Corn chips were the cheapest, thinnest … I think you get the picture.

We had a similar experience about 10 years ago. We went to a little Mexican food place that several friends had recommended. The place was clean and bright and tastefully decorated, the service was good, and their food was indistinguishable from what we encountered in Bismark, North Dakota back in 1992. Except for the salsa. The salsa wasn’t good, but at least it had a little bite to it. With all the good Mexican food available in the Austin area, I honestly cannot understand how that place stays in business. Maybe there’s a bunch of North Dakota natives living nearby?

“North Dakota Mexican food” is our way of describing something that looks like what it’s trying to imitate, but has no taste or tastes completely wrong. I suppose the term could be used to describe things other than food, but I haven’t used it in that context.

Hacking the slot machine

Wired reports that a Russian group designed a brilliant slot machine cheat that they’ve used to bilk casinos out of millions of dollars. The article is sketchy on technical details, but there’s enough information there for me to speculate how it was done.

Understand, I don’t know anything about the internals of the software running on these machines, but I know enough about pseudorandom number generators, their use and misuse, to offer a plausible explanation of the vulnerability and how it’s exploited. I also know a few people in the industry. What I describe below is possible, and from my experience quite likely to have happened. Whether it’s exactly what happened, or if it’s even close to the mark, I have no way of knowing.

First, a little background.

In modern computer controlled slot machines (say, anything built in the last 30 years), the machine uses random numbers to determine the results of a spin. In concept, this is like rolling dice, but the number of possibilities is huge: on the order of about four billion. In theory, every one of those four billion outcomes is equally likely every time you roll the dice. That would be true in practice if the computer were using truely random numbers.

But computers are deterministic; they don’t do random. Instead, they use algorithms that simulate randomness. As a group, these algorithms are called pseudorandom number generators, or PRNGs. You can probably guess that PRNGs differ in how well they simulate true randomness. They also differ in ease of implementation, speed, and something called “period.” You see, a PRNG is just a mathematical way to define a deterministic, finite sequence. Given a starting state (called the seed), the PRNG will generate a finite set of values before it “wraps around” to beginning and starts generating the same sequence all over again. Period is the number of values generated before wrapping. If you know what PRNG is being used and you know the initial state (seed), then you know the sequence of numbers that will be generated.

The machines in question were purchased on the secondary market sometime before 2009. It’s probably safe to say that the machines were manufactured sometime between 1995 and 2005. During that era, the machines were almost certainly running 32 bit processors, and likely were generating 32 bit random numbers using PRNGs that maintained 32 bits of state. That means that there are 2^32 (four billion and change) possible starting states, each of which has a maximum period of 2^32. Overall, there are 2^64 possible states for the machine to be in. That’s a huge number, but it’s possible to compute and store every possible sequence so that, if somebody gave you a few dozen random numbers in sequence, you could find out where they came from and predict the next number. It’d take a few days and a few terabytes of disk storage to pre-compute all that, but you’d only have to do it once.

It’s likely that the PRNG used in these machines is a linear congruential generator which, although potentially good if implemented well, is easy to reverse-engineer. That is, given a relatively small sequence of generated numbers, it’s possible to compute the seed value and predict the next values in the sequence. All this can be done without knowing exactly which LCG algorithm is being used.

The hackers did have the source code of the program (or they disassembled the ROM), but they didn’t have access to the raw numbers as they were generated. Instead, they had to deduce the random number based on the outcome of the spin. But again, that just takes a little computation (okay, more than just a little, but not too much) time to create a table that maps reel positions to the random sequence.

My understanding is that slot machines continually generate random numbers on a schedule, even when the machine is idle. Every few milliseconds, a new number is generated. If it’s not used, then it’s discarded and the next number is generated. So if you know where you are in the sequence at a given time, then you can predict the number that will be generated at any time in the future. Assuming, of course, that your clock is synchronized with the slot machine’s clock.

If you refer back to the article, you’ll see that the agent who was working the machine would record the results of several spins, then walk away and consult his phone for a while before coming back to play. That phone consultation was almost certainly uploading the recorded information to the central site, which would crunch the numbers to determine where the machine was in the random sequence. The system knows which numbers in the sequence correspond to high payouts, so it can tell the phone app when to expect them. The agent then goes back to the machine and watches his phone while hovering his finger over the spin button. When the phone says spin, he hits the button.

The system isn’t perfect. With perhaps up to 200 random numbers being generated every second, and human reaction time being somewhat variable, no player will hit the big payout every time. But he’s increased his odds tremendously. Imagine somebody throwing one gold coin into a bucket of a million other coins, and another gold coin into a bucket of 200 other coins. You’re given the choice to blindly choose from one of the two buckets. Which would you choose from?

That might all sound complicated, but it’s really pretty simple in concept. All they did was create a map of the possibilities and devise a way to locate themselves on the map. Once you know where you are on the map, then the rest is a simple matter of counting your steps. Creating the map and the location algorithm likely took some doing, but it’s very simple in concept.

The above explanation is overly broad, I’ll admit, and I wave my hand over a number of technical details, but people at work with whom I’ve discussed this generally agree that this is, at least in broad strokes, how the hackers did it. Understand, I work at a company that develops slot machine games for mobile devices, and several of the programmers here used to work for companies that make real slot machines. They know how these machines work.

When I originally saw the article, I assumed that some programmer had made a mistake in coding or using the PRNG. But after thinking about it more critically, I believe that these machines are representative of the state of the art in that era (1995-2005). I don’t think there was a design or implementation failure here. The only failure would be that of not imagining that in a few years it would be possible for somebody who didn’t have a supercomputer to compute the current machine state in a few minutes and exploit that knowledge. This isn’t a story about incompetence on the part of the game programmers, but rather a story about the cleverness of the crooks who pulled it off. I can admire the technical prowess it took to achieve the hack while still condemning the act itself and the people who perpetrated it.

Welcome to the cesspool

Today President Trump signed an executive order titled REDUCING REGULATION AND CONTROLLING REGULATORY COSTS. This fulfills a campaign pledge to reduce burdensome regulation. On the face of it, I applaud the measure, especially the provision that says, “whenever an executive department or agency publicly proposes for notice and comment or otherwise promulgates a new regulation, it shall identify at least two existing regulations to be repealed.”

I suspect that the net effect of this order will be approximately zero, as far as business is concerned. In the first place, there are so many exceptions listed, it’s likely that any department head with a modicum of intelligence will get his proposals exempted from the new rules. And in the unlikely event that they do have to identify regulations to be repealed, they have a plethora of idiotic rules that are on the books and no longer enforced. It’ll take them years to clear that cruft from the books. So at best what we’ll see is large numbers of unenforced or unenforceable regulations being repealed. A Good Thing, no doubt, but not something that businesses will notice.

The Director of the Office of Management and Budget is tasked with specifying

“… processes for standardizing the measurement and estimation of regulatory costs; standards for determining what qualifies as new and offsetting regulations; standards for determining the costs of existing regulations that are considered for elimination; processes for accounting for costs in different fiscal years; methods to oversee the issuance of rules with costs offset by savings at different times or different agencies; and emergencies and other circumstances that might justify individual waivers of the requirements …”

It looks to me like the president’s order creates more regulations and more work, which probably will require increased expenses. I wonder if his order is subject to the new 1-for-2 rule.

I mentioned exceptions above. The order exempts:

  • regulations issued with respect to a military, national security, or foreign affairs function of the United States
  • regulations related to agency organization, management, or personnel
  • any other category of regulations exempted by the Director (of the OMB)

A savvy department head can probably make a good argument that any new regulation fits one of the first two. Failing that, being “in” with the Director of OMB will likely get you a pass.

Also, Section 5 states that the order will not “impair or otherwise affect”

  • the authority granted by law to an executive department or agency, or the head thereof
  • the functions of the Director relating to budgetary, administrative, or legislative proposals

Oh, and the last part, Section 5(c) says:

“This order is not intended to, and does not, create any right or benefit, substantive or procedural, enforceable at law or in equity by any party against the United States, its departments, agencies, or entities, its officers, employees, or agents, or any other person.”

In other words, this isn’t Law, but rather the president’s instructions to his subordinates.

The dog can’t bite; it can hardly growl. But the president can say that he “did something about the problem,” and thus get marks for keeping a campaign pledge.

So much for draining the swamp. This is the way things have been done in Washington for decades. Make a big deal signing a regulation with a feel-good title, but that does nothing (or, worse, does exactly the opposite of what you would expect), bask in the praise of your supporters, and then go about business as usual.

Welcome to the cesspool, Mr. President.

Tales from the trenches: It’s just a warning

Back in the bad ol’ days when we were writing Windows programs in C and directly calling the Windows API, the compiler would produce many warnings about one’s code. Conventional wisdom was to ignore the warnings. “If it was a real problem, the compiler would have issued an error.”

The idea behind compiler warnings was that they pointed out potential programming errors. For example, due to integer overflow, a conversion from unsigned integer to signed integer could be a problem if the unsigned int were large enough. A single compile might produce thousands of warnings, most of which weren’t actual problems. As a result, compiler warnings were treated like The Boy Who Cried Wolf.

But not all warnings are alike, and even some of those unsigned-to-signed warnings really were important. One day I realized that the compiler had produced warnings for some of the bugs that I’d been encountering, and I began writing my code to avoid the warnings. Surprisingly enough, I began to see fewer bugs in my programs.

Some time after I’d had my epiphany, I got a contract to rescue a failing project. The program was “feature complete,” but incredibly unstable. There was a huge number of bugs, and the project was falling further behind schedule. The team had experienced a lot of turnover, including the original lead and several of the senior developers, and they couldn’t give a reliable estimate of when they’d have the program working.

The first day I was there, I got my development system set up and compiled the program. As expected, there were over 1,000 compiler warnings. I spent a few hours looking through the code in question and found a half dozen warnings that were pointing out real problems with the code, and proved that modifying the code to eliminate the warnings resolved some reported bugs. The next day I met with the development team.

If you ever wrote Windows programs back then, you can imagine the response to my proclamation: “The first thing we’re going to do is eliminate all of the compiler warnings.”

“But there’s more than 1,000 warnings!”

“That will take us forever!”

And, the one I was waiting for: “They’re just warnings. If it was important the compiler would have issued an error.”

At that point, I trotted out my half dozen examples of warnings that pointed to actual program bugs. That got everybody’s attention.

It took us a week to eliminate those thousand-plus compiler warnings. In the process, we resolved dozens of reported bugs that were directly attributable to the warnings. By the end of the week the development team was in good spirits and the project manager thought I was a miracle worker.

The following Monday I met with the development team and said, “Now we’re going to Warning Level four.”

And I thought the previous week’s protests were bad! I had compiled the program on Level 4 and got over 2,000 warnings. Many of those truly are spurious: unreachable code, unused parameters and local variables, etc. But some of those warnings are important. For example, the compiler would issue a warning if the code used a variable before it was initialized. I had found a bug in the program that was caused by referencing an uninitialized pointer, but that warning only occurs on Level 4.

It took almost two weeks, and the development manager had to “counsel” one programmer who, rather than fixing the offending code, was instructing the compiler not to issue the warnings.

Not surprisingly, we resolved another few dozen bugs in the process and the code was a lot cleaner, too. A month after I started, we had a solid schedule to fix the remaining issues, and the product shipped six weeks later. The development team was happy and the manager thought I was a miracle worker. My contract had been for three months (13 weeks) but they let me go after 10 weeks, with an extra week’s pay as a bonus. To top things off, the manager set me up with a friend of his who also had a troubled project.

It was a good gig for a while, getting paid for making people pay attention to what their compilers were telling them. Good pay, easy work, and people thinking I was some kind of genius. It’s a weird world we live in.

The moral of the story, of course, is that you shouldn’t ignore compiler warnings. A more important lesson, which the C# team took to heart, is that compiler warnings are ambiguous. The C# language specification is much more rigorous than is the C language specification, and most things that were “just warnings” in C are definite errors in C#. For example, referencing an uninitialized variable is an error. There are fewer than 30 compiler warnings in the latest version (Visual Studio 2015) of the C# compiler. Contrast that to the hundreds of different warnings in a C or C++ compiler.

In my work, I compile with “Warnings as errors.” If the compiler thinks something is important enough to warn me about it, then I’m going to fix the code so that the compiler is happy with it. It’s usually an easy fix. I almost never have to disable the warning altogether.

Subtraction is not the same as comparison

I tracked down a bug the other day that really surprised me. Again, not because of the code itself, but rather that the person who wrote the code should have known better.

If you want to sort a list of a user-defined type in C# (and other languages), you need a comparison method. There are several different ways to do that, perhaps the simplest being to create a comparison delegate. For example, if you have this class definition:

    public class Foo
    {
        public int Id {get; private set;}
        public string Name {get; set;}
        // constructor and other stuff here . . .
    }

And a list of those:

    List<Foo> MyList = GetFooThings(...);

Now, imagine you want to sort the list in place by Id. You can write:
    MyList.Sort((x,y) => { return x.Id.CompareTo(y.Id); });

CompareTo will return a value that is:

  • Less than 0 if x.Id < y.Id
  • Equal to 0 if x.Id = y.Id
  • Greater than 0 if x.Id > y.Id

The code I ran into was similar, except the programmer thought he’d “optimize” the function. Rather than incurring the overhead of calling Int32.CompareTo, he took a shortcut:

    MyList.Sort((x,y)) => { return x.Id - y.Id; });

His thinking here is that the result of subtracting the second argument from the first will contain the correct value. And that’s true most of the time. It’s not at all surprising that his testing passed. After all, 12 - 3 returns a positive number, 300 - 123456 returns a negative number, and 99 - 99 returns 0.

But he forgot about negative numbers and integer overflow. Subtracting a negative number is like adding a positive number. That is, 4 - (-1) = 5. And in the artificial world of computers, there are limits to what we can express. The largest 32-bit signed integer we can express is Int32.MaxValue, or 2,147,483,647. If you add 1 to that number, the result is Int.MinValue, or -2,147,483,648. So the result returned by the comparison delegate above, when x is a very large number and y is a sufficiently small negative number is incorrect. It will report, for example, that 2,147,483,647 is smaller than -1!

There’s a reason that Int32.CompareTo is written the way it is. Here it is, directly from Microsoft’s Reference Source:

    public int CompareTo(int value) {
        // Need to use compare because subtraction will wrap
        // to positive for very large neg numbers, etc.
        if (m_value < value) return -1;
        if (m_value > value) return 1;
        return 0;
    }

It seems like I run into these fundamental errors more often now than I did in the past. I’m yet sure of the reason. I’ve been active in the online programming community for 30 years, and have seen a lot of code written by programmers of all skill levels. Early on, the average skill level of the programmers I met online was much higher than it is today. These days, every Java programming student has access to Stack Overflow, and I see a lot of these rookie mistakes. But I also see many such mistakes made by more experienced programmers.

I’m wondering if the difference is one of education. When I started programming, we learned assembly language early on and actually used it to write real programs. Two’s complement integer representation was a lesson we learned early, and integer overflow was something we dealt with on a daily basis. We would not have made this mistake. No, we made much more interesting mistakes.

This error can be caught, by the way. If you enable runtime overflow checking in C#, the runtime will throw an exception if the result of an arithmetic operation results in overflow or underflow. But that check is almost always turned off because it can have a large negative impact on performance, and in many places we depend on integer overflow for our algorithms to work correctly. It’s possible to disable overflow checking for a specific expression or block of code, but in my experience that functionality is rarely used. You can also disable the check globally but enable it for specific expressions or blocks of code, but doing so requires that you understand the issue. And if a programmer understood the issue, he wouldn’t have written the erroneous code in the first place.

Errors like this can exist in a working program for years, and then crop up at the most inopportune time when somebody uses the code in a new way or presents data that the original programmer didn’t envision. As much as our languages and tools have evolved over the past 30 years, in some ways we’re still distressingly close to the metal.

Pairing heap representation

Up to this point, I’ve been showing paring heap nodes as having a child list. Conceptually, the node structure looks like this:

    class HeapNode
        public int Data;
        public HeapNode[] Children;
    }

That’s a good conceptual model, but implementing that model can be unwieldy, and can consume a huge amount of memory. When working with small objects in managed languages like .NET, memory allocation overhead can easily exceed the memory used for data. That’s especially true of arrays, which have a 56-byte allocation overhead.

Granted, not all nodes in a pairing heap have children. In fact, at most half of them will. So we can save memory by not allocating an array for the children if there aren’t any. But that adds some complexity to the code, and at best saves only half of the allocation overhead.

Using the .NET List<T> collection doesn’t help, because List<T> uses an array as the backing store. LinkedList<T> will work, but involves its own overhead what with having to manage LinkedListNode instances.

In short, managing a per-node list of children can be difficult.

In my introduction to the Pairing heap, I showed this figure:

    2
    |
    8, 7, 3
    |     |
    9     4, 5
          |
          6

2 is the root of the tree. It has child nodes 8, 7, and 3. 9 is a child of 8. 4 and 5 are children of node 3, and 6 is a child of 4.

More traditionally, that tree would be drawn as:

                   2
                 / | \
                8  7  3
               /     / \
              9     4   5
                   /
                  6

That’s the traditional view of the tree. But we can also say that 8 is the first child of 2, 7 is the sibling of 8, and 3 is the sibling of 7. That is, every node has a reference to its first child, and to its next sibling. The node structure, then, is:

    class HeapNode
        public int Data;
        public HeapNode FirstChild;
        public HeapNode Sibling;
    }

As it turns out, there’s a well known binary tree structure called Left-child-right-sibling. Any traditional tree structure can be represented as such a binary tree. Our tree above, when represented as a left-child-right-sibling binary tree, becomes:

                 2
                /
               8
              / \
             9   7
                  \
                   3
                  /
                 4
                / \
               6   5

You might notice that this structure bears striking similarity to the original figure from my introduction. It’s the same thing, only rotated 45 degrees clockwise.

As you can see, this builds a very unbalanced binary tree. That’s okay, since we’re not searching it. With pairing heap, all of the action is at the first few tree levels of the tree. A deep tree is good, because it means that we rarely have many siblings to examine when deleting the smallest node.

Implementing a Pairing heap in a left-child-right-sibling binary tree is incredibly easy. Next time I’ll show a simple implementation in C#.

But that’s the way we’ve always done it!

It’s surprising how many Computer Science articles and textbooks, when showing how to implement a binary heap in an array, present code that has the root node at index 1 in C-like languages where the first element of an array is at index 0. In those implementations, element 0 in the array is unused.

This bothers me, not because the code wastes an insignificant amount of memory, but because it leads to a lot of confusion among students and junior programmers who are trying to implement a binary heap. Off by one errors abound, the first being in allocating the heap array. Here’s a common error that occurs when allocating a heap to hold 100 integers.

    int[] heap = new int[100];

We’re conditioned from the very start of our programming education to begin counting at 0. Every time we have stuff in an array or list, the first thing is at index 0. Every array-based algorithm we work with reinforces this lesson. We’re taught that some languages used to start at 1, but those heretical languages have nearly been eliminated from the world. And then they foist this folly on us: a heap’s root is at index 1.

We’re taught that 100 items means indexes 0 through 99. It’s beat into our heads on a daily basis when we’re learning programming. Then they trot out this one exception, where we have to allocate an extra item and count from 1 to 100 rather than 0 to 99 like normal programmers do.

Some people say, “but if you start at 0, then the calculations to find the children won’t work.” Well, they’re partially right. If the root is at index 1, then the left child of the node at index x is at index (x * 2), and the right child is at index (x * 2) + 1. The parent of the node at index x is at index x/2. They’re right in that those calculations don’t work if you move the root to index 0. But the changes required to make the calculations work are trivial.

If the root is at index 0, then the left child is at (2 * x) + 1, right child at (2 * x) + 2, and parent at (x-1)/2.

The hard core optimizer will tell you that the extra addition in computing the left child, and the extra subtraction when computing the parent will incur a big performance hit. In truth, in the context of a real, working, computer program, the performance difference is down in the noise. It’s highly unlikely that your program’s overall performance will suffer from the addition of a couple of ADD or SUB instructions in a binary heap implementation. If it does, then you’re doing something horribly wrong. Doing something stupid in order to save a few nanoseconds here and there is … well … stupid.

No, I think the real reason we continue this madness is historical. The first known reference to a binary heap is in J.W.J. Williams’ implementation of Heapsort (Williams, J.W.J. (1964), ‘Algorithm 232: Heapsort’, Communications of the ACM 7 (6), 347-348). Yes, that’s right: 52 years ago. His code sample, like all ACM code samples at the time, was written in Algol, a language in which array indexes start at 1.

Textbooks with code samples in Algol and, later, Pascal, perpetuated the idea that the root of a binary heap must be at index 1. It made sense, what with 1 being the first index in the array. When those books were revised in the 1990s and the examples presented in C, Java, C++, etc., a literal conversion of the code maintained the 1-based root even though arrays in those languages are 0-based. Nobody stopped to consider how confusing that empty first element can be to the uninitiated.

What I find disturbing is that whoever did the code conversions almost certainly ran into an off by one problem when testing the examples. But rather than spend time to rewrite the code, they “fixed” the problem by allocating an extra item, and maybe explained it away in the text as something that just has to be done to keep the root at index 1. Those few who actually understood the issue seem to have been too lazy to make the correction, opting instead to explain it away as a performance issue.

In so doing, they’ve done their audiences a grave disservice. I find that inexcusable.

An interesting interview question

I ran across a problem today that I think would make an excellent interview question.

You work in a warehouse that has N storage compartments, labeled D-1 through D-N. Each storage compartment can hold one shipping crate. There’s also a temporary holding area, called D-T, that can hold a single crate. There also are N crates, labeled C-1 through C-N, each of which is in a storage compartment.

There is a forklift that can pick up a single crate, move it to an empty position, and put it down.

The owner, after touring the facility, decreed that the crates must be rearranged so that crate C-x is in space D-x (C-1 in D-1, C-2 in D-2, … C-N in D-N).

What algorithm would you employ to rearrange the crates in the minimum number of moves?

This is a fairly simple problem. If you pose it as a “balls and bins” problem, with a physical model to play with, most people will develop the optimum algorithm in a matter of minutes. But pose it as a programming problem and a lot of programmers have trouble coming up with the same algorithm.

If you’re interested in trying the problem yourself, cut five squares (crates) from a piece of paper, and label them C1 through C5. Draw six boxes on another sheet of paper, and label them D1 through D6. Now, arrange the crates on the spaces in this order: C4, C1, C5, C3, C2. Leave D6 empty. Now, rearrange them. Remember, if you pick up a card you have to place it in an empty spot before you can pick up another card.

So if this problem is so simple, why is it such an interesting interview question and why do programmers have trouble with it?

One reason is that it’s not really a sorting problem. At least not in the traditional sense. You see, I’m not asking the candidate to write a sort. In fact, I’m not asking him to write a program at all. I’m asking him to develop an algorithm that will rearrange the crates in a specific order, given the constraints. The candidate’s first reaction tells me a lot about how he approaches a problem. Granted, I have to take into account that the candidate will expect a whiteboard programming problem, but if his immediate reaction is to stand up and start writing some kind of sort method, I can pretty much guarantee that he’s going down the wrong track.

The algorithm that uses the fewest moves is not one that you would typically write. It’s either computationally expensive, or it uses additional extra memory. And there’s a well known efficient solution to sorting sequential items. That algorithm works well in the computer, but is less than optimum when translated to the physical world where it takes time for a forklift to pick up an item and move it.

The well known algorithm starts at the first storage compartment, D-1. If the item in that position is not crate C-1, it swaps the crate with the crate in the position that it needs to be in. It continues making swaps until C-1 is in D-1, and then moves on to compartment D-2 and does the same thing. So, for example, if you have five crates that are arranged in the order [4,1,5,3,2,x] (the x is for the empty spot, initially D-T) the sequence of swaps is:

    Swap 4 with 3 (the item that's in position D-4), giving [3,1,5,4,2,x]
    Swap 3 with 5, giving [5,1,3,4,2,x]
    Swap 5 with 2, giving [2,1,3,4,5,x]
    Swap 2 with 1, giving [1,2,3,4,5,x]

At which point the programmer says, “Done!”

The programmer who develops this solution and calls it done might think again if he re-reads the problem statement. Note that nowhere did the algorithm use the temporary location, D-T. Either the programmer will smell a rat, or he’ll dismiss that temporary location as a red herring.

As it turns out, it’s not a red herring. A swap is actually a sequence of three separate operations. Swapping the contents of D-1 and D-4, for example, results in:

  1. Move the contents of D-1 to a temporary location.
  2. Move the contents of D-4 to D-1.
  3. Move the contents of the temporary location to D-1.

The well known solution to the problem–the solution that is in many ways optimum on the computer–results in 12 individual moves:

  1. Move crate C-4 from D-1 to D-T.
  2. Move crate C-3 from D-4 to D-1.
  3. Move crate C-4 from D-T to D-4. (Crate C-4 is in position.)
  4. Move crate C-3 from D-1 to D-T.
  5. Move crate C-5 from D-3 to D-1.
  6. Move crate C-3 from D-T to D-3. (Crate C-3 is in position.)
  7. Move crate C-5 from D-1 to D-T.
  8. Move crate C-2 from D-5 to D-1.
  9. Move crate C-5 from D-T to D-5. (Crate C-5 is in position.)
  10. Move crate C-2 from D-1 to D-T.
  11. Move crate C-1 from D-2 to D-1. (Crate C-1 is in position.)
  12. Move crate C-2 from D-T to D-2. (Crate C-2 is in position.)

In the worst case, of which this is an example, the algorithm makes N-1 swaps, or 3N-3 moves.

At this point, it might be useful to have a discussion about what problem we’re really trying to solve. The problem asked for the minimum number of moves. A reminder that we’re talking about a physical process might be in order. To programmers who are used to things happening almost instantly, a few extra moves here and there don’t really make a difference. The solution presented above asymptotically optimum in that the number of moves required is linearly proportional to the number of crates. That’s typically a good enough answer for a programming question. But, again, this isn’t really a programming question. One can assume that the client wants to make the minimum number of moves because he has to pay a forklift operator $15 per hour and it takes an average of five minutes for every move. So if he can reduce the number of moves from 12 to 6, he saves $30.

The solution that most people presented with the balls and bins problem develop does things a bit differently. Rather than starting by moving crate C-4 to its rightful place, the algorithm starts by emptying the first bin (i.e. moving C-4 to D-T), and then filling the empty location with the item that belongs there. Then, it fills the new empty slot with the element that belongs there, etc. Let’s see how it works, again starting with [4,1,5,3,2,x].

  1. Move crate C-4 from D-1 to D-T.
  2. Move crate C-1 from D-2 to D-1. (Crate C-1 is in position.)
  3. Move crate C-2 from D-5 to D-2. (Crate C-2 is in position.)
  4. Move crate C-5 from D-3 to D-5. (Crate C-5 is in position.)
  5. Move crate C-3 from D-4 to D-3. (Crate C-3 is in position.)
  6. Move crate C-4 from D-T to D-4. (crate C-4 is in position.)

That took six moves, or half of what the “optimum” solution took. That’s not quite a fair comparison, though, because that’s not the worst case for this algorithm. The worst case occurs when moves result in a cycle that leaves D-T empty before all of the items are in order. Consider the sequence [2,1,5,3,4,x]. Using our new algorithm, we start with:

  1. Move crate C-2 from D-1 to D-T.
  2. Move crate C-1 from D-2 to D-1. (Crate C-1 is in position.)
  3. Move crate C-2 from D-T to D-2. (Crate C-2 is in position.)
    At this point, our temporary spot is empty, so we need to make a new empty spot by moving the first out of place crate to D-T.
  4. Move crate C-5 from D-3 to D-T.
  5. Move crate C-3 from D-4 to D-3. (Crate C-3 is in position.)
  6. Move crate C-4 from D-5 to D-4. (Crate C-4 is in position.)
  7. Move crate C-5 from D-T to D-5. (Crate C-5 is in position.)

It’s interesting to note that the swapping algorithm requires three swaps (nine moves) to rearrange items given this initial arrangement.

As far as I’ve been able to determine, the worst case for this algorithm requires 3N/2 moves. It will never make more moves than the swapping algorithm, and often makes fewer.

The point of posing the problem to the candidate is to get him to develop an approach to the problem before he starts coding.

Another interesting thing about this question is that it allows for a follow-on question. Assuming that the candidate develops the optimum algorithm, ask him to write code that, given the starting position, outputs the moves required to rearrange the crates. In other words, implement the algorithm in code.

At this point, he’ll have to make a decision: use sequential search to locate specific crates, or create an index of some kind to hold the original layout so he can quickly look up an individual crate’s location. Again, he should be asking questions because the requirements are deliberately ambiguous. In particular, he might think that he can’t use the index because we only allowed him one extra swapping space for the crates. But again, the original problem statement doesn’t mention a computer program, and the follow-on question doesn’t place any restrictions on the memory consumption.

I’ll leave implementation of the algorithm as an exercise for those who are interested. It’s not difficult, just a bit odd for those who are used to writing traditional sorting algorithms.

Ultimately, the reason I find this problem a potentially useful interview question is because it forces the candidate to differentiate between the customer’s requirements (the optimum sequence of steps required to arrange his crates), and then internal implementation details. The extent to which the candidate can make that differentiation tells me a lot about how she will approach the real-world problems that we’ll present to her on a daily basis. In addition, the simple problem is rife with opportunities to have other discussions about approaches to problems and common coding tradeoffs.

It occurred to me as I was writing this that I should construct a balls and bins prop that I can trot out if the candidate doesn’t trip to the algorithm relatively quickly. It would be interesting to see how quickly the candidate, if given the prop, would discover the optimum algorithm.

I’m kind of excited about getting an opportunity to try this out.

Null is evil, Chapter 9346

If we’ve learned anything from 45 years of C coding, it’s that NULL is evil. At least relying on NULL return values is usually evil, especially when it forces you to write contorted special case handling code. I ran across a case in point today that surprised me because the programmer who wrote it should have known better. And for sure the senior programmer who reviewed the code should have flagged it for rewriting.

There is a method that takes an existing object and updates it based on fields in another object. The code writes a change log entry for each field updated. The code is similar to this:


    private bool UpdatePerson(Person person, Models.Person personUpdate)
    {
        var changes = new List<ChangeLogEntry>();
        
        changes.Add(CompareValues(person.FirstName, personUpdate.FirstName, "First Name"));
        changes.Add(CompareValues(person.LastName, personUpdate.LastName, "Last Name"));
        // ... other fields here
        
        changes = changes.Where(x => x != null).ToList();  // 

This is Just Wrong. It works, but it’s wonky. The code is adding null values to a list, knowing that later they’ll have to be removed. Wouldn’t it make more sense not to insert the null values in the first place?

One way to fix the code would be to write a bunch of conditional statements:


    ChangeLogEntry temp;
    temp = CompareValues(person.FirstName, personUpdate.FirstName, "First Name");
    if (temp != null) changes.Add(temp);
    
    temp = CompareValues(person.LastName, personUpdate.LastName, "Last Name");
    if (temp != null) changes.Add(temp);
    
    // etc., etc.

That’s kind of ugly, but at least it doesn’t write null values to the list.

The real problem is the CompareValues function:


    private ChangeLogEntry CompareValues(string oldValue, string newValue, string columnName)
    {
        if (oldValue != newValue)
        {
            var change = new ChangeLogEntry(columnName, oldValue, newValue);

            return change;
        }

        return null;
    }

Yes, it’s poorly named. The method compares the values and, if the values are different, returns a ChangeLogEntry instance. If the values are identical, it returns null. Perhaps that’s the worst thing of all: returning null to indicate that the items are the same. This model of returning null to indicate status is a really bad idea. It forces clients to do wonky things. Of all the horrible practices enshrined in 40+ years of C-style coding, this one is probably the worst. I’ve seen and had to fix (and, yes, written) too much terrible old-style C code to ever advocate this design pattern. null is evil. Avoid it when you can.

And you usually can. In this case it’s easy enough. First, let’s eliminate that CompareValues function:


    if (person.FirstName != personUpdate.FirstName) changes.Add(new ChangeLogEntry(person.FirstName, personUpdate.FirstName, "First Name"));
    if (person.LastName != personUpdate.LastName) changes.Add(new ChangeLogEntry(person.LastName, personUpdate.LastName, "Last Name"));

That’s still kind of ugly, but it eliminates the null values altogether. And we can create an anonymous method that does the comparison and update, thereby removing duplicated code. The result is:


    private void UpdatePerson(Person person, Models.Person personUpdate)
    {
        var changes = new List<ChangeLogEntry>();
        
        // Anonymous method logs changes.
        var logIfChanged = new Action<string, string, string>((oldValue, newValue, columnName) =>
        {
            if (oldValue != newValue)
            {
                changes.Add(new ChangeLogEntry(columnName, oldValue, newValue));
            }
        }
        
        logIfChanged(person.FirstName, personUpdate.FirstName, "First Name");
        logIfChanged(person.LastName, personUpdate.LastName, "Last Name");
        // ... other fields here

        if (changes.Any())
        {
            // update record and write change log
        }
    }

That looks a lot cleaner to me because there are no null values anywhere, and no need to clean trash out of the list.

Here’s the kicker: I didn’t rewrite the code. If the code were as simple as what I presented here, I probably would have changed it. But the code is a bit more complex and it’s currently running in a production environment. I hesitate to change working code, even when its ugliness causes me great mental pain. I’m almost certain that I can make the required changes without negatively affecting the system, but I’ve been burned by that “almost” too many times with this code base. If I had a unit test for this code, I’d change it and depend on the test to tell me if I screwed up. But there’s no unit test for this code because … well, it doesn’t matter why. Let’s just say that this project I inherited has a number of “issues.” The one I’ll relate next time will boggle your mind.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.