Wikipedia: Trust, but verify

A recent question on Stack Overflow asked why Quicksort is called Quicksort, even though it sometimes exhibits O(n2) behavior whereas Merge sort and Heap sort are both O(n * log(n)) in all cases. (For those unfamiliar with the terminology, he’s asking why it’s called “Quicksort,” when other algorithms are theoretically faster.) It’s a reasonable question for a beginner to ask.

One person commented that the answer to the question can be found in the Quicksort article on Wikipedia. But the person asking the question rejected that source because “My professor said Wikipedia is not an authentic site.”

I doubt that’s what the professor said, because a college professor telling students not to use Wikipedia would be, at best, irresponsible. I suspect that the professor said Wikipedia should not be cited as a primary source in formal writing, and the student took that to mean Wikipedia is not a reliable source of information. It seems a prevalent opinion among many. I know intelligent people who will not use Wikipedia. At all. For anything. I cannot understand that behavior.

In my experience, Wikipedia is a highly reliable source of information on fact-based topics (historical events, general science, algorithms and data structures, mathematics, astronomy, etc.), and a good introduction when it comes to political, religious, or other emotionally-charged topics. If I want to know what something is or was, then Wikipedia probably has an article about it, and that article will be accurate enough for me to get the general idea. Quickly skimming an article gives me a shallow understanding: enough that I don’t feel completely ignorant when somebody starts talking to me about the topic.

More critical reading not only supplies details, but also gives me a feel for points of controversy. Because Wikipedia articles, especially articles on current events or highly charged topics, are often edited by multiple people, it’s difficult for any single person or group to present a completely one-sided viewpoint. It’s also fairly easy to determine when a topic has been edited to remove any hint of controversy.

The best thing about Wikipedia is that it contains mostly reliable articles on almost every topic of any import. The second best thing is that those articles cite their sources. Every article has a “references” section in which it lists the sources for factual statements made in the article. If a factual statement does not have a reference, there is an annotation saying so. If the article lacks references in general, there’s a big note at the top of the article saying so.

With the references listed, one can easily spend a few minutes reading the primary source material to get more details, or to see if the article presents an accurate summation of the material. A Wikipedia article, like a well-written research paper, is verifiable. Contrast that with a printed encyclopedia, which rarely cites sources.

There are pretty high standards for Wikipedia articles, and the community of editors does a very good job of ensuring that articles meet those standards. If an article does not, there are prominent annotations in the text. If an article appears to be opinion-based, presents only one side of a controversial topic, lacks references, appears to be purely anecdotal, or runs afoul of any number of standards, there is a clearly marked indication of that in the article itself. I’ve seen things like, “this article lacks references” at the top of an article. Or “reference needed” when an unsupported assertion is made. Countless articles say, “This article does not meet standards.”

In short, a Wikipedia article gives you all the tools you need to gauge the reliability of the information presented. Certainly much more than a newspaper, television news channel, printed encyclopedia, magazine article, random Web site, your favorite politician, etc. Of those, only Wikipedia makes it a point to give you the resources you need to verify the information that’s presented.

Other than scientific journals, I can’t think of a general information source that I trust more than Wikipedia. It’s the first stop I make if I want to learn about anything.

Not that I take everything on Wikipedia as gospel. Like any other source of information, there are inadvertent mistakes and deliberate falsehoods in Wikipedia articles. But my experience has been that the number of such incidents is much smaller than in any other source of general information that I’m familiar with. More importantly, such things are typically identified and corrected very quickly.

trust information from Wikipedia articles, but I also verify it through other means. By contrast, I’ve come to distrust information from most news sources, and use other means to determine the veracity of their articles. The primary differences here are that Wikipedia articles tell me where I can verify the information, and I’m usually able to verify it. News outlets pointedly do not reveal their sources, and very often I find that their versions of events are, to be kind, not entirely accurate.

How to generate unique “random-looking” keys

A common question on Stack Overflow goes something like this:

I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.

or

I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.

There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly not G75AC, which makes it obvious that the keys are sequential.

A related Stack Overflow question is this:

What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.

This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.

The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.

Encoding is easy: it’s just a base conversion. One common encoding is base-64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.

Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.

An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?

The answer is a particularly cool bit of math called a modular multiplicative inverse. Despite the name, it’s not a magical incantation. Eric Lippert explains it well in his blog post, Math from scratch, part thirteen: multiplicative inverses. (In case you’re wondering, the acronym he uses in that second sentence, “WOLOG” means, “Without loss of generality“.)

One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.

    private void DoIt()
    {
        const long m = 101;         // Number of keys + 1
        const long x = 387420489;   // must be coprime to m

        // Compute the multiplicative inverse
        var multInv = MultiplicativeInverse(x, m);

        // HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
        var nums = new HashSet<long>();

        // Obfuscate each number from 1 to 100.
        // Show that the process can be reversed.
        // Show that no duplicates are generated.
        for (long i = 1; i <= 100; ++i)
        {
            var obfuscated = i * x % m;
            var original = obfuscated * multInv % m;
            Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
            if (!nums.Add(obfuscated))
            {
                Console.WriteLine("Duplicate");
            }
        }    
    }

    private long MultiplicativeInverse(long x, long modulus)
    {
        return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
    }

    private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
    {
        if (a < 0)
        {
            var result = ExtendedEuclideanDivision(-a, b);
            return Tuple.Create(-result.Item1, result.Item2);
        }
        if (b < 0)
        {
            var result = ExtendedEuclideanDivision(a, -b);
            return Tuple.Create(result.Item1, -result.Item2);
        }
        if (b == 0)
        {
            return Tuple.Create(1L, 0L);
        }
        var q = a / b;
        var r = a % b;
        var rslt = ExtendedEuclideanDivision(b, r);
        var s = rslt.Item1;
        var t = rslt.Item2;
        return Tuple.Create(t, s - q * t);
    }

The first few lines of output for that program are:

    1 => 43 => 1
    2 => 86 => 2
    3 => 28 => 3
    4 => 71 => 4
    5 => 13 => 5
    6 => 56 => 6
    7 => 99 => 7
    8 => 41 => 8
    9 => 84 => 9
    10 => 26 => 10

If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.

The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.

By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.

The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.

In my next entry, I’ll wrap this all up into a nice easy-to-use class.

Serialize enums as numbers rather than as strings

Imagine that your application works with different document types. You have a server that stores documents in a database, and several applications that query the database, and briefly cache query results in shared memory.

One of the fields that’s returned from a search is the document type, which you’ve encoded in a C# enumeration:

    public enum DocumentType
    {
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3
    }

And in your document record:

    public class DocumentSummary
    {
        public string Name { get; set; }
        public DocumentType DocType { get; set; }
        // other stuff
    }

Now, say you serialize that to JSON, converting the enum values to strings. You end up with:

    {
        "Name":"MyDocument.msh",
        "DocType":"WordSmasher"
    }

All of the applications include a shared library that defines common data structures and such, like DocumentSummary and DocumentType.

This all works great until you add support for a new type of document: the company’s new PowerPoint competitor, called SleepyAudience. You update your enumeration:

    public enum DocumentType
    {
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3,
        SleepyAudience = 4
    }

Then you compile and test your server code, build one of the other applications to test things with, and release it to the world. Only to find that the other applications, which haven’t yet been rebuilt, die a horrible death when they see the type “SleepyAudience”.

The application crashes because the deserialization code doesn’t know what to do when it finds the string “SleepyAudience” in the DocType. As far as that code is concerned, the DocType field can be an integer value, or it can have one of the first three strings. “SleepyAudience” is not the name of an enum value, as far as the deserializer is concerned.

The easiest solution to this problem is to encode the enumeration value as a number, rather than as a string. Had the DocType been encoded as a number, the deserializer would have read and populated a DocumentSummary instance. It’s then up to the application code to decide what to do with a document type that it is not familiar with. Perhaps the application would filter that document from the list presented to the user. Or it would display “Unknown document type.” Whatever the case, it certainly shouldn’t crash.

You lose a little bit of readability by encoding the enum value as a number, but only if you’re in the practice of eyeballing JSON data. And, yes, it’s often easier for people who are writing code that interacts with your API to understand what “WordSmasher” means. But we programmers are used to associating numbers with names. It’s no particular burden for a programmer to look up the value “1” in the API documentation to determine that it identifies the WordSmasher application.

One solution to this problem, of course, is to make sure that all applications that depend on the enumeration are rebuilt and released whenever the enumeration changes. That would solve the problem, but it’s not practical when you have dozens of different applications online all the time. You can’t just stop the world for a few hours (or even a few minutes) to apply updates.

You could also argue that the applications themselves are at fault, and I would agree with that assessment. After all, an application shouldn’t crash horribly because it got some bad data. The application should ignore that one data item (in this case, a single document record from a list of documents), log an error message, and continue. But standard serialization libraries aren’t that robust. They either understand the entirety of what’s sent, or they roll over and die.

The Robustness principle says “Be conservative in what you send, be liberal in what you accept.” When applied to deserialization it means that if you don’t understand something, deal with it gracefully; do the best you can. Fail gracefully. Interpret the value or ignore it and supply a default. If you can’t do that, then discard the record. If you can’t discard the record, discard the input. At worst return an empty data set. But under no circumstances should the program go down in flames because it didn’t understand some input data.

Regardless of who’s at fault here, the point is that encoding enumeration values as strings has very little, if any, real benefit, and makes it harder to write robust deserialization code. Especially when working with deserialization libraries (as opposed to rolling your own deserialization code), you should do whatever you reasonably can to ensure that the deserializer can interpret your data. Let the rest of your application decide how to proceed, after the data has been read.

In my opinion, the application I describe above has a more fundamental flaw in that the document types it supports are hard coded in an enumeration. But that’s a discussion for another day.

How did this happen?

Last time I showed two different implementations of the naive method for generating unique coupon codes. The traditional method does this:

    do
    {
        id = pick random number
    } until id not in used numbers
    code = generate code from id

The other is slightly different:

    do
    {
        id = pick random number
        code = generate code from id
    } until code not in used codes

Before we start, understand that I strongly discourage you from actually implementing such code. The naive selection algorithm performs very poorly as the number of items you choose increases. But the seemingly small difference in these two implementations allows me to illustrate why I think that the second version is broken in a fundamental way.

Think about what we’re doing here. We’re obtaining a number by which we are going to identify something. Then we’re generating a string to express that number. It just so happens that in the code I’ve been discussing these past few entries, the expression is a six-digit, base-31 value.

So why are we saving the display version of the number? If we were going to display the number in decimal, there’d be no question: we’d save the binary value. After all, the user interface already knows how to display a binary value in decimal. Even if we were going to display the number in hexadecimal, octal, or binary, we’d store the binary value. We certainly wouldn’t store the string “3735928559”, “11011110101011011011111011101111”, or “DEADBEEF”. So, again, why store the string “26BB62” instead of the the 32-bit value 33,554,432?

I can’t give you a good reason to do that. I can, however, give you reasons not to.

  1. Storing a six-character string in the database takes more space than storing a 32-bit integer.
  2. Everything you do with with a six-character code string takes longer than if you were working with an integer.
  3. Display logic should be part of the user interface rather than part of the database schema.
  4. Changing your coding scheme becomes much more difficult.

The only argument I can come up with for storing codes rather than integer identifiers is that with the former, there’s no conversion necessary once the code is generated. Whereas that’s true, it doesn’t hold up under scrutiny. Again, nobody would store a binary string just because the user interface wants to display the number in binary. It’s a simple number base conversion, for crying out loud!

If you’re generating unique integer values for database objects, then let the database treat them as integers. Let everybody treat them as integers. Except the users. “26BB62” is a whole lot easier to read and to type than is “33554432”.

I’ll be charitable and say that whoever came up with the idea of storing the display representation was inexperienced. I’m much less forgiving of the person who approved the design. The combination of the naive selection algorithm, storing the generated coupon code rather than the corresponding integer, and the broken base conversion function smacks of an inexperienced programmer working without adult supervision. Or perhaps an inexperienced programmer working with incompetent adult supervision, which amounts to the same thing.

The mere existence of the naive selection algorithm in this code is a sure sign that whoever wrote the code either never studied computer science, or slept through the part of his algorithms class where they studied random selection and the birthday paradox (also, birthday problem). The comment, “There is a tiny chance that a generated code may collide with an existing code,” tells us all we need to know of how much whoever implemented the code understood about the asymptotic behavior of this algorithm. Unless your definition of “tiny” includes a 10% chance after only one percent of the codes have been generated.

The decision to store the generated code string surprises me. You’d think that a DBA would want to conserve space, processor cycles, and data transfer size. Certainly every DBA I’ve ever worked with would prefer to store and index a 32-bit integer rather than a six-character string.

The way in which the base conversion function is broken is hard evidence that whoever modified it was definitely not an experienced programmer. I suspect strongly that the person who wrote the original function knew what he was doing, but whoever came along later and modified it was totally out of his depth. That’s the only way I can explain how the function ended up the way it did.

Finally, that all this was implemented on the database looks to me like a case of seeing the world through database-colored glasses. Sure, one can implement an obfuscated unique key generator in T-SQL. Whether one should is another matter entirely. Whoever did it in this case shouldn’t have tried, because he wasn’t up to the task.

However this design came to exist, it should not have been approved. If there was any oversight at all, it should have been rejected. That it was implemented in the first place is surprising. That it was approved and put into production borders on the criminal. Either there was no oversight, or the person reviewing the implementation was totally incompetent.

With the code buried three layers deep in a database stored procedure, it was pretty safe from prying eyes. So the bug remained hidden for a couple of years. Until a crisis occurred: a duplicate code was generated. Yes, I learned that the original implementation didn’t even take into account the chance of a duplicate. Apparently somebody figured that with 887 million possible codes, the chance of getting a duplicate in the first few million was so remote as to be ignored. Little did they know that the chance of getting a duplicate within the first 35,000 codes is 50%.

I also happen to know that at this point there was oversight by a senior programmer who did understand the asymptotic behavior of the naive algorithm, and yet approved “fixing” the problem by implementing the duplicate check. He selected that quick fix rather than replacing the naive algorithm with the more robust algorithm that was proposed: one that does not suffer from the same pathological behavior. The combination of arrogance and stupidity involved in that decision boggles the mind.

The history of this bug is rooted in company culture, where the database is considered sacrosanct: the domain of DBAs, no programmers allowed, thank you very much. And although programmers have access to all of the source code, they aren’t exactly encouraged to look at parts of the system outside of their own areas. Worse, they’re actively discouraged from making suggestions for improvement.

In such an environment, it’s little surprise that the horribly broken unique key generation algorithm survives.

So much for that. Next time we’ll start looking at good ways to generate unique, “random-looking” keys.

A broken unique key generator

Hanlon’s Razor: “Never attribute to malice that which is adequately explained by stupidity.”

Recap: in my previous post I showed the naive method of generating unique “random-looking” six-character keys, and explained why it wasn’t a good solution for generating hundreds of millions of keys. I also mentioned that I used to think that the naive method was the worst possible way somebody would try to use for generating keys, and promised that I would show a worse way in today’s post.

If you recall, the naive method does this:

    do
    {
        generate a random number
    } until you find one that hasn't been used
    mark the number as used
    create the coupon code for that number

The worse way that I mentioned attempts to do this:

    do
    {
        generate a random number
        create the coupon code for that number
    } until you find a coupon code that hasn't been used
    mark the code as used

The only difference in the overall algorithm is that the second method uses the coupon code rather than the random number to decide whether a code has been used. That might seem like a minor difference, but it’s really a major change in thinking. It’s also more expensive and yet another place for bugs to creep in.

Let’s defer the discussion of why the second method is worse than the first. In preparation, I want to provide a real-life example of such an implementation.

I ran across this code in a production database that’s used to generate millions of six-character coupon codes every week. When I first saw it I was surprised that somebody would use the naive method to generate unique codes. But then I studied it a little closer and was astonished at how incompetently it was written. If I were the suspicious type, I’d be tempted to think that the person who wrote (or, possibly, modified) it actually broke it on purpose. But then I remembered Hanlon’s Razor.

Here is part of a database stored procedure that implements the second algorithm I showed above.

    WHILE (@codesGenerated < @NumberToGenerate AND @Attempts < @MaxAttempts)
    BEGIN
        BEGIN TRY
            INSERT INTO [dbo].[CouponCodes]
                    ([CouponNamespaceId], [CouponId], [Code], [Used])
                VALUES
                    (@CouponNamespaceId,@CouponId,LTRIM(RTRIM([dbo].[GenerateRandomBase32](ABS(CHECKSUM(NEWID()))))), 0)

            -- If a collision happened in the Insert above then this @codesGenerated will (correctly) not happen
            SET @codesGenerated = @codesGenerated + 1  

        END TRY
        BEGIN CATCH 
            -- There is a tiny chance that a generated code may collide with an existing code.
            -- which will violate the unique constraint.
            -- Error handling code is here.
        END CATCH

        --Always increment @Attempts to prevent an infinite loop if there are too many collisions
        SET @Attempts = @Attempts + 1
    END

This is a T-SQL version of the modified naive method: pick a number, generate the coupon code, see if the coupon code has been used. In the interest of brevity, I’ve removed some irrelevant code and the error handling. It’s a reasonable implementation of that modified naive method, although I question the use of a GUID checksum as the source of random numbers. I’m also not convinced that the person who wrote the error handling understands that the ABS function will throw an exception if the parameter is -2147483648. It’s highly unlikely that CHECKSUM will generate that value, but I’ve been burned before by code that blindly calls ABS without either first checking the input parameter, or handling the exception.

Whereas I believe that those issues should be addressed, they’re pretty minor in light of what follows.

I’m also a bit amused at the comment: “There is a tiny chance that a generated code may collide with an existing code.” Again, it’s especially amusing in light of what follows.

Let’s also ignore for now the question of whether this code even belongs in the database layer. I think it’s a good discussion to have, but we’ll save it for another time.

The problem lies in GenerateRandomBase32, a SQL Server scalar function:

    ALTER FUNCTION [dbo].[GenerateRandomBase32]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 1040187391 + 33554432;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        WHILE @seed > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
            END
        RETURN @answer;
    END

One problem is that the function is poorly named. The name “GenerateRandomBase32” implies that it’s going to generate a random value. At first I thought it was supposed to generate a random 32-bit value, but for reasons I’ll get to in a bit I’m now convinced that “32” was supposed to be the number base. The function doesn’t generate anything, there’s no randomness involved, and the number base is 31. A more appropriate name would be “ConvertToBase31,” or perhaps “ConvertToSixDigitBase31”. I realize that code changes over the years and this function might have done something quite different when it was originally written, but that’s no excuse for failing to change the function name to something meaningful.

On the face of it, the function looks like a T-SQL implementation of the base conversion method that I showed last time. That is, it takes a binary number and creates base-31 representation of it. That’s what it does, in general, but there are a few oddities that will surprise you. The first is this expression at the beginning:

    SET @seed = @seed % 1040187391 + 33554432;

With that expression, the effective range of keys that the function can create base-31 representations of is 33,554,432 to 1,073,741,822: a total of 1,040,187,390. Remember, though, that there are only 887,503,681 possible six-character coupon codes. Some of those values will get duplicates.

You might get the idea that this isn’t going to end well.

The peculiar thing about the expression is that, if we were generating a base-32 code, it would have the effect of preventing all codes that start with the first digit (presumably 0) from being generated. That is, if you sum the two magic numbers you end up with 1,073,741,823. It just so happens that 1,073,741,824 is the number of six-digit base-32 codes you can generate. I suspect that this code was originally written to generate base-32 codes, but for some reason later modified to eliminate ‘0’ from the valid digits, making it a base-31 code generator. But changing the number base without changing that expression has resulted in a horribly broken function.

I can’t prove that the function used to generate base-32 numbers, and was subsequently modified. I can only say that the evidence points strongly to that. I also can’t say for sure why one would want to prevent generation of codes that start with “0”. The only two reasons I can think of are:

  1. Somebody decided that numbers with leading 0’s were somehow difficult for users to handle.
  2. The programmer couldn’t figure out how to include leading 0’s when doing the base conversion.

At this point, I consider those two possibilities equally likely.

After mangling the input parameter, the code performs the standard base conversion logic, peeling out digits from lowest to highest, and creating a string. Note that each digit is prepended to the string. The code “7HNH51” would be constructed in the sequence “1”, “61”, “H61”, “NH61”, “HNH61”, “7HNH61”.

However, the range of values for which the function can generate codes exceeds the number of possible coupon codes. The function can generate codes for numbers greater than 887,503,681, which means that some of the codes generated will be seven characters long. They have to be truncated to six characters. How that truncation is done ends up making this code much worse than it appears at first.

If you pass the value 1040187360 to this function, it will generate seven characters. The first six characters generated are the code shown above, “7HNH61”. But then it generates a seventh character, “2”, which is prepended. Except the answer variable is declared as holding only six characters. So when this line of code is executed:

    SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;

The character “2” becomes the first character in the new string, and then the first five (from left to right) of the previous characters are copied. The result is “27HNH6”.

Now here’s the kicker. Guess what happens when you pass the value 1040187361. The function generates “27HNH62”, which is truncated to “27HNH6”. Duplicate! In fact, all 31 values from 1040187360 through 1040187390 generate that same code. That’s a huge oversight that seriously affects the duplicate rate.

There are 186,238,141 numbers for which this function will generate a seven-character code that’s truncated. If you pass a number in the range 853,949,249 to 1,040,187,390 to this function, it will generate one of 6,007,682 codes. Those 186 million values are 31 times more likely than average to be generated.

One other thing that affects the duplicate rate, albeit not very large, is that the function won’t ever generate a code for any number less than 33,554,432.

Combined, those two errors more than double the duplicate rate experienced by programs that call this function as part of the naive code generation algorithm.

Fixing that function really isn’t difficult, as I show below. The only real changes are to replace the seed mangling expression, and modify the loop so that it always generates exactly six characters.

    ALTER FUNCTION [dbo].[CreateSixDigitBase31]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        /*
         * This function generates a six-digit, base-31 representation of the passed seed parameter.
         * It is capable of generating 887,503,681 different codes (permutations of 31 items taken six
         * at a time, with repetition).
         * If the passed seed is larger than 887,593,680, the seed is taken modulo 887,593,680.
         * This function will not produce a useful value if seed is negative.
         */
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 887503680;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        DECLARE @counter as INT = 6;
        WHILE @counter > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
                SET @counter = @counter - 1;
            END
        RETURN @answer;
    END

I don’t particularly like that function because I suspect that the string manipulation is highly inefficient. We are in effect doing six string concatenation instructions with every code generated. I know that doing things this way in a C# program would bring the program to a crawl, which is why the implementation in my previous post used a character array. I don’t know enough about how T-SQL manipulates strings to say how efficient this is, and I don’t know if there’s a better alternative.

With the fixed base conversion function, this naive unique code generation algorithm is about as good as it can be. Which isn’t very good. Not only does it suffer from the duplicates problem that I pointed out yesterday, it has two other, related, strikes against it: it’s implemented at the database level, and it stores the generated code rather than the seed number.

Next time I’m going to talk a bit about why I think implementing the code generation in the database is a bad idea. I’ll also touch on how it’s possible for such a horribly broken implementation to go unnoticed for so long.