How not to generate unique codes

Imagine that part of the system you’re developing requires you to generate six-character unique alphanumeric coupon codes that you don’t want to appear sequential. For example, the first code might be XQJ97T and the next is 1B9QD4. You need to keep track of the codes, of course, and generating a duplicate would be a very bad thing.

For this discussion, let’s assume that you want codes to contain characters from the set 123456789ABCDFGHJKLMNPQRSTVWXYZ. I’ve eliminated the number 0 and the vowels E, I, O, and U. Why eliminate the vowels? Because some people think it reduces the likelihood of spelling a “bad” word. Whatever. You’d be surprised how common that set is.

So there are 31 characters, and we want to create a six-character code. If you allow repetition (codes where characters can be repeated, like XQJ97J), then there are 887,503,681 possible codes. (See permutations with repetition.)

I used to think that the worst possible way to generate a unique code would be to generate one, see if it’s been used, and if so go back and generate another one. Something like this:

    do
        code = generateRandomCode()
    until (code not in database)
    addGeneratedCode(code)

That actually works. However, you’ll find that as the number of codes you generate increases, it takes longer and longer to find one that hasn’t been used. The code below illustrates this effect using random numbers.

First, the DumbCodeGenerator class, which implements the code generation and keeps track of the number of duplicates.

    public class DumbCodeGenerator
    {
        private readonly Random _rnd = new Random();
        private readonly HashSet<int> _uniqueCodes = new HashSet<int>();

        public int TotalCodesGenerated { get; private set; }
        public int UniqueCodesGenerated { get { return _uniqueCodes.Count; } }
        public int DuplicatesGenerated { get { return TotalCodesGenerated - UniqueCodesGenerated; } }

        private const int TotalAvailableCodes = 887503681;

        public int NextCode()
        {
            while (true)
            {
                int code = _rnd.Next(TotalAvailableCodes);
                ++TotalCodesGenerated;
                if (_uniqueCodes.Add(code))
                {
                    return code;
                }
            }
        }
    }

And, the code that calls it:

    class Program
    {
        private const int Thousand = 1000;
        private const int Million = Thousand*Thousand;
        private const int NumCodesToGenerate = 25*Million;
        static void Main(string[] args)
        {
            var generator = new DumbCodeGenerator();
            int lastDup = 0;
            for (var i = 0; i < NumCodesToGenerate; ++i)
            {
                generator.NextCode();
                if (i%Million == 0)
                {
                    Console.WriteLine("{0:N0}, {1:N0}, {2:N0}", i, generator.DuplicatesGenerated-lastDup, generator.DuplicatesGenerated);
                    lastDup = generator.DuplicatesGenerated;
                }
            }
            Console.WriteLine("{0:N0} unique codes generated.", generator.UniqueCodesGenerated);
            Console.WriteLine("{0:N0} duplicates generated.", generator.DuplicatesGenerated);
            Console.WriteLine("{0:N0} total generated.", generator.TotalCodesGenerated);
            Console.WriteLine("Duplicate percent = {0:P2}", (double)generator.DuplicatesGenerated/NumCodesToGenerate);
            Console.Write("Press Enter:");
            Console.ReadLine();
        }
    }

When I run that code, it produces this output:

0  0  0
1,000,000  583  583
2,000,000  1,742  2,325
3,000,000  2,859  5,184
4,000,000  4,153  9,337
5,000,000  5,369  14,706
6,000,000  6,492  21,198
7,000,000  7,666  28,864
8,000,000  8,823  37,687
9,000,000  10,203  47,890
10,000,000  11,185  59,075
11,000,000  12,378  71,453
12,000,000  13,719  85,172
13,000,000  14,895  100,067
14,000,000  15,989  116,056
15,000,000  17,251  133,307
16,000,000  18,906  152,213
17,000,000  19,871  172,084
18,000,000  20,956  193,040
19,000,000  21,995  215,035
20,000,000  23,420  238,455
21,000,000  24,494  262,949
22,000,000  25,663  288,612
23,000,000  26,990  315,602
24,000,000  28,254  343,856
25,000,000 unique codes generated.
373,291 duplicates generated.
25,373,291 total generated.
Duplicate percent = 1.49 %

In short, the first million codes selected generated 583 duplicates. Between 1 million and 2 million, there were 1,742 duplicates generated. Between 23 million and 24 million, it generated 28,254 duplicates, or about 2.83%. The likelihood of a duplicate increases as we generate more codes. When I ran it for 45 million codes (the largest I could get on my laptop), it was generating 5.5% duplicates at the last interval, and the duplicate percentage up to that point was 2.74%.

If you were to generate 50 million unique codes this way, you would have generated approximately 3% more codes than you had to. That’s not really so bad, provided you don’t mind spending the memory (or database resources) saving and querying the generated numbers. Things get more troublesome when you want to generate hundreds of millions of codes, because your duplicate percentage increases so much that you spend a huge amount of time trying to generate a unique code.

So far, I’ve talked about generating random numbers. What about six-character codes? They’re just a base conversion away. That is, just like you can represent the decimal value 1793 as 0x0701, you can also represent any number in base-31 using whatever characters you want as digits. Here’s the method that turns a passed binary value into a base-31 code string.

    private const string ValidChars = "123456789ABCDFGHJKLMNPQRSTVWXYZ";
    private static readonly int NumberBase = ValidChars.Length;

    static string CreateEncodedString(int code)
    {
        var answer = new char[6];
        int codeVal = code;
        for (var i = 0; i < 6; ++i)
        {
            int digit = codeVal%NumberBase;
            answer[5 - i] = ValidChars[digit];
            codeVal = codeVal/NumberBase;
        }
        return new string(answer);
    }

In this case, I wanted to keep leading “zeros” (which are represented by the character “1”). Note that the code builds the string backwards. If this method is passed a number larger than 887,503,680, only the six low-order digits of the number are generated. It would be like generating the code for the number modulo 887,503,681.

As I said, I used to think that the above was the worst possible way anybody would consider generating codes. I was wrong. There’s a much worse way that I’ve actually seen used in production code. I’ll talk about that in my next entry.

ReSharper finds the bugs

I might have mentioned before that I really like ReSharper. In my opinion if you’re writing C# code and not using ReSharper, you’re handicapping yourself. And probably committing professional malpractice. Next to Visual Studio, it’s the most useful development tool I have.

Today’s discovery is a case in point.

I was reviewing some code today when I ran across a latent bug and a definite bug, both because ReSharper had flagged the lines involved.

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch != null) { paramName = paramNameMatch.Groups["ParamName"].Value; }
    else { paramName = string.Empty; }

    string actualValue;
    var actualValueMatch = Regex.Match(message, @"Actual value was (?.+)\.");
    if (paramNameMatch != null) { actualValue = paramNameMatch.Groups["ActualValue"].Value; }
    else { actualValue = string.Empty; }

Don’t worry too much about what the regular expressions are doing or why.

The code looks reasonable, right? Try to match a regular expression and assign a value based on whether the match was successful. Except that’s not what the code does. Let’s take a look at the first part. I’ve reformatted it for readability.

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch != null)
    { 
        paramName = paramNameMatch.Groups["ParamName"].Value;
    }
    else
    {
        paramName = string.Empty;
    }

The primary problem here is that Regex.Match never returns null!. Documentation says that if no match was found, the return value is a Match object that is equal to Match.Empty. In any case, the else clause will never be executed.

The code passed the programmer’s test only because trying to access a group that doesn’t exist returns an empty string. As documentation says:

If groupname is not the name of a capturing group in the collection, or if groupname is the name of a capturing group that has not been matched in the input string, the method returns a Group object whose Group.Success property is false and whose Group.Value property is String.Empty.

The code works by accident, which to me means that it’s a latent bug. Had we wanted to return something other than string.Empty in the else clause, this code would have been in error.

In this case, ReSharper told me that in this line:

    if (paramNameMatch != null)

The expression is always true. In other words, paramNameMatch will never be null.

The proper way to determine if a regular expression match was successful is to check the value of the Match.Success property. The above code is more correctly written as:

    string paramName;
    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    if (paramNameMatch.Success)    // will be false if no match was made.
    { 
        paramName = paramNameMatch.Groups["ParamName"].Value;
    }
    else
    {
        paramName = string.Empty;
    }

I find that overly verbose. Situations like this just beg for the ternary operator:

    var paramNameMatch = Regex.Match(message, @"Parameter name\:\s+(?.+)");
    string paramName = (paramNameMatch.Success) ? paramNameMatch.Groups["ParamName"].Value : string.Empty;

ReSharper also notified me of a real bug in this code:

    string actualValue;
    var actualValueMatch = Regex.Match(message, @"Actual value was (?.+)\.");
    if (paramNameMatch != null) { actualValue = paramNameMatch.Groups["ActualValue"].Value; }
    else { actualValue = string.Empty; }

ReSharper’s warning was that the variable actualValueMatch is assigned a value that is never used. Look closely: the second line assigns a value to actualValueMatch, but the next line checks the value of paramNameMatch. This code could not have worked. The programmer obviously didn’t test it.

Had the programmer who wrote this been using ReSharper and paying attention to what it was telling him, neither of these bugs would have been checked into source control. Most likely, the programmer would have seen ReSharper’s warnings immediately after he wrote the bugs, and would have fixed them before he even tried to compile the program.

Some people complain that ReSharper is expensive. I find that a ridiculous argument. A ReSharper subscription costs $239 per year. Or, if you don’t want upgrades you can buy a license for $300 and keep it for a few years before upgrading. If you’re writing code for pay, $300 is something like four or six hours of work. ReSharper saves me that much time every month! It helps me avoid mistakes when I’m writing code, and identifies trouble spots when I’m reviewing code. It also helps me with refactoring when I’m working on older code, and helps me navigate this very large solution I’ve inherited. I could do my work without it, but I’d rather not. It’d be like riding a horse to work: I’d get there eventually, but it’d be uncomfortable and would take too long. Buy the best tools you can afford. And if you’re making a living writing C# code, you can afford ReSharper.