How not to generate unique codes

Imagine that part of the system you’re developing requires you to generate six-character unique alphanumeric coupon codes that you don’t want to appear sequential. For example, the first code might be XQJ97T and the next is 1B9QD4. You need to keep track of the codes, of course, and generating a duplicate would be a very bad thing.

For this discussion, let’s assume that you want codes to contain characters from the set 123456789ABCDFGHJKLMNPQRSTVWXYZ. I’ve eliminated the number 0 and the vowels E, I, O, and U. Why eliminate the vowels? Because some people think it reduces the likelihood of spelling a “bad” word. Whatever. You’d be surprised how common that set is.

So there are 31 characters, and we want to create a six-character code. If you allow repetition (codes where characters can be repeated, like XQJ97J), then there are 887,503,681 possible codes. (See permutations with repetition.)

I used to think that the worst possible way to generate a unique code would be to generate one, see if it’s been used, and if so go back and generate another one. Something like this:

    do
        code = generateRandomCode()
    until (code not in database)
    addGeneratedCode(code)

That actually works. However, you’ll find that as the number of codes you generate increases, it takes longer and longer to find one that hasn’t been used. The code below illustrates this effect using random numbers.

First, the DumbCodeGenerator class, which implements the code generation and keeps track of the number of duplicates.

    public class DumbCodeGenerator
    {
        private readonly Random _rnd = new Random();
        private readonly HashSet<int> _uniqueCodes = new HashSet<int>();

        public int TotalCodesGenerated { get; private set; }
        public int UniqueCodesGenerated { get { return _uniqueCodes.Count; } }
        public int DuplicatesGenerated { get { return TotalCodesGenerated - UniqueCodesGenerated; } }

        private const int TotalAvailableCodes = 887503681;

        public int NextCode()
        {
            while (true)
            {
                int code = _rnd.Next(TotalAvailableCodes);
                ++TotalCodesGenerated;
                if (_uniqueCodes.Add(code))
                {
                    return code;
                }
            }
        }
    }

And, the code that calls it:

    class Program
    {
        private const int Thousand = 1000;
        private const int Million = Thousand*Thousand;
        private const int NumCodesToGenerate = 25*Million;
        static void Main(string[] args)
        {
            var generator = new DumbCodeGenerator();
            int lastDup = 0;
            for (var i = 0; i < NumCodesToGenerate; ++i)
            {
                generator.NextCode();
                if (i%Million == 0)
                {
                    Console.WriteLine("{0:N0}, {1:N0}, {2:N0}", i, generator.DuplicatesGenerated-lastDup, generator.DuplicatesGenerated);
                    lastDup = generator.DuplicatesGenerated;
                }
            }
            Console.WriteLine("{0:N0} unique codes generated.", generator.UniqueCodesGenerated);
            Console.WriteLine("{0:N0} duplicates generated.", generator.DuplicatesGenerated);
            Console.WriteLine("{0:N0} total generated.", generator.TotalCodesGenerated);
            Console.WriteLine("Duplicate percent = {0:P2}", (double)generator.DuplicatesGenerated/NumCodesToGenerate);
            Console.Write("Press Enter:");
            Console.ReadLine();
        }
    }

When I run that code, it produces this output:

0  0  0
1,000,000  583  583
2,000,000  1,742  2,325
3,000,000  2,859  5,184
4,000,000  4,153  9,337
5,000,000  5,369  14,706
6,000,000  6,492  21,198
7,000,000  7,666  28,864
8,000,000  8,823  37,687
9,000,000  10,203  47,890
10,000,000  11,185  59,075
11,000,000  12,378  71,453
12,000,000  13,719  85,172
13,000,000  14,895  100,067
14,000,000  15,989  116,056
15,000,000  17,251  133,307
16,000,000  18,906  152,213
17,000,000  19,871  172,084
18,000,000  20,956  193,040
19,000,000  21,995  215,035
20,000,000  23,420  238,455
21,000,000  24,494  262,949
22,000,000  25,663  288,612
23,000,000  26,990  315,602
24,000,000  28,254  343,856
25,000,000 unique codes generated.
373,291 duplicates generated.
25,373,291 total generated.
Duplicate percent = 1.49 %

In short, the first million codes selected generated 583 duplicates. Between 1 million and 2 million, there were 1,742 duplicates generated. Between 23 million and 24 million, it generated 28,254 duplicates, or about 2.83%. The likelihood of a duplicate increases as we generate more codes. When I ran it for 45 million codes (the largest I could get on my laptop), it was generating 5.5% duplicates at the last interval, and the duplicate percentage up to that point was 2.74%.

If you were to generate 50 million unique codes this way, you would have generated approximately 3% more codes than you had to. That’s not really so bad, provided you don’t mind spending the memory (or database resources) saving and querying the generated numbers. Things get more troublesome when you want to generate hundreds of millions of codes, because your duplicate percentage increases so much that you spend a huge amount of time trying to generate a unique code.

So far, I’ve talked about generating random numbers. What about six-character codes? They’re just a base conversion away. That is, just like you can represent the decimal value 1793 as 0x0701, you can also represent any number in base-31 using whatever characters you want as digits. Here’s the method that turns a passed binary value into a base-31 code string.

    private const string ValidChars = "123456789ABCDFGHJKLMNPQRSTVWXYZ";
    private static readonly int NumberBase = ValidChars.Length;

    static string CreateEncodedString(int code)
    {
        var answer = new char[6];
        int codeVal = code;
        for (var i = 0; i < 6; ++i)
        {
            int digit = codeVal%NumberBase;
            answer[5 - i] = ValidChars[digit];
            codeVal = codeVal/NumberBase;
        }
        return new string(answer);
    }

In this case, I wanted to keep leading “zeros” (which are represented by the character “1”). Note that the code builds the string backwards. If this method is passed a number larger than 887,503,680, only the six low-order digits of the number are generated. It would be like generating the code for the number modulo 887,503,681.

As I said, I used to think that the above was the worst possible way anybody would consider generating codes. I was wrong. There’s a much worse way that I’ve actually seen used in production code. I’ll talk about that in my next entry.