Imagine that part of the system you’re developing requires you to generate six-character unique alphanumeric coupon codes that you don’t want to appear sequential. For example, the first code might be XQJ97T and the next is 1B9QD4. You need to keep track of the codes, of course, and generating a duplicate would be a very bad thing.
For this discussion, let’s assume that you want codes to contain characters from the set 123456789ABCDFGHJKLMNPQRSTVWXYZ
. I’ve eliminated the number 0 and the vowels E, I, O, and U. Why eliminate the vowels? Because some people think it reduces the likelihood of spelling a “bad” word. Whatever. You’d be surprised how common that set is.
So there are 31 characters, and we want to create a six-character code. If you allow repetition (codes where characters can be repeated, like XQJ97J), then there are 887,503,681 possible codes. (See permutations with repetition.)
I used to think that the worst possible way to generate a unique code would be to generate one, see if it’s been used, and if so go back and generate another one. Something like this:
do
code = generateRandomCode()
until (code not in database)
addGeneratedCode(code)
That actually works. However, you’ll find that as the number of codes you generate increases, it takes longer and longer to find one that hasn’t been used. The code below illustrates this effect using random numbers.
First, the DumbCodeGenerator
class, which implements the code generation and keeps track of the number of duplicates.
public class DumbCodeGenerator
{
private readonly Random _rnd = new Random();
private readonly HashSet<int> _uniqueCodes = new HashSet<int>();
public int TotalCodesGenerated { get; private set; }
public int UniqueCodesGenerated { get { return _uniqueCodes.Count; } }
public int DuplicatesGenerated { get { return TotalCodesGenerated - UniqueCodesGenerated; } }
private const int TotalAvailableCodes = 887503681;
public int NextCode()
{
while (true)
{
int code = _rnd.Next(TotalAvailableCodes);
++TotalCodesGenerated;
if (_uniqueCodes.Add(code))
{
return code;
}
}
}
}
And, the code that calls it:
class Program
{
private const int Thousand = 1000;
private const int Million = Thousand*Thousand;
private const int NumCodesToGenerate = 25*Million;
static void Main(string[] args)
{
var generator = new DumbCodeGenerator();
int lastDup = 0;
for (var i = 0; i < NumCodesToGenerate; ++i)
{
generator.NextCode();
if (i%Million == 0)
{
Console.WriteLine("{0:N0}, {1:N0}, {2:N0}", i, generator.DuplicatesGenerated-lastDup, generator.DuplicatesGenerated);
lastDup = generator.DuplicatesGenerated;
}
}
Console.WriteLine("{0:N0} unique codes generated.", generator.UniqueCodesGenerated);
Console.WriteLine("{0:N0} duplicates generated.", generator.DuplicatesGenerated);
Console.WriteLine("{0:N0} total generated.", generator.TotalCodesGenerated);
Console.WriteLine("Duplicate percent = {0:P2}", (double)generator.DuplicatesGenerated/NumCodesToGenerate);
Console.Write("Press Enter:");
Console.ReadLine();
}
}
When I run that code, it produces this output:
0 0 0
1,000,000 583 583
2,000,000 1,742 2,325
3,000,000 2,859 5,184
4,000,000 4,153 9,337
5,000,000 5,369 14,706
6,000,000 6,492 21,198
7,000,000 7,666 28,864
8,000,000 8,823 37,687
9,000,000 10,203 47,890
10,000,000 11,185 59,075
11,000,000 12,378 71,453
12,000,000 13,719 85,172
13,000,000 14,895 100,067
14,000,000 15,989 116,056
15,000,000 17,251 133,307
16,000,000 18,906 152,213
17,000,000 19,871 172,084
18,000,000 20,956 193,040
19,000,000 21,995 215,035
20,000,000 23,420 238,455
21,000,000 24,494 262,949
22,000,000 25,663 288,612
23,000,000 26,990 315,602
24,000,000 28,254 343,856
25,000,000 unique codes generated.
373,291 duplicates generated.
25,373,291 total generated.
Duplicate percent = 1.49 %
In short, the first million codes selected generated 583 duplicates. Between 1 million and 2 million, there were 1,742 duplicates generated. Between 23 million and 24 million, it generated 28,254 duplicates, or about 2.83%. The likelihood of a duplicate increases as we generate more codes. When I ran it for 45 million codes (the largest I could get on my laptop), it was generating 5.5% duplicates at the last interval, and the duplicate percentage up to that point was 2.74%.
If you were to generate 50 million unique codes this way, you would have generated approximately 3% more codes than you had to. That’s not really so bad, provided you don’t mind spending the memory (or database resources) saving and querying the generated numbers. Things get more troublesome when you want to generate hundreds of millions of codes, because your duplicate percentage increases so much that you spend a huge amount of time trying to generate a unique code.
So far, I’ve talked about generating random numbers. What about six-character codes? They’re just a base conversion away. That is, just like you can represent the decimal value 1793 as 0x0701, you can also represent any number in base-31 using whatever characters you want as digits. Here’s the method that turns a passed binary value into a base-31 code string.
private const string ValidChars = "123456789ABCDFGHJKLMNPQRSTVWXYZ";
private static readonly int NumberBase = ValidChars.Length;
static string CreateEncodedString(int code)
{
var answer = new char[6];
int codeVal = code;
for (var i = 0; i < 6; ++i)
{
int digit = codeVal%NumberBase;
answer[5 - i] = ValidChars[digit];
codeVal = codeVal/NumberBase;
}
return new string(answer);
}
In this case, I wanted to keep leading “zeros” (which are represented by the character “1”). Note that the code builds the string backwards. If this method is passed a number larger than 887,503,680, only the six low-order digits of the number are generated. It would be like generating the code for the number modulo 887,503,681.
As I said, I used to think that the above was the worst possible way anybody would consider generating codes. I was wrong. There’s a much worse way that I’ve actually seen used in production code. I’ll talk about that in my next entry.