Hanlon’s Razor: “Never attribute to malice that which is adequately explained by stupidity.”
Recap: in my previous post I showed the naive method of generating unique “random-looking” six-character keys, and explained why it wasn’t a good solution for generating hundreds of millions of keys. I also mentioned that I used to think that the naive method was the worst possible way somebody would try to use for generating keys, and promised that I would show a worse way in today’s post.
If you recall, the naive method does this:
do
{
generate a random number
} until you find one that hasn't been used
mark the number as used
create the coupon code for that number
The worse way that I mentioned attempts to do this:
do
{
generate a random number
create the coupon code for that number
} until you find a coupon code that hasn't been used
mark the code as used
The only difference in the overall algorithm is that the second method uses the coupon code rather than the random number to decide whether a code has been used. That might seem like a minor difference, but it’s really a major change in thinking. It’s also more expensive and yet another place for bugs to creep in.
Let’s defer the discussion of why the second method is worse than the first. In preparation, I want to provide a real-life example of such an implementation.
I ran across this code in a production database that’s used to generate millions of six-character coupon codes every week. When I first saw it I was surprised that somebody would use the naive method to generate unique codes. But then I studied it a little closer and was astonished at how incompetently it was written. If I were the suspicious type, I’d be tempted to think that the person who wrote (or, possibly, modified) it actually broke it on purpose. But then I remembered Hanlon’s Razor.
Here is part of a database stored procedure that implements the second algorithm I showed above.
WHILE (@codesGenerated < @NumberToGenerate AND @Attempts < @MaxAttempts)
BEGIN
BEGIN TRY
INSERT INTO [dbo].[CouponCodes]
([CouponNamespaceId], [CouponId], [Code], [Used])
VALUES
(@CouponNamespaceId,@CouponId,LTRIM(RTRIM([dbo].[GenerateRandomBase32](ABS(CHECKSUM(NEWID()))))), 0)
-- If a collision happened in the Insert above then this @codesGenerated will (correctly) not happen
SET @codesGenerated = @codesGenerated + 1
END TRY
BEGIN CATCH
-- There is a tiny chance that a generated code may collide with an existing code.
-- which will violate the unique constraint.
-- Error handling code is here.
END CATCH
--Always increment @Attempts to prevent an infinite loop if there are too many collisions
SET @Attempts = @Attempts + 1
END
This is a T-SQL version of the modified naive method: pick a number, generate the coupon code, see if the coupon code has been used. In the interest of brevity, I’ve removed some irrelevant code and the error handling. It’s a reasonable implementation of that modified naive method, although I question the use of a GUID checksum as the source of random numbers. I’m also not convinced that the person who wrote the error handling understands that the ABS function will throw an exception if the parameter is -2147483648. It’s highly unlikely that CHECKSUM
will generate that value, but I’ve been burned before by code that blindly calls ABS
without either first checking the input parameter, or handling the exception.
Whereas I believe that those issues should be addressed, they’re pretty minor in light of what follows.
I’m also a bit amused at the comment: “There is a tiny chance that a generated code may collide with an existing code.” Again, it’s especially amusing in light of what follows.
Let’s also ignore for now the question of whether this code even belongs in the database layer. I think it’s a good discussion to have, but we’ll save it for another time.
The problem lies in GenerateRandomBase32
, a SQL Server scalar function:
ALTER FUNCTION [dbo].[GenerateRandomBase32]
(
@seed AS BIGINT
)
RETURNS CHAR(6)
AS
BEGIN
DECLARE @base BIGINT = 31
SET @seed = @seed % 1040187391 + 33554432;
DECLARE @answer AS CHAR(6) = '';
DECLARE @allvalid AS VARCHAR(31);
SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
WHILE @seed > 0
BEGIN
SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
SET @seed = @seed / @base;
END
RETURN @answer;
END
One problem is that the function is poorly named. The name “GenerateRandomBase32” implies that it’s going to generate a random value. At first I thought it was supposed to generate a random 32-bit value, but for reasons I’ll get to in a bit I’m now convinced that “32” was supposed to be the number base. The function doesn’t generate anything, there’s no randomness involved, and the number base is 31. A more appropriate name would be “ConvertToBase31,” or perhaps “ConvertToSixDigitBase31”. I realize that code changes over the years and this function might have done something quite different when it was originally written, but that’s no excuse for failing to change the function name to something meaningful.
On the face of it, the function looks like a T-SQL implementation of the base conversion method that I showed last time. That is, it takes a binary number and creates base-31 representation of it. That’s what it does, in general, but there are a few oddities that will surprise you. The first is this expression at the beginning:
SET @seed = @seed % 1040187391 + 33554432;
With that expression, the effective range of keys that the function can create base-31 representations of is 33,554,432 to 1,073,741,822: a total of 1,040,187,390. Remember, though, that there are only 887,503,681 possible six-character coupon codes. Some of those values will get duplicates.
You might get the idea that this isn’t going to end well.
The peculiar thing about the expression is that, if we were generating a base-32 code, it would have the effect of preventing all codes that start with the first digit (presumably 0) from being generated. That is, if you sum the two magic numbers you end up with 1,073,741,823. It just so happens that 1,073,741,824 is the number of six-digit base-32 codes you can generate. I suspect that this code was originally written to generate base-32 codes, but for some reason later modified to eliminate ‘0’ from the valid digits, making it a base-31 code generator. But changing the number base without changing that expression has resulted in a horribly broken function.
I can’t prove that the function used to generate base-32 numbers, and was subsequently modified. I can only say that the evidence points strongly to that. I also can’t say for sure why one would want to prevent generation of codes that start with “0”. The only two reasons I can think of are:
- Somebody decided that numbers with leading 0’s were somehow difficult for users to handle.
- The programmer couldn’t figure out how to include leading 0’s when doing the base conversion.
At this point, I consider those two possibilities equally likely.
After mangling the input parameter, the code performs the standard base conversion logic, peeling out digits from lowest to highest, and creating a string. Note that each digit is prepended to the string. The code “7HNH51” would be constructed in the sequence “1”, “61”, “H61”, “NH61”, “HNH61”, “7HNH61”.
However, the range of values for which the function can generate codes exceeds the number of possible coupon codes. The function can generate codes for numbers greater than 887,503,681, which means that some of the codes generated will be seven characters long. They have to be truncated to six characters. How that truncation is done ends up making this code much worse than it appears at first.
If you pass the value 1040187360 to this function, it will generate seven characters. The first six characters generated are the code shown above, “7HNH61”. But then it generates a seventh character, “2”, which is prepended. Except the answer
variable is declared as holding only six characters. So when this line of code is executed:
SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
The character “2” becomes the first character in the new string, and then the first five (from left to right) of the previous characters are copied. The result is “27HNH6”.
Now here’s the kicker. Guess what happens when you pass the value 1040187361. The function generates “27HNH62”, which is truncated to “27HNH6”. Duplicate! In fact, all 31 values from 1040187360 through 1040187390 generate that same code. That’s a huge oversight that seriously affects the duplicate rate.
There are 186,238,141 numbers for which this function will generate a seven-character code that’s truncated. If you pass a number in the range 853,949,249 to 1,040,187,390 to this function, it will generate one of 6,007,682 codes. Those 186 million values are 31 times more likely than average to be generated.
One other thing that affects the duplicate rate, albeit not very large, is that the function won’t ever generate a code for any number less than 33,554,432.
Combined, those two errors more than double the duplicate rate experienced by programs that call this function as part of the naive code generation algorithm.
Fixing that function really isn’t difficult, as I show below. The only real changes are to replace the seed mangling expression, and modify the loop so that it always generates exactly six characters.
ALTER FUNCTION [dbo].[CreateSixDigitBase31]
(
@seed AS BIGINT
)
RETURNS CHAR(6)
AS
BEGIN
/*
* This function generates a six-digit, base-31 representation of the passed seed parameter.
* It is capable of generating 887,503,681 different codes (permutations of 31 items taken six
* at a time, with repetition).
* If the passed seed is larger than 887,593,680, the seed is taken modulo 887,593,680.
* This function will not produce a useful value if seed is negative.
*/
DECLARE @base BIGINT = 31
SET @seed = @seed % 887503680;
DECLARE @answer AS CHAR(6) = '';
DECLARE @allvalid AS VARCHAR(31);
SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
DECLARE @counter as INT = 6;
WHILE @counter > 0
BEGIN
SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
SET @seed = @seed / @base;
SET @counter = @counter - 1;
END
RETURN @answer;
END
I don’t particularly like that function because I suspect that the string manipulation is highly inefficient. We are in effect doing six string concatenation instructions with every code generated. I know that doing things this way in a C# program would bring the program to a crawl, which is why the implementation in my previous post used a character array. I don’t know enough about how T-SQL manipulates strings to say how efficient this is, and I don’t know if there’s a better alternative.
With the fixed base conversion function, this naive unique code generation algorithm is about as good as it can be. Which isn’t very good. Not only does it suffer from the duplicates problem that I pointed out yesterday, it has two other, related, strikes against it: it’s implemented at the database level, and it stores the generated code rather than the seed number.
Next time I’m going to talk a bit about why I think implementing the code generation in the database is a bad idea. I’ll also touch on how it’s possible for such a horribly broken implementation to go unnoticed for so long.