A common pattern used when communicating with external services is to retry a call that fails. Stripped of bells and whistles, the typical retry loop looks like this:
result = makeServiceCall(parameters)
numRetries = 0
while !result.success && numRetries < MAX_RETRIES
// insert delay before retry
++numRetries
result = makeServiceCall(parameters)
end while
if (!result.success)
// Raise error.
We can quibble about the specific implementation, but the general idea is the same: make multiple attempts to get a success response from the service, inserting a delay between calls. The delay can be a fixed amount of time, an exponential fallback with jitter, etc., and you can include all kinds of other logic to improve things, but it all boils down to essentially the same thing.
Under normal circumstances, of course, service calls don’t fail, and this logic works very well. But if the service call failure rate increases, some very bad things can happen.
Imagine a critical service for which you have a contract (a service level agreement, or SLA) that guarantees an average 10 ms response time for up to 1,000 transactions per second (TPS). Remember, though, that this is a shared service. There are many other clients who have similar contracts: 10 ms response time for X TPS. Your application calls that service perhaps 900 times per second, on average. There will be brief periods when your call rate will exceed 1,000 TPS, but that’s okay because the service is scaled to handle large amounts of traffic from many clients. Say the service can guarantee that 10 ms response time for a total of 1,000,000 TPS from all of its clients, combined. Short-term bursts of excess traffic from a few clients aren’t a problem.
Even if calls to the service exceed 1,000,000 TPS, the likely result at first will be increased response time: perhaps latency increases by 50% with a sustained traffic increase of 10%, and doubles when traffic is 20% above the configured maximum. The specific breaking point differs for every service, but in general latency increases non-linearly with the call rate.
Clients, of course, won’t wait forever for a response. They typically configure a timeout (often two or three times the SLA), and consider the call a failure if it takes longer than that. Not a problem with this retry logic: just delay a bit and try again.
As I said above, this kind of thing works fine under normal conditions. But in a large system, lots of things can go wrong.
Imagine what would happen if the service starts getting 1,500,000 requests per second: a sustained 50% increase in traffic. Or one of the service’s dependencies can’t meet its SLA. Or network congestion increases the error rate. Whatever the cause, the service’s failure rate increases, or latency increases beyond the timeout value set by clients. Whatever the cause of the service’s distress, your application blindly responds by sending another request. So if your MAX_RETRIES value is two, then you’ve effectively tripled the number of calls you make to the service.
The last thing a service under distress needs is more requests. Even if your application is not experiencing increased traffic, your retries still have a negative effect on the service.
Some argue that services should protect themselves from such traffic storms. And many do. But that protection isn’t free. There comes a point when the service is spending so much time telling clients to go away that it can’t spend any time clearing its queue. Not that clearing the queue is much help. Even after the initial problem is fixed, the service is swamped with requests from all those clients who keep blindly retrying. It’s a positive feedback loop that won’t clear until the clients stop calling.
The retry loop above might improve your application’s reliability in normal operation. I say “might” because most applications I’ve encountered don’t actually track the number of retries, so they have no idea if the retry logic even works. I’ve seen the following in production code:
A retry loop that always made the maximum number of retries, even if the initial call succeeded.
Retry logic that never retried. That code was in production for two years before anybody realized there was a problem. Why? Because the service had never failed before.
Retry logic that actually did the retries but then returned the result of the first call.
Infinite retry. When a non-critical service went down one day, the entire site became inoperable.
As bad as it is that many programmers apparently don’t test their retry logic, even fewer monitor it. In all the applications I’ve seen with retry logic, only a handful can tell me how effective it is. If you want to know whether your retry logic is working, you have to log:
The number of initial calls to the service.
The number of initial call failures.
The total number of calls to the service (including retries).
The number of call successes (including success after retry).
From those numbers, you can determine the effectiveness of the retry logic. In my experience, the percentage of initial call failures to any service under normal operation is less than 1%, and retry succeeds in fewer than 50% of those cases. When a service is under distress and the initial failure percentage gets above about 10%, retry is almost never successful. The reason, I think, is that whatever condition caused the outage hasn’t cleared before the last retry: service outages last longer than clients are willing to wait.
For the majority of applications I’ve encountered, retry is rarely worth the effort it takes to design, implement, debug, test, and monitor. Under normal circumstances it’s almost irrelevant, maybe making the difference between 99% and 99.5% success rate. In unusual circumstances, it increases the load on the underlying service, and almost never results in a successful call. It’s a small win where it doesn’t matter, and a huge liability when it does matter.
If you have existing retry logic in your code, I encourage you to monitor its effectiveness. If, like me, you discover that it rarely provides benefit, I suggest you remove it.
If you’re considering adding retry logic to your code, be sure to consider the potential consequences. And add the monitoring up front.
Java has an option, -XX:MaxRAMFraction=XXX, that lets developers specify the maximum amount of available RAM a Java application should occupy. This sounds like a great idea, until you realize how it’s implemented. The number that you specify for “XXX” is an integer that represents the fraction of total RAM that should be used. The intrepretation is 1/XXX. So -XX:MaxRamFraction=1 tells the application to use as much RAM as it wants. XX:MaxRAMFraction=2 tells it to use half of the memory. What if you want to use 75% of the memory? Or, perhaps 2/3 the memory? Tough luck. You get all, one-half, one-third, one-fourth, one-fifth, etc. You want to use 40%? Sorry, buddy. You can get 50% or 33%. Take your pick.
When I first heard this, I thought the guy who was explaining it was misinformed or pulling my leg.
I’d sure like to understand the thinking that went into this design decision, because from where I’m standing it looks like lunacy. I can’t imagine how such an idiotic design could make it through review and implementation without somebody with a modicum of intelligence and authority raising a stink. You mean nobody realized how stupid this is?
Sanity prevails, but I still want to understand how the original design got approved. That something so obviously wrong managed to make it through the approval process and was released on an unsuspecting world doesn’t give me much confidence in the competence of Java platform developers.
Imagine there’s a program that makes API calls through a proxy to another service. One of the parameters you pass to that proxy is a timeout value: the maximum time that the proxy should wait before giving up and returning with a timeout error. The function call would be something like:
result = callProxy("SomeService", "ServiceFunction", {parameters}, timeout);
The timeout value is defined in code as:
double timeout = 10;
And that code works for months or years until some programmer is in there editing things one day and says to himself, “A timeout of 10 milliseconds? That’s crazy. We should give the service at least a full second to respond.” He promptly changes the timeout value to 1000 and goes on his merry way. All the unit tests pass, the code is pushed, and continues to work for more months or years.
Until it doesn’t.
One fateful morning at about 02:00, you’re shocked out of bed by your pager screaming that there’s a problem. Bleary eyed, you stumble to the computer, log in, and find that your page has stopped serving requests. After a long search, you discover that “SomeServer” is experiencing high CPU usage, but you can’t figure out why that would make your program stop serving requests. After all, your program should be getting a timeout error after a second, and the error handling code looks like it works.
At some point you discover the problem: timeout is actually expressed in seconds, not milliseconds. So rather than waiting 10 seconds for a timeout, it’s waiting 16 minutes and 40 seconds.
Yes, something like this actually has happened. It wasn’t pretty.
Granted, quite a few things went wrong here, among them:
There was no comment describing the units that the timeout represents.
The timeout variable should have been named timeoutInSeconds.
The programmer should have looked more closely at the code before changing it.
The code reviewer should have caught the error.
The automated tests should have simulated a service timeout.
The error should have been caught. But it wasn’t.
A much more effective way to avoid an error like that is to use descriptive data types. If you’re expressing a time span, then don’t use a unit-less value like an integer or a floating point value. Use a data type intended for expressing time spans. In C#, for example, you’d use the TimeSpan, like this:
TimeSpan timeout = TimeSpan.FromSeconds(10);
It would be very difficult for a programmer to look at that and think, “10 milliseconds isn’t long enough.” And there’s absolutely no need for a comment here because the code says exactly what it does: it creates a timeout value of 10 seconds. It’s nearly impossible to screw this up. If on the off chance some drunken coder did screw that up, the person doing the code review almost certainly wouldn’t.
These types of errors happen more often than you might like to think. Many of us don’t see these because we work in small companies on relatively new code, where the current employees are also the people who wrote the original code. Things are different when you work at a large company where people move around a lot, and the code you’re working on has been there for 10 years and the programmers who wrote it are off doing other things, or have left the company.
Do yourself and the people who will maintain your code in the future a favor. Use descriptive data types just like you use descriptive variable names. Unit-less values are errors waiting to happen. And I guarantee you don’t want to be the person who made this type of error when that error happens at 2 AM on a large retail web site, where every minute of down time means millions of dollars in lost sales. Nor do you want to be the person who has to track down such an error in those circumstances.
This is a re-post of an article I wrote for my .NET Reference Guide column some years ago.
You’re probably familiar with the Birthday problem, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.
This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.
class Program
{
const int NumKeys = 365;
const int Thousand = 1000;
const int Million = Thousand * Thousand;
const int NumTries = 10 * Million;
static void Main(string[] args)
{
// List contains number of items to collision for each iteration.
List<int> Totals = new List<int>();
// Do the test NumTries times, saving the result from each attempt.
for (int i = 0; i < NumTries; ++i)
{
int n = DoTest(NumKeys);
Totals.Add(n);
if ((i % 100000) == 0)
{
Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
}
}
// Group the items by their result.
var grouped = from cnt in Totals
group cnt by cnt into newGroup
orderby newGroup.Key
select newGroup;
// Output a table of results and counts, with running percentage.
int runningTotal = 0;
foreach (var grp in grouped)
{
runningTotal += grp.Count();
Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2}) ({3,8:P6})", grp.Key, grp.Count(),
(double)grp.Count() / NumTries, (double)runningTotal / NumTries);
}
Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
Console.ReadLine();
}
// Random number generator used for all attempts.
static Random rnd = new Random();
static int DoTest(int nkey)
{
HashSet keys = new HashSet();
// Pick random numbers in the range [0..nkey-1]
// until a duplicate is selected.
int key;
do
{
key = rnd.Next(nkey);
} while (keys.Add(key));
return keys.Count;
}
}
The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.
The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:
This result agrees with the math. It shows some other interesting things, too.
Look at how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 10% with just 14 numbers selected.
You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the birthday problem by a few points. Maybe the break even point is really 25 rather than 23.
You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.
So what do birthdays have to do with hash keys?
Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys. Right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.
You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the birthday problem raises its ugly head here. Let’s look at the math and see what it tells us.
What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?
According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:
Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article, but they’re close–within the range of what you’d expect for a coarse approximation.
So, then, about our one million items selected from a field of four billion:
DuplicateProb(4294967296, 1000000) = 1
What?!?!
How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?
It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:
Field Size
Min
Max
Average
2^16
4
1,081
316
2^20
21
4,390
1,281
2^30
342
142,952
41,498
2^31
1,036
193,529
57,606
These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.
The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:
Field Size
Min
Max
Average
16
4
10.08
8.30
20
4.39
12.10
10.32
30
8.42
17.13
15.34
31
10.02
17.56
15.81
For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.
A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.
More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.
Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.
That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.
It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.
In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.
I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re indexing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how much data some programs index.
Granted, if you were only indexing a million items, you’d probably use something like a hash table to index them. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.
Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the probability of two different items hashing to the same value is surprisingly high.
Computer programming is complicated enough that we really don’t need to look for ways to further confuse programmers. On the contrary, we should actively look for ways to increase programmer understanding. Perhaps one of the easiest ways to confuse programmers is to choose bad names for things. For example, consider this little code snippet that’s supposed to demonstrate using a ManualResetEvent to control thread execution.
private ManualResetEvent suspended = new ManualResetEvent(true);
public void Main()
{
//
// suspend
suspended.Reset();
// resume
suspended.Set();
//
}
public void Work()
{
// long task
while (!stop)
{
suspended.WaitOne(/* optional timeout */);
//worker task
}
}
There’s nothing wrong with the way the snippet works, but naming the event suspended is one of the most confusing things I’ve seen this year. The suspended flag is set when the program is running. WTF? I’m pretty sure I’m not the only programmer who, if he were to encounter this in production code, would very likely misunderstand it.
“I have to wait for the suspended flag to be set before I can resume.”
Come on, people. Think about what you call things. Naming is important!
The “new thing” when I first started working with computers was Structured Programming. One of the tenets was that a function should have exactly one entry point and exactly one exit point. Some languages (Pascal, for example) enforced this by not having a return statement that would allow you to exit the function from somewhere in the middle.
In retrospect, that probably was a good thing, because people used to do all kinds of things like jump into the middle of a function, exit a function without returning a value, etc. Restricting functions to a single entry point and a single exit point was an effective way to prevent those stupid programmer tricks. But the restrictions are a little too restrictive.
I can’t think of any reason for a function to have multiple entry points, but restricting a function to a single exit point leads to things like this:
public boolean isExternal(final URI url) {
boolean target = false;
if (url.isAbsolute()) {
// this can be expensive and slow, run only for absolute URLs which should be an exception
final String serverName = requestAttributesProvider.get().getServerName();
final String host = url.getHost();
if (!host.equals(serverName)) {
target = true;
}
}
return target;
}
This code has an unnecessary intermediate variable, and an unnecessary level of nesting. All due to the restriction on the number of exit points. If we relax that restriction and allow an early out, the code becomes less cluttered:
public boolean isExternal(final URI url) {
// this can be expensive and slow, run only for absolute URLs which should be an exception.
if (!url.IsAbsolute()) {
return false;
}
final String serverName = requestAttributesProvider.get().getServerName();
final String host = url.getHost();
return !host.equals(serverName);
}
The idea here is to make a decision and then get out. There’s no reason to clutter the rest of the code with the “it’s not an absolute url” context. That allows us to eliminate the target variable, and reduces nesting in the function.
I find that code to be much more clear than the original.
A purist, by the way, would split that into two functions:
public boolean isExternal(final URI url) {
// this can be expensive and slow, run only for absolute URLs which should be an exception
if (!url.IsAbsolute()) {
return false;
}
return urlHostisExternal(url);
}
private boolean urlHostIsExternal(final URI url) {
final String serverName = requestAttributesProvider.get().getServerName();
final String host = url.getHost();
return !host.equals(serverName);
}
On a more general note, this is an anti-pattern:
if (something) {
return true;
}
else {
// do other stuff
}
return whatever;
If you return from inside the if block, then the else is implied. There’s no need for it. You can reduce nesting and increase legibility by eliminating it:
if (something) {
return true;
}
// do other stuff
return whatever;
I can’t think of any case in which such a transformation doesn’t make the code more readable, especially when working with cascading or nested if structures.
The same kind of thing applies to loops. Say you’re doing a sequential search for an item in a list. Structured Programming purists would have you write something like this:
bool found = false;
int ix = 0;
while (ix < list.length && !found)
{
if (list[ix] == thingToFind)
{
found = true;
}
++ix;
}
if (found)
{
return ix;
}
else
{
return -1;
}
We can eliminate a lot of clutter by returning directly from within the loop:
int ix = 0;
while (ix < list.length)
{
if (list[ix] == thingToFind)
{
return ix;
}
++ix;
}
// here, we know that the item wasn't found, so just return sentinel value.
return -1;
Again, I can think of no case in which that transformation doesn’t improve readability.
Long-time readers probably wondered why some of the code examples above are in Java. Most of the code I’ll be working on in my new job is Java, so I figured it’d be a good idea to start using it. I’ll still be working with C# in my home projects, and reporting on them here, but there also will be Java from time to time.
Elmer got a new job working for the procurement department of a very large company. Boss called him in on his second day, and said that he needs a rope long enough to go all the way around the Earth at the equator. Elmer’s job was to figure out how long the rope has to be, and to get quotes from suppliers: X feet of rope at Y dollars per foot.
Assume for the sake of this example that the Earth is perfectly spherical and smooth.
Elmer went back to his desk, did some calculation to figure out how much rope he’d need, and started burning up the phone talking to suppliers. A few hours later, quotes in hand, he entered the boss’s office and placed the quotes on his desk. Boss glanced at them and then said, “Good work. Before we place the order, this is enough rope to stretch all the way around the Earth, leaving a foot of space between the surface and the rope, right?”
Elmer, surprised, said, “No, sir. You asked for a rope to go all the way around the Earth. You didn’t say anything about a foot of slack.”
Boss pushed the quotes across the desk towards Elmer: “Well, you’ll have to re-do these quotes. I need the rope a foot away from the surface.”
Elmer, being the smart guy he is, replied “That’s okay, we’ll just add XX feet of rope to the original quote.”
What is the value of XX? How did Elmer figure that out so quickly?
My first games programming job was a huge learning experience. It was the first time I’d ever worked with people who were a lot smarter than I was. At least, smarter in terms of computer science. The people I’d worked with before understood specific things about programming, but they didn’t love programming like I did. To most of the people I’d worked with before, programming was just a job. I lived programming! I studied problems at home, solved problems at work, always had a few side projects, etc. Programming was my job and my most engaging hobby. Debra says that she shares me with my first wife: the computer.
One of the great things about working there was that we were always posing little brain teasers to each other. These could vary from math questions such as the above, algorithms for things like finding if a linked list has a loop, or sometimes larger design questions that don’t really have a “right” answer, but rather were designed to make us think about how to approach a problem. In many ways it was like an ongoing job interview. The puzzle that I presented above was one of them. This is one that I got after just a few seconds’ thought, but then had to check my work on the white board.
I’ve posed this question in interviews over the years, when a candidate says he’s good with math. The reason is that it’s a relatively simple problem that somebody who’s “good at math” should be able to solve pretty quickly. Whereas I’m interested in the solution, I’m more interested in how the candidate approaches the problem. Does he know how to set up the equation? Does he ask me questions to clarify the problem? Does he just sit there looking like he has no idea where to start?
Some people think the problem is too specialized, but I think it’s more than reasonable to expect a programmer, especially one who says he’s good at math, to know the equation for the circumference of a circle. This goes double for games programmers!
It’s surprising how many candidates utterly fail when it comes to this question. Some just sit there, dumbfounded. Others get up at the white board and start doodling, but can’t formulate the problem. It’s embarrassing.
The circumference of a circle is two times Pi, times the radius. That is:
C = 2πr
The original quote was for C feet of rope–enough rope to go around the Earth with radius r. The new rope has to be a foot away from the surface of the Earth. That radius is (r+1). So you have the original rope with length 2πr, and the new rope with length 2π(r+1). We want to know the difference. Showing my work, as my instructors always insisted:
Elmer’s answer was 2π, or about 6.28 feet. It only takes about 6.28 feet to move that rope from being against the surface of the Earth to being a foot away.
I’ve had candidates reach that solution and say, “but that can’t be right,” and go back to check their work. I consider that a good thing; being willing to admit that one might have made a mistake is something I look for in the people I work with.
I’ll admit that this fits squarely in the category of “math that just don’t feel right,” but it is right.
I’ve had other candidates argue that the answer is just flat wrong. Even after I show them the math. That’s bad. Not only is arguing with the interviewer bad form, but unsupported statements like, “I know that can’t be the right answer” show an unwillingness to learn and an inability to form reasoned, coherent arguments.
Note that I have nothing against polite, informed, disagreement from a candidate. That shows backbone and an ability to support one’s argument.
It’s a fun interview question, and can tell you a lot about the candidate’s approach to problems, and his reaction to things that are non-intuitive. Will he accept the math, or will he try to argue that the math is somehow wrong?
A common question on Stack Overflow goes something like this:
I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.
or
I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.
There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly notG75AC, which makes it obvious that the keys are sequential.
A related Stack Overflow question is this:
What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.
This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.
The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.
Encoding is easy: it’s just a base conversion. One common encoding is base-64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.
Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.
An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?
One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.
private void DoIt()
{
const long m = 101; // Number of keys + 1
const long x = 387420489; // must be coprime to m
// Compute the multiplicative inverse
var multInv = MultiplicativeInverse(x, m);
// HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
var nums = new HashSet<long>();
// Obfuscate each number from 1 to 100.
// Show that the process can be reversed.
// Show that no duplicates are generated.
for (long i = 1; i <= 100; ++i)
{
var obfuscated = i * x % m;
var original = obfuscated * multInv % m;
Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
if (!nums.Add(obfuscated))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a / b;
var r = a % b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q * t);
}
The first few lines of output for that program are:
If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.
The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.
By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.
The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.
In my next entry, I’ll wrap this all up into a nice easy-to-use class.
Imagine that your application works with different document types. You have a server that stores documents in a database, and several applications that query the database, and briefly cache query results in shared memory.
One of the fields that’s returned from a search is the document type, which you’ve encoded in a C# enumeration:
All of the applications include a shared library that defines common data structures and such, like DocumentSummary and DocumentType.
This all works great until you add support for a new type of document: the company’s new PowerPoint competitor, called SleepyAudience. You update your enumeration:
Then you compile and test your server code, build one of the other applications to test things with, and release it to the world. Only to find that the other applications, which haven’t yet been rebuilt, die a horrible death when they see the type “SleepyAudience”.
The application crashes because the deserialization code doesn’t know what to do when it finds the string “SleepyAudience” in the DocType. As far as that code is concerned, the DocType field can be an integer value, or it can have one of the first three strings. “SleepyAudience” is not the name of an enum value, as far as the deserializer is concerned.
The easiest solution to this problem is to encode the enumeration value as a number, rather than as a string. Had the DocType been encoded as a number, the deserializer would have read and populated a DocumentSummary instance. It’s then up to the application code to decide what to do with a document type that it is not familiar with. Perhaps the application would filter that document from the list presented to the user. Or it would display “Unknown document type.” Whatever the case, it certainly shouldn’t crash.
You lose a little bit of readability by encoding the enum value as a number, but only if you’re in the practice of eyeballing JSON data. And, yes, it’s often easier for people who are writing code that interacts with your API to understand what “WordSmasher” means. But we programmers are used to associating numbers with names. It’s no particular burden for a programmer to look up the value “1” in the API documentation to determine that it identifies the WordSmasher application.
One solution to this problem, of course, is to make sure that all applications that depend on the enumeration are rebuilt and released whenever the enumeration changes. That would solve the problem, but it’s not practical when you have dozens of different applications online all the time. You can’t just stop the world for a few hours (or even a few minutes) to apply updates.
You could also argue that the applications themselves are at fault, and I would agree with that assessment. After all, an application shouldn’t crash horribly because it got some bad data. The application should ignore that one data item (in this case, a single document record from a list of documents), log an error message, and continue. But standard serialization libraries aren’t that robust. They either understand the entirety of what’s sent, or they roll over and die.
The Robustness principle says “Be conservative in what you send, be liberal in what you accept.” When applied to deserialization it means that if you don’t understand something, deal with it gracefully; do the best you can. Fail gracefully. Interpret the value or ignore it and supply a default. If you can’t do that, then discard the record. If you can’t discard the record, discard the input. At worst return an empty data set. But under no circumstances should the program go down in flames because it didn’t understand some input data.
Regardless of who’s at fault here, the point is that encoding enumeration values as strings has very little, if any, real benefit, and makes it harder to write robust deserialization code. Especially when working with deserialization libraries (as opposed to rolling your own deserialization code), you should do whatever you reasonably can to ensure that the deserializer can interpret your data. Let the rest of your application decide how to proceed, after the data has been read.
In my opinion, the application I describe above has a more fundamental flaw in that the document types it supports are hard coded in an enumeration. But that’s a discussion for another day.
Last time I showed two different implementations of the naive method for generating unique coupon codes. The traditional method does this:
do
{
id = pick random number
} until id not in used numbers
code = generate code from id
The other is slightly different:
do
{
id = pick random number
code = generate code from id
} until code not in used codes
Before we start, understand that I strongly discourage you from actually implementing such code. The naive selection algorithm performs very poorly as the number of items you choose increases. But the seemingly small difference in these two implementations allows me to illustrate why I think that the second version is broken in a fundamental way.
Think about what we’re doing here. We’re obtaining a number by which we are going to identify something. Then we’re generating a string to express that number. It just so happens that in the code I’ve been discussing these past few entries, the expression is a six-digit, base-31 value.
So why are we saving the display version of the number? If we were going to display the number in decimal, there’d be no question: we’d save the binary value. After all, the user interface already knows how to display a binary value in decimal. Even if we were going to display the number in hexadecimal, octal, or binary, we’d store the binary value. We certainly wouldn’t store the string “3735928559”, “11011110101011011011111011101111”, or “DEADBEEF”. So, again, why store the string “26BB62” instead of the the 32-bit value 33,554,432?
I can’t give you a good reason to do that. I can, however, give you reasons not to.
Storing a six-character string in the database takes more space than storing a 32-bit integer.
Everything you do with with a six-character code string takes longer than if you were working with an integer.
Display logic should be part of the user interface rather than part of the database schema.
Changing your coding scheme becomes much more difficult.
The only argument I can come up with for storing codes rather than integer identifiers is that with the former, there’s no conversion necessary once the code is generated. Whereas that’s true, it doesn’t hold up under scrutiny. Again, nobody would store a binary string just because the user interface wants to display the number in binary. It’s a simple number base conversion, for crying out loud!
If you’re generating unique integer values for database objects, then let the database treat them as integers. Let everybody treat them as integers. Except the users. “26BB62” is a whole lot easier to read and to type than is “33554432”.
I’ll be charitable and say that whoever came up with the idea of storing the display representation was inexperienced. I’m much less forgiving of the person who approved the design. The combination of the naive selection algorithm, storing the generated coupon code rather than the corresponding integer, and the broken base conversion function smacks of an inexperienced programmer working without adult supervision. Or perhaps an inexperienced programmer working with incompetent adult supervision, which amounts to the same thing.
The mere existence of the naive selection algorithm in this code is a sure sign that whoever wrote the code either never studied computer science, or slept through the part of his algorithms class where they studied random selection and the birthday paradox (also, birthday problem). The comment, “There is a tiny chance that a generated code may collide with an existing code,” tells us all we need to know of how much whoever implemented the code understood about the asymptotic behavior of this algorithm. Unless your definition of “tiny” includes a 10% chance after only one percent of the codes have been generated.
The decision to store the generated code string surprises me. You’d think that a DBA would want to conserve space, processor cycles, and data transfer size. Certainly every DBA I’ve ever worked with would prefer to store and index a 32-bit integer rather than a six-character string.
The way in which the base conversion function is broken is hard evidence that whoever modified it was definitely not an experienced programmer. I suspect strongly that the person who wrote the original function knew what he was doing, but whoever came along later and modified it was totally out of his depth. That’s the only way I can explain how the function ended up the way it did.
Finally, that all this was implemented on the database looks to me like a case of seeing the world through database-colored glasses. Sure, one can implement an obfuscated unique key generator in T-SQL. Whether one should is another matter entirely. Whoever did it in this case shouldn’t have tried, because he wasn’t up to the task.
However this design came to exist, it should not have been approved. If there was any oversight at all, it should have been rejected. That it was implemented in the first place is surprising. That it was approved and put into production borders on the criminal. Either there was no oversight, or the person reviewing the implementation was totally incompetent.
With the code buried three layers deep in a database stored procedure, it was pretty safe from prying eyes. So the bug remained hidden for a couple of years. Until a crisis occurred: a duplicate code was generated. Yes, I learned that the original implementation didn’t even take into account the chance of a duplicate. Apparently somebody figured that with 887 million possible codes, the chance of getting a duplicate in the first few million was so remote as to be ignored. Little did they know that the chance of getting a duplicate within the first 35,000 codes is 50%.
I also happen to know that at this point there was oversight by a senior programmer who did understand the asymptotic behavior of the naive algorithm, and yet approved “fixing” the problem by implementing the duplicate check. He selected that quick fix rather than replacing the naive algorithm with the more robust algorithm that was proposed: one that does not suffer from the same pathological behavior. The combination of arrogance and stupidity involved in that decision boggles the mind.
The history of this bug is rooted in company culture, where the database is considered sacrosanct: the domain of DBAs, no programmers allowed, thank you very much. And although programmers have access to all of the source code, they aren’t exactly encouraged to look at parts of the system outside of their own areas. Worse, they’re actively discouraged from making suggestions for improvement.
In such an environment, it’s little surprise that the horribly broken unique key generation algorithm survives.
So much for that. Next time we’ll start looking at good ways to generate unique, “random-looking” keys.