A common question on Stack Overflow goes something like this:
I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.
or
I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.
There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly not G75AC, which makes it obvious that the keys are sequential.
A related Stack Overflow question is this:
What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.
This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.
The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.
Encoding is easy: it’s just a base conversion. One common encoding is base-64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.
Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.
An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?
The answer is a particularly cool bit of math called a modular multiplicative inverse. Despite the name, it’s not a magical incantation. Eric Lippert explains it well in his blog post, Math from scratch, part thirteen: multiplicative inverses. (In case you’re wondering, the acronym he uses in that second sentence, “WOLOG” means, “Without loss of generality“.)
One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.
private void DoIt()
{
const long m = 101; // Number of keys + 1
const long x = 387420489; // must be coprime to m
// Compute the multiplicative inverse
var multInv = MultiplicativeInverse(x, m);
// HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
var nums = new HashSet<long>();
// Obfuscate each number from 1 to 100.
// Show that the process can be reversed.
// Show that no duplicates are generated.
for (long i = 1; i <= 100; ++i)
{
var obfuscated = i * x % m;
var original = obfuscated * multInv % m;
Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
if (!nums.Add(obfuscated))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a / b;
var r = a % b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q * t);
}
The first few lines of output for that program are:
1 => 43 => 1
2 => 86 => 2
3 => 28 => 3
4 => 71 => 4
5 => 13 => 5
6 => 56 => 6
7 => 99 => 7
8 => 41 => 8
9 => 84 => 9
10 => 26 => 10
If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.
The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.
By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.
The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.
In my next entry, I’ll wrap this all up into a nice easy-to-use class.