If you know how many items are in a collection, picking one at random is trivial: just select a random number from 0 to count-1
, and you’re done. Things get more interesting when you don’t know how many items are in the collection. The naive way to do the selection is to first make an array. Then you can use the solution I just mentioned. For example:
readonly Random _rnd = new Random();
int GetRandomItem(IEnumerable<int> items)
{
var a = items.ToArray();
var ix = _rnd.Next(a.Length - 1);
return a[ix];
}
Whereas that works, it can be expensive or even impossible to create a new array if the collection is very large.
When the question is modified to include the restriction that the collection won’t fit in memory (say, a text file with hundreds of millions of lines), things get more interesting still. You could go through the file to count the lines, pick a random number from 0 to count - 1
, and then go back through the file to read that many lines. In code:
var count = File.ReadLines(filename).Count(); // reads the whole file
var ix = _rnd.Next(count - 1);
var sel = File.ReadLines(filename) // potentially reads the whole file
.Skip(ix-1)
.First();
return sel;
If the collection is a file, then you end up reading the file once, and at least part of the file twice. On average, you end up reading the file halfway through the second time. It works, but it’s pretty expensive.
Worse, it’s impossible to use that technique if the collection is a bunch of records coming in from a network stream. It’s not like you can rewind the stream and start over after you figure out how many records there are.
There is a way to do this that doesn’t require knowing up front how many items are in the collection, and guarantees that each item has the same probability of being selected. Using this technique, you can select any number of items in a single pass through the collection. Provided, of course, that you have enough memory to hold the selected items.
Imagine that you are asked to select an item at random from a sequence of items presented to you one at a time. You don’t know how many items will be presented, and you have to prove that each item was given equal opportunity to be selected. You can’t decide to pick the first item every time, and you can’t decide to pick the fifteenth item, for example, because there might not be fifteen items presented. What to do?
First, since you always have to be prepared to present a selected item, it follows that you must select the first item because there might be only one item presented. But you can’t unconditionally keep the first item. When the second item is presented, you have to decide whether you should keep the first item, or replace it with the second item. How? By flipping a coin. Heads you keep the first, tails you replace it. Assuming a fair flip of a fair coin, then after two items half the time you’ll be holding the first item, and half the time you’ll be holding the second item.
Now the third item comes along. How do you decide whether to keep your previously selected item, or if you should replace it with the third item? You can’t flip a coin because doing so would mean that half the time you’d keep the third item. You want to keep the third item one third of the time.
What you really want is to roll a three-sided die and if it turns up a 3, replace the currently selected item with that third item. It should be obvious that the third item will be selected one third of the time.
If you don’t believe that this technique treats all items equally (i.e. gives every item an equal chance of ending up as the selected item), consider the possibilities with three items.
First | Coin | Die | Result |
---|---|---|---|
1 | Heads | 1 | 1 |
1 | Heads | 2 | 1 |
1 | Heads | 3 | 3 |
1 | Tails | 1 | 2 |
1 | Tails | 2 | 2 |
1 | Tails | 3 | 3 |
Those are all of the possibilities when selecting from three items. Each number is selected in two of the six cases, or one third of the time.
If you continue in this manner, selecting the nth item only if rolling an n-sided die turns up n, then when you’ve reached the end of the list you will have given each item an equal chance of being selected.
Let’s see if that really does work. The program below uses the described technique to randomly select a value from 1 to 100. The program does that 10 million times and reports the results.
private void DoIt()
{
const int arraySize = 100;
const int iterations = 10*1000*1000;
var selected = new int[arraySize];
for (var i = 0; i < iterations; ++i)
{
var sel = SelectRandomInt(arraySize);
++selected[sel];
}
for (var i = 0; i < selected.Length; ++i)
{
Console.WriteLine("{0}: {1} ({2:P4})", i, selected[i],
(double) selected[i]/iterations);
}
Console.WriteLine("Expected value = {0:N0}", iterations/arraySize);
Console.WriteLine("Min value = {0:N0}", selected.Min());
Console.WriteLine("Max value = {0:N0}", selected.Max());
}
private readonly Random _rnd = new Random();
private int SelectRandomInt(int max)
{
var selected = -1;
for (int i = 0; i < max; ++i)
{
if (_rnd.Next(i + 1) == i)
selected = i;
}
return selected;
}
The “expected” number that’s output is just the number of iterations divided by the number of items. In this case, 100,000. That’s the number of times you would expect any item to be selected, given a sufficiently large number of trials. This program outputs the number of times each item is selected, but below I show just the minimum and maximum.
Expected value = 100,000
Min value = 98,012
Max value = 101,257
Not surprisingly, not every item was selected exactly 100,000 times. But it was close. The maximum variance was less than 2% on the low side, and about 1.3% on the high side. How does that compare with just selecting a random number from 0 to 99?
I replaced the SelectRandomInt
method above with one that just returns a random number from 0 to max-1
, like this:
private int SelectRandomInt(int max)
{
return _rnd.Next(max);
}
That gave me:
Expected value = 100,000
Min value = 99,069
Max value = 100,842
This second version consistently gives me results that are closer to the expected values. The variance is approximately half. That is, where the first version might vary by 2% on the low side, this version varies by 1%. I think the difference is due to the random number generator being slightly biased (that is, it doesn’t produce a perfectly uniform distribution), and because we’re selecting many more random numbers with the first version than with the second, that bias is amplified. It’s not so far out of whack that I find it unusual, but I do think it merits further study.
Random number weirdness aside, you can see here that I was able to randomly choose one item from a collection of unknown size.
You wouldn’t use that technique to select a random number from 0 to 99. A real method would take an IEnumerable<T>
and return a randomly selected item. The code looks like this:
T SelectRandomItem<T>(IEnumerable<T> source, Random rnd)
{
T selectedItem;
int nItems = 0;
foreach (var item in source)
{
++nItems;
if (rnd.Next(nItems) == 0)
{
selectedItem = item;
}
}
return selectedItem;
}
Note that I modified the test a bit. In the previous version I checked to see if the selected random number was equal to the top of the range. Here, I test to see if it’s equal to zero. The effect is the same because the conditional is selecting one item from n
possible items. The probability that the random number will be 0 is the same as the probability that it’s equal to (nItems - 1)
. The reason I changed it here is that it’s easier for me to think about it this way, and it simplifies the task of modifying this code to select multiple items at random.
Using that code is very simple. If you want to select a line from a large text file, you can do it in a single line of code:
string selected = SelectRandom(File.ReadLines("foo.txt", new Random()));
You might ask why I pass in the random number generator rather than creating a new one when the method is called, or expecting there to be one at class scope. There are several reasons.
The default Random
constructor seeds the random number generator with the current time, which is expressed in milliseconds. If I were to call SelectRandomItem
multiple times in a single millisecond (which could happen when working with short lists), I’d be very likely to select the same item each time. Consider, for example, selecting from two lists:
IEnumerable<string> firstNames = LoadFirstNames();
IEnumerable<string> lastNames = LoadLastNames();
var firstName = SelectRandomItem(firstNames);
var lastName = SelectRandomItem(lastNames);
If firstNames
and lastNames
were very short, then whenever I selected the third item from one, I’d be very likely to select the third item from the other. That wouldn’t make for a very good random name generator.
Programmers often want a repeatable random sequence, especially for testing purposes. That’s not possible if the method creates a new random number generator with every call. It’s possible if there is a random number generator dedicated to the SelectRandomItem
method, but if any other piece of code uses that random number generator or calls SelectRandomItem
, then the sequence is lost. If I want to guarantee a repeatable random sequence, then I need to supply the random number generator to the method that uses it.
Selecting multiple items
The algorithm to select multiple items in a single pass is a straightforward generalization of the single item selection algorithm. Rather than dealing with a single “currently selected” item, we maintain an array of them. And to decide whether to select a particular item, we get a random number and compare it with the number of items to be selected. So, we replace this line in the inner loop:
if (_rnd.Next(nItems) == 0)
With this one:
if (_rnd.Next(nItems) < numToSelect)
If the random number is smaller than the number of items to select, then we have to randomly replace one of the selected items. The resulting method looks like this:
T SelectRandomItems<T>(
IEnumerable<T> source, int numToSelect, Random rnd)
{
T[] selectedItems = new T[numToSelect];
int nItems = 0;
foreach (var item in source)
{
++nItems;
if (rnd.Next(nItems) < numToSelect)
{
var ix = rnd.Next(numToSelect);
selectedItems[ix] = item;
}
}
return selectedItems;
}
When selecting more than a single item, the output from that method is surprisingly similar to the output from a method that selects unique random numbers in the same range. So similar, in fact, that I can’t tell them apart. I find that a little odd. There is a definite difference in the results from selecting a single item, but when selecting multiple items the difference is not detectable. I do need to investigate why that difference in behavior exists.
And that’s all there is to it: a method that will fairly select one or more items at random from a collection of unknown size, and do it with a single scan of the entire collection. If there is a better way to do this, I don’t know what it is.
Source
I first ran across this problem in Jon Bently’s Programming Pearls. He cites Volume 2: Fundamental Algorithms of Knuth’s The Art of Computer Programming. See section 3.4.2: Random Sampling and Shuffling.