Performance testing the TopBy method

The purpose of my TopBy extension method is to provide a faster and more memory efficient way to select the top k items from a list. Absent this method, the LINQ way to get those items is to sort the sequence in descending order and then take the top k. For example:

    var oldestPeople = people
        .OrderbyDescending(x => x.Age)
        .Take(10);

That will give you the 10 oldest people in the list. My TopBy method makes it a little easier:

    var oldestPeople = people.TopBy(10, x => x.Age);

That reads better to me and, done right, should be a whole lot faster than the existing LINQ alternative.

When LINQ does that selection, it has to copy the entire source list (people in this case), sort the entire list, and then return the top 10. That requires extra memory proportional to the size of the source list, and time proportional to n*log2(n). If you want to select the top 10 from a list of one million, LINQ will create another list of one million items to sort.

It should also be clear that LINQ will take almost the same amount of time to select the top 10 as it would to select the top 100,000. The bottleneck in the LINQ way of doing things is the sort, and it always has to sort the entire array.

My TopBy method uses a heap selection algorithm similar to the one that I discussed in my article When theory meets practice. This algorithm requires extra space proportional to k–the number of items to be selected. It takes time proportional to n*log2(k), meaning that it should be much faster than the LINQ method when selecting a relatively small percentage of items in the list, and it will use a whole lot less memory.

One other benefit of the new TopBy method is that you can use it to select the top items from a list that won’t even fit in memory. Imagine, for example, that you have a text file that contains hundreds of millions of records–more than will fit in memory. You want to select the top 100 records from that file. You can’t do that with OrderBy because you can’t fit those hundreds of millions of records in memory. But TopBy only needs space for 100 records, so you can easily use it to select the records you want.

That’s the theory. Let’s see how it works in practice.

The short version is that TopBy is faster than OrderBy followed by Take when the number of items that you want to select is less than about 25% of the total number of items in the source list. For primitive types like integers, “about” is closer to 20%, and for strings it’s closer to 40%. As the cost of comparisons increases, TopBy is a better choice.

How much faster depends mostly on the ratio of selected items (k) to total items (n). When k/n is small, TopBy is hugely faster. For example, selecting the top 100 items from a list of 1,000,000 strings takes TopBy about 500 milliseconds on my machine. It takes the OrderBy method about 17 seconds.

When k/n approaches 0.25, they’re about the same speed, and when k/n is near 1.0, TopBy takes about twice as long. Again, TopBy has an advantage as the cost of key selection and comparison increases.

The first table below compares the performance of OrderBy and TopBy when selecting varying percentages of integers from different list sizes. The second table compares the performance when selecting strings.

Total Items
Percent1001,00010,000100,0001,000,000
SelectedOrderByTopByOrderByTopByOrderByTopByOrderByTopByOrderByTopBy
0.01%9.410.761127.601,47575.25
0.10%0.740.088.870.791128.181,47984.73
1%0.040.010.730.108.891.2011314.841,477176.37
10%0.040.030.810.399.065.1811666.841,536854
25%0.040.040.710.699.189.751161261,5011,677
50%0.040.070.731.359.5014.321181891,5542,547
75%0.040.090.731.359.5917.011182241,5723,106
90%0.050.090.751.329.5718.551232351,6993,287
Time (ms) to select integers
Total Items
Percent1001,00010,000100,0001,000,000
SelectedOrderByTopByOrderByTopByOrderByTopByOrderByTopByOrderByTopBy
0.01%97.855.141,49954.4317,595540
0.10%6.630.5394.795.401,45558.7316,410612
1%0.450.067.200.7891.628.421,36012216,6701,364
10%0.530.146.542.4091.1136.161,37655616,8776,919
25%0.420.276.614.6210069.691,33796717,14713,375
50%0.440.437.217.3490.451041,2991,48116,85420,561
75%0.460.526.478.6094.601241,3181,82317,10624,166
90%0.460.586.939.171051351,4001,99817,88626,071
Time (ms) to select strings

The important thing to note here is that for a given list size, OrderBy followed by Take takes nearly constant time regardless of how many items are selected. That is, selecting 100 items from a list of 1 million takes approximately the same amount of time as selecting 900,000 items from the same list. In comparison, the time required by TopBy depends on the ratio of selected items to total items.

The time I spent developing the TopBy extension method is a huge win for me because I regularly select small numbers of items from very large lists. Selecting the top 100 from a list of 1 million is a common operation in the work that I do. Being able to do that in 500 milliseconds rather than 16 seconds means that my programs run much faster. They also require a whole lot less memory.

I also find myself, sometimes, having to select the top 100 items from a list that’s too large to fit in memory. Again, that’s something that TopBy can do, but OrderBy can’t. I realize that my situation is a little out of the ordinary in that most people who work with hundreds of millions of records will have them in a database and they can use SQL to do the selection.

It’s important to point out, though, that my TopBy extension method isn’t the optimum in performance. I have a custom generic selection method that’s three to five times faster, and outperforms LINQ’s OrderBy even when selecting 100% of items. Unfortunately, it doesn’t support the ThenBy and ThenByDescending methods, and it doesn’t do a stable ordering. It works more like List<T>.Sort in that I can pass it a comparison delegate. It’s useful, but using it in conjunction with LINQ is difficult.

All in all, this was an interesting and enjoyable diversion. I learned a lot about how LINQ to Objects works under the hood, and I ended up with a very useful extension method that makes my programs run faster and use less memory.

Below is the code I used to generate the data that went into the tables above. Of course, all tests were run with a release build with no debugger attached.


    private const int Million = 1000*1000;
    private const int Billion = 1000*Million;
    private const int NumIterations = 10;

    private static readonly double[] Percentages = new double[] {0.0001, 0.001, 0.01, 0.1, .25, .50, .75, .90};
    private static readonly int[] Sizes = new int[] {100, 1000, 10000, 100000, Million};
    private static void TestSelect()
    {
        // prime the jitter pump
        TestOrderBy(new List<int> {1, 2}, 1, 1);
        TestOrderBy(new List<string> {"foo", "bar"}, 1, 1);
        TestTopBy(new List<int> { 1, 2 }, 1, 1);
        TestTopBy(new List<string> { "foo", "bar" }, 1, 1);

        foreach (var numItems in Sizes)
        {
            foreach (var t in Percentages)
            {
                int numSelect = (int) (t*numItems);
                //var items = CreateRandomInts(numItems);
                var items = CreateRandomStrings(numItems, 4, 20);
                var orderBy = TestOrderBy(items, numSelect, NumIterations);
                var topBy = TestTopBy(items, numSelect, NumIterations);
                Console.WriteLine("{0},{1:N2},{2},{3},{4}", numItems, t*100, numSelect, orderBy, topBy);
            }
        }
    }

    private static double TestOrderBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.OrderByDescending(k => k).Take(numSelect).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static double TestTopBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.TopBy(numSelect, k => k).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static Random rnd = new Random();

    static List<int> CreateRandomInts(int howMany)
    {
        return Enumerable.Range(0, howMany).Select(i => rnd.Next()).ToList();
    }

    static List<string> CreateRandomStrings(int howMany, int minLength, int maxLength)
    {
        var items = new List<string>(howMany);
        for (int i = 0; i < howMany; ++i)
        {
            int len = rnd.Next(minLength, maxLength);
            var sb = new StringBuilder(len);
            for (int c = 0; c < len; ++c)
            {
                int x = rnd.Next(52);
                if (x < 26)
                    sb.Append((char)(x + 'A'));
                else
                    sb.Append((char)(x - 26 + 'a'));
            }
            items.Add(sb.ToString());
        }
        return items;
    }