ReSharper annoyances

If you’re writing C# code, you should be using ReSharper. It’s a Visual Studio plugin that examines your code in real time (as you edit), helping you avoid common errors and write better code. I admit that I found ReSharper annoying when I first started using it, but after a week or so I found it invaluable. I found it so useful at work that I purchased a license myself so that I can use it at home. It’s not exactly cheap (about $150), but I find that it improves my code and saves me time. Well worth the price.

But sometimes its suggestions are just annoying. One in particular drives me bonkers. Consider this fragment from my heap selection code:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

ReSharper suggests that I “Invert ‘if’ statement to reduce nesting.” If I tell it to perform the transformation, it turns the code into this:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) <= 0) continue;
        // The current item is larger than the smallest item.
        // So move the current item to the root and sift down.
        Swap(0, _numItems);
        SiftDown(0);
    }

So rather than having a simple conditional and a short bit of subordinate code, it inverts the conditional and adds an extra logical branch in the flow. All for no good reason. Rather than saying “If condition do x,” the code is now saying “If not condition go on to the next iteration, otherwise do x.” Negative logic and a more complicated control flow does not, in my opinion, constitute an improvement.

ReSharper is suggesting that I make my code harder to read and harder to understand.

Ostensibly, ReSharper is helping to avoid the arrow anti-pattern, but it’s being too aggressive. I fully agree that one should avoid writing such hideous code. However, the arrow anti pattern involves “excessive nested conditional operators.” One conditional operator is not excessively nested. What we have here is a simple conditional that ReSharper should ignore.

I can change ReSharper options to disable that suggestion, or I could annotate my code to disable that suggestion in this particular case. I won’t do the former because I want ReSharper to suggest improvements that make sense, and I definitely want to avoid the arrow anti-pattern. I won’t add ReSharper-specific (or any other tool-specific) annotations to my code because those annotations are not germane to the problem being solved. They’re just clutter. It’s bad enough that I have to suffer those XML documentation comments in the code.

ReSharper got that one wrong, and there’s nothing I can reasonably do to prevent it from displaying the warning.

ReSharper also warns me when I write something like this:

    private int _head = 0;

It says, “Initializing field by default value is redundant.” And ReSharper is right. But the redundancy is harmless and even helpful at times. I like this particular redundancy because it explicitly states what the initial value of that variable is. I don’t have to think, “Oh, right. The default value is zero.” I’ve turned this warning off because there is no context in which I want to be told that my preference for explicit initialization is redundant. Besides, the compiler will silently remove the redundancy, so there’s no inefficiency in the generated code.

Come to think of it, I’d like an option to flag code that doesn’t explicitly initialize something. I’d enable that option in a heartbeat.

All that said, I really like ReSharper. As annoying as it can be from time to time, I find that it really does help me write better code. Highly recommended.

Performance testing the TopBy method

The purpose of my TopBy extension method is to provide a faster and more memory efficient way to select the top k items from a list. Absent this method, the LINQ way to get those items is to sort the sequence in descending order and then take the top k. For example:

    var oldestPeople = people
        .OrderbyDescending(x => x.Age)
        .Take(10);

That will give you the 10 oldest people in the list. My TopBy method makes it a little easier:

    var oldestPeople = people.TopBy(10, x => x.Age);

That reads better to me and, done right, should be a whole lot faster than the existing LINQ alternative.

When LINQ does that selection, it has to copy the entire source list (people in this case), sort the entire list, and then return the top 10. That requires extra memory proportional to the size of the source list, and time proportional to n*log2(n). If you want to select the top 10 from a list of one million, LINQ will create another list of one million items to sort.

It should also be clear that LINQ will take almost the same amount of time to select the top 10 as it would to select the top 100,000. The bottleneck in the LINQ way of doing things is the sort, and it always has to sort the entire array.

My TopBy method uses a heap selection algorithm similar to the one that I discussed in my article When theory meets practice. This algorithm requires extra space proportional to k–the number of items to be selected. It takes time proportional to n*log2(k), meaning that it should be much faster than the LINQ method when selecting a relatively small percentage of items in the list, and it will use a whole lot less memory.

One other benefit of the new TopBy method is that you can use it to select the top items from a list that won’t even fit in memory. Imagine, for example, that you have a text file that contains hundreds of millions of records–more than will fit in memory. You want to select the top 100 records from that file. You can’t do that with OrderBy because you can’t fit those hundreds of millions of records in memory. But TopBy only needs space for 100 records, so you can easily use it to select the records you want.

That’s the theory. Let’s see how it works in practice.

The short version is that TopBy is faster than OrderBy followed by Take when the number of items that you want to select is less than about 25% of the total number of items in the source list. For primitive types like integers, “about” is closer to 20%, and for strings it’s closer to 40%. As the cost of comparisons increases, TopBy is a better choice.

How much faster depends mostly on the ratio of selected items (k) to total items (n). When k/n is small, TopBy is hugely faster. For example, selecting the top 100 items from a list of 1,000,000 strings takes TopBy about 500 milliseconds on my machine. It takes the OrderBy method about 17 seconds.

When k/n approaches 0.25, they’re about the same speed, and when k/n is near 1.0, TopBy takes about twice as long. Again, TopBy has an advantage as the cost of key selection and comparison increases.

The first table below compares the performance of OrderBy and TopBy when selecting varying percentages of integers from different list sizes. The second table compares the performance when selecting strings.

Total Items
Percent1001,00010,000100,0001,000,000
SelectedOrderByTopByOrderByTopByOrderByTopByOrderByTopByOrderByTopBy
0.01%9.410.761127.601,47575.25
0.10%0.740.088.870.791128.181,47984.73
1%0.040.010.730.108.891.2011314.841,477176.37
10%0.040.030.810.399.065.1811666.841,536854
25%0.040.040.710.699.189.751161261,5011,677
50%0.040.070.731.359.5014.321181891,5542,547
75%0.040.090.731.359.5917.011182241,5723,106
90%0.050.090.751.329.5718.551232351,6993,287
Time (ms) to select integers
Total Items
Percent1001,00010,000100,0001,000,000
SelectedOrderByTopByOrderByTopByOrderByTopByOrderByTopByOrderByTopBy
0.01%97.855.141,49954.4317,595540
0.10%6.630.5394.795.401,45558.7316,410612
1%0.450.067.200.7891.628.421,36012216,6701,364
10%0.530.146.542.4091.1136.161,37655616,8776,919
25%0.420.276.614.6210069.691,33796717,14713,375
50%0.440.437.217.3490.451041,2991,48116,85420,561
75%0.460.526.478.6094.601241,3181,82317,10624,166
90%0.460.586.939.171051351,4001,99817,88626,071
Time (ms) to select strings

The important thing to note here is that for a given list size, OrderBy followed by Take takes nearly constant time regardless of how many items are selected. That is, selecting 100 items from a list of 1 million takes approximately the same amount of time as selecting 900,000 items from the same list. In comparison, the time required by TopBy depends on the ratio of selected items to total items.

The time I spent developing the TopBy extension method is a huge win for me because I regularly select small numbers of items from very large lists. Selecting the top 100 from a list of 1 million is a common operation in the work that I do. Being able to do that in 500 milliseconds rather than 16 seconds means that my programs run much faster. They also require a whole lot less memory.

I also find myself, sometimes, having to select the top 100 items from a list that’s too large to fit in memory. Again, that’s something that TopBy can do, but OrderBy can’t. I realize that my situation is a little out of the ordinary in that most people who work with hundreds of millions of records will have them in a database and they can use SQL to do the selection.

It’s important to point out, though, that my TopBy extension method isn’t the optimum in performance. I have a custom generic selection method that’s three to five times faster, and outperforms LINQ’s OrderBy even when selecting 100% of items. Unfortunately, it doesn’t support the ThenBy and ThenByDescending methods, and it doesn’t do a stable ordering. It works more like List<T>.Sort in that I can pass it a comparison delegate. It’s useful, but using it in conjunction with LINQ is difficult.

All in all, this was an interesting and enjoyable diversion. I learned a lot about how LINQ to Objects works under the hood, and I ended up with a very useful extension method that makes my programs run faster and use less memory.

Below is the code I used to generate the data that went into the tables above. Of course, all tests were run with a release build with no debugger attached.


    private const int Million = 1000*1000;
    private const int Billion = 1000*Million;
    private const int NumIterations = 10;

    private static readonly double[] Percentages = new double[] {0.0001, 0.001, 0.01, 0.1, .25, .50, .75, .90};
    private static readonly int[] Sizes = new int[] {100, 1000, 10000, 100000, Million};
    private static void TestSelect()
    {
        // prime the jitter pump
        TestOrderBy(new List<int> {1, 2}, 1, 1);
        TestOrderBy(new List<string> {"foo", "bar"}, 1, 1);
        TestTopBy(new List<int> { 1, 2 }, 1, 1);
        TestTopBy(new List<string> { "foo", "bar" }, 1, 1);

        foreach (var numItems in Sizes)
        {
            foreach (var t in Percentages)
            {
                int numSelect = (int) (t*numItems);
                //var items = CreateRandomInts(numItems);
                var items = CreateRandomStrings(numItems, 4, 20);
                var orderBy = TestOrderBy(items, numSelect, NumIterations);
                var topBy = TestTopBy(items, numSelect, NumIterations);
                Console.WriteLine("{0},{1:N2},{2},{3},{4}", numItems, t*100, numSelect, orderBy, topBy);
            }
        }
    }

    private static double TestOrderBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.OrderByDescending(k => k).Take(numSelect).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static double TestTopBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.TopBy(numSelect, k => k).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static Random rnd = new Random();

    static List<int> CreateRandomInts(int howMany)
    {
        return Enumerable.Range(0, howMany).Select(i => rnd.Next()).ToList();
    }

    static List<string> CreateRandomStrings(int howMany, int minLength, int maxLength)
    {
        var items = new List<string>(howMany);
        for (int i = 0; i < howMany; ++i)
        {
            int len = rnd.Next(minLength, maxLength);
            var sb = new StringBuilder(len);
            for (int c = 0; c < len; ++c)
            {
                int x = rnd.Next(52);
                if (x < 26)
                    sb.Append((char)(x + 'A'));
                else
                    sb.Append((char)(x - 26 + 'a'));
            }
            items.Add(sb.ToString());
        }
        return items;
    }

Some optimizations to the TopBy method

I had planned to present the results of performance testing here, comparing my TopBy extension method against the existing LINQ alternative (OrderByDescending() followed by Take()). But while testing I came up with a couple of optimizations that provide slight performance increases. So I’ll show you those optimizations today, and then follow up with the performance tests in the next day or two.

My original code used a 3-heap because testing of my DHeap class showed that a 3-heap is almost always faster than a binary heap. But when I made the decision to use a 3-heap, I didn’t take into consideration the operations that the heap is expected to perform. If you recall from my discussion of the d-heap, heap insertions (the BubbleUp method) become faster as the value of d increases. Inserting into a 3-heap is faster than inserting into a binary heap. But heap removals (the SiftDown method) are slower as d increases.

The HeapSelector<TElement> class I presented last time never does a heap insert! The BubbleUp method doesn’t even exist in the code. The DoSelection method creates an array of the first numItems items and heapifies it, and then any changes to the heap are performed by sifting new elements down. The result is that my SiftDown method does more comparisons than it would as a binary heap, and there’s no corresponding reduction in comparisons during heap insertion.

The Heapify method in a 3-heap is faster than in a binary heap, and better locality of reference with the 3-heap offsets some of the lost performance in SiftDown, but overall the 3-heap is slightly slower with this mix of operations.

So I modified the code to use a binary heap. That simplifies the SiftDown method and provides a small increase (about 2%) in speed. Here are the modified Heapify and SiftDown methods.

    private void Heapify()
    {
        for (int i = _count / 2; i >= 0; --i)
        {
            SiftDown(i);
        }
    }

    private void SiftDown(int ix)
    {
        var child = (ix*2) + 1;
        while (child < _count)
        {
            if (child+1 < _count && Compare(child, child + 1) > 0)
                ++child;
            if (Compare(child, ix) >= 0)
                break;
            Swap(ix, child);
            ix = child;
            child = (ix*2) + 1;
        }
    }

The other optimization has to do with extracting keys from records before comparing to see if an item should go onto the heap. The original code looks like this:

    // For each remaining item . . .
    while (srcEnumerator.MoveNext())
    {
        ExtractKeys(srcEnumerator.Current, _numItems);
        if (Compare(_numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

That extracts all of the keys for an item and then compares the item against the first (smallest) item on the heap. If the new item is larger than the smallest item, then the smallest item is replaced by the new item.

One of the things I learned when working with heap selection some time ago is that with typical data, especially when you’re selecting a small percentage of the total items, only a very small number of items are actually placed on the heap. In most cases the vast majority of items examined are smaller than the smallest item on the heap. That being true, it’s useful to perform as little work as possible to determine if an item will go onto the heap.

The other observation is that, when multiple sort keys are specified, the majority of items will differ in the first key. If you’re sorting people by last name and then first name, most of them will have different last names. It makes no sense, then, to extract the first name key if it won’t ever be compared. If we had a method that extracted only those keys it needs for the comparison, we should spend a lot less time extracting keys.

With that in mind, I created the ExtractCompare method:

    private int ExtractCompare(TElement item, int ix, int x, int y)
    {
        var rslt = 0;
        var i = 0;
        while (rslt == 0 && i < _criteria.Length)
        {
            // Extract this key
            _criteria[i].ExtractKey(item, ix);
            rslt = _criteria[i].CompareKeys(x, y);
            ++i;
        }
        // If the item isn't larger, we can short circuit
        if (rslt <= 0) return rslt;

        // The new item compares larger, so it will be placed on the heap.
        // Extract the rest of the keys
        _items[ix] = item;
        while (i < _criteria.Length)
        {
            _criteria[i].ExtractKey(item, ix);
            ++i;
        }
        return rslt;
    }

You can see here that if the new item is smaller than the item it’s being compared to (which in this case will be the smallest item on the heap), the other keys aren’t even extracted. Again, why extract a key that you won’t use?

Using this new method requires a small change in the selection loop:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

The performance increase from this optimization isn’t very noticeable when you have just one ordering clause (i.e. no ThenBy or ThenByDescending). My testing shows that it’s only two to three percent in the general case. The effect becomes much more evident when you have multiple sort keys, or if extracting the keys is expensive.

I did have one reservation about this optimization. Somebody could write a key selector method that has side effects that other parts of the program (perhaps subsequent key selections) depends on. I can’t think of a good reason why anybody would write their program to depend on the key selector’s side effects, though. I gave this quite a bit of thought and came to the conclusion that depending on a key selector’s side effects is so fraught with peril anyway that adding this potential problem is no big deal. Key selection shouldn’t modify global state.

Together, these two optimizations resulted in a performance increase of from five to seven percent in the single sort key case. Not huge by any standard, but worthwhile considering the simplicity of the changes.

Performance comparisons next time. Really.

Implementing IOrderedEnumerable, Part 2: The details

In my previous post, I developed the classes that implement IOrderedEnumerable so that we can create a chain of sort criteria that the GetEnumerator method can use to select the top items from the list. The harder part is writing that GetEnumerator method.

When I started this project I intended to use my generic heap class, DHeap, to do the selection. I actually wrote code to do that and almost published it. But the code was convoluted, and used a few “clever” tricks that were too ugly to post. Besides, the code was slow. It was faster than using OrderBy and Take, but not by a whole lot. It was slow and ugly, and not something that I wanted to post when there was a better (and, as it turns out, much faster) way to do things.

It’s instructive to understand the way the LINQ to Objects OrderBy implementation works. When code calls OrderBy, the LINQ code creates a chain of ordering criteria quite similar to what I showed in my previous post. The GetEnumerator method, when it’s eventually called, first extracts the sort keys and stores them in individual arrays. For example, if the code were OrderBy(p => p.Age).ThenBy(p => p.FirstName), then there are two key arrays created: one that contains all of the ages, and one that contains all of the first names. When sorting, the keys are compared and swapped as necessary.

Doing things that way costs memory, but lets us create and compare custom keys that might take significant time to generate. Rather than generating the key each time a comparison is made, the keys are generated one time and cached. It’s less than optimum, but it’s a much more flexible way to do things.

After studying the problem a bit (that is, after my initial solution was such a disappointment), I determined that I could use a technique that’s very similar to what the LINQ to Objects implementation of OrderBy does. But my key arrays are only the size of the heap (the number of items to be selected) rather than the size of the entire list. So if the code is supposed to select 100 items, then each key array is large enough to hold 101 items–the last item used as a scratch pad.

Each HeapOrderedEnumerable<TElement, TKey> instance has a _keys array and implements three abstract methods that are defined by the base HeapOrderedEnumerable<TElement> class:

  • ExtractKey(element, index) extracts a key of type TKey from the passed element and stores it in the _keys array at the specified index.
  • Compare(x, y) uses the comparison delegate to compare the keys at indexes x and y.
  • Swap(x, y) swaps the values in the _keys array located at indexes x and y.

The HeapOrderedEnumerable<TElement,TKey> class, then, is pretty simple. Modifying the code I posted last time, I just need to add the _keys array and the three methods that I outlined above. Here’s the completed class.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        private readonly TKey[] _keys;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;

            // Allocate one extra key for the working item.
            _keys = new TKey[numItems+1];
        }

        public override int CompareKeys(int x, int y)
        {
            return _descending
                ? _comparer.Compare(_keys[y], _keys[x])
                : _comparer.Compare(_keys[x], _keys[y]);
        }

        public override void SwapKeys(int x, int y)
        {
            var t = _keys[x];
            _keys[x] = _keys[y];
            _keys[y] = t;
        }

        public override void ExtractKey(TElement item, int ix)
        {
            _keys[ix] = _keySelector(item);
        }
    }

I then created a class called HeapSelector that, given a list of ordering criteria (the chain of HeapOrderedEnumerable<TElement,TKey> instances) and a list of items, can select the top items that match those criteria. That class maintains an array of the items currently on the heap, and calls the ExtractKeyCompare, and Swap methods described above to maintain the keys for those items. Internally, HeapSelector implements a custom d-Heap to do the actual selection. The public interface consists of the constructor and the DoSelection method. Here’s the public interface.

    internal class HeapSelector<TElement>
    {
        public HeapSelector(
            IEnumerable source,
            HeapOrderedEnumerable[] criteria,
            int numItems);

        public IEnumerable DoSelect();
    }

The full code is shown at the end of this post, along with the rest.

All that’s left is the GetEnumerator method in HeapOrderedEnumerable<TElement> which, in concept, is very simple. It just has to create an array of ordering criteria to pass to the HeapSelector, create the HeapSelector, and then return the sequence generated by the DoSelect method. It’s almost that simple, except for one complication.

OrderBy and ThenBy state that they do a stable ordering. A stable ordering simply means that items that compare equal maintain their original relative order in the output. For example, imagine that I had this list of names and ages.

    Jim,52
    Ralph,30
    Susie,37
    Mary,52
    George,47

If I were to sort those names by descending age, the first two items could be Jim followed by Mary, or Mary followed by Jim. If a stable sort isn’t specified, then either ordering is correct. But if the sort is guaranteed stable, then Jim must come before Mary in the output because Jim appeared before Mary in the original list. Note that if I reversed the sort order (i.e. sorted by ascending age), Jim would still appear before Mary.

It’s reasonable for users of TopBy to expect a stable ordering, because that’s the way that OrderBy works and, more importantly, how ThenBy is documented to work. If I want ThenBy to work with TopBy, then TopBy must do a stable ordering. That complicates things a little bit because heap selection isn’t typically a stable operation.

It can be made stable, though, by adding one final ordering criterion: the record number. In the example above, Jim would be record 0 and Mary would be record 3. Each record is assigned a unique, monotonically increasing numeric key. If the other keys compare as equal, the record numbers are compared. This guarantees the correct ordering of otherwise equal items.

It turns out that adding that final ordering isn’t terribly complicated. The code essentially tacks on a final ThenByDescending that stores the unique record number. It then walks the chain of ordering criteria to build an array that it can pass to the HeapSelector constructor. Finally, it calls the HeapSelector instance’s DoSelect method.

This whole thing turned out to be about 30 lines longer than my initial attempt, but much easier to understand. It’s also about five times faster than my original code, which is a nice bonus. The entire code is shown below. Next time we’ll do a little testing to see how it stacks up with other ways of selecting the top items.

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;

    namespace Heaps
    {
        public static class HeapEnumerable
        {
            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
            }

            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
            }

            internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly int _numItems;
                internal HeapOrderedEnumerable<TElement> Parent;

                protected HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems)
                {
                    _source = source;
                    _numItems = numItems;
                }

                public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer, bool @descending)
                {
                    var oe = new HeapOrderedEnumerable<TElement, TKey>(
                        _source, _numItems, keySelector, comparer, descending);
                    oe.Parent = this;
                    return oe;
                }

                public IEnumerator<TElement> GetEnumerator()
                {
                    int numRecs = 0;
                    var recordKeySelector = new Func<TElement, int>(item => ++numRecs);

                    // Add a ThenByDescending for the record key.
                    // This ensures a stable ordering.
                    var oe = (HeapOrderedEnumerable<TElement>)CreateOrderedEnumerable(recordKeySelector, null, true);

                    // Get the ordering criteria, starting with the last ordering clause.
                    // Which will always be the record key ordering.
                    var criteria = oe.GetCriteria().ToArray();

                    var selector = new HeapSelector<TElement>(_source, criteria, _numItems);
                    return selector.DoSelect().GetEnumerator();
                }

                // Walks the ordering criteria to build an array that the HeapSelector can use.
                private IEnumerable<HeapOrderedEnumerable<TElement>> GetCriteria()
                {
                    var keys = new Stack<HeapOrderedEnumerable<TElement>>();

                    var current = this;
                    while (current != null)
                    {
                        keys.Push(current);
                        current = current.Parent;
                    }
                    return keys;
                }

                IEnumerator IEnumerable.GetEnumerator()
                {
                    return GetEnumerator();
                }

                // The individual ordering criteria instances (HeapOrderedEnumerable<TElement, TKey>)
                // implement these abstract methods to provice key extraction, comparison, and swapping.
                public abstract void ExtractKey(TElement item, int ix);
                public abstract int CompareKeys(int x, int y);
                public abstract void SwapKeys(int x, int y);
            }

            internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
            {
                private readonly Func<TElement, TKey> _keySelector;
                private readonly IComparer<TKey> _comparer;
                private readonly bool _descending;

                private readonly TKey[] _keys;

                internal HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems,
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer,
                    bool descending) : base(source, numItems)
                {
                    _keySelector = keySelector;
                    _comparer = comparer ?? Comparer<TKey>.Default;
                    _descending = descending;

                    // Allocate one extra key for the working item.
                    _keys = new TKey[numItems+1];
                }

                public override int CompareKeys(int x, int y)
                {
                    return _descending
                        ? _comparer.Compare(_keys[y], _keys[x])
                        : _comparer.Compare(_keys[x], _keys[y]);
                }

                public override void SwapKeys(int x, int y)
                {
                    var t = _keys[x];
                    _keys[x] = _keys[y];
                    _keys[y] = t;
                }

                public override void ExtractKey(TElement item, int ix)
                {
                    _keys[ix] = _keySelector(item);
                }
            }

            internal class HeapSelector<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly HeapOrderedEnumerable<TElement>[] _criteria;
                private readonly int _numItems;
                private readonly TElement[] _items;
                private int _count;

                public HeapSelector(
                    IEnumerable<TElement> source,
                    HeapOrderedEnumerable<TElement>[] criteria,
                    int numItems)
                {
                    _source = source;
                    _criteria = criteria;
                    _numItems = numItems;
                    _items = new TElement[numItems+1];
                }

                public IEnumerable<TElement> DoSelect()
                {
                    // Build a heap from the first _numItems items
                    var srcEnumerator = _source.GetEnumerator();
                    while (_count < _numItems && srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _count);
                        ++_count;
                    }
                    Heapify();

                    // For each remaining item . . .
                    while (srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _numItems);
                        if (Compare(_numItems, 0) > 0)
                        {
                            // The current item is larger than the smallest item.
                            // So move the current item to the root and sift down.
                            Swap(0, _numItems);
                            SiftDown(0);
                        }
                    }

                    // Top N items are on the heap. Sort them.
                    int saveCount = _count;
                    while (_count > 0)
                    {
                        --_count;
                        Swap(0, _count);
                        SiftDown(0);
                    }

                    // And then return.
                    // Have to use the Take here because it's possible that saveCount
                    // will be smaller than _numItems.
                    return _items.Take(saveCount);
                }

                private const int ary = 3;

                private void Heapify()
                {
                    for (int i = _count / ary; i >= 0; --i)
                    {
                        SiftDown(i);
                    }
                }

                private void SiftDown(int ix)
                {
                    while ((ary*ix) + 1 < _count)
                    {
                        var child = (ix*ary) + 1;
                        // find the smallest child
                        var currentSmallestChild = child;
                        var maxChild = child + ary;
                        if (maxChild > _count) maxChild = _count;
                        ++child;
                        while (child < maxChild)
                        {
                            if (Compare(currentSmallestChild, child) > 0)
                                currentSmallestChild = child;
                            ++child;
                        }

                        child = currentSmallestChild;
                        if (Compare(child, ix) >= 0)
                            break;
                        Swap(ix, child);
                        ix = child;
                    }
                }

                private void ExtractKeys(TElement item, int ix)
                {
                    // Extract keys from the record into the array at index ix.
                    // Also calls the ExtractKey method for each ordering criteria.
                    _items[ix] = item;
                    foreach (var t in _criteria)
                    {
                        t.ExtractKey(item, ix);
                    }
                }

                private int Compare(int x, int y)
                {
                    // Walks the list of comparers, doing the comparisons.
                    // The first unequal comparison short-circuits the loop.
                    var rslt = 0;
                    for (int i = 0; rslt == 0 && i < _criteria.Length; ++i)
                    {
                        rslt = _criteria[i].CompareKeys(x, y);
                    }
                    return rslt;
                }

                // Swap two items. This swaps the elements in the local array,
                // and calls the Swap method for each of the ordering criteria.
                private void Swap(int x, int y)
                {
                    var temp = _items[x];
                    _items[x] = _items[y];
                    _items[y] = temp;
                    foreach (var t in _criteria)
                    {
                        t.SwapKeys(x, y);
                    }
                }
            }
        }
    }

Implementing IOrderedEnumerable, Part 1: The basics

Last time I introduced the TopBy extension method, which I want to implement for LINQ to Objects. On the surface of it, doing such a thing seems simple enough: just implement the IOrderedEnumerable interface. That turns out to be more involved than you might expect.

Documentation for IOrderedEnumerable isn’t particularly helpful. The brief description says:

Represents a sorted sequence.

There are no remarks or other detailed information. The interface has two methods: CreateOrderedEnumerable<TKey> and GetEnumerator.

If you’ve worked with LINQ at all, you know what GetEnumerator is supposed to do: generate the sequence. More correctly, it returns an enumerator that iterates through the collection, but something has to generate the items in the collection, and in keeping with the lazy nature of LINQ, we typically generate those items during enumeration.

Documentation for CreateOrderedEnumerable<TKey> isn’t much help, either. The brief description says:

Performs a subsequent ordering on the elements of an IOrderedEnumerable<TElement> according to a key.

The prototype tells us that there are three parameters: a key selector, a comparer, and a flag to indicate if the subsequent ordering should be ascending or descending. And, the only remark is:

The functionality provided by this method is like that provided by ThenBy or ThenByDescending, depending on whether descending is true or false. They both perform a subordinate ordering of an already sorted sequence of type IOrderedEnumerable<TElement>.

Not so helpful. The only clue in the documentation that even hints at an implementation strategy is in the Remarks section of ThenBy and ThenByDescending:

ThenBy and ThenByDescending are defined to extend the type IOrderedEnumerable<TElement>, which is also the return type of these methods. This design enables you to specify multiple sort criteria by applying any number of ThenBy or ThenByDescending methods.

The idea is easy enough. We want to create a chain of comparison operations so that we can apply code that does essentially this:

    if first keys are equal then
        if second keys are equal then
            if third keys are equal then
            ... etc.

It all starts with the TopBy method, which has two overloads:

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector);

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer);

There are corresponding TopByDescending methods, as well, which have the same parameters.

Those methods return an IOrderedEnumerable<TSource>, so I need to create a class that implements the IOrderedEnumerable<TSource> interface. I called that class HeapOrderedEnumerable<TElement>. In its skeletal form, it looks like this:

    internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
    {
        public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending)
        {
            throw new NotImplementedException();
        }

        public IEnumerator<TElement> GetEnumerator()
        {
            throw new NotImplementedException();
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }

If you think about it a bit, that HeapOrderedEnumerable<TElement> class isn’t sufficient because the key type varies. What we really need is a HeapOrderedEnumerable<TElement,TKey> that we derive from HeapOrderedEnumerable<TElement>.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
    }

TopBy, then, creates and returns a HeapOrderedEnumerable<TSource, TKey>. A reference to that object is passed to ThenBy, which in turns calls the CreateOrderedEnumerable method to create additional sort criteria. It’s up to us to figure out how to chain those multiple sort criteria together so that when GetEnumerator is called, items are selected appropriately.

Clear as mud, right?

The HeapOrderedEnumerable<TElement, TKey> instance has to save the parameters that are passed to its constructor so that they can be used when it’s time to generate the sequence. The class also has to save a reference to its parent so that we can properly construct the chain of comparers when the time comes. Some of the information to be saved (the source list, number of items to select, and the parent pointer) has to be accessible by the base class. The rest can be private to the derived class.

With that information, we can create internal data definitions and constructors for the classes. First, the base class:

    // Internal data and constructor for HeapOrderedEnumerable<TElement>
    private readonly IEnumerable<TElement> _source;
    private readonly int _numItems;
    internal HeapOrderedEnumerable<TElement> Parent;

    protected HeapOrderedEnumerable(
        IEnumerable<TElement> source,
        int numItems)
    {
        _source = source;
        _numItems = numItems;
    }

Note that the Parent isn’t set by the constructor. We’ll come back to it in a minute.

The derived class, HeapOrderedEnumerable<TElement, TKey>, saves the key selector, the comparer, and the descending flag.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;
        }
    }

Now we can write the CreateOrderedEnumerable method. All it does is create a HeapOrderedEnumerable<TElement, TKey>, and link it to the parent so that we have a chain that we can work with when GetEnumerator is called.

    public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
        Func<TElement, TKey> keySelector,
        IComparer<TKey> comparer, bool descending)
    {
        var oe = new HeapOrderedEnumerable<TElement, TKey>(
            _source, _numItems, keySelector, comparer, descending);
        oe.Parent = this;
        return oe;
    }

And, finally, the TopBy and TopByDescending methods.

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
    }

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
    }

Note that I don’t have to define the ThenBy or ThenByDescending methods. Those methods are already defined by LINQ. They call CreateOrderedEnumerable on the IOrderedEnumerable reference that’s passed as the first parameter. So, if there’s a TopBy followed by ThenBy, the HeapOrderedEnumerable<TElement> that TopBy created gets called. If the first function is OrderBy, then the IOrderedEnumerable implementation created by that is called.

The code now will create a chain of comparers from a TopBy or TopByDescending followed by any number of ThenBy or ThenByDescending calls. The chain is built backwards, but that’s okay; it’s easy enough to reverse the chain when it comes time to enumerate items.

Enumerating the items involves extracting and saving keys, and doing the actual heap selection. In the next post I’ll talk a bit about key selection, and start building the GetEnumerator method. I might be able to finish next time, but there’s a lot of material to cover and it might be too big for a single post. If so, we’ll finish up after that.

Creating a TopBy method for LINQ to Objects

One reason I spent so much time last fall working on my generic heap implementation was that I wanted a TopBy extension method for LINQ to Objects. So, for example, if I have a list of integers and I want the 10 largest, I could write something like:

  var top10 = theList.TopBy(10, k => k);

Without a TopBy method, the LINQ way to do that is:

  var top10 = theList.OrderByDescending(k => k).Take(10);

That’s all well and good when the list is quite small but when working with lists that contain millions of items, having to do a complete sort just so I can get the top 10 is hugely expensive in both memory and processing time. And if the list is larger than will fit in memory, the LINQ OrderBy solution flat doesn’t work.

Given my heap class, it’s easy enough to create an extension method that does something similar to what OrderBy does. If you ignore the keySelector parameter and assume that the items have a default comparer, then it’s almost trivial:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems)
    {
        var comparer = Comparer<TSource>.Default;
        var heap = new MinDHeap<TSource>(3);
        foreach (var item in source)
        {
            if (heap.Count < numItems)
            {
                heap.Insert(item);
            }
            else if (comparer.Compare(item, heap.Peek()) > 0)
            {
                heap.ReplaceRoot(item);
            }
        }
        // and return in reverse heap order
        return heap.GetConsumingEnumerable().Reverse();
    }

Wrap that up into a static class and you can get the top 10 items from a list with:

    var top10 = theList.TopBy(10);

Adding the ability to specify a custom comparer is straightforward: something that I covered in my heap discussions. That would produce this function:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems,
        IComparer<TSource> keyComparer);

I could probably get by with that just fine. If I need to compare on multiple fields, I can just write a different IComparer. And if I need to do some fancy key selection, I can either do it in the comparer, or write a front end filter that gets the keys that I need and builds a list of intermediate objects I can work with. It’s not something that I need to do very often.

At least, that was my thinking until I ran into a situation where building that list of temporary objects turned out to be a giant pain in the neck. Here I was able to sort with no trouble by using a key selector and several ThenBy operations, but to select the top 10 I had to jump through all kinds of hoops. It was just ugly.

Consider, for example, ordering a list of items based on how many users liked or disliked them. You want the items that have the most likes to appear on top, but if two items have the same number of likes, you want the one with the fewest dislikes to be shown first. So, given three items:

    bar, 50 likes, 8 dislikes
    foo, 10 likes, 1 dislike
    bas, 50 likes, 1 dislike

You would present them in the order (bas, bar, foo) because bar and bas both have 50 likes, but bas has fewer dislikes. The LINQ code to order them would be:

    var ordered = items
        .OrderByDescending(x => x.Likes)
        .ThenBy(x => x.Dislikes);

No custom comparer is required. But if I wanted to do that with the TopBy method shown above, I’d have to create a custom comparer and pass it. What a pain in the neck. And that’s just a simple example. If generating the keys required any calculation I’d want to pre-compute the keys once so that I wouldn’t wast time constructing a key every time I wanted to compare it. As I said, I could get by with the primitive method, but it’d be much nicer if TopBy would work the same way as OrderBy.

So I thought I’d look into creating TopBy and TopByDescending methods that I can use the same way that I use LINQ’s OrderBy and OrderByDescending. That turned out to be more involved than I thought it would be.

The key is the IOrderedEnumerable interface which, although simple enough on the surface, turns out to be a lot more complicated to implement than the documentation would lead you to believe. That’s where I’ll start next time.

Next: Implementing IOrderedEnumerable, Part 1

My next car

I drive a 1996 GMC Sonoma pickup truck that we bought new in February 1996. The truck has about 190,000 miles on it. It’s been wrecked and repaired twice, and I put a new engine in it at 120,000 miles. I keep up on the maintenance so the truck still runs well, and it looks much better than you’d expect an 18 year old truck to look. A few minor dents here and there is all.

Debra drives a 1996 Nissan Maxima that we bought new in December 1996. It has about 235,000 miles and is starting to show its age. She has to drive around town for work sometimes, so the car has to be reliable. We’ve kept up the maintenance on this car, so it runs well. But we haven’t fixed the dents from the minor wreck she had a few years ago. We had planned to replace the car by now but every time we think of getting something new we decide to wait another few months.

Debra’s new car will have to be something with four seats. We really like the Maxima and might get another. We both have a tough time with the price range, though: $30,000 to $40,000 for a car just seems crazy. There are lots of good cars in that price range, and even some nice ones in the $25,000 range (the Nissan Altima, for example). Debra wants four doors and plenty of space. Other than that, we haven’t narrowed it down. A 4 door sedan? Minivan? SUV? We’ll be looking for a long while.

I on the other hand just need basic transportation. The truck is handy for hauling things, but in truth I don’t often carry anything in the back. So I’m looking for something small and inexpensive to take me to work and back. I looked at the all electric Nissan Leaf, but can’t justify the $30,000 price tag for a basic commuter car. Even the Smart and the Fiat 500 are pretty expensive for what you get, but they’re less expensive than the Leaf and I can actually go places if I want to. The Leaf’s 100 mile range is a real drawback.

About six months ago I ran across Elio Motors, a company that’s gearing up to produce a small, affordable, and high mileage car. Last I heard (earlier this week), they expect to start shipping cars in the fourth quarter of this year.

It has two seats (front and back) and gets 80 MPG on the highway. City mileage will be about 50. It’s a 0.9 liter, 3-cylinder engine. The best part: $6,800. Yes, $6,800. It’s the perfect little commuter car.

I’m sending in my deposit to reserve one.