A closer look at pairwise merging

When I started writing the performance test code, I got sidetracked with yet another way to merge multiple sorted lists. While I was analyzing it I realized that the pairwise merge might have a problem, as well. It’s worth talking about here because knowing of the problem’s existence will help me create better test data when I finally get around to performance testing.

To review, assuming four lists named A, B, C, and D, the pairwise merge works like this:

    Merge A and B to produce list AB
    Merge C and D to produce list CD
    Merge AB and CD to produce list ABCD

You can achieve the same result without a queue:

    Merge A and B to produce AB
    Merge AB and C to produce ABC
    Merge ABC and D to produce ABCD

Whereas they both require the same number of merges, not all merges are equal.

We can estimate the speed of these algorithms by figuring out how many items they process. We already know that merging two lists that have among them n items requires processing n items. So if we sum the number of items for each two-way merge, we have the total number of items processed for the entire operation.

If we assume that each of the lists has 10 items, then the first example requires:

20 iterations to merge A and B. The resulting list AB has 20 items.
20 iterations to merge C and D. The resulting list CD has 20 items.
40 iterations to merge AB and CD.
80 total iterations.

The second example requires:

20 iterations to merge A and B. The resulting list AB has 20 items.
30 iterations to merge AB and C. The resulting list ABC has 30 items.
40 iterations to merge ABC and D.
90 total iterations.

In this case, the queue-based algorithm is faster.

But not in all cases. Consider, for example, what happens when the lists’ lengths are:

    A = 1
    B = 1
    C = 1
    D = 100

The queue merge will look at two items to make AB, 101 items to make CD, and 103 items to make ABCD. That’s a total of 206 items processed. The other algorithm will look at two items to make AB, three items to make ABC, and 103 items to make ABCD, for a total of 108 items processed. If you switch the order, though, the queue merge will still process 203 items but the other version will have to look at 101 + 102 + 103, or 308 items.

The order of merges matters. A lot. Change that 100 to 1,000,000, and you start to get the idea.

If we know the lengths of the lists to be merged, we could potentially order the pairwise merges to take advantage of that information. The idea would be to order the merges so that we create intermediate lists that are relatively the same length. It wouldn’t have to be a perfect solution in order to realize a huge performance increase. A simple priority queue ordered by increasing list length would allow us to merge the shorter lists first, which would give a reasonably good solution to the problem.

The only drawback–and it’s a deal killer–is that we don’t know how long the lists are. All of the merge code I’ve been working with assumes that the lists are of indeterminate length.

The result is that worst cases can happen. If we use the queue-based pairwise merge, at some point we’ll be presented with a pathological case that makes the merge perform very poorly. In those cases, the running time isn’t O(n log2 k), but rather O(n*k). Ouch.

In contrast, the heap-based merge always takes O(n log2 k) time. The reason is that every item is inserted into the heap once, and every item is removed from the heap once, regardless of the individual lists’ length.

Asymptotic analysis, then, tells us that the heap-based merge is better because it’s always O(n log2 k), whereas the pairwise merge can be O(n*k). Actual running time is a different story. If maintaining the heap is very expensive in comparison to maintaining the enumerators, then the pairwise merge can still outperform the heap-based merge.

The pairwise merge looks like it should be faster in the average case, and even in pathological cases with relatively few items. I suspect that it will be slower than the heap-based merge with many large, unevenly-sized lists.

Don’t take my word for it, though. I’ve been wrong often enough to distrust my gut feelings when it comes to performance. The only way to know for sure which is faster is to measure it.

A different way to merge multiple lists

In my previous post, I said that this time I would show full-featured MergeBy and MergeByDescending methods. Before I do that, I want to show an alternate method for merging multiple sorted lists.

If you recall, I said that the generalized k-way merge algorithm is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

As I illustrated last time, that does indeed work. But there is another way to do it that is, asymptotically at least, just as fast. That is, it operates in O(n log2 k) time.

Imagine you have four sorted lists, named A, B, C, and D. Another way to merge them is this way:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and CD to create the final list.

To estimate the time complexity, let’s assume that each of the four lists contains the same number of items. So if the total number of items is n, then each list’s length is n/4.

Merging A and B, then, takes time proportional to n/4 + n/4, or n/2. Merging C and D also requires n/2 time. The final merge of AB and CD takes n/2 + n/2. So the total amount of time it takes to merge the four lists is 2n.

Time complexity of the k-way merge is, as I said, O(n log2 k). With four lists, that’s n * log2(4)log2(4) = 2, so the time to merge four lists with a total of n items is n * 2, or 2n.

If there is an odd number of lists, then one of the original lists gets merged with one of the larger, already merged, lists. For example, with five lists the task becomes:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and E to create a list called ABE.
    Merge ABE and CD to create the final list.

To do that with an arbitrary number of lists, we create a queue that initially contains the original lists. We then remove the first two lists from the queue, merge them, and add the result back to the queue. Then take the next two, merge them, and add the result to the queue. We continue doing that until there is only one list left on the queue. That is:

    Initialize queue with all of the lists.
    while queue has more than one item
        l1 = queue.Dequeue()
        l2 = queue.Dequeue()
        rslt = Merge(l1, l2)
        queue.Enqueue(rslt)
    merged = queue.Dequeue()

At first, it seems like that method will require a lot of extra space to hold the temporary lists. Remember, the merge algorithms I’ve shown so far don’t require much in the way of extra storage: just a little memory to hold the smallest value in each of the lists. That is, O(k) extra space. Perhaps not surprisingly, you can do the same thing with LINQ. Here’s the code.

    public static IEnumerable<T> MergeVersionTwo<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // Populate a queue with all the lists.
        var theQueue = new Queue<IEnumerable<T>>(lists);
        if (theQueue.Count == 0) yield break;

        // Do pair-wise merge repeatedly until there is only one list left. 
        while (theQueue.Count > 1)
        {
            var l1 = theQueue.Dequeue();
            var l2 = theQueue.Dequeue();
            var rslt = Merge(l1, l2); // uses the two-way merge
            theQueue.Enqueue(rslt);
        }
        var merged = theQueue.Dequeue();
        foreach (var x in merged)
        {
            yield return x;
        }
    }

That code uses the two-way merge that I presented a while back.

When building the queue, we’re just setting things up. Everything is an IEnumerable<T>, meaning that no actual work takes place. LINQ’s deferred execution postpones the merging until we start to enumerate the result. And when we do enumerate the result, only one item is produced at a time. All of the intermediate merges take place simultaneously, in steps.

Here’s some code that uses the new MergeVersionTwo method:

    var l1 = new int[] {1, 3, 9, 12, 15};
    var l2 = new int[] {2, 7, 14, 16, 19};
    var l3 = new int[] {6, 8, 13, 17, 20};
    var rslt = MergeVersionTwo(new List<int[]> {l1, l2, l3});

    foreach (var i in rslt)
        Console.WriteLine(i);

If you wrap that in a method and single-step it, you’ll see that no real work is done until you get to the foreach. Even then, all that mucking about with the queue doesn’t enumerate any of the lists until you get to the foreach at the end of the MergeVersionTwo method. If you start single-stepping (in Visual Studio, press F11, or the “Step into” debugger command), you’ll see it start working on the two-way merges. Watching how all that works will help you to understand just how powerful deferred execution can be.

It’s important to note that asymptotic analysis just tells us how the run time varies with the size of the input. It doesn’t say that both of the k-way merge algorithms will take the same amount of time. n log2 k just says that for a particular algorithm the amount of time varies with the size of the input and the logarithm of the number of lists. We’ll have to do some performance tests to determine which is actually faster.

That testing should prove interesting. Before we go there, though, we need to look more closely at the pairwise merge.

Merging multiple lists with LINQ

To merge multiple sorted lists efficiently, we need a better way to obtain the smallest value from among the lists’ current items. That is, given three sorted lists:

    A: 1, 8, 13, 19, 40
    B: 0, 4, 12, 41
    C: 3, 22, 30, 37, 43

The first item to output would be the 0 from list B. Then we’d output 1 from list A, 3 from list C, etc. In my previous post I showed one way to find the smallest value. That involved looking at the current item from each list to get the minimum. It works, but it’s very expensive. If you have k sorted lists, then it requires O(k) operations to determine the smallest item.

I spent a lot of time last year talking about heaps. The heap, as you recall, is a data structure that efficiently keeps track of the smallest (or largest, depending on how you structure it) item. In particular, you can remove the smallest item from the heap in O(log2 k) time, where k is the number of items in the heap. Also, insertion into the heap is an O(log2 k) operation.

If there are n items spread out over k lists, then the naive method takes O(n * k) time, and the method with the heap requires O(n log2 k) time. Using a heap can save you a lot of time, even when k is small.

That sounds simple enough, but there’s one catch: you can’t just store values in the heap. With the value, you have to store a reference to the list itself. Otherwise you don’t know which list index to increment. Still, it’s not terribly difficult. With C#, we could use a Tuple<T1, T2>, or a simple object. Tuple is convenient, but I prefer creating a class for this kind of thing:

    class HeapItem
    {
        int[] a;  // the array
        int ix;   // current index
    }

Given that, and assuming we have a heap data structure, merging multiple lists becomes much easier. Following the basic algorithm from last time, the almost-C# code looks like this:

    // Initialize the heap
    var heap = new MinHeap();
    foreach (array in arrays)
        if (array.Length > 0)
            heap.Insert(new HeapItem(a = array; ix = 0);
    
    while (!heap.IsEmpty)
    {
        HeapItem item = heap.RemoveSmallest();
        result.Add(item.a[item.ix]);  // output item
        item.ix++;                    // move to next item in this array
        // and put it into the heap if there are still items
        if (item.ix < item.a.Length)
            heap.Insert(item);
    }

I glossed over the heap implementation here. The code assumes that you have a heap implementation that will work with those HeapItem instances. I do in fact have such a thing, and we’ll use it in a bit.

Note that there is no need to explicitly check for the end of all lists. All of the lists’ indexes are initialized at 0. When a list is removed from the heap, its current item is output and then its index is incremented. If the new current index is beyond the end of the list, then the list isn’t added back to the heap. If we drop lists from the heap once they’ve been exhausted, then we know that there are still items as long as the heap is not empty. That simplifies things quite a bit.

The problem with the above is that it assumes we’re working with arrays of integers. What we really want is a generic Merge method that works with any type that implements IEnumerable<T>. That is:

    IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<TSource>> lists);

The basic idea isn’t much different from what I presented in Merging sequences with LINQ. But rather than having only two enumerators and doing a simple comparison, I create a heap of enumerators. It looks like this:

    public static IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // This builds a heap of the first items from each list
        var initialItems =
            lists.Select(l => l.GetEnumerator())
                .Where(i => i.MoveNext());

        var heap = new MinDHeap<IEnumerator<T>>(
            2, initialItems, new EnumeratorComparer<T>(Comparer<T>.Default));

        while (!heap.IsEmpty)
        {
            var i = heap.RemoveRoot();
            yield return i.Current;
            if (i.MoveNext())
            {
                heap.Insert(i);
            }
        }
    }

That code relies on an EnumeratorComparer, which is defined as:

    private class EnumeratorComparer<T> : IComparer<IEnumerator<T>>
    {
        private readonly IComparer<T> _comp;

        public EnumeratorComparer(IComparer<T> comparer)
        {
            _comp = comparer ?? Comparer<T>.Default;
        }

        public int Compare(IEnumerator<T> x, IEnumerator<T> y)
        {
            var rslt = _comp.Compare(x.Current, y.Current);
            return rslt;
        }
    }

It also depends on my DHeap, the code for which you can download from that page.

One of the cool things about using enumerators here is that the concept of the current item is built in. The logic of incrementing and checking for the end of the list is all there in the MoveNext method and the Current property. So the code to implement the merge is incredibly simple.

I should mention the initialization code, which consists of this LINQ expression:

   var initialItems =
        lists.Select(l => l.GetEnumerator())
             .Where(i => i.MoveNext());

That code is just getting the enumerators for each of the lists, positioning to the first item in the list, and adding it to the list of enumerators (which is subsequently inserted into the heap) if the list has items. It’s roughly equivalent to:

    foreach (var list in lists)
    {
        var i = list.GetEnumerator();
        if (i.MoveNext)
            initialItems.Add(i);
    }

That’s the simple method that merges multiple lists using the default comparison. That’s useful in itself, but it doesn’t provide all of the features that I included in the MergeWithBy extension. Next time, we’ll look at extending this code to provide full-featured MergeBy and MergeByDescending methods.

Actually, I went off on a tangent and presented a different way to merge multiple lists. I’ll get to the full-featured methods after exploring that.

Merging multiple sorted lists

The process of merging multiple (i.e. more than two) sorted sequences is a simple extension of merging just two sorted sequences. The two-way merge algorithm I posted last month included an optimization that I shouldn’t have introduced quite so soon. That optimization complicates the algorithm a bit. The simplest algorithm for merging two sorted lists looks like this:

    while ((not end of List A) and (not end of List B))
        if (end of List A)
            output List B current item
            advance List B index
        else if (end of List B)
            output List A current item
            advance List A index
        else if (List A current item <= List B current item)
            output List A current item
            advance List A index
        else
            output List B current item
            advance List B index

That does the same thing as what I posted before, but does it all in a single loop. My previous version is slightly more efficient because once it reaches the end of one list it just outputs the remainder of the other. Once List A is exhausted, there’s no need to keep checking it.

You could extend that algorithm for three, four, or more lists, but the logic gets complicated in a hurry. What we really need is a generalized k-way merge algorithm that can merge items from any number of sorted lists. That algorithm, simply stated, is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

One way to implement that would be with an array of lists, and a corresponding array of current index pointers. The first cut of such a thing might look something like this:

    // merges multiple integer arrays
    private static int[] MergeInts(List<int[]> arrays)
    {
        var result = new List<int>();
        // Initialize indexes
        var indexes = new int[arrays.Count];
        for (int i = 0; i < arrays.Count; ++i)
        {
            indexes[i] = 0;
        }

        // Function returns true if we've reached the end of all the arrays
        var endOfAllArrays = new Func<bool>(() =>
        {
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                    return false;
            }
            return true;
        });

        while (!endOfAllArrays())
        {
            // Find the array with the smallest current item
            int smallestItem = int.MaxValue;
            int smallestArray = 0;
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                {
                    if (arrays[i][indexes[i]] <= smallestItem)
                    {
                        smallestArray = i;
                        smallestItem = arrays[i][indexes[i]];
                    }
                }
            }

            // output that item
            result.Add(smallestItem);

            // advance the index on that array
            ++indexes[smallestArray];
        }
        return result.ToArray();
    }

That code implements the basic algorithm exactly. It’s not especially pretty, but it gets the job done. It’s also horribly inefficient because for every time through the loop it looks at all of the current items to determine which is the smallest. The complexity of this particular implementation is O(n*k), where n is the total number of items in all lists, and k is the number of lists. If you’re merging 10 lists that combined contain n items, it’s going to take twice as long to merge than if you have the same items spread out over just 5 lists.

What we need is a faster way to find the list that has the smallest current item. I mentioned in my first article (linked above) that the k-way merge is can be done in O(n log2 k) time complexity. That makes a big difference, even for small values of k. The base 2 logarithm of 4, for example, is 2, meaning that if we can improve the selection algorithm, we can merge four lists in half the time.

What data structure do we know that can return the smallest item in O(log2 k) time? The heap, maybe? It’s a surprisingly useful data structure.

Next time we’ll look at replacing the array of arrays in the implementation above with a heap. That will reduce the time required to select the smallest item, and it will also eliminate the time required to determine if we’re at the end of the list.

Removing duplicates

Imagine you have two lists of names in alphabetical order. The first list is of people who drive from Point A to Point B on Monday, and the second list is of people who drove that route the on Tuesday. You want to build a list of people who drove that route on either day, and you don’t want any duplicates. So if Joe, Mary, and Sam drove on Monday and Joe, Mary, and Ralph drove on Tuesday, then your list would contain [Joe, Mary, Ralph, Sam].

My MergeWithBy LINQ extension can easily merge the two lists to give [Joe, Joe, Mary, Mary, Ralph, Sam], but it doesn’t remove duplicates. I could make it remove duplicates easily enough by adding another parameter to the method, but it seems more useful to have a separate method that can remove duplicates from any sorted sequence.

LINQ’s existing Distinct method can do it, but it has at least two drawbacks:

  • Because it works on unsorted sequences as well, it has to hold all of the items in memory. This limits the size of sequences you can operate on.
  • It requires that I define an equality comparer if I want to do custom comparisons.

The first is a problem because I often work with very large lists that won’t fit into memory. The second is an annoyance because if I want to sort, merge, and then uniquify, I end up with two different meanings of equality: one defined by the IComparer that I pass to OrderBy and MergeWithBy, and one defined by the IEqualityComparer implementation that I pass to Distinct. In addition, the syntax is different. That is, to merge and then uniquify:

    var merged = List1.MergeWithBy(List2, k => k.Age).ThenBy(k => k.LastName);
    var unique = merged.Distinct(new ItemComparerByAgeAndName());

I’d much rather have the second line be:

    var unique = merged.UniqueBy(k => k.Age).ThenBy(k => k.LastName);

Removing duplicates from an ordered sequence is easy enough. For example, this code de-dupes an ordered list, using the passed comparer.

    IEnumerable<TSource> Unique(IEnumerable<TSource> theList, IComparer<TSource> comparer)
    {
        var it = theList.GetEnumerator();
        var hasItems = it.MoveNext();
        if (!hasItems)
            yield break;
        
        yield return it.Current;
        var prev = it.Current;
        hasItems = it.MoveNext();
        while (hasItems)
        {
            if (comparer.Compare(prev, it.Current) != 0)
            {
                yield return it.Current;
                prev = it.Current;
            }
            hasItems = it.MoveNext();
        }
    }

Note that I used an IComparer<T> rather than an IEqualityComparer<T> here, which is different from what Distinct uses. I did that because the next step is to implement a LINQ extension, called UniqueBy, that works in the same way that OrderBy and my MergeWithBy methods work, including support for ThenBy and ThenByDescending.

I’m not going to spend any time describing how the UniqueBy and related methods work. The inner workings are sufficiently similar to the previously covered MergeWithBy that going over it again would be needless repetition. See the article I linked at the top of this post if you want more information.

UniqueBy will be faster than Distinct, and will use a lot less memory. Like my MergeWithBy, it only needs enough extra memory to store three items at any one time. But it only works on ordered sequences. If you want to uniquify an unordered list, you need to call Distinct, or you need to sort it first before calling UniqueBy. But Distinct will be faster if you have an unsorted list that fits in memory.

As I showed above, using the new method is really easy: it’s just like using OrderBy or my TopBy method.

One question that comes up is whether there is any benefit to calling UniqueBy before merging sequences. That is, which of these two is better?

    var foo = L1.UniqueBy(k => k).Merge(L2.UniqueBy(k => k), k => k).UniqueBy(k => k);

OR:

    var foo = L1.Merge(L2, k => k).UniqueBy(k => k)

It should be clear that the final UniqueBy is required in the first example. Otherwise you wouldn’t remove duplicates from the merged list. You’d only remove duplicates from each of the input lists. But if “Joe” existed in both lists then “Joe” would be output twice.

You’re better off with the second example because it’s simpler, produces the same result as the first, will use slightly less memory, and will be ever so slightly faster than the first example unless each of the input lists has a high percentage of duplicate items. Even in pathological cases, though, the difference in execution speed will likely be so small as to be irrelevant. So you should stick with the simpler code.

Whereas merging two sorted lists into a single sorted list and potentially removing duplicates is the most common type of merge I’ve used or seen used, it’s certainly not the only merge topic. Merging is a technique that’s been used in transaction processing systems for at least 50 years. I’ll likely come back to it in the future.

Following is complete code for the UniqueBy extension.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace EnumerableExtensions
{
    public static class UniqueExtension
    {
        public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector)
        {
            return UniqueBy(list1, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, false);
        }

        public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector)
        {
            return UniqueByDescending(list1, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, true);
        }

        internal abstract class UniqueOrderedEnumerable<TSource> : IOrderedEnumerable<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            internal UniqueOrderedEnumerable<TSource> Parent;

            protected UniqueOrderedEnumerable(
                IEnumerable<TSource> list1)
            {
                _list1 = list1;
            }

            public IOrderedEnumerable<TSource> CreateOrderedEnumerable<TKey>(
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool @descending)
            {
                var oe = new UniqueOrderedEnumerable<TSource, TKey>(
                    _list1, keySelector, comparer, descending) { Parent = this };
                return oe;
            }

            public IEnumerator<TSource> GetEnumerator()
            {
                var criteria = GetCriteria().ToArray();

                var selector = new UniqueSelector<TSource>(_list1, criteria);
                return selector.DoSelect().GetEnumerator();
            }

            // Walks the ordering criteria to build an array that the HeapSelector can use.
            private IEnumerable<UniqueOrderedEnumerable<TSource>> GetCriteria()
            {
                var keys = new Stack<UniqueOrderedEnumerable<TSource>>();

                var current = this;
                while (current != null)
                {
                    keys.Push(current);
                    current = current.Parent;
                }
                return keys;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }

            // The individual ordering criteria instances (UniqueOrderedEnumerable<TElement, TKey>)
            // implement these abstract methods to provice key extraction, comparison, and swapping.
            public abstract void ExtractKey(TSource item, int ix);
            public abstract int CompareKeys(int x, int y);
        }

        internal class UniqueOrderedEnumerable<TSource, TKey> : UniqueOrderedEnumerable<TSource>
        {
            private readonly Func<TSource, TKey> _keySelector;
            private readonly IComparer<TKey> _comparer;
            private readonly bool _descending;
            private readonly TKey[] _keys;

            internal UniqueOrderedEnumerable(
                IEnumerable<TSource> list1,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool descending)
                : base(list1)
            {
                _keySelector = keySelector;
                _comparer = comparer ?? Comparer<TKey>.Default;
                _descending = descending;

                _keys = new TKey[2];
            }

            public override int CompareKeys(int x, int y)
            {
                return _descending
                    ? _comparer.Compare(_keys[y], _keys[x])
                    : _comparer.Compare(_keys[x], _keys[y]);
            }

            public override void ExtractKey(TSource item, int ix)
            {
                _keys[ix] = _keySelector(item);
            }
        }

        internal class UniqueSelector<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly UniqueOrderedEnumerable<TSource>[] _criteria;

            public UniqueSelector(
                IEnumerable<TSource> list1,
                UniqueOrderedEnumerable<TSource>[] criteria)
            {
                _list1 = list1;
                _criteria = criteria;
            }

            public IEnumerable<TSource> DoSelect()
            {
                // Initialize the iterator
                var i1 = _list1.GetEnumerator();

                // ix controls the key position that's loaded
                var ix = 0;

                var next = new Func<bool>(() =>
                {
                    if (!i1.MoveNext()) return false;
                    ExtractKeys(i1.Current, ix);
                    return true;
                });

                var i1HasItems = next();
                if (!i1HasItems) yield break;

                // output the first item
                yield return i1.Current;
                ix = 1;
                i1HasItems = next();
                while (i1HasItems)
                {
                    // Output the item if it's not equal to the previous item
                    if (Compare(0, 1) != 0)
                    {
                        yield return i1.Current;

                        // Change the index position for the next loaded key
                        ix = (ix == 0) ? 1 : 0;
                    }
                    i1HasItems = next();
                }
            }

            private int Compare(int x, int y)
            {
                var rslt = 0;
                for (var i = 0; rslt == 0 && i < _criteria.Length; ++i)
                {
                    rslt = _criteria[i].CompareKeys(x, y);
                }
                return rslt;
            }

            private void ExtractKeys(TSource item, int ix)
            {
                foreach (var t in _criteria)
                {
                    t.ExtractKey(item, ix);
                }
            }
        }
    }
}

Thoughts after the election

If you believe the after-election commentary, the American people have “repudiated the policies of this administration and embraced conservative ideals.” Republicans are saying that gaining control of the Senate and increasing their control of the House is “a mandate” from the American people. In a sense it is, but almost certainly not in the way that they apparently believe and, more importantly, want you to believe.

For the president’s opposing party to take control of Congress during midterm elections is nothing new. Republicans did it in 1994, and Democrats did it in 2006. Historically, it’s unusual for the opposing party not to gain seats during the midterm. The media pundits and party propaganda twits will come up with all kinds of complicated reasons, often based on the belief that the winning party lied or cheated, but I think the real reason is much simpler: voters are expressing their discontent by throwing the bums out.

Throwing the bums out is a good thing as far as I’m concerned. Unfortunately, they just elect a new crop of bums and it’s business as usual.

The American electorate has an incredibly short memory and an even more limited imagination. They’re smart enough to see that things aren’t working well, and know that we need a change in Congress. But their imagination is limited to one of two flavors: Democrat or Republican. They’ll happily throw out the Democrats in favor of the Republicans, despite having been burned by the Republicans eight or ten years ago. And at some time in the near future they’ll throw the Republicans out in favor of the Democrats, forgetting that things weren’t any better the last time Democrats controlled Congress.

For some unfathomable reason, Americans lack the imagination to throw out Democrats and Republicans. That would send the right message. As it stands, all we’re doing is swapping one crowd of blood sucking parasites for another.

I liken it to being given the choice of having your face slapped repeatedly or getting punched in the gut repeatedly. We get tired of slaps after a while and switch to the gut punchers, but then our stomachs start to hurt and we go for the face slapping again. But what we really want is for people to stop hitting us. And yet we don’t have the imagination or the will to do anything about it.

In part, that’s because we long ago allowed Congress to make rules that enforce a very strict two-party system. The party that has the most seats in the House or Senate gets to make rules and control what legislation is presented. That in itself encourages an adversarial relationship, which is especially bad when the President’s party controls one house and the opposing party controls the other. In that case, the opposing party is forced to do whatever it can to block the president’s every move. To do otherwise would alienate their base and anybody else who might be discontented enough to vote for them during the next election cycle.

When the president’s party controls both houses of Congress, we’re in real trouble. Especially when they hold a super majority that can completely block every move of the opposing party. In that case, there is no opposition to whatever grandiose scheme the president’s party can dream up. We usually regret such laws that are enacted without careful consideration and lively debate. Giving a single party full control of two of our three branches of government is dangerous. So far we’ve been fortunate that the parties aren’t quite as tightly controlled as they could be. It’s a good thing that sometimes a party member will vote against the wishes of the party leaders.

We’re best off when one party controls the Executive branch and the other party controls the Legislative. In that case, the two parties are forced to work together. When one party controls both houses of Congress, the people who elected them expect them to Get Things Done. Sure, some of the party faithful think Congress should adopt an adversarial posture towards the president, but that leads to idiocy like the Clinton impeachment trial. Most of the people will want Congress to work with the president and enact meaningful legislation. Or, one would hope, repeal stupid legislation that was imposed on us when one party had a little too much power.

If Republicans can work with the president over the next two years (one year, really, because the second year will be dedicated to the mud slinging and political posturing we call campaigning), Republicans have a chance of gaining the White House again, and perhaps can keep from losing too much ground in the Congressional elections. If, however, they adopt an adversarial role and refuse to work with the president, the 2016 elections will be a repeat of 2008.

What I’d really like to see, though, is a meaningful third party or, preferably, a serious independent (i.e. no party affiliation) movement. Our system of party politics, especially the artificial two party system, is a serious impediment. It just encourages tribalism and perpetuates the dysfunctional governance we’ve experienced for the last 25 years or more.

Timers don’t keep track of time

In computer programming, the word Timer is used to describe a hardware or software component that raises an event (“ticks”) at a specified interval. So, for example, if you want a notification once per second, you would instantiate a timer and set its period to one second. The timer API often has you specify the period in milliseconds, so you’d say 1,000 milliseconds.

In a C# program, for example, you can create a timer that ticks every second, like this:


    private System.Threading.Timer _timer;
    private int _tickCount;
    public void Test()
    {
        _timer = new System.Threading.Timer(MyTimerHandler, null, 1000, 1000);
        Console.ReadLine();
    }

    private void MyTimerHandler(object state)
    {
        ++_tickCount;
        Console.WriteLine("{0:N0} tick", _tickCount);
    }

It’s important to note that ticks don’t come at exact 1-second intervals. The timer is reasonably accurate, but it’s not perfect. How imperfect is it? Let’s find out. Below I’ve modified the little program to start a Stopwatch and output the actual elapsed time with each tick.

    private System.Threading.Timer _timer;
    private Stopwatch _stopwatch;
    private int _tickCount;
    public void Test()
    {
        _stopwatch = Stopwatch.StartNew();
        _timer = new System.Threading.Timer(MyTimerHandler, null, 1000, 1000);
        Console.ReadLine();
    }

    private void MyTimerHandler(object state)
    {
        ++_tickCount;
        Console.WriteLine("{0:N0} - {1:N0}", _tickCount, _stopwatch.ElapsedMilliseconds);
    }

On my system, that program shows the timer to be off by about 200 milliseconds every three minutes. That’s on a system that isn’t doing anything else of note: no video playing, no downloads, video streaming, or heavy duty background jobs. 200 milliseconds every three minutes doesn’t sound like much, but that’s one second every fifteen minutes, or 96 seconds every day. My granddad’s wind-up Timex kept better time than that.

Granted, the problem could be with the Stopwatch class. But it isn’t. I created a simple program that starts a Stopwatch and then outputs the elapsed time whenever I press the Enter key. I let that program run for days, and every time I hit Enter, it agreed very closely with what my wristwatch said. There was some variation, of course, because no two clocks ever agree perfectly, but the difference between Stopwatch and my wristwatch were a lot smaller than the differences between Stopwatch and the Timer. I have since run that experiment on multiple computers and multiple clocks, and every time the Stopwatch has been quite reliable.

Things get worse. On a heavily loaded system with lots of processing going on, timer ticks can be delayed by an indeterminate time. I’ve seen ticks delayed for five seconds or more, and then I get a flood of ticks. I’ll see, for example, a tick at 5,000 milliseconds, then one at 9,000 milliseconds followed very quickly by ticks at 9,010, 9015, and 9,030 milliseconds. What happened is that the system was busy and couldn’t schedule the timer thread at 6,000 milliseconds, so the system buffered the ticks. When it finally got around to dispatching, three other ticks had come in and it fired all four of them as quickly as it could.

This can be a huge problem on heavily loaded systems because it’s possible that multiple timer ticks can be processing concurrently.

The only thing a timer is good for is to give you notification on a periodic basis–approximately at the frequency you requested. A timer pointedly is not for keeping time.

Given that, you can probably explain what’s wrong with this code, which is a very simplified example of something I saw recently. The original code did some processing and periodically checked the _elapsedSeconds variable to see how much time had elapsed. I simplified it here to illustrate the folly of that approach.

private Timer _timer;
    private int _elapsedSeconds;
    private ManualResetEvent _event;
    private void Test()
    {
        _timer = new Timer(MyTimerHandler, null, 1000, 1000);
        _event = new ManualResetEvent(false);
        _event.WaitOne();
        Console.WriteLine("One minute!");
        _timer.Dispose();
    }
    
    private void MyTimerHandler(object state)
    {
        ++_elapsedSeconds;
        if (_elapsedSeconds == 60)
        {
            _event.Set();
        }
    }

Here, the programmer is using the timer like the second hand of a clock: each tick represents one second. There are at least two problems in that code. First, we’ve already seen that the timer won’t tick at exactly one-second intervals. Second and more importantly, it’s possible on a busy system for multiple timer events to occur concurrently. Imagine what happens in this case:

  1. _elapsedSeconds is equal to 59.
  2. A tick occurs and Thread A is dispatched to handle it.
  3. Thread A increments the _elapsedSeconds value to 60.
  4. Another tick occurs and Thread B is dispatched to handle it.
  5. At the same time, Thread A is swapped out in favor of some higher priority thread.
  6. Thread B increments the _elapsedSeconds value to 61.

At this point, the _elapsedSeconds value has been incremented beyond 60, so the conditional will never be true and the event will never be set. (Well, it might: 4 billion seconds from now, when _elapsedSeconds rolls over. That’s only about 125 years.)

You could change the conditional to _elapsedSeconds >= 60, but that’s missing the point. We’ve already shown that the timer isn’t going to be particularly accurate. If you were trying to time a whole day, you could be off by a minute and a half.

The problem of concurrent execution of timer handlers is another topic entirely that I probably should cover in a separate post. It’s important you understand that it can happen, though, and you have to keep that in mind.

In programming, timers don’t keep time. Don’t try to make them. They can’t count up reliably, and they can’t count down reliably. If you want to keep track of elapsed time, start a Stopwatch and then check its Elapsed or ElapsedMilliseconds properties when necessary.

On a related note, do not use the clock to measure time.

The MergeWithBy LINQ extension

Last time, I showed how to write the standard two-way merge using LINQ. With that code, you can easily merge two sorted sequences from any type of container that implements the IEnumerable<T> interface. That in itself is very useful, and with the addition of a parameter that defines the comparison function, it likely would suffice for most merge operations. However, it’s less than satisfactory for the same reasons I pointed out in my introduction to the TopBy method earlier this year: sometimes creating an IComparer is a giant pain in the neck. To make a LINQ-compatible merge capability, I really need to implement IOrderedEnumerable.

I’ve struggled with what to call the method. Obvious candidates are MergeMergeByMergeWith, and MergeWithBy, but each of those has its drawbacks. Consider, for example, the possibilities:

    var merged = List1.Merge(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeBy(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeWith(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeWithBy(List2, k => k.Age).ThenBy(k => k.LastName);

I’m not 100% happy with any one of those options but after struggling with it for a while I decided on MergeWithBy which, although a little clunky to the ear (MergeWithByDescending in particular), is the most descriptive. I’m merging one list with another by the specified ordering criteria.

With the naming out of the way, the easy part is defining the LINQ extension methods. These exactly mirror the standard OrderBy methods as well as my TopBy methods.

    public static class MergeExtension
    {
        public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector)
        {
            return MergeWithBy(list1, list2, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, false);
        }

        public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector)
        {
            return MergeWithByDescending(list1, list2, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, true);
        }
    }

The harder part is implementing the IOrderedEnumerable<T> interface that actually does the work. The idea behind it, as I described in my two-part series on IOrderedEnumerable, is to create, for each of the ordering criteria, a class instance that can compare the relevant key. There is a base class, MergeOrderedEnumerable<TSource<, and a derived class for each key type: MergeOrderedEnumerable<TSource, TKey>.

I covered the details of how those two classes work in the article linked above, and in the second part. For the most part, those two classes are just bookkeeping: setting things up so that a third class, the selector, can do the actual work. The full source of the merge extension class is shown at the end of this post.

The guts of the merge–the part that implements the Merge method I showed last time–is the MergeSelector.DoSelect method, shown here:

    public IEnumerable<TSource> DoSelect()
    {
        // Initialize the iterators
        var iterators = new IEnumerator<TSource>[2];

        var next = new Func<int, bool>(ix =>
        {
            if (!iterators[ix].MoveNext()) return false;
            ExtractKeys(iterators[ix].Current, ix);
            return true;
        });

        iterators[0] = _list1.GetEnumerator();
        iterators[1] = _list2.GetEnumerator();
        var i1HasItems = next(0);
        var i2HasItems = next(1);
        while (i1HasItems && i2HasItems)
        {
            if (Compare(0, 1) <= 0)
            {
                yield return iterators[0].Current;
                i1HasItems = next(0);
            }
            else
            {
                yield return iterators[1].Current;
                i2HasItems = next(1);
            }
        }

        while (i1HasItems)
        {
            yield return iterators[0].Current;
            i1HasItems = next(0);
        }

        while (i2HasItems)
        {
            yield return iterators[1].Current;
            i2HasItems = next(1);
        }
    }

This code is nearly a direct translation of the Merge I showed the last time. The primary difference in the structure of the code is that I put the iterators in an array so that I could reduce the amount of duplicated code. The next function advances to the next item in whichever list is specified. Because I have to extract keys that are maintained by the individual MergeOrderedEnumerable<TSource, TKey> instances, without that function I’d have to write code like this four times (twice for i1, and twice for i2):

    yield return i1.Current;
    i1.HasItems = i1.MoveNext();
    if (i1.HasItems)
    {
        ExtractKeys(i1.Current, 0);
    }

I suppose I could have combined the code:

    if ((i1.HasItems = it.MoveNext())) ExtractKeys(i1.Current, 0);

Call it a matter of style.

That’s really all there is to it. The MergeOrderedEnumerable classes look unnecessarily complex at first glance, but after you study them for a few minutes, you understand why all that code is there. It’s particularly instructive to set up a short merge of a few items, and then single-step the code. Not only will you see how all this stuff works together, you’ll gain a better understanding of how LINQ works in general.

That’s almost all there is to merging two sequences. The only thing remaining is the matter of uniqueness, which I’ll talk about next time in Removing duplicates.

    namespace EnumerableExtensions
    {
        public static class MergeExtension
        {
            public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector)
            {
                return MergeWithBy(list1, list2, keySelector, null);
            }

            public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, false);
            }

            public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector)
            {
                return MergeWithByDescending(list1, list2, keySelector, null);
            }

            public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, true);
            }
        }

        internal abstract class MergeOrderedEnumerable<TSource> : IOrderedEnumerable<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly IEnumerable<TSource> _list2;
            internal MergeOrderedEnumerable<TSource> Parent;

            protected MergeOrderedEnumerable(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2)
            {
                _list1 = list1;
                _list2 = list2;
            }

            public IOrderedEnumerable<TSource> CreateOrderedEnumerable<TKey>(
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool @descending)
            {
                var oe = new MergeOrderedEnumerable<TSource, TKey>(
                    _list1, _list2, keySelector, comparer, descending) {Parent = this};
                return oe;
            }

            public IEnumerator<TSource> GetEnumerator()
            {
                var criteria = GetCriteria().ToArray();

                var selector = new MergeSelector<TSource>(_list1, _list2, criteria);
                return selector.DoSelect().GetEnumerator();
            }

            // Walks the ordering criteria to build an array that the HeapSelector can use.
            private IEnumerable<MergeOrderedEnumerable<TSource>> GetCriteria()
            {
                var keys = new Stack<MergeOrderedEnumerable<TSource>>();

                var current = this;
                while (current != null)
                {
                    keys.Push(current);
                    current = current.Parent;
                }
                return keys;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }

            // The individual ordering criteria instances (MergeOrderedEnumerable<TElement, TKey>)
            // implement these abstract methods to provice key extraction, comparison, and swapping.
            public abstract void ExtractKey(TSource item, int ix);
            public abstract int CompareKeys(int x, int y);
        }

        internal class MergeOrderedEnumerable<TSource, TKey> : MergeOrderedEnumerable<TSource>
        {
            private readonly Func<TSource, TKey> _keySelector;
            private readonly IComparer<TKey> _comparer;
            private readonly bool _descending;
            private readonly TKey[] _keys;

            internal MergeOrderedEnumerable(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool descending)
                : base(list1, list2)
            {
                _keySelector = keySelector;
                _comparer = comparer ?? Comparer<TKey>.Default;
                _descending = descending;

                _keys = new TKey[2];
            }

            public override int CompareKeys(int x, int y)
            {
                return _descending
                    ? _comparer.Compare(_keys[y], _keys[x])
                    : _comparer.Compare(_keys[x], _keys[y]);
            }

            public override void ExtractKey(TSource item, int ix)
            {
                _keys[ix] = _keySelector(item);
            }
        }

        internal class MergeSelector<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly IEnumerable<TSource> _list2;
            private readonly MergeOrderedEnumerable<TSource>[] _criteria;

            public MergeSelector(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                MergeOrderedEnumerable<TSource>[] criteria)
            {
                _list1 = list1;
                _list2 = list2;
                _criteria = criteria;
            }

            public IEnumerable<TSource> DoSelect()
            {
                // Initialize the iterators
                var iterators = new IEnumerator<TSource>[2];

                var next = new Func<int, bool>(ix =>
                {
                    if (!iterators[ix].MoveNext()) return false;
                    ExtractKeys(iterators[ix].Current, ix);
                    return true;
                });

                iterators[0] = _list1.GetEnumerator();
                iterators[1] = _list2.GetEnumerator();
                var i1HasItems = next(0);
                var i2HasItems = next(1);
                while (i1HasItems && i2HasItems)
                {
                    if (Compare(0, 1) <= 0)
                    {
                        yield return iterators[0].Current;
                        // I could do a loop here using ExtractCompare to compare against
                        // item 2. That would reduce the key extraction as long as the
                        // new item from list 1 is smaller than the new item from list 2.
                        // Then extract all of the keys once l1 goes high.
                        // Lots of code that might not be particularly useful.
                        i1HasItems = next(0);
                    }
                    else
                    {
                        yield return iterators[1].Current;
                        i2HasItems = next(1);
                    }
                }

                while (i1HasItems)
                {
                    yield return iterators[0].Current;
                    // TODO: Could add an "extract" parameter to the next function
                    // If "extract" is false, it doesn't extract the keys.
                    // That would speed up the tail copying,
                    // but might not be worth the trouble.
                    i1HasItems = next(0);
                }

                while (i2HasItems)
                {
                    yield return iterators[1].Current;
                    i2HasItems = next(1);
                }
            }

            private int Compare(int x, int y)
            {
                var rslt = 0;
                for (var i = 0; rslt == 0 && i < _criteria.Length; ++i)
                {
                    rslt = _criteria[i].CompareKeys(x, y);
                }
                return rslt;
            }

            private void ExtractKeys(TSource item, int ix)
            {
                foreach (var t in _criteria)
                {
                    t.ExtractKey(item, ix);
                }
            }
        }
    }