Merging multiple lists with LINQ

To merge multiple sorted lists efficiently, we need a better way to obtain the smallest value from among the lists’ current items. That is, given three sorted lists:

    A: 1, 8, 13, 19, 40
    B: 0, 4, 12, 41
    C: 3, 22, 30, 37, 43

The first item to output would be the 0 from list B. Then we’d output 1 from list A, 3 from list C, etc. In my previous post I showed one way to find the smallest value. That involved looking at the current item from each list to get the minimum. It works, but it’s very expensive. If you have k sorted lists, then it requires O(k) operations to determine the smallest item.

I spent a lot of time last year talking about heaps. The heap, as you recall, is a data structure that efficiently keeps track of the smallest (or largest, depending on how you structure it) item. In particular, you can remove the smallest item from the heap in O(log2 k) time, where k is the number of items in the heap. Also, insertion into the heap is an O(log2 k) operation.

If there are n items spread out over k lists, then the naive method takes O(n * k) time, and the method with the heap requires O(n log2 k) time. Using a heap can save you a lot of time, even when k is small.

That sounds simple enough, but there’s one catch: you can’t just store values in the heap. With the value, you have to store a reference to the list itself. Otherwise you don’t know which list index to increment. Still, it’s not terribly difficult. With C#, we could use a Tuple<T1, T2>, or a simple object. Tuple is convenient, but I prefer creating a class for this kind of thing:

    class HeapItem
    {
        int[] a;  // the array
        int ix;   // current index
    }

Given that, and assuming we have a heap data structure, merging multiple lists becomes much easier. Following the basic algorithm from last time, the almost-C# code looks like this:

    // Initialize the heap
    var heap = new MinHeap();
    foreach (array in arrays)
        if (array.Length > 0)
            heap.Insert(new HeapItem(a = array; ix = 0);
    
    while (!heap.IsEmpty)
    {
        HeapItem item = heap.RemoveSmallest();
        result.Add(item.a[item.ix]);  // output item
        item.ix++;                    // move to next item in this array
        // and put it into the heap if there are still items
        if (item.ix < item.a.Length)
            heap.Insert(item);
    }

I glossed over the heap implementation here. The code assumes that you have a heap implementation that will work with those HeapItem instances. I do in fact have such a thing, and we’ll use it in a bit.

Note that there is no need to explicitly check for the end of all lists. All of the lists’ indexes are initialized at 0. When a list is removed from the heap, its current item is output and then its index is incremented. If the new current index is beyond the end of the list, then the list isn’t added back to the heap. If we drop lists from the heap once they’ve been exhausted, then we know that there are still items as long as the heap is not empty. That simplifies things quite a bit.

The problem with the above is that it assumes we’re working with arrays of integers. What we really want is a generic Merge method that works with any type that implements IEnumerable<T>. That is:

    IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<TSource>> lists);

The basic idea isn’t much different from what I presented in Merging sequences with LINQ. But rather than having only two enumerators and doing a simple comparison, I create a heap of enumerators. It looks like this:

    public static IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // This builds a heap of the first items from each list
        var initialItems =
            lists.Select(l => l.GetEnumerator())
                .Where(i => i.MoveNext());

        var heap = new MinDHeap<IEnumerator<T>>(
            2, initialItems, new EnumeratorComparer<T>(Comparer<T>.Default));

        while (!heap.IsEmpty)
        {
            var i = heap.RemoveRoot();
            yield return i.Current;
            if (i.MoveNext())
            {
                heap.Insert(i);
            }
        }
    }

That code relies on an EnumeratorComparer, which is defined as:

    private class EnumeratorComparer<T> : IComparer<IEnumerator<T>>
    {
        private readonly IComparer<T> _comp;

        public EnumeratorComparer(IComparer<T> comparer)
        {
            _comp = comparer ?? Comparer<T>.Default;
        }

        public int Compare(IEnumerator<T> x, IEnumerator<T> y)
        {
            var rslt = _comp.Compare(x.Current, y.Current);
            return rslt;
        }
    }

It also depends on my DHeap, the code for which you can download from that page.

One of the cool things about using enumerators here is that the concept of the current item is built in. The logic of incrementing and checking for the end of the list is all there in the MoveNext method and the Current property. So the code to implement the merge is incredibly simple.

I should mention the initialization code, which consists of this LINQ expression:

   var initialItems =
        lists.Select(l => l.GetEnumerator())
             .Where(i => i.MoveNext());

That code is just getting the enumerators for each of the lists, positioning to the first item in the list, and adding it to the list of enumerators (which is subsequently inserted into the heap) if the list has items. It’s roughly equivalent to:

    foreach (var list in lists)
    {
        var i = list.GetEnumerator();
        if (i.MoveNext)
            initialItems.Add(i);
    }

That’s the simple method that merges multiple lists using the default comparison. That’s useful in itself, but it doesn’t provide all of the features that I included in the MergeWithBy extension. Next time, we’ll look at extending this code to provide full-featured MergeBy and MergeByDescending methods.

Actually, I went off on a tangent and presented a different way to merge multiple lists. I’ll get to the full-featured methods after exploring that.