A different way to merge multiple lists

In my previous post, I said that this time I would show full-featured MergeBy and MergeByDescending methods. Before I do that, I want to show an alternate method for merging multiple sorted lists.

If you recall, I said that the generalized k-way merge algorithm is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

As I illustrated last time, that does indeed work. But there is another way to do it that is, asymptotically at least, just as fast. That is, it operates in O(n log2 k) time.

Imagine you have four sorted lists, named A, B, C, and D. Another way to merge them is this way:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and CD to create the final list.

To estimate the time complexity, let’s assume that each of the four lists contains the same number of items. So if the total number of items is n, then each list’s length is n/4.

Merging A and B, then, takes time proportional to n/4 + n/4, or n/2. Merging C and D also requires n/2 time. The final merge of AB and CD takes n/2 + n/2. So the total amount of time it takes to merge the four lists is 2n.

Time complexity of the k-way merge is, as I said, O(n log2 k). With four lists, that’s n * log2(4)log2(4) = 2, so the time to merge four lists with a total of n items is n * 2, or 2n.

If there is an odd number of lists, then one of the original lists gets merged with one of the larger, already merged, lists. For example, with five lists the task becomes:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and E to create a list called ABE.
    Merge ABE and CD to create the final list.

To do that with an arbitrary number of lists, we create a queue that initially contains the original lists. We then remove the first two lists from the queue, merge them, and add the result back to the queue. Then take the next two, merge them, and add the result to the queue. We continue doing that until there is only one list left on the queue. That is:

    Initialize queue with all of the lists.
    while queue has more than one item
        l1 = queue.Dequeue()
        l2 = queue.Dequeue()
        rslt = Merge(l1, l2)
        queue.Enqueue(rslt)
    merged = queue.Dequeue()

At first, it seems like that method will require a lot of extra space to hold the temporary lists. Remember, the merge algorithms I’ve shown so far don’t require much in the way of extra storage: just a little memory to hold the smallest value in each of the lists. That is, O(k) extra space. Perhaps not surprisingly, you can do the same thing with LINQ. Here’s the code.

    public static IEnumerable<T> MergeVersionTwo<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // Populate a queue with all the lists.
        var theQueue = new Queue<IEnumerable<T>>(lists);
        if (theQueue.Count == 0) yield break;

        // Do pair-wise merge repeatedly until there is only one list left. 
        while (theQueue.Count > 1)
        {
            var l1 = theQueue.Dequeue();
            var l2 = theQueue.Dequeue();
            var rslt = Merge(l1, l2); // uses the two-way merge
            theQueue.Enqueue(rslt);
        }
        var merged = theQueue.Dequeue();
        foreach (var x in merged)
        {
            yield return x;
        }
    }

That code uses the two-way merge that I presented a while back.

When building the queue, we’re just setting things up. Everything is an IEnumerable<T>, meaning that no actual work takes place. LINQ’s deferred execution postpones the merging until we start to enumerate the result. And when we do enumerate the result, only one item is produced at a time. All of the intermediate merges take place simultaneously, in steps.

Here’s some code that uses the new MergeVersionTwo method:

    var l1 = new int[] {1, 3, 9, 12, 15};
    var l2 = new int[] {2, 7, 14, 16, 19};
    var l3 = new int[] {6, 8, 13, 17, 20};
    var rslt = MergeVersionTwo(new List<int[]> {l1, l2, l3});

    foreach (var i in rslt)
        Console.WriteLine(i);

If you wrap that in a method and single-step it, you’ll see that no real work is done until you get to the foreach. Even then, all that mucking about with the queue doesn’t enumerate any of the lists until you get to the foreach at the end of the MergeVersionTwo method. If you start single-stepping (in Visual Studio, press F11, or the “Step into” debugger command), you’ll see it start working on the two-way merges. Watching how all that works will help you to understand just how powerful deferred execution can be.

It’s important to note that asymptotic analysis just tells us how the run time varies with the size of the input. It doesn’t say that both of the k-way merge algorithms will take the same amount of time. n log2 k just says that for a particular algorithm the amount of time varies with the size of the input and the logarithm of the number of lists. We’ll have to do some performance tests to determine which is actually faster.

That testing should prove interesting. Before we go there, though, we need to look more closely at the pairwise merge.