A closer look at pairwise merging

When I started writing the performance test code, I got sidetracked with yet another way to merge multiple sorted lists. While I was analyzing it I realized that the pairwise merge might have a problem, as well. It’s worth talking about here because knowing of the problem’s existence will help me create better test data when I finally get around to performance testing.

To review, assuming four lists named A, B, C, and D, the pairwise merge works like this:

    Merge A and B to produce list AB
    Merge C and D to produce list CD
    Merge AB and CD to produce list ABCD

You can achieve the same result without a queue:

    Merge A and B to produce AB
    Merge AB and C to produce ABC
    Merge ABC and D to produce ABCD

Whereas they both require the same number of merges, not all merges are equal.

We can estimate the speed of these algorithms by figuring out how many items they process. We already know that merging two lists that have among them n items requires processing n items. So if we sum the number of items for each two-way merge, we have the total number of items processed for the entire operation.

If we assume that each of the lists has 10 items, then the first example requires:

20 iterations to merge A and B. The resulting list AB has 20 items.
20 iterations to merge C and D. The resulting list CD has 20 items.
40 iterations to merge AB and CD.
80 total iterations.

The second example requires:

20 iterations to merge A and B. The resulting list AB has 20 items.
30 iterations to merge AB and C. The resulting list ABC has 30 items.
40 iterations to merge ABC and D.
90 total iterations.

In this case, the queue-based algorithm is faster.

But not in all cases. Consider, for example, what happens when the lists’ lengths are:

    A = 1
    B = 1
    C = 1
    D = 100

The queue merge will look at two items to make AB, 101 items to make CD, and 103 items to make ABCD. That’s a total of 206 items processed. The other algorithm will look at two items to make AB, three items to make ABC, and 103 items to make ABCD, for a total of 108 items processed. If you switch the order, though, the queue merge will still process 203 items but the other version will have to look at 101 + 102 + 103, or 308 items.

The order of merges matters. A lot. Change that 100 to 1,000,000, and you start to get the idea.

If we know the lengths of the lists to be merged, we could potentially order the pairwise merges to take advantage of that information. The idea would be to order the merges so that we create intermediate lists that are relatively the same length. It wouldn’t have to be a perfect solution in order to realize a huge performance increase. A simple priority queue ordered by increasing list length would allow us to merge the shorter lists first, which would give a reasonably good solution to the problem.

The only drawback–and it’s a deal killer–is that we don’t know how long the lists are. All of the merge code I’ve been working with assumes that the lists are of indeterminate length.

The result is that worst cases can happen. If we use the queue-based pairwise merge, at some point we’ll be presented with a pathological case that makes the merge perform very poorly. In those cases, the running time isn’t O(n log2 k), but rather O(n*k). Ouch.

In contrast, the heap-based merge always takes O(n log2 k) time. The reason is that every item is inserted into the heap once, and every item is removed from the heap once, regardless of the individual lists’ length.

Asymptotic analysis, then, tells us that the heap-based merge is better because it’s always O(n log2 k), whereas the pairwise merge can be O(n*k). Actual running time is a different story. If maintaining the heap is very expensive in comparison to maintaining the enumerators, then the pairwise merge can still outperform the heap-based merge.

The pairwise merge looks like it should be faster in the average case, and even in pathological cases with relatively few items. I suspect that it will be slower than the heap-based merge with many large, unevenly-sized lists.

Don’t take my word for it, though. I’ve been wrong often enough to distrust my gut feelings when it comes to performance. The only way to know for sure which is faster is to measure it.