When I started writing the performance test code, I got sidetracked with yet another way to merge multiple sorted lists. While I was analyzing it I realized that the pairwise merge might have a problem, as well. It’s worth talking about here because knowing of the problem’s existence will help me create better test data when I finally get around to performance testing.

To review, assuming four lists named A, B, C, and D, the pairwise merge works like this:

```
Merge A and B to produce list AB
Merge C and D to produce list CD
Merge AB and CD to produce list ABCD
```

You can achieve the same result without a queue:

```
Merge A and B to produce AB
Merge AB and C to produce ABC
Merge ABC and D to produce ABCD
```

Whereas they both require the same number of merges, not all merges are equal.

We can estimate the speed of these algorithms by figuring out how many items they process. We already know that merging two lists that have among them `n`

items requires processing `n`

items. So if we sum the number of items for each two-way merge, we have the total number of items processed for the entire operation.

If we assume that each of the lists has 10 items, then the first example requires:

20 items to merge A and B. The resulting list AB has 20 items.

20 items to merge C and D. The resulting list CD has 20 items.

40 items to merge AB and CD.

80 total items processed.

The second example requires:

20 items to merge A and B. The resulting list AB has 20 items.

30 items to merge AB and C. The resulting list ABC has 30 items.

40 items to merge ABC and D.

90 total items processed.

In this case, the queue-based algorithm is faster.

But not in all cases. Consider, for example, what happens when the lists’ lengths are:

```
A = 1
B = 1
C = 1
D = 100
```

The queue merge will look at two items to make AB, 101 items to make CD, and 103 items to make ABCD. That’s a total of 206 items processed. The other algorithm will look at two items to make AB, three items to make ABC, and 103 items to make ABCD, for a total of 108 items processed. If you switch the order, though, the queue merge will still process 203 items but the other version will have to look at 101 + 102 + 103, or 308 items.

The order of merges matters. A lot. Change that 100 to 1,000,000, and you start to get the idea.

If we know the lengths of the lists to be merged, we could potentially order the pairwise merges to take advantage of that information. The idea would be to order the merges so that we create lists that are relatively the same length. It wouldn’t have to be a *perfect* solution in order to realize a huge performance increase. A simple priority queue ordered by increasing list length would allow us to merge the shorter lists first, which would give a reasonably good solution to the problem.

The only drawback–and it’s a deal killer–is that we don’t know how long the lists are. All of the merge code I’ve been working with assumes that the lists are of indeterminate length.

The result is that worst cases can happen. If we use the queue-based pairwise merge, at some point we’ll be presented with a pathological case that makes the merge perform very poorly. In those cases, the running time isn’t O(n log_{2} k), but rather O(n*k). Ouch.

In contrast, the heap-based merge *always* takes O(n log_{2} k) time. The reason is that every item is inserted into the heap once, and every item is removed from the heap once, regardless of the individual lists’ length.

Asymptotic analysis, then, tells us that the heap-based merge is better because it’s always O(n log_{2} k), whereas the pairwise merge *can* be O(n*k). *Actual running time* is a different story. If maintaining the heap is very expensive in comparison to maintaining the enumerators, then the pairwise merge can still outperform the heap-based merge.

The pairwise merge *looks* like it should be faster in the average case, and even in pathological cases with relatively few items. I suspect that it will be slower than the heap-based merge with many large, unevenly-sized lists.

Don’t take my word for it, though. I’ve been wrong often enough to distrust my gut feelings when it comes to performance. The only way to know for sure which is faster is to measure it.

[…] Getting back to my discussion of merging multiple lists. […]