A fairly common operation in data processing applications is the sort/merge. That is, merging two or more sorted lists to create a single sorted list. In my previous work I frequently had to merge lists that contained tens or hundreds of millions of items each. When you’re working with lists that large, the choice of merge algorithm can make the difference between a few seconds’ work and an overnight job.
The naive way to merge two lists into a third is to concatenate and sort. That is, given these two lists:
List A = [1, 5, 6, 9]
List B = [2, 3, 6, 8]
You’d first create the list [1, 5, 6, 9, 2, 3, 6, 8]
, and then sort the sequence to produce [1, 2, 3, 5, 6, 6, 8, 9]
.
That technique works well enough, but it’s expensive. The best sorting algorithms take time proportional to n log2(n)
, where n
is the number of items in the list. That’s no big deal when you’re working with reasonably short lists, but sorting a few hundred million items takes significant time. Another problem is that the sort is necessarily an in-memory operation, or a very complicated disk-based merge sort which is even more expensive.
You can do much better by taking advantage of the existing ordering. We know that both lists are already sorted, so you know that the first item in the combined list will be the smallest of the first items in each of the lists. In the lists above, List A starts with 1, and List B starts with 2. You know that the first item in the new list has to be 1.
If you remove 1 from List A and repeat the process, you’re comparing 5 with 2. Obviously, you output 2 from List B and remove it from the list. Then compare 5 with 3, etc. You do this until one of the lists is exhausted, and then you output all the items from the other list. In pseudo code, it looks like this:
while (not end of List A and not end of List B)
if (List A current item <= List B current item)
output List A current item
advance List A index
else
output List B current item
advance List B index
// At this point, one of the lists is empty.
// Output remaining items from the other
while (not end of List A)
output List A current item
advance List A index
while (not end of List B)
output List B current item
advance List B index
Note that it doesn’t matter which list empties first. If List A empties first, then the condition at the beginning of the following while
loop will prevent trying to output something that isn’t there.
The algorithm above is a standard merge, which is an O(n) algorithm. If you study it for a few minutes, you’ll see that given two lists with lengths L1
and L2
the maximum number of item comparisons it will make is min(L1, L2)
. That’s a far cry from the (L1 + L2) * log2(L1 + L2)
that a sort will make in the average case.
The difference in the number of comparisons makes a huge difference. To see just how much of a difference, I wrote a little program that creates two lists and then merges them using each of the techniques. Here’s the code:
static void Main(string[] args)
{
const int numIterations = 100;
const int numItems = 1000 * 1000 * 100;
Console.WriteLine("Testing with two lists of {0:N0} items each.", numItems);
// Create two sorted lists
var l1 = CreateRandomList(numItems);
Array.Sort(l1);
var l2 = CreateRandomList(numItems);
Array.Sort(l2);
DoTest(() => MergeWithSort(l1, l2), numIterations, "Merge with sort");
DoTest(() => SimpleMerge(l1, l2), numIterations, "Simple merge");
Console.ReadLine();
}
static void DoTest(Action action, int iterations, string name)
{
Console.Write("Executing ({0}) {1:N0} times.", name, iterations);
var totalTime = TimeSpan.Zero;
var sw = new Stopwatch();
for (int i = 0; i < iterations; ++i)
{
sw.Restart();
action();
sw.Stop();
totalTime += sw.Elapsed;
Console.Write('.');
}
Console.WriteLine(" {0:N0} ms. Average {1:N2} ms",
totalTime.TotalMilliseconds, (double)totalTime.TotalMilliseconds/iterations);
}
static int[] MergeWithSort(int[] listA, int[] listB)
{
var result = new int[listA.Length + listB.Length];
Array.Copy(listA, result, listA.Length);
Array.Copy(listB, 0, result, listA.Length, listB.Length);
Array.Sort(result);
return result;
}
static int[] SimpleMerge(int[] listA, int[] listB)
{
var result = new int[listA.Length + listB.Length];
int ixListA = 0;
int ixListB = 0;
int ixResult = 0;
while (ixListA < listA.Length && ixListB < listB.Length)
{
if (listA[ixListA] <= listB[ixListB])
{
result[ixResult] = listA[ixListA];
++ixListA;
}
else
{
result[ixResult] = listB[ixListB];
++ixListB;
}
++ixResult;
}
while (ixListA < listA.Length)
{
result[ixResult] = listA[ixListA];
++ixListA;
}
while (ixListB < listB.Length)
{
result[ixResult] = listB[ixListB];
++ixListB;
}
return result;
}
private static readonly Random Rnd = new Random();
static int[] CreateRandomList(int count)
{
var result = new int[count];
for (int i = 0; i < count; ++i)
{
result[i] = Rnd.Next();
}
return result;
}
Running that with different list lengths shows how much faster the standard merge algorithm is in comparison to the combine-and-sort method, especially when the lists get very long. Here are the results:
Items | Sort | Standard merge |
---|---|---|
100 | 0.02 ms | 0.01 ms |
1,000 | 0.14 ms | 0.03 |
1,000,000 | 284.39 ms | 19.42 ms |
100,000,000 | 32.3 seconds | 1.92 seconds |
I think you can see where that’s going. It’s pretty clear that the standard merge is a whole lot faster when the lists get large. And that’s just sorting integers. As item comparisons get more expensive (sorting strings, for example), the sort gets even more expensive.
So the standard merge is faster. As I mentioned above, it can also require much less memory than the combine and sort method. Because the standard merge only has to compare the current items in each list, and once an item is output it’s never looked at again, the merge only has to hold two items in memory at any time. Granted, it needs a way to get the next item from each list, but even when you add it all up it’s still a constant amount of memory. You can merge two lists of a billion items each using exactly the same amount of memory that it takes to merge two lists of 10 items each.
The merge example I showed above merges arrays of integers, but it’s not difficult to extend it to merge IEnumerable<T> instances. With slightly more work, I can add LINQ extension methods for merging. Specifically, extension methods MergeWith
and MergeWithDescending
that merge sorted sequences in ascending or descending order, and support the ThenBy and ThenByDescending LINQ methods so that you can merge based on multiple keys in the same way that you can sort by multiple keys with OrderBy and ThenBy
.
The other thing I should mention is that, at least for now, I’m going to limit my discussion to merging two sorted sequences with the standard two-way merge algorithm shown above. There is another algorithm, called a k-way merge, that merges k sorted sequences in O(n log2 k) time, where n is the total number of items in all sequences.
The two-way merge, in fact, is just a special case of the k-way merge. That is, O(n log2 k), when k is equal to 2, works out to O(n), because log2(2) is equal to 1.
I’ll get around to discussing the k-way merge at some point.
Next time I’ll show a LINQ-ified version of the standard two-way merge, and then start looking at how to implement the MergeWith
and MergeWithDescending
extension methods.