Merging multiple sorted lists

The process of merging multiple (i.e. more than two) sorted sequences is a simple extension of merging just two sorted sequences. The two-way merge algorithm I posted last month included an optimization that I shouldn’t have introduced quite so soon. That optimization complicates the algorithm a bit. The simplest algorithm for merging two sorted lists looks like this:

    while ((not end of List A) and (not end of List B))
        if (end of List A)
            output List B current item
            advance List B index
        else if (end of List B)
            output List A current item
            advance List A index
        else if (List A current item <= List B current item)
            output List A current item
            advance List A index
        else
            output List B current item
            advance List B index

That does the same thing as what I posted before, but does it all in a single loop. My previous version is slightly more efficient because once it reaches the end of one list it just outputs the remainder of the other. Once List A is exhausted, there’s no need to keep checking it.

You could extend that algorithm for three, four, or more lists, but the logic gets complicated in a hurry. What we really need is a generalized k-way merge algorithm that can merge items from any number of sorted lists. That algorithm, simply stated, is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

One way to implement that would be with an array of lists, and a corresponding array of current index pointers. The first cut of such a thing might look something like this:

    // merges multiple integer arrays
    private static int[] MergeInts(List<int[]> arrays)
    {
        var result = new List<int>();
        // Initialize indexes
        var indexes = new int[arrays.Count];
        for (int i = 0; i < arrays.Count; ++i)
        {
            indexes[i] = 0;
        }

        // Function returns true if we've reached the end of all the arrays
        var endOfAllArrays = new Func<bool>(() =>
        {
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                    return false;
            }
            return true;
        });

        while (!endOfAllArrays())
        {
            // Find the array with the smallest current item
            int smallestItem = int.MaxValue;
            int smallestArray = 0;
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                {
                    if (arrays[i][indexes[i]] <= smallestItem)
                    {
                        smallestArray = i;
                        smallestItem = arrays[i][indexes[i]];
                    }
                }
            }

            // output that item
            result.Add(smallestItem);

            // advance the index on that array
            ++indexes[smallestArray];
        }
        return result.ToArray();
    }

That code implements the basic algorithm exactly. It’s not especially pretty, but it gets the job done. It’s also horribly inefficient because for every time through the loop it looks at all of the current items to determine which is the smallest. The complexity of this particular implementation is O(n*k), where n is the total number of items in all lists, and k is the number of lists. If you’re merging 10 lists that combined contain n items, it’s going to take twice as long to merge than if you have the same items spread out over just 5 lists.

What we need is a faster way to find the list that has the smallest current item. I mentioned in my first article (linked above) that the k-way merge is can be done in O(n log2 k) time complexity. That makes a big difference, even for small values of k. The base 2 logarithm of 4, for example, is 2, meaning that if we can improve the selection algorithm, we can merge four lists in half the time.

What data structure do we know that can return the smallest item in O(log2 k) time? The heap, maybe? It’s a surprisingly useful data structure.

Next time we’ll look at replacing the array of arrays in the implementation above with a heap. That will reduce the time required to select the smallest item, and it will also eliminate the time required to determine if we’re at the end of the list.