Merging sequences with LINQ

Last time, I introduced the concept of merging two sorted sequences and showed that the standard two-way merge is superior to the naive combine-and-sort method. The code I presented in that post was specific to sorting two arrays of integers. It’s useful as a demonstration, but in most situations the lists I want to merge contain more than just integers. It’s time to make that merge more generic by providing a LINQ implementation.

Making a LINQ version of the standard two-way merge is straightforward. The algorithm is obviously the same; I just have to modify the code that I presented last time so that it works enumerators rather than with array indexes, and modify the comparison logic so that it calls an item comparison function rather than relying on the comparison operators. A direct translation of the integer merge code, extended to support any type, looks like this:

    public static IEnumerable<T> Merge(IEnumerable<T> list1, IEnumerable<T> list2)
    {
        // Initialize everything
        var comp = Comparer.Default;
        var i1 = list1.GetEnumerator();
        var i2 = list2.GetEnumerator();
        var i1HasItems = i1.MoveNext();
        var i2HasItems = i2.MoveNext();

        // While there are items in both lists,
        // compare and pick the smallest.
        while (i1HasItems && i2HasItems)
        {
            if (comp.Compare(i1.Current, i2.Current) <= 0)
            {
                yield return i1.Current;
                i1HasItems = i1.MoveNext();
            }
            else
            {
                yield return i2.Current;
                i2HasItems = i2.MoveNext();
            }
        }

        // At this point, one of the lists has been exhausted.
        // The other still has items in it.
        // We don't know which, so just make sure both are exhausted.
        while (i1HasItems)
        {
            yield return i1.Current;
            i1HasItems = i1.MoveNext();
        }
        while (i2HasItems)
        {
            yield return i2.Current;
            i2HasItems = i2.MoveNext();
        }
    }

On the surface, that’s not a whole lot different than the integer array merge I presented last time. There are, however, significant differences in how it actually behaves, all of which have to do with deferred execution.

The previous code requires that you pass it fully-realized arrays, which it then merges into another array and returns a reference to that new array. The LINQ version above accepts two IEnumerable<T> references, which it merges and returns items one at a time with the yield return statements. The difference in behavior is quite striking.

To illustrate the difference, imagine you have two files, each of which contains a sorted list of integers: one per line. If you wanted to merge those files into a single output file using the array merge code, you would write code that does this:

    Load the first file into Array1
    Load the second file into Array2
    Result = Merge(Array1, Array2)
    Save Result to file

The important part to note here is that you will need enough memory to hold all of Array1, all of Array2, and all of the Result array simultaneously. If the files contain a billion items each, that works out to 16 gigabytes of memory: four gigabytes for each of Array1 and Array2, and eight gigabytes for the merged Result array.

For the LINQ version, you can structure things so that you need very little memory, regardless of how large the files are. Consider this code:

    public static void MergeTwoFiles(string file1, string file2, string resultFile)
    {
        var f1 = File.ReadLines(file1).Select(int.Parse);
        var f2 = File.ReadLines(file2).Select(int.Parse);
        var result = Merge(f1, f2).Select(i => i.ToString());
        File.WriteAllLines(resultFile, result);
    }

That looks like it does everything that I described in the pseudo code above, but there’s a big difference. The first line creates an enumerator that can iterate over the file line-by-line. But it doesn’t actually read the file until you call MoveNext. Same with the second line of code. Even the third line of code doesn’t actually perform any processing.

The fourth line of code, the call to File.WriteAllLines starts the ball rolling. It essentially does this:

    using (var writer = new StreamWriter(resultFile))
    {
        foreach (var r in result)
        {
            Console.WriteLine(r);
        }
    }

That foreach is where all the magic happens. The first time through, it calls MoveNext on the result enumerator, and that causes the initialization code in the Merge method to execute. That code in turn calls MoveNext on the f1 and f2 enumerators, which ends up loading the first number from each file. Then, the first iteration of the Merge loop is executed. When the code gets to a yield return, the value r is returned to the loop shown above, and that value is written to the file.

At this point, only three values have been read from the files: the first value from each of f1 and f2, and the second value from whichever file contained the lowest value.

The code continues in this way until both files are exhausted. At no time are there more than three values from the files in memory. (This isn’t strictly true because the StreamWriter and the file system are probably doing some buffering. Nevertheless, client code has to maintain only three references to values at any particular time.)

Another way to look at it is to imagine that there are three workers: Joe, Bob, and Fred. Joe and Bob each have a list of numbers that they can give to you one at a time, and Fred will write down whatever number you give him. Your merge works like this:

    Get first item from Joe
    Get first item from Bob
    while Joe and Bob both have items
        if Joe's number is bigger
            tell Fred to write Joe's number
            get the next number from Joe
        else
            tell Fred to write Bob's number
            get the next number from Bob

For brevity, I’ve skipped the code at the end of the loop that makes sure that Joe and Bob have each given you all of their items.

The key here is that you don’t really care where or how Joe and Bob are getting their numbers. They could have a big list of numbers written down on paper, or they could be pulling the numbers out of thin air as you ask for them. And Fred doesn’t need all of the numbers at once. He just needs to write down each number as you give it to him.

Because I often work with huge data sets that won’t fit into memory, I make good use of LINQ’s deferred execution, this being one particularly good example.

The Merge method above that uses IEnumerable<T> is a reasonably useful method, but it doesn’t go quite far enough. In particular, it doesn’t let me specify the comparison method, nor does it let me define a key selector. I need it to work like the OrderBy method, where I can define multiple ordering keys and key selectors. That takes a bit more work. Fortunately, I know how to do most of it because I had to implement much the same thing when I created my TopBy method earlier this year. Implementing IOrderedEnumerable for my MergeBy method should be much simpler.

We’ll dig into it next time.