Merging sequences with LINQ

Last time, I introduced the concept of merging two sorted sequences and showed that the standard two-way merge is superior to the naive combine-and-sort method. The code I presented in that post was specific to sorting two arrays of integers. It’s useful as a demonstration, but in most situations the lists I want to merge contain more than just integers. It’s time to make that merge more generic by providing a LINQ implementation.

Making a LINQ version of the standard two-way merge is straightforward. The algorithm is obviously the same; I just have to modify the code that I presented last time so that it works enumerators rather than with array indexes, and modify the comparison logic so that it calls an item comparison function rather than relying on the comparison operators. A direct translation of the integer merge code, extended to support any type, looks like this:

    public static IEnumerable<T> Merge(IEnumerable<T> list1, IEnumerable<T> list2)
    {
        // Initialize everything
        var comp = Comparer.Default;
        var i1 = list1.GetEnumerator();
        var i2 = list2.GetEnumerator();
        var i1HasItems = i1.MoveNext();
        var i2HasItems = i2.MoveNext();

        // While there are items in both lists,
        // compare and pick the smallest.
        while (i1HasItems && i2HasItems)
        {
            if (comp.Compare(i1.Current, i2.Current) <= 0)
            {
                yield return i1.Current;
                i1HasItems = i1.MoveNext();
            }
            else
            {
                yield return i2.Current;
                i2HasItems = i2.MoveNext();
            }
        }

        // At this point, one of the lists has been exhausted.
        // The other still has items in it.
        // We don't know which, so just make sure both are exhausted.
        while (i1HasItems)
        {
            yield return i1.Current;
            i1HasItems = i1.MoveNext();
        }
        while (i2HasItems)
        {
            yield return i2.Current;
            i2HasItems = i2.MoveNext();
        }
    }

On the surface, that’s not a whole lot different than the integer array merge I presented last time. There are, however, significant differences in how it actually behaves, all of which have to do with deferred execution.

The previous code requires that you pass it fully-realized arrays, which it then merges into another array and returns a reference to that new array. The LINQ version above accepts two IEnumerable<T> references, which it merges and returns items one at a time with the yield return statements. The difference in behavior is quite striking.

To illustrate the difference, imagine you have two files, each of which contains a sorted list of integers: one per line. If you wanted to merge those files into a single output file using the array merge code, you would write code that does this:

    Load the first file into Array1
    Load the second file into Array2
    Result = Merge(Array1, Array2)
    Save Result to file

The important part to note here is that you will need enough memory to hold all of Array1, all of Array2, and all of the Result array simultaneously. If the files contain a billion items each, that works out to 16 gigabytes of memory: four gigabytes for each of Array1 and Array2, and eight gigabytes for the merged Result array.

For the LINQ version, you can structure things so that you need very little memory, regardless of how large the files are. Consider this code:

    public static void MergeTwoFiles(string file1, string file2, string resultFile)
    {
        var f1 = File.ReadLines(file1).Select(int.Parse);
        var f2 = File.ReadLines(file2).Select(int.Parse);
        var result = Merge(f1, f2).Select(i => i.ToString());
        File.WriteAllLines(resultFile, result);
    }

That looks like it does everything that I described in the pseudo code above, but there’s a big difference. The first line creates an enumerator that can iterate over the file line-by-line. But it doesn’t actually read the file until you call MoveNext. Same with the second line of code. Even the third line of code doesn’t actually perform any processing.

The fourth line of code, the call to File.WriteAllLines starts the ball rolling. It essentially does this:

    using (var writer = new StreamWriter(resultFile))
    {
        foreach (var r in result)
        {
            Console.WriteLine(r);
        }
    }

That foreach is where all the magic happens. The first time through, it calls MoveNext on the result enumerator, and that causes the initialization code in the Merge method to execute. That code in turn calls MoveNext on the f1 and f2 enumerators, which ends up loading the first number from each file. Then, the first iteration of the Merge loop is executed. When the code gets to a yield return, the value r is returned to the loop shown above, and that value is written to the file.

At this point, only three values have been read from the files: the first value from each of f1 and f2, and the second value from whichever file contained the lowest value.

The code continues in this way until both files are exhausted. At no time are there more than three values from the files in memory.

Another way to look at it is to imagine that there are three workers: Joe, Bob, and Fred. Joe and Bob each have a list of numbers that they can give to you one at a time, and Fred will write down whatever number you give him. Your merge works like this:

    Get first item from Joe
    Get first item from Bob
    while Joe and Bob both have items
        if Joe's number is bigger
            tell Fred to write Joe's number
            get the next number from Joe
        else
            tell Fred to write Bob's number
            get the next number from Bob

For brevity, I’ve skipped the code at the end of the loop that makes sure that Joe and Bob have each given you all of their items.

The key here is that you don’t really care where or how Joe and Bob are getting their numbers. They could have a big list of numbers written down on paper, or they could be pulling the numbers out of thin air as you ask for them. And Fred doesn’t need all of the numbers at once. He just needs to write down each number as you give it to him.

Because I often work with huge data sets that won’t fit into memory, I make good use of LINQ’s deferred execution, this being one particularly good example.

The Merge method above that uses IEnumerable<T> is a reasonably useful method, but it doesn’t go quite far enough. In particular, it doesn’t let me specify the comparison method, nor does it let me define a key selector. I need it to work like the OrderBy method, where I can define multiple ordering keys and key selectors. That takes a bit more work. Fortunately, I know how to do most of it because I had to implement much the same thing when I created my TopBy method earlier this year. Implementing IOrderedEnumerable for my MergeBy method should be much simpler.

We’ll dig into it next time.

Merging sorted sequences

A fairly common operation in some data processing applications involves merging two or more sorted sequences. In my previous work I frequently had to merge lists that contained tens or hundreds of millions of items each. When you’re working with lists that large, the choice of merge algorithm can make the difference between a few seconds’ work and an overnight job.

The naive way to merge two lists into a third is to concatenate and sort. That is, given these two lists:

    List A = [1, 5, 6, 9]
    List B = [2, 3, 6, 8]

You’d first create the list [1, 5, 6, 9, 2, 3, 6, 8], and then sort the sequence to produce [1, 2, 3, 5, 6, 6, 8, 9].

That technique works well enough, but it’s expensive. The best sorting algorithms take time proportional to n log2(n), where n is the number of items in the list. That’s no big deal when you’re working with reasonably short lists, but sorting a few hundred million items takes significant time. Another problem is that the sort is necessarily an in-memory operation, or a very complicated disk-based merge sort which is even more expensive.

You can do much better by taking advantage of the existing ordering. We know that both lists are already sorted, so you know that the first item in the combined list will be the smallest of the first items in each of the lists. In the lists above, List A starts with 1, and List B starts with 2. You know that the first item in the new list has to be 1.

If you remove 1 from List A and repeat the process, you’re comparing 5 with 2. Obviously, you output 2 from List B and remove it from the list. Then compare 5 with 3, etc. You do this until one of the lists is exhausted, and then you output all the items from the other list. In pseudo code, it looks like this:


    while (not end of List A and not end of List B)
        if (List A current item <= List B current item)
            output List A current item
            advance List A index
        else
            output List B current item
            advance List B index

    // At this point, one of the lists is empty.
    // Output remaining items from the other
    while (not end of List A)
        output List A current item
        advance List A index

    while (not end of List B)
        output List B current item
        advance List B index

Note that it doesn’t matter which list empties first. If List A empties first, then the condition at the beginning of the following while loop will prevent trying to output something that isn’t there.

The algorithm above is a standard merge, which is an O(n) algorithm. If you study it for a few minutes, you’ll see that given two lists with lengths L1 and L2 the maximum number of item comparisons it will make is min(L1, L2). That’s a far cry from the (L1 + L2) * log2(L1 + L2) that a sort will make in the average case.

The difference in the number of comparisons makes a huge difference. To see just how much of a difference, I wrote a little program that creates two lists and then merges them using each of the techniques. Here’s the code:

    static void Main(string[] args)
    {
        const int numIterations = 100;
        const int numItems = 1000 * 1000 * 100;
        Console.WriteLine("Testing with two lists of {0:N0} items each.", numItems);

        // Create two sorted lists
        var l1 = CreateRandomList(numItems);
        Array.Sort(l1);
        var l2 = CreateRandomList(numItems);
        Array.Sort(l2);
  
        DoTest(() => MergeWithSort(l1, l2), numIterations, "Merge with sort");
        DoTest(() => SimpleMerge(l1, l2), numIterations, "Simple merge");
        Console.ReadLine();
    }

    static void DoTest(Action action, int iterations, string name)
    {
        Console.Write("Executing ({0}) {1:N0} times.", name, iterations);
        var totalTime = TimeSpan.Zero;
        var sw = new Stopwatch();
        for (int i = 0; i < iterations; ++i)
        {
            sw.Restart();
            action();
            sw.Stop();
            totalTime += sw.Elapsed;
            Console.Write('.');
        }
        Console.WriteLine(" {0:N0} ms. Average {1:N2} ms", 
            totalTime.TotalMilliseconds, (double)totalTime.TotalMilliseconds/iterations);
    }

    static int[] MergeWithSort(int[] listA, int[] listB)
    {
        var result = new int[listA.Length + listB.Length];
        Array.Copy(listA, result, listA.Length);
        Array.Copy(listB, 0, result, listA.Length, listB.Length);
        Array.Sort(result);
        return result;
    }

    static int[] SimpleMerge(int[] listA, int[] listB)
    {
        var result = new int[listA.Length + listB.Length];
        int ixListA = 0;
        int ixListB = 0;
        int ixResult = 0;
        while (ixListA < listA.Length && ixListB < listB.Length)
        {
            if (listA[ixListA] <= listB[ixListB])
            {
                result[ixResult] = listA[ixListA];
                ++ixListA;
            }
            else
            {
                result[ixResult] = listB[ixListB];
                ++ixListB;
            }
            ++ixResult;
        }

        while (ixListA < listA.Length)
        {
            result[ixResult] = listA[ixListA];
            ++ixListA;
        }

        while (ixListB < listB.Length)
        {
            result[ixResult] = listB[ixListB];
            ++ixListB;
        }

        return result;
    }

    private static readonly Random Rnd = new Random();
    static int[] CreateRandomList(int count)
    {
        var result = new int[count];
        for (int i = 0; i < count; ++i)
        {
            result[i] = Rnd.Next();
        }
        return result;
    }

Running that with different list lengths shows how much faster the standard merge algorithm is in comparison to the combine-and-sort method, especially when the lists get very long. Here are the results:

Items Sort Standard merge
100 0.02 ms 0.01 ms
1,000 0.14 ms 0.03
1,000,000 284.39 ms 19.42 ms
100,000,000 32.3 seconds 1.92 seconds

I think you can see where that’s going. It’s pretty clear that the standard merge is a whole lot faster when the lists get large. And that’s just sorting integers. As item comparisons get more expensive (sorting strings, for example), the sort gets even more expensive.

So the standard merge is faster. As I mentioned above, it can also require much less memory than the combine and sort method. Because the standard merge only has to compare the current items in each list, and once an item is output it’s never looked at again, the merge only has to hold two items in memory at any time. Granted, it needs a way to get the next item from each list, but even when you add it all up it’s still a constant amount of memory. You can merge two lists of a billion items each using exactly the same amount of memory that it takes to merge two lists of 10 items each.

The merge example I showed above merges arrays of integers, but it’s not difficult to extend it to merge IEnumerable<T> instances. With slightly more work, I can add LINQ extension methods for merging. Specifically, extension methods MergeWith and MergeWithDescending that merge sorted sequences in ascending or descending order, and support the ThenBy and ThenByDescending LINQ methods so that you can merge based on multiple keys in the same way that you can sort by multiple keys with OrderBy and ThenBy.

The other thing I should mention is that, at least for now, I’m going to limit my discussion to merging two sorted sequences with the standard two-way merge algorithm shown above. There is another algorithm, called a k-way merge, that merges k sorted sequences in O(n log2 k) time, where n is the total number of items in all sequences.

The two-way merge, in fact, is just a special case of the k-way merge. That is, O(n log2 k), when k is equal to 2, works out to O(n), because log2(2) is equal to 1.

I’ll get around to discussing the k-way merge at some point.

Next time I’ll show a LINQ-ified version of the standard two-way merge, and then start looking at how to implement the MergeWith and MergeWithDescending extension methods.

Solving the right problem

As programmers, we sometimes lose sight of the problem we’re supposed to solve because we’re focused on the problem we want to solve. Or perhaps on the problem we think we need to solve. Today’s lesson is a case in point.

Imagine you have a very large collection (many thousands) of records that contain individuals’ names and their skills. The data consists of a name followed by a comma-separated list of skills. Something like:

Jim Mischel, programming, wood carving, brewing, curmudgeon

The problem is that some of the skills are misspelled or abbreviated: “mgmt” for “Management,” for example, or “hadup” instead of “Hadoop,” etc.

Your job is to go through the records and fix the misspellings.

Seems easy enough, right? Just go through the skills, look up each word in a dictionary, and correct the entry if the word isn’t in the dictionary. But there are two problems:

  1. A lot of common terms used in describing skills are domain specific and won’t be in any dictionary.
  2. Even if there were such a dictionary, how would you determine that “mgmt” is supposed to be “Management,” or that “hadup” should be “Hadoop?”

It’s at this point that we lose sight of the real problem we’re trying to solve. Being programmers, we start thinking about figuring out edit distances, machine learning, support vector machines, etc. In no time at all we’re shaving a yak.

Writing a computer program that will detect and correct misspellings like this is hard. I’m talking Ph.D. research project hard. I guarantee that the person who asked you to solve the problem isn’t willing to let you embark on a multi-year research project.

Fortunately, you don’t have to. You can solve it in a day or two with two simple programs and a little bit of manual effort.

First, write a program that builds a dictionary of all the terms used, along with how often they’re used. You’ll end up with a very large list that looks something like:

    Management,200
    Mgmt,3
    Hadoop,10
    Hadup,1
    Mamagement,1
    ...

If you make the assumption that if a term appears very often, it’s probably spelled correctly, then all you’re really concerned about are those terms that occur infrequently. You can decide on a threshold of, say, three or five, and have your program output only those terms. You’ll probably end up with a list of a few hundred infrequently used terms.

That’s where the manual work comes in. You have to look at each word, determine if it’s a misspelling, and if so find out what the correct spelling is. You manually create a file that contains the misspelled term along with the correct term. For example:

    mgmt, Management
    mamagement, Management
    hadup, Hadoop
    wood craving, wood carving
    etc.

The second program is even easier. All it does is load the replacements file into a dictionary with the key being the misspelled term, and then reads each name record. For each name record, the program checks each term against the dictionary. If it finds a misspelling that’s in the dictionary, it writes the replacement. Otherwise, it writes the original text. The algorithm looks like this:

    dict = LoadDictionary();
    create output file
    for each record in file
        output name
        for each skill in record.skills
            if skill in dict
                output replacement text
            else
                output original skill text

This approach won’t be perfect, but it will be very good. You’ll have to tinker with the threshold value a bit to get the right balance between detecting real misspellings and “false positives”: terms that are used infrequently but aren’t actual misspellings.

You’ll also have to accept that common misspellings will not be detected. Fortunately, common misspellings will be … common. So you can add them to the replacements dictionary as they’re detected, and subsequent passes over the file will catch them. You could do some work with edit distance algorithms to catch the more common misspellings, but it would be a huge amount of work for relatively little gain.

I’ll admit that it’s somewhat dissatisfying having to manually create the replacements file, but it’s hard to argue with the result. Especially I can get somebody else to do that work. Then all I have to do is create the list of terms and their counts, and write the program that will make the replacements once the manual work is done.

It’s not sexy, but it solves the problem quickly and effectively.

The Toothpaste Rant

I’m cranky this morning. Like many people, I have kind of a ritual I go through when I get out of bed. The first thing I do is wander, bleary-eyed, into the bathroom to do my business. Then I stop by the sink and brush my teeth. The day just doesn’t seem right if I can’t brush my teeth first thing.

So why am I cranky? The toothpaste tube was empty, and I grabbed one of those little sample tubes that come in the mail from time to time. Crest wanting me to try their new flavor, I guess. Heck, it’s toothpaste, right? I squeezed a bit out onto my brush, put the thing in my mouth and … nearly vomited.

High on the list of things that you just shouldn’t mess with is toothpaste. Long before I started brushing my teeth, civilization figured out that the proper flavor for toothpaste was mint. It doesn’t really matter what kind of mint, as long as it has that … well, that minty flavor. And consistency matters, too. Toothpaste is done.

But can they leave well enough alone? No! Some idiot at the Crest toothpaste company decided that they needed a cinnamon flavor. To compete with cinnamon Altoids, perhaps? Who knows what goes through their minds? I know it’s hard to believe that they even thought of making a non-mint toothpaste, but for sure their taste testers should have nixed this crap before it ever got out of the lab. The stuff tastes like bathtub caulk mixed with just a hint of fake cinnamon–maybe from one of those Atomic Fireball candies. And it left a sour aftertaste in the back of my mouth. To top it all off, it has the consistency of bathtub caulk, too. Toothpaste is supposed to be a little moist–almost runny. It’s not supposed to suck all the moisture off your tongue when you put it in your mouth.

Fortunately, I was able to get rid of that taste by gargling with Listerine, and I brushed my teeth with Scope, if you can believe it. So I’m not as cranky as I could be. But it was a close thing.

Fair warning: check the flavor on that toothpaste tube before you squeeze some out onto your brush. You do not want the kind of rude surprise that I got this morning.

And to the people who make toothpaste: Stop messing with it! Just make sure that it cleans my teeth and leaves my mouth feeling minty fresh. Some things are perfect just the way they’ve been for a hundred years, thank you very much. Spend your time on real problems, like figuring out how to make a floss that doesn’t get stuck between my teeth, or maybe a “scrubbing bubbles” mouthwash that I can swish around instead of having to use the dang brush. But make sure that it’s a mint flavor, okay?

Stop the style bashing

I might have mentioned before that I try to spend some time every day answering questions on StackOverflow. I learn a lot about the common issues that other programmers face, and I’ve picked up quite a few tips from reading answers supplied by others. I also learn quite a bit when I answer questions; sometimes providing a complete useful answer requires a lot of research.

Note that the discussion below is specific to C#. I’m not familiar with the implementation of the bool data type in C++ or boolean in Java, and C didn’t have a separate Boolean type.

I also learn about others’ misconceptions. For example, whenever somebody posts code that compares a Boolean value to true or false, at least one person will say something like “Never write == true or == false.” One person said, “Whenever you write == true or == false, God kills a kitten.”.

Those types of comments are pretty annoying. It’s just a style thing. We all know that these two conditionals are equivalent:

    if (timer1.Enabled == true) { ... }
    if (timer1.Enabled) { ... }  // the "== true" is implied

Complaining about the code containing == true is like complaining about using braces to enclose a single statement, or the extraneous parentheses in x = x + (y/z);, or getting mad because the person didn’t write x += y/z;. It’s a style thing! If you don’t like it, just move on.

Until recently, I honestly thought that’s all it was: immature griping about a minor style issue. But recent comments indicate that several people think there’s some difference in the generated code. That is, they think that if (timer1.Enabled == true) requires an extra operation to compare the values! That’s not true at all, as you can clearly demonstrate by compiling a simple program and looking at the generated code.

    System.Timers.Timer myTimer = new Timer();
    if (timer1.Enabled) Console.WriteLine("Enabled!");
    if (timer1.Enabled == true) Console.WriteLine("Enabled");

Compiling that in Release mode with Visual Studio 2013 results in the following intermediate code. I’ve added a few comments.

  // start of first conditional: if (timer1.Enabled) ...
  IL_0000:  ldarg.0
  IL_0001:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_0006:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_000b:  brfalse.s  IL_0017
  IL_000d:  ldstr      "Enabled!"
  IL_0012:  call       void [mscorlib]System.Console::WriteLine(string)
  // start of second conditional if (timer1.Enabled == true)
  IL_0017:  ldarg.0
  IL_0018:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_001d:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_0022:  brfalse.s  IL_002e
  IL_0024:  ldstr      "Enabled!"
  IL_0029:  call       void [mscorlib]System.Console::WriteLine(string)

Those two bits of code are identical. Each one calls the get accessor for the myTimer.Enabled property, and then branches around the WriteLine if the value is false.

Note that this is intermediate code. The JIT compiler might be able to inline the get_Elapsed method, which would eliminate the method call. Regardless, the code for these two snippets is identical. So stop with the == true bashing, already.

Free the Scratch!

The other day I overheard two women talking about baking cakes. I wasn’t eavesdropping; they were sitting at the table next to mine. Nor was I paying much attention to the conversation, engrossed as I was in the technical manual that so often serves as my lunch partner. But then one of the women said: “You bake from scratch?” That totally broke my concentration. It was time to go anyway.

As I walked back to the office, I got to wondering what this magical “scratch” stuff is that people make things from. I often hear people say that they bake from scratch, but that’s not all it’s used for. I hear programmers and builders say they’re going to “start over from scratch.” In almost every field I’ve examined, there’s somebody making things from scratch.

It took me a while to track this down, but what I discovered flies in the face of all the science we’ve been spoon fed over the years. Scratch builders have learned how to transmute matter and make wondrous things from seemingly nothing. Science teaches us that you can’t turn lead into gold–that you can’t change the fundamental properties of matter. I haven’t seen a scratch builder turn lead into gold, but I wouldn’t put it past them. Why would they want to, anyway, when they can take beans from a plant, add some butter and sugar, and come up with chocolate? Gold has nothing on chocolate, right?

Scratch is, despite what your science teachers will try to tell you, the basic building block of the universe. Forget chemistry and quantum physics with all their arcane formulas, quirks, quarks, electron orbitals, and difficult math. Those fields of study are just red herrings intended to divert your attention from where the real action is. You won’t see “Scratch Studies” at even the best universities because the powers that be don’t want the word to get out. They don’t want just anybody knowing how to transform scratch into all the wondrous things that we see every day.

The Scratch Police let amateurs dabble in the field–baking cakes or building clunky radios from spare parts. They do this because it turns out that the principles of scratch are quite simple and they can’t stop people from discovering them. But they keep a close eye to ensure that a baker, for example, doesn’t figure out that he can use the same principles of scratch to make a CD player, a car, or anything else. If that knowledge got out, the economy would crash like nothing you’ve ever seen before.

But it’s time the word got out. We can no longer be held hostage by the Scratch Police and the small group of people who hold so tightly the knowledge of scratch. My life is forfeit, I know. The Scratch Police will come track me down when they see this blog, and I doubt that Charlie will be able to protect me from them. If you see this blog, please copy and email it to everybody on your address list. Help spread the word and free us from the tyranny of those who would keep this fundamental knowledge to themselves.

Free the Scratch!

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.