A closer look at pairwise merging

When I started writing the performance test code, I got sidetracked with yet another way to merge multiple sorted lists. While I was analyzing it I realized that the pairwise merge might have a problem, as well. It’s worth talking about here because knowing of the problem’s existence will help me create better test data when I finally get around to performance testing.

To review, assuming four lists named A, B, C, and D, the pairwise merge works like this:

    Merge A and B to produce list AB
    Merge C and D to produce list CD
    Merge AB and CD to produce list ABCD

You can achieve the same result without a queue:

    Merge A and B to produce AB
    Merge AB and C to produce ABC
    Merge ABC and D to produce ABCD

Whereas they both require the same number of merges, not all merges are equal.

We can estimate the speed of these algorithms by figuring out how many items they process. We already know that merging two lists that have among them n items requires processing n items. So if we sum the number of items for each two-way merge, we have the total number of items processed for the entire operation.

If we assume that each of the lists has 10 items, then the first example requires:

20 items to merge A and B. The resulting list AB has 20 items.
20 items to merge C and D. The resulting list CD has 20 items.
40 items to merge AB and CD.
80 total items processed.

The second example requires:

20 items to merge A and B. The resulting list AB has 20 items.
30 items to merge AB and C. The resulting list ABC has 30 items.
40 items to merge ABC and D.
90 total items processed.

In this case, the queue-based algorithm is faster.

But not in all cases. Consider, for example, what happens when the lists’ lengths are:

    A = 1
    B = 1
    C = 1
    D = 100

The queue merge will look at two items to make AB, 101 items to make CD, and 103 items to make ABCD. That’s a total of 206 items processed. The other algorithm will look at two items to make AB, three items to make ABC, and 103 items to make ABCD, for a total of 108 items processed. If you switch the order, though, the queue merge will still process 203 items but the other version will have to look at 101 + 102 + 103, or 308 items.

The order of merges matters. A lot. Change that 100 to 1,000,000, and you start to get the idea.

If we know the lengths of the lists to be merged, we could potentially order the pairwise merges to take advantage of that information. The idea would be to order the merges so that we create lists that are relatively the same length. It wouldn’t have to be a perfect solution in order to realize a huge performance increase. A simple priority queue ordered by increasing list length would allow us to merge the shorter lists first, which would give a reasonably good solution to the problem.

The only drawback–and it’s a deal killer–is that we don’t know how long the lists are. All of the merge code I’ve been working with assumes that the lists are of indeterminate length.

The result is that worst cases can happen. If we use the queue-based pairwise merge, at some point we’ll be presented with a pathological case that makes the merge perform very poorly. In those cases, the running time isn’t O(n log2 k), but rather O(n*k). Ouch.

In contrast, the heap-based merge always takes O(n log2 k) time. The reason is that every item is inserted into the heap once, and every item is removed from the heap once, regardless of the individual lists’ length.

Asymptotic analysis, then, tells us that the heap-based merge is better because it’s always O(n log2 k), whereas the pairwise merge can be O(n*k). Actual running time is a different story. If maintaining the heap is very expensive in comparison to maintaining the enumerators, then the pairwise merge can still outperform the heap-based merge.

The pairwise merge looks like it should be faster in the average case, and even in pathological cases with relatively few items. I suspect that it will be slower than the heap-based merge with many large, unevenly-sized lists.

Don’t take my word for it, though. I’ve been wrong often enough to distrust my gut feelings when it comes to performance. The only way to know for sure which is faster is to measure it.

A different way to merge multiple lists

In my previous post, I said that this time I would show full-featured MergeBy and MergeByDescending methods. Before I do that, I want to show an alternate method for merging multiple sorted lists.

If you recall, I said that the generalized k-way merge algorithm is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

As I illustrated last time, that does indeed work. But there is another way to do it that is, asymptotically at least, just as fast. That is, it operates in O(n log2 k) time.

Imagine you have four sorted lists, named A, B, C, and D. Another way to merge them is this way:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and CD to create the final list.

To estimate the time complexity, let’s assume that each of the four lists contains the same number of items. So if the total number of items is n, then each list’s length is n/4.

Merging A and B, then, takes time proportional to n/4 + n/4, or n/2. Merging C and D also requires n/2 time. The final merge of AB and CD takes n/2 + n/2. So the total amount of time it takes to merge the four lists is 2n.

Time complexity of the k-way merge is, as I said, O(n log2 k). With four lists, that’s n * log2(4). log2(4) = 2, so the time to merge four lists with a total of n items is n * 2, or 2n.

If there is an odd number of lists, then one of the original lists gets merged with one of the larger, already merged, lists. For example, with five lists the task becomes:

    Merge A and B to create a list called AB.
    Merge C and D to create a list called CD.
    Merge AB and E to create a list called ABE.
    Merge ABE and CD to create the final list.

To do that with an arbitrary number of lists, we create a queue that initially contains the original lists. We then remove the first two lists from the queue, merge them, and add the result back to the queue. Then take the next two, merge them, and add the result to the queue. We continue doing that until there is only one list left on the queue. That is:

    Initialize queue with all of the lists.
    while queue has more than one item
        l1 = queue.Dequeue()
        l2 = queue.Dequeue()
        rslt = Merge(l1, l2)
        queue.Enqueue(rslt)
    merged = queue.Dequeue()

At first, it seems like that method will require a lot of extra space to hold the temporary lists. Remember, the merge algorithms I’ve shown so far don’t require much in the way of extra storage: just a little memory to hold the smallest value in each of the lists. That is, O(k) extra space. Perhaps not surprisingly, you can do the same thing with LINQ. Here’s the code.

    public static IEnumerable<T> MergeVersionTwo<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // Populate a queue with all the lists.
        var theQueue = new Queue<IEnumerable<T>>(lists);
        if (theQueue.Count == 0) yield break;

        // Do pair-wise merge repeatedly until there is only one list left. 
        while (theQueue.Count > 1)
        {
            var l1 = theQueue.Dequeue();
            var l2 = theQueue.Dequeue();
            var rslt = Merge(l1, l2); // uses the two-way merge
            theQueue.Enqueue(rslt);
        }
        var merged = theQueue.Dequeue();
        foreach (var x in merged)
        {
            yield return x;
        }
    }

That code uses the two-way merge that I presented a while back.

When building the queue, we’re just setting things up. Everything is an IEnumerable<T>, meaning that no actual work takes place. LINQ’s deferred execution postpones the merging until we start to enumerate the result. And when we do enumerate the result, only one item is produced at a time. All of the intermediate merges take place simultaneously, in steps.

Single-stepping through a merge is very instructive. Here’s some code that uses it:

    var l1 = new int[] {1, 3, 9, 12, 15};
    var l2 = new int[] {2, 7, 14, 16, 19};
    var l3 = new int[] {6, 8, 13, 17, 20};
    var rslt = MergeVersionTwo(new List<int[]> {l1, l2, l3});

    foreach (var i in rslt)
        Console.WriteLine(i);

If you wrap that in a method and single-step it, you’ll see that no real work is done until you get to the foreach. Even then, all that mucking about with the queue doesn’t enumerate any of the lists until you get to the foreach at the end of the MergeVersionTwo method. If you start single-stepping (in Visual Studio, press F11, or the “Step into” debugger command), you’ll see it start working on the two-way merges. Watching how all that works will help you to understand just how powerful deferred execution can be.

It’s important to note that asymptotic analysis just tells us how the run time varies with the size of the input. It doesn’t say that both of the k-way merge algorithms will take the same amount of time. n log2 k just says that for a particular algorithm the amount of time varies with the size of the input and the logarithm of the number of lists. We’ll have to do some performance tests to determine which is actually faster.

That testing should prove interesting. Before we go there, though, we need to look more closely at the pairwise merge.

 

Merging multiple lists with LINQ

To merge multiple sorted lists efficiently, we need a better way to obtain the smallest value from among the lists’ current items. That is, given three lists:

    A: 1, 8, 13, 19, 40
    B: 0, 4, 12, 41
    C: 3, 22, 30, 37, 43

The first item to output would be the 0 from list B. Then we’d output 1 from list A, 3 from list C, etc. In my previous post I showed one way to find the smallest value. That involved looking at the current item from each list to get the minimum. It works, but it’s very expensive. If you have k sorted lists, then it requires O(k) operations to determine the smallest item.

I spent a lot of time last year talking about heaps. The heap, as you recall, is a data structure that efficiently keeps track of the smallest (or largest, depending on how you structure it) item. In particular, you can remove the smallest item from the heap in O(log2 k) time, where k is the number of items in the heap. Also, insertion into the heap is an O(log2 k) operation.

If there are n items spread out over k lists, then the naive method takes O(n * k) time, and the method with the heap requires O(n log2 k) time. Using a heap can save you a lot of time, even when k is small.

That sounds simple enough, but there’s one catch: you can’t just store values in the heap. With the value, you have to store a reference to the list itself. Otherwise you don’t know which list index to increment. Still, it’s not terribly difficult. With C#, we could use a Tuple<T1, T2>, or a simple object. Tuple is convenient, but I prefer creating a class for this kind of thing:

    class HeapItem
    {
        int[] a;  // the array
        int ix;   // current index
    }

Given that, and assuming we have a heap data structure, merging multiple lists becomes much easier. Following the basic algorithm from last time, the almost-C# code looks like this:

    // Initialize the heap
    var heap = new MinHeap();
    foreach (array in arrays)
        if (array.Length > 0)
            heap.Insert(new HeapItem(a = array; ix = 0);
    
    while (!heap.IsEmpty)
    {
        HeapItem item = heap.RemoveSmallest();
        result.Add(item.a[item.ix]);  // output item
        item.ix++;                    // move to next item in this array
        // and put it back into the heap if there are still items
        if (item.ix < item.a.Length)
            heap.Insert(item);
    }

I glossed over the heap implementation here. The code assumes that you have a heap implementation that will work with those HeapItem instances. I do in fact have such a thing, and we’ll use it in a bit.

Note that there is no need to explicitly check for the end of all lists. All of the lists’ indexes are initialized at 0. When a list is removed from the heap, its current item is output and then its index is incremented. If the new current index is beyond the end of the list, then the list isn’t added back to the heap. If we drop lists from the heap once they’ve been exhausted, then we know that there are still items as long as the heap is not empty. That simplifies things quite a bit.

The problem with the above is that it assumes we’re working with arrays of integers. What we really want is a generic Merge method that works with any type that implements IEnumerable<T>. That is:

    IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<TSource>> lists);

The basic idea isn’t much different from what I presented in Merging sequences with LINQ. But rather than having only two enumerators and doing a simple comparison, I create a heap of enumerators. It looks like this:

    public static IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<T>> lists)
    {
        // This builds a heap of the first items from each list
        var initialItems =
            lists.Select(l => l.GetEnumerator())
                .Where(i => i.MoveNext());

        var heap = new MinDHeap<IEnumerator<T>>(
            2, initialItems, new EnumeratorComparer<T>(Comparer<T>.Default));

        while (!heap.IsEmpty)
        {
            var i = heap.RemoveRoot();
            yield return i.Current;
            if (i.MoveNext())
            {
                heap.Insert(i);
            }
        }
    }

That code relies on an EnumeratorComparer, which is defined as:

    private class EnumeratorComparer<T> : IComparer<IEnumerator<T>>
    {
        private readonly IComparer<T> _comp;

        public EnumeratorComparer(IComparer<T> comparer)
        {
            _comp = comparer ?? Comparer<T>.Default;
        }

        public int Compare(IEnumerator<T> x, IEnumerator<T> y)
        {
            var rslt = _comp.Compare(x.Current, y.Current);
            return rslt;
        }
    }

It also depends on my DHeap, the code for which you can download from that page.

One of the cool things about using enumerators here is that the concept of the current item is built in. The logic of incrementing and checking for the end of the list is all there in the MoveNext method and the Current property. So the code to implement the merge is incredibly simple.

I should mention the initialization code, which consists of this LINQ expression:

   var initialItems =
        lists.Select(l => l.GetEnumerator())
             .Where(i => i.MoveNext());

That code is just getting the enumerators for each of the lists, positioning to the first item in the list, and adding it to the list of enumerators (which is subsequently inserted into the heap) if the list has items. It’s roughly equivalent to:

    foreach (var list in lists)
    {
        var i = list.GetEnumerator();
        if (i.MoveNext)
            initialItems.Add(i);
    }

That’s the simple method that merges multiple lists using the default comparison. That’s useful in itself, but it doesn’t provide all of the features that I included in the MergeWithBy extension. Next time, we’ll look at extending this code to provide full-featured MergeBy and MergeByDescending methods.

Actually, I went off on a tangent and presented a different way to merge multiple lists. I’ll get to the full-featured methods after exploring that.

Merging multiple sorted lists

The process of merging multiple (i.e. more than two) sorted sequences is a simple extension of merging just two sorted sequences. The two-way merge algorithm I posted last month included an optimization that I shouldn’t have introduced quite so soon. That optimization complicates the algorithm a bit. The simplest algorithm for merging two sorted lists looks like this:

    while ((not end of List A) and (not end of List B))
        if (end of List A)
            output List B current item
            advance List B index
        else if (end of List B)
            output List A current item
            advance List A index
        else if (List A current item <= List B current item)
            output List A current item
            advance List A index
        else
            output List B current item
            advance List B index

That does the same thing as what I posted before, but does it all in a single loop. My previous version is slightly more efficient because once it reaches the end of one list it just outputs the remainder of the other. Once List A is exhausted, there’s no need to keep checking it.

You could extend that algorithm for three, four, or more lists, but the logic gets complicated in a hurry. What we really need is a generalized k-way merge algorithm that can merge items from any number of sorted lists. That algorithm, simply stated, is:

    Initialize list indexes
    while any list has items
        Determine which list has the smallest current item
        Output the item from that list
        Increment that list's current index

One way to implement that would be with an array of lists, and a corresponding array of current index pointers. The first cut of such a thing might look something like this:

    // merges multiple integer arrays
    private static int[] MergeInts(List<int[]> arrays)
    {
        var result = new List<int>();
        // Initialize indexes
        var indexes = new int[arrays.Count];
        for (int i = 0; i < arrays.Count; ++i)
        {
            indexes[i] = 0;
        }

        // Function returns true if we've reached the end of all the arrays
        var endOfAllArrays = new Func<bool>(() =>
        {
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                    return false;
            }
            return true;
        });

        while (!endOfAllArrays())
        {
            // Find the array with the smallest current item
            int smallestItem = int.MaxValue;
            int smallestArray = 0;
            for (int i = 0; i < indexes.Length; ++i)
            {
                if (indexes[i] < arrays[i].Length)
                {
                    if (arrays[i][indexes[i]] <= smallestItem)
                    {
                        smallestArray = i;
                        smallestItem = arrays[i][indexes[i]];
                    }
                }
            }

            // output that item
            result.Add(smallestItem);

            // advance the index on that array
            ++indexes[smallestArray];
        }
        return result.ToArray();
    }

That code implements the basic algorithm exactly. It’s not especially pretty, but it gets the job done. It’s also horribly inefficient because for every time through the loop it looks at all of the current items to determine which is the smallest. The complexity of this particular implementation is O(n*k), where n is the total number of items in all lists, and k is the number of lists. If you’re merging 10 lists that combined contain n items, it’s going to take twice as long to merge than if you have the same items spread out over just 5 lists.

What we need is a faster way to find the list that has the smallest current item. I mentioned in my first article (linked above) that the k-way merge is can be done in O(n log2 k) time complexity. That makes a big difference, even for small values of k. The base 2 logarithm of 4, for example, is 2, meaning that if we can improve the selection algorithm, we can merge four lists in half the time.

What data structure do we know that can return the smallest item in O(log2 k) time? The heap, maybe? It’s a surprisingly useful data structure.

Next time we’ll look at replacing the array of arrays in the implementation above with a heap. That will reduce the time required to select the smallest item, and it will also eliminate the time required to determine if we’re at the end of the list.

Removing duplicates

Imagine you have two lists of names in alphabetical order. The first list is of people who drive from Point A to Point B on Monday, and the second list is of people who drove that route the on Tuesday. You want to build a list of people who drove that route on either day, and you don’t want any duplicates. So if Joe, Mary, and Sam drove on Monday and Joe, Mary, and Ralph drove on Tuesday, then your list would contain [Joe, Mary, Ralph, Sam].

My MergeWithBy LINQ extension can easily merge the two lists to give [Joe, Joe, Mary, Mary, Ralph, Sam], but it doesn’t remove duplicates. I could make it remove duplicates easily enough by adding another parameter to the method, but it seems more useful to have a separate method that can remove duplicates from any sorted sequence.

LINQ’s existing Distinct method can can do it, but it has at least two drawbacks:

  • Because it works on unsorted sequences as well, it has to hold all of the items in memory. This limits the size of sequences you can operate on.
  • It requires that I define an equality comparer if I want to do custom comparisons.

The first is a problem because I often work with very large lists that won’t fit into memory. The second is an annoyance because if I want to sort, merge, and then uniquify, I end up with two different meanings of equality: one defined by the IComparer I pass to OrderBy and MergeWithBy, and one defined by the IEqualityComparer implementation that I pass to Distinct. In addition, the syntax is different. That is, to merge and then uniquify:

    var merged = List1.MergeWithBy(List2, k => k.Age).ThenBy(k => k.LastName);
    var unique = merged.Distinct(new ItemComparerByAgeAndName());

I’d much rather have the second line be:

    var unique = merged.UniqueBy(k => k.Age).ThenBy(k => k.LastName);

Removing duplicates from an ordered sequence is easy enough. For example, this code de-dupes an ordered list, using the passed comparer.

    IEnumerable<TSource> Unique(IEnumerable<TSource> theList, IComparer<TSource> comparer)
    {
        var it = theList.GetEnumerator();
        var hasItems = it.MoveNext();
        if (!hasItems)
            yield break;
        
        yield return it.Current;
        var prev = it.Current;
        hasItems = it.MoveNext();
        while (hasItems)
        {
            if (comparer.Compare(prev, it.Current) != 0)
            {
                yield return it.Current;
                prev = it.Current;
            }
            hasItems = it.MoveNext();
        }
    }

Note that I used an IComparer<T> rather than an IEqualityComparer<T> here, which is different from what Distinct uses. I did that because the next step is to implement a LINQ extension, called UniqueBy, that works in the same way that OrderBy and my MergeWithBy methods work, including support for ThenBy and ThenByDescending.

I’m not going to spend any time describing how the UniqueBy and related methods work. The inner workings are sufficiently similar to the previously covered MergeWithBy that going over it again would be needless repetition. See the article I linked at the top of this post if you want more information.

UniqueBy will be faster than Distinct, and will use a lot less memory. Like my MergeWithBy, it only needs enough extra memory to store three items at any one time. But it only works on ordered sequences. If you want to uniquify an unordered list, you need to call Distinct, or you need to sort it first before calling UniqueBy. But Distinct will be faster if you have an unsorted list that fits in memory.

As I showed above, using the new method is really easy: it’s just like using OrderBy or my TopBy method.

One question that comes up is whether there is any benefit to calling UniqueBy before merging sequences. That is, which of these two is better?

    var foo = L1.UniqueBy(k => k).Merge(L2.UniqueBy(k => k), k => k).UniqueBy(k => k);

OR:

    var foo = L1.Merge(L2, k => k).UniqueBy(k => k)

It should be clear that the final UniqueBy is required in the first example. Otherwise you wouldn’t remove duplicates from the merged list. You’d only remove duplicates from each of the input lists. But if “Joe” existed in both lists then “Joe” would be output twice.

You’re better off with the second example because it’s simpler, produces the same result as the first, will use slightly less memory, and will be ever slightly faster than the first example unless each of the input lists has a high percentage of duplicate items. Even in pathological cases, though, the difference in execution speed will likely be so small as to be irrelevant. So you should stick with the simpler code.

Whereas merging two sorted lists into a single sorted list and potentially removing duplicates is the most common type of merge I’ve used or seen used, it’s certainly not the only merge topic. Merging is a technique that’s been used in transaction processing systems for at least 50 years. I’ll likely come back to it in the future.

Following is complete code for the UniqueBy extension.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace EnumerableExtensions
{
    public static class UniqueExtension
    {
        public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector)
        {
            return UniqueBy(list1, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> UniqueBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, false);
        }

        public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector)
        {
            return UniqueByDescending(list1, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> UniqueByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new UniqueOrderedEnumerable<TSource, TKey>(list1, keySelector, comparer, true);
        }

        internal abstract class UniqueOrderedEnumerable<TSource> : IOrderedEnumerable<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            internal UniqueOrderedEnumerable<TSource> Parent;

            protected UniqueOrderedEnumerable(
                IEnumerable<TSource> list1)
            {
                _list1 = list1;
            }

            public IOrderedEnumerable<TSource> CreateOrderedEnumerable<TKey>(
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool @descending)
            {
                var oe = new UniqueOrderedEnumerable<TSource, TKey>(
                    _list1, keySelector, comparer, descending) { Parent = this };
                return oe;
            }

            public IEnumerator<TSource> GetEnumerator()
            {
                var criteria = GetCriteria().ToArray();

                var selector = new UniqueSelector<TSource>(_list1, criteria);
                return selector.DoSelect().GetEnumerator();
            }

            // Walks the ordering criteria to build an array that the HeapSelector can use.
            private IEnumerable<UniqueOrderedEnumerable<TSource>> GetCriteria()
            {
                var keys = new Stack<UniqueOrderedEnumerable<TSource>>();

                var current = this;
                while (current != null)
                {
                    keys.Push(current);
                    current = current.Parent;
                }
                return keys;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }

            // The individual ordering criteria instances (UniqueOrderedEnumerable<TElement, TKey>)
            // implement these abstract methods to provice key extraction, comparison, and swapping.
            public abstract void ExtractKey(TSource item, int ix);
            public abstract int CompareKeys(int x, int y);
        }

        internal class UniqueOrderedEnumerable<TSource, TKey> : UniqueOrderedEnumerable<TSource>
        {
            private readonly Func<TSource, TKey> _keySelector;
            private readonly IComparer<TKey> _comparer;
            private readonly bool _descending;
            private readonly TKey[] _keys;

            internal UniqueOrderedEnumerable(
                IEnumerable<TSource> list1,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool descending)
                : base(list1)
            {
                _keySelector = keySelector;
                _comparer = comparer ?? Comparer<TKey>.Default;
                _descending = descending;

                _keys = new TKey[2];
            }

            public override int CompareKeys(int x, int y)
            {
                return _descending
                    ? _comparer.Compare(_keys[y], _keys[x])
                    : _comparer.Compare(_keys[x], _keys[y]);
            }

            public override void ExtractKey(TSource item, int ix)
            {
                _keys[ix] = _keySelector(item);
            }
        }

        internal class UniqueSelector<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly UniqueOrderedEnumerable<TSource>[] _criteria;

            public UniqueSelector(
                IEnumerable<TSource> list1,
                UniqueOrderedEnumerable<TSource>[] criteria)
            {
                _list1 = list1;
                _criteria = criteria;
            }

            public IEnumerable<TSource> DoSelect()
            {
                // Initialize the iterator
                var i1 = _list1.GetEnumerator();

                // ix controls the key position that's loaded
                var ix = 0;

                var next = new Func<bool>(() =>
                {
                    if (!i1.MoveNext()) return false;
                    ExtractKeys(i1.Current, ix);
                    return true;
                });

                var i1HasItems = next();
                if (!i1HasItems) yield break;

                // output the first item
                yield return i1.Current;
                ix = 1;
                i1HasItems = next();
                while (i1HasItems)
                {
                    // Output the item if it's not equal to the previous item
                    if (Compare(0, 1) != 0)
                    {
                        yield return i1.Current;

                        // Change the index position for the next loaded key
                        ix = (ix == 0) ? 1 : 0;
                    }
                    i1HasItems = next();
                }
            }

            private int Compare(int x, int y)
            {
                var rslt = 0;
                for (var i = 0; rslt == 0 && i < _criteria.Length; ++i)
                {
                    rslt = _criteria[i].CompareKeys(x, y);
                }
                return rslt;
            }

            private void ExtractKeys(TSource item, int ix)
            {
                foreach (var t in _criteria)
                {
                    t.ExtractKey(item, ix);
                }
            }
        }
    }
}

Thoughts after the election

If you believe the after election commentary, the American people have “repudiated the policies of this administration and embraced conservative ideals.” Republicans are saying that gaining control of the Senate and increasing their control of the House is “a mandate” from the American people. In a sense it is, but almost certainly not in the way that they apparently believe and, more importantly, want you to believe.

For the president’s opposing party to take control of Congress during midterm elections is nothing new. Republicans did it in 1994, and Democrats did it in 2006. Historically, it’s unusual for the opposing party not to gain seats during the midterm. The media pundits and party propaganda twits will come up with all kinds of complicated reasons, often based on the belief that the winning party lied or cheated, but I think the real reason is much simpler: voters are expressing their discontent by throwing the bums out.

Throwing the bums out is a good thing as far as I’m concerned. Unfortunately, they just elect a new crop of bums and it’s business as usual.

The American electorate has an incredibly short memory and an even more limited imagination. They’re smart enough to seee that things aren’t working well, and know that we need a change in Congress. But their imagination is limited to one of two flavors: Democrat or Republican. They’ll happily throw out the Democrats in favor of the Republicans, despite having been burned by the Republicans eight or ten years ago. And at some time in the near future they’ll throw the Republicans out in favor of the Democrats, forgetting that things weren’t any better the last time Democrats controlled Congress.

For some unfathomable reason, Americans lack the imagination to throw out Democrats and Republicans. That would send the right message. As it stands, all we’re doing is swapping one crowd of blood sucking parasites for another.

I liken it to being given the choice of having your face slapped repeatedly or getting punched in the gut repeatedly. We get tired of slaps after a while and switch to the gut punchers, but then our stomachs start to hurt and we go for the face slapping again. But what we really want is for people to stop hitting us. And yet we don’t have the imagination or the will to do anything about it.

In part, that’s because we long ago allowed Congress to make rules that enforce a very strict two-party system. The party that has the most seats in the House or Senate gets to make rules and control what legislation is presented. That in itself encourages an adversarial relationship, which is especially bad when the President’s party controls one house and the opposing party controls the other. In that case, the opposing party is forced to do whatever it can to block the president’s every move. To do otherwise would alienate their base and anybody else who might be discontented enough to vote for them during the next election cycle.

When the president’s party controls both houses of Congress, we’re in real trouble. Especially when they hold a super majority that can completely block every move of the opposing party. In that case, there is no opposition to whatever grandiose scheme the president’s party can dream up. We usually regret such laws that are enacted without careful consideration and lively debate. Giving a single party full control of two of our three branches of government is dangerous. So far we’ve been fortunate that the parties aren’t quite as tightly controlled as they could be. It’s a good thing that sometimes a party member will vote against the wishes of the party leaders.

We’re best off when one party controls the Executive branch and the other party controls the Legislative. In that case, the two parties are forced to work together. When one party controls both houses of Congress, the people who elected them expect them to Get Things Done. Sure, some of the party faithful think Congress should adopt an adversarial posture towards the president, but that leads to idiocy like the Clinton impeachment trial. Most of the people will want Congress to work with the president and enact meaningful legislation. Or, one would hope, repeal stupid legislation that was imposed on us when one party had a little too much power.

If Republicans can work with the president over the next two years (one year, really, because the second year will be dedicated to the mud slinging and political posturing we call campaigning), Republicans have a chance of gaining the White House again, and perhaps can keep from losing too much ground in the Congressional elections. If, however, they adopt an adversarial role and refuse to work with the president, the 2016 elections will be a repeat of 2008.

What I’d really like to see, though, is a meaningful third party or, preferably, a serious independent (i.e. no party affiliation) movement. Our system of party politics, especially the artificial two party system, is a serious impediment. It just encourages tribalism and perpetuates the dysfunctional governance we’ve experienced for the last 25 years or more.

Timers don’t keep track of time

In computer programming, the word Timer is used to describe a hardware or software component that raises an event (“ticks”) at a specified interval. So, for example, if you want a notification once per second, you would instantiate a timer and set its period to one second. The timer API often has you specify the period in milliseconds, so you’d say 1,000 milliseconds.

In a C# program, for example, you can create a timer that ticks every second, like this:


    private System.Threading.Timer _timer;
    private int _tickCount;
    public void Test()
    {
        _timer = new System.Threading.Timer(MyTimerHandler, null, 1000, 1000);
        Console.ReadLine();
    }

    private void MyTimerHandler(object state)
    {
        ++_tickCount;
        Console.WriteLine("{0:N0} tick", _tickCount);
    }

It’s important to note that ticks don’t come at exact 1-second intervals. The timer is reasonably accurate, but it’s not perfect. How imperfect is it? Let’s find out. Below I’ve modified the little program to start a Stopwatch and output the actual elapsed time with each tick.

    private System.Threading.Timer _timer;
    private Stopwatch _stopwatch;
    private int _tickCount;
    public void Test()
    {
        _stopwatch = Stopwatch.StartNew();
        _timer = new System.Threading.Timer(MyTimerHandler, null, 1000, 1000);
        Console.ReadLine();
    }

    private void MyTimerHandler(object state)
    {
        ++_tickCount;
        Console.WriteLine("{0:N0} - {1:N0}", _tickCount, _stopwatch.ElapsedMilliseconds);
    }

On my system, that program shows the timer to be off by about 200 milliseconds every three minutes. That’s on a system that isn’t doing anything else of note: no video playing, no downloads, video streaming, or heavy duty background jobs. 200 milliseconds every three minutes doesn’t sound like much, but that’s one second every fifteen minutes, or 96 seconds every day. My granddad’s wind-up Timex kept better time than that.

Granted, the problem could be with the Stopwatch class. But it isn’t. I created a simple program that starts a Stopwatch and then outputs the elapsed time whenever I press the Enter key. I let that program run for days, and every time I hit Enter, it agreed very closely with what my wristwatch said. There was some variation, of course, because no two clocks ever agree perfectly, but the difference between Stopwatch and my wristwatch were a lot smaller than the differences between Stopwatch and the Timer. I have since run that experiment on multiple computers and multiple clocks, and every time the Stopwatch has been quite reliable.

Things get worse. On a heavily loaded system with lots of processing going on, timer ticks can be delayed by an indeterminate time. I’ve seen ticks delayed for five seconds or more, and then I get a flood of ticks. I’ll see, for example, a tick at 5,000 milliseconds, then one at 9,000 milliseconds followed very quickly by ticks at 9,010, 9015, and 9,030 milliseconds. What happened is that the system was busy and couldn’t schedule the timer thread at 6,000 milliseconds, so the system buffered the ticks. When it finally got around to dispatching, three other ticks had come in and it fired all four of them as quickly as it could.

This can be a huge problem on heavily loaded systems because it’s possible that multiple timer ticks can be processing concurrently.

The only thing a timer is good for is to give you notification on a periodic basis–approximately at the frequency you requested. A timer pointedly is not for keeping time.

Given that, you can probably explain what’s wrong with this code, which is a very simplified example of something I saw recently. The original code did some processing and periodically checked the _elapsedSeconds variable to see how much time had elapsed. I simplified it here to illustrate the folly of that approach.

private Timer _timer;
    private int _elapsedSeconds;
    private ManualResetEvent _event;
    private void Test()
    {
        _timer = new Timer(MyTimerHandler, null, 1000, 1000);
        _event = new ManualResetEvent(false);
        _event.WaitOne();
        Console.WriteLine("One minute!");
        _timer.Dispose();
    }
    
    private void MyTimerHandler(object state)
    {
        ++_elapsedSeconds;
        if (_elapsedSeconds == 60)
        {
            _event.Set();
        }
    }

Here, the programmer is using the timer like the second hand of a clock: each tick represents one second. There are at least two problems here. First, we’ve already seen that the timer won’t tick at exactly one-second intervals. Second and more importantly, it’s possible on a busy system for multiple timer events to occur concurrently. Imagine what happens in this case:

  1. _elapsedSeconds is equal to 59.
  2. A tick occurs and Thread A is dispatched to handle it.
  3. Thread A increments the _elapsedSeconds value to 60.
  4. Another tick occurs and Thread B is dispatched to handle it.
  5. At the same time, Thread A is swapped out in favor of some higher priority thread.
  6. Thread B increments the _elapsedSeconds value to 61.

At this point, the _elapsedSeconds value has been incremented beyond 60, so the conditional will never be true and the event will never be set. (Well, it might: 4 billion seconds from now, when _elapsedSeconds rolls over. That’s only about 125 years.)

You could change the conditional to _elapsedSeconds >= 60, but that’s missing the point. We’ve already shown that the timer isn’t going to be particularly accurate. If you were trying to time a whole day, you could be off by a minute and a half.

The problem of concurrent execution of timer handlers is another topic entirely that I probably should cover in a separate post. It’s important you understand that it can happen, though, and you have to keep that in mind.

In programming, timers don’t keep time. Don’t try to make them. They can’t count up reliably, and they can’t count down reliably. If you want to keep track of elapsed time, start a Stopwatch and then check its Elapsed or ElapsedMilliseconds properties when necessary.

On a related note, do not use the clock to measure time.

The MergeWithBy LINQ extension

Last time, I showed how to write the standard two-way merge using LINQ. With that code, you can easily merge two sorted sequences from any type of container that implements the IEnumerable<T> interface. That in itself is very useful, and with the addition of a parameter that defines the comparison function, it likely would suffice for most merge operations. However, it’s less than satisfactory for the same reasons I pointed out in my introduction to the TopBy method earlier this year: sometimes creating an IComparer is a giant pain in the neck. To make a LINQ-compatible merge capability, I really need to implement IOrderedEnumerable.

I’ve struggled with what to call the method. Obvious candidates are Merge, MergeBy, MergeWith, and MergeWithBy, but each of those has its drawbacks. Consider, for example, the possibilities:

    var merged = List1.Merge(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeBy(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeWith(List2, k => k.Age).ThenBy(k => k.LastName);
    var merged = List1.MergeWithBy(List2, k => k.Age).ThenBy(k => k.LastName);

I’m not 100% happy with any one of those options but after struggling with it for a while I decided on MergeWithBy which, although a little clunky to the ear (MergeWithByDescending in particular), is the most descriptive. I’m merging one list with another by the specified ordering criteria.

With the naming out of the way, the easy part is defining the LINQ extension methods. These exactly mirror the standard OrderBy methods as well as my TopBy methods.

    public static class MergeExtension
    {
        public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector)
        {
            return MergeWithBy(list1, list2, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, false);
        }

        public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector)
        {
            return MergeWithByDescending(list1, list2, keySelector, null);
        }

        public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
            this IEnumerable<TSource> list1,
            IEnumerable<TSource> list2,
            Func<TSource, TKey> keySelector,
            IComparer<TKey> comparer)
        {
            return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, true);
        }
    }

The harder part is implementing the IOrderedEnumerable<T> interface that actually does the work. The idea behind it, as I described in my two-part series on IOrderedEnumerable, is to create, for each of the ordering criteria, a class instance that can compare the relevant key. There is a base class, MergeOrderedEnumerable<TSource<, and a derived class for each key type: MergeOrderedEnumerable<TSource, TKey>.

I covered the details of how those two classes work in the article linked above, and in the second part. For the most part, those two classes are just bookkeeping: setting things up so that a third class, the selector, can do the actual work. The full source of the merge extension class is shown at the end of this post.

The guts of the merge–the part that implements the Merge method I showed last timeMergeSelector.DoSelect method, shown here:

    public IEnumerable<TSource> DoSelect()
    {
        // Initialize the iterators
        var iterators = new IEnumerator<TSource>[2];

        var next = new Func<int, bool>(ix =>
        {
            if (!iterators[ix].MoveNext()) return false;
            ExtractKeys(iterators[ix].Current, ix);
            return true;
        });

        iterators[0] = _list1.GetEnumerator();
        iterators[1] = _list2.GetEnumerator();
        var i1HasItems = next(0);
        var i2HasItems = next(1);
        while (i1HasItems && i2HasItems)
        {
            if (Compare(0, 1) <= 0)
            {
                yield return iterators[0].Current;
                i1HasItems = next(0);
            }
            else
            {
                yield return iterators[1].Current;
                i2HasItems = next(1);
            }
        }

        while (i1HasItems)
        {
            yield return iterators[0].Current;
            i1HasItems = next(0);
        }

        while (i2HasItems)
        {
            yield return iterators[1].Current;
            i2HasItems = next(1);
        }
    }

This code is nearly a direct translation of the Merge I showed the last time. The primary difference in the structure of the code is that I put the iterators in an array so that I could reduce the amount of duplicated code. The next function advances to the next item in whichever list is specified. Because I have to extract keys that are maintained by the individual MergeOrderedEnumerable<TSource, TKey> instances, without that function I’d have to write code like this four times (twice for i1, and twice for i2):

    yield return i1.Current;
    i1.HasItems = i1.MoveNext();
    if (i1.HasItems)
    {
        ExtractKeys(i1.Current, 0);
    }

I suppose I could have combined the code:

    if ((i1.HasItems = it.MoveNext())) ExtractKeys(i1.Current, 0);

Call it a matter of style.

That’s really all there is to it. The MergeOrderedEnumerable classes look unnecessarily complex at first glance, but after you study them for a few minutes, you understand why all that code is there. It’s particularly instructive to set up a short merge of a few items, and then single-step the code. Not only will you see how all this stuff works together, you’ll gain a better understanding of how LINQ works in general.

That’s almost all there is to merging two sequences. The only thing remaining is the matter of uniqueness, which I’ll talk about next time in Removing duplicates.

    namespace EnumerableExtensions
    {
        public static class MergeExtension
        {
            public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector)
            {
                return MergeWithBy(list1, list2, keySelector, null);
            }

            public static IOrderedEnumerable<TSource> MergeWithBy<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, false);
            }

            public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector)
            {
                return MergeWithByDescending(list1, list2, keySelector, null);
            }

            public static IOrderedEnumerable<TSource> MergeWithByDescending<TSource, TKey>(
                this IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new MergeOrderedEnumerable<TSource, TKey>(list1, list2, keySelector, comparer, true);
            }
        }

        internal abstract class MergeOrderedEnumerable<TSource> : IOrderedEnumerable<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly IEnumerable<TSource> _list2;
            internal MergeOrderedEnumerable<TSource> Parent;

            protected MergeOrderedEnumerable(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2)
            {
                _list1 = list1;
                _list2 = list2;
            }

            public IOrderedEnumerable<TSource> CreateOrderedEnumerable<TKey>(
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool @descending)
            {
                var oe = new MergeOrderedEnumerable<TSource, TKey>(
                    _list1, _list2, keySelector, comparer, descending) {Parent = this};
                return oe;
            }

            public IEnumerator<TSource> GetEnumerator()
            {
                var criteria = GetCriteria().ToArray();

                var selector = new MergeSelector<TSource>(_list1, _list2, criteria);
                return selector.DoSelect().GetEnumerator();
            }

            // Walks the ordering criteria to build an array that the HeapSelector can use.
            private IEnumerable<MergeOrderedEnumerable<TSource>> GetCriteria()
            {
                var keys = new Stack<MergeOrderedEnumerable<TSource>>();

                var current = this;
                while (current != null)
                {
                    keys.Push(current);
                    current = current.Parent;
                }
                return keys;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }

            // The individual ordering criteria instances (MergeOrderedEnumerable<TElement, TKey>)
            // implement these abstract methods to provice key extraction, comparison, and swapping.
            public abstract void ExtractKey(TSource item, int ix);
            public abstract int CompareKeys(int x, int y);
        }

        internal class MergeOrderedEnumerable<TSource, TKey> : MergeOrderedEnumerable<TSource>
        {
            private readonly Func<TSource, TKey> _keySelector;
            private readonly IComparer<TKey> _comparer;
            private readonly bool _descending;
            private readonly TKey[] _keys;

            internal MergeOrderedEnumerable(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer,
                bool descending)
                : base(list1, list2)
            {
                _keySelector = keySelector;
                _comparer = comparer ?? Comparer<TKey>.Default;
                _descending = descending;

                _keys = new TKey[2];
            }

            public override int CompareKeys(int x, int y)
            {
                return _descending
                    ? _comparer.Compare(_keys[y], _keys[x])
                    : _comparer.Compare(_keys[x], _keys[y]);
            }

            public override void ExtractKey(TSource item, int ix)
            {
                _keys[ix] = _keySelector(item);
            }
        }

        internal class MergeSelector<TSource>
        {
            private readonly IEnumerable<TSource> _list1;
            private readonly IEnumerable<TSource> _list2;
            private readonly MergeOrderedEnumerable<TSource>[] _criteria;

            public MergeSelector(
                IEnumerable<TSource> list1,
                IEnumerable<TSource> list2,
                MergeOrderedEnumerable<TSource>[] criteria)
            {
                _list1 = list1;
                _list2 = list2;
                _criteria = criteria;
            }

            public IEnumerable<TSource> DoSelect()
            {
                // Initialize the iterators
                var iterators = new IEnumerator<TSource>[2];

                var next = new Func<int, bool>(ix =>
                {
                    if (!iterators[ix].MoveNext()) return false;
                    ExtractKeys(iterators[ix].Current, ix);
                    return true;
                });

                iterators[0] = _list1.GetEnumerator();
                iterators[1] = _list2.GetEnumerator();
                var i1HasItems = next(0);
                var i2HasItems = next(1);
                while (i1HasItems && i2HasItems)
                {
                    if (Compare(0, 1) <= 0)
                    {
                        yield return iterators[0].Current;
                        // I could do a loop here using ExtractCompare to compare against
                        // item 2. That would reduce the key extraction as long as the
                        // new item from list 1 is smaller than the new item from list 2.
                        // Then extract all of the keys once l1 goes high.
                        // Lots of code that might not be particularly useful.
                        i1HasItems = next(0);
                    }
                    else
                    {
                        yield return iterators[1].Current;
                        i2HasItems = next(1);
                    }
                }

                while (i1HasItems)
                {
                    yield return iterators[0].Current;
                    // TODO: Could add an "extract" parameter to the next function
                    // If "extract" is false, it doesn't extract the keys.
                    // That would speed up the tail copying,
                    // but might not be worth the trouble.
                    i1HasItems = next(0);
                }

                while (i2HasItems)
                {
                    yield return iterators[1].Current;
                    i2HasItems = next(1);
                }
            }

            private int Compare(int x, int y)
            {
                var rslt = 0;
                for (var i = 0; rslt == 0 && i < _criteria.Length; ++i)
                {
                    rslt = _criteria[i].CompareKeys(x, y);
                }
                return rslt;
            }

            private void ExtractKeys(TSource item, int ix)
            {
                foreach (var t in _criteria)
                {
                    t.ExtractKey(item, ix);
                }
            }
        }
    }

Merging sequences with LINQ

Last time, I introduced the concept of merging two sorted sequences and showed that the standard two-way merge is superior to the naive combine-and-sort method. The code I presented in that post was specific to sorting two arrays of integers. It’s useful as a demonstration, but in most situations the lists I want to merge contain more than just integers. It’s time to make that merge more generic by providing a LINQ implementation.

Making a LINQ version of the standard two-way merge is straightforward. The algorithm is obviously the same; I just have to modify the code that I presented last time so that it works enumerators rather than with array indexes, and modify the comparison logic so that it calls an item comparison function rather than relying on the comparison operators. A direct translation of the integer merge code, extended to support any type, looks like this:

    public static IEnumerable<T> Merge(IEnumerable<T> list1, IEnumerable<T> list2)
    {
        // Initialize everything
        var comp = Comparer.Default;
        var i1 = list1.GetEnumerator();
        var i2 = list2.GetEnumerator();
        var i1HasItems = i1.MoveNext();
        var i2HasItems = i2.MoveNext();

        // While there are items in both lists,
        // compare and pick the smallest.
        while (i1HasItems && i2HasItems)
        {
            if (comp.Compare(i1.Current, i2.Current) <= 0)
            {
                yield return i1.Current;
                i1HasItems = i1.MoveNext();
            }
            else
            {
                yield return i2.Current;
                i2HasItems = i2.MoveNext();
            }
        }

        // At this point, one of the lists has been exhausted.
        // The other still has items in it.
        // We don't know which, so just make sure both are exhausted.
        while (i1HasItems)
        {
            yield return i1.Current;
            i1HasItems = i1.MoveNext();
        }
        while (i2HasItems)
        {
            yield return i2.Current;
            i2HasItems = i2.MoveNext();
        }
    }

On the surface, that’s not a whole lot different than the integer array merge I presented last time. There are, however, significant differences in how it actually behaves, all of which have to do with deferred execution.

The previous code requires that you pass it fully-realized arrays, which it then merges into another array and returns a reference to that new array. The LINQ version above accepts two IEnumerable<T> references, which it merges and returns items one at a time with the yield return statements. The difference in behavior is quite striking.

To illustrate the difference, imagine you have two files, each of which contains a sorted list of integers: one per line. If you wanted to merge those files into a single output file using the array merge code, you would write code that does this:

    Load the first file into Array1
    Load the second file into Array2
    Result = Merge(Array1, Array2)
    Save Result to file

The important part to note here is that you will need enough memory to hold all of Array1, all of Array2, and all of the Result array simultaneously. If the files contain a billion items each, that works out to 16 gigabytes of memory: four gigabytes for each of Array1 and Array2, and eight gigabytes for the merged Result array.

For the LINQ version, you can structure things so that you need very little memory, regardless of how large the files are. Consider this code:

    public static void MergeTwoFiles(string file1, string file2, string resultFile)
    {
        var f1 = File.ReadLines(file1).Select(int.Parse);
        var f2 = File.ReadLines(file2).Select(int.Parse);
        var result = Merge(f1, f2).Select(i => i.ToString());
        File.WriteAllLines(resultFile, result);
    }

That looks like it does everything that I described in the pseudo code above, but there’s a big difference. The first line creates an enumerator that can iterate over the file line-by-line. But it doesn’t actually read the file until you call MoveNext. Same with the second line of code. Even the third line of code doesn’t actually perform any processing.

The fourth line of code, the call to File.WriteAllLines starts the ball rolling. It essentially does this:

    using (var writer = new StreamWriter(resultFile))
    {
        foreach (var r in result)
        {
            Console.WriteLine(r);
        }
    }

That foreach is where all the magic happens. The first time through, it calls MoveNext on the result enumerator, and that causes the initialization code in the Merge method to execute. That code in turn calls MoveNext on the f1 and f2 enumerators, which ends up loading the first number from each file. Then, the first iteration of the Merge loop is executed. When the code gets to a yield return, the value r is returned to the loop shown above, and that value is written to the file.

At this point, only three values have been read from the files: the first value from each of f1 and f2, and the second value from whichever file contained the lowest value.

The code continues in this way until both files are exhausted. At no time are there more than three values from the files in memory.

Another way to look at it is to imagine that there are three workers: Joe, Bob, and Fred. Joe and Bob each have a list of numbers that they can give to you one at a time, and Fred will write down whatever number you give him. Your merge works like this:

    Get first item from Joe
    Get first item from Bob
    while Joe and Bob both have items
        if Joe's number is bigger
            tell Fred to write Joe's number
            get the next number from Joe
        else
            tell Fred to write Bob's number
            get the next number from Bob

For brevity, I’ve skipped the code at the end of the loop that makes sure that Joe and Bob have each given you all of their items.

The key here is that you don’t really care where or how Joe and Bob are getting their numbers. They could have a big list of numbers written down on paper, or they could be pulling the numbers out of thin air as you ask for them. And Fred doesn’t need all of the numbers at once. He just needs to write down each number as you give it to him.

Because I often work with huge data sets that won’t fit into memory, I make good use of LINQ’s deferred execution, this being one particularly good example.

The Merge method above that uses IEnumerable<T> is a reasonably useful method, but it doesn’t go quite far enough. In particular, it doesn’t let me specify the comparison method, nor does it let me define a key selector. I need it to work like the OrderBy method, where I can define multiple ordering keys and key selectors. That takes a bit more work. Fortunately, I know how to do most of it because I had to implement much the same thing when I created my TopBy method earlier this year. Implementing IOrderedEnumerable for my MergeBy method should be much simpler.

We’ll dig into it next time.

Merging sorted sequences

A fairly common operation in some data processing applications involves merging two or more sorted sequences. In my previous work I frequently had to merge lists that contained tens or hundreds of millions of items each. When you’re working with lists that large, the choice of merge algorithm can make the difference between a few seconds’ work and an overnight job.

The naive way to merge two lists into a third is to concatenate and sort. That is, given these two lists:

    List A = [1, 5, 6, 9]
    List B = [2, 3, 6, 8]

You’d first create the list [1, 5, 6, 9, 2, 3, 6, 8], and then sort the sequence to produce [1, 2, 3, 5, 6, 6, 8, 9].

That technique works well enough, but it’s expensive. The best sorting algorithms take time proportional to n log2(n), where n is the number of items in the list. That’s no big deal when you’re working with reasonably short lists, but sorting a few hundred million items takes significant time. Another problem is that the sort is necessarily an in-memory operation, or a very complicated disk-based merge sort which is even more expensive.

You can do much better by taking advantage of the existing ordering. We know that both lists are already sorted, so you know that the first item in the combined list will be the smallest of the first items in each of the lists. In the lists above, List A starts with 1, and List B starts with 2. You know that the first item in the new list has to be 1.

If you remove 1 from List A and repeat the process, you’re comparing 5 with 2. Obviously, you output 2 from List B and remove it from the list. Then compare 5 with 3, etc. You do this until one of the lists is exhausted, and then you output all the items from the other list. In pseudo code, it looks like this:


    while (not end of List A and not end of List B)
        if (List A current item <= List B current item)
            output List A current item
            advance List A index
        else
            output List B current item
            advance List B index

    // At this point, one of the lists is empty.
    // Output remaining items from the other
    while (not end of List A)
        output List A current item
        advance List A index

    while (not end of List B)
        output List B current item
        advance List B index

Note that it doesn’t matter which list empties first. If List A empties first, then the condition at the beginning of the following while loop will prevent trying to output something that isn’t there.

The algorithm above is a standard merge, which is an O(n) algorithm. If you study it for a few minutes, you’ll see that given two lists with lengths L1 and L2 the maximum number of item comparisons it will make is min(L1, L2). That’s a far cry from the (L1 + L2) * log2(L1 + L2) that a sort will make in the average case.

The difference in the number of comparisons makes a huge difference. To see just how much of a difference, I wrote a little program that creates two lists and then merges them using each of the techniques. Here’s the code:

    static void Main(string[] args)
    {
        const int numIterations = 100;
        const int numItems = 1000 * 1000 * 100;
        Console.WriteLine("Testing with two lists of {0:N0} items each.", numItems);

        // Create two sorted lists
        var l1 = CreateRandomList(numItems);
        Array.Sort(l1);
        var l2 = CreateRandomList(numItems);
        Array.Sort(l2);
  
        DoTest(() => MergeWithSort(l1, l2), numIterations, "Merge with sort");
        DoTest(() => SimpleMerge(l1, l2), numIterations, "Simple merge");
        Console.ReadLine();
    }

    static void DoTest(Action action, int iterations, string name)
    {
        Console.Write("Executing ({0}) {1:N0} times.", name, iterations);
        var totalTime = TimeSpan.Zero;
        var sw = new Stopwatch();
        for (int i = 0; i < iterations; ++i)
        {
            sw.Restart();
            action();
            sw.Stop();
            totalTime += sw.Elapsed;
            Console.Write('.');
        }
        Console.WriteLine(" {0:N0} ms. Average {1:N2} ms", 
            totalTime.TotalMilliseconds, (double)totalTime.TotalMilliseconds/iterations);
    }

    static int[] MergeWithSort(int[] listA, int[] listB)
    {
        var result = new int[listA.Length + listB.Length];
        Array.Copy(listA, result, listA.Length);
        Array.Copy(listB, 0, result, listA.Length, listB.Length);
        Array.Sort(result);
        return result;
    }

    static int[] SimpleMerge(int[] listA, int[] listB)
    {
        var result = new int[listA.Length + listB.Length];
        int ixListA = 0;
        int ixListB = 0;
        int ixResult = 0;
        while (ixListA < listA.Length && ixListB < listB.Length)
        {
            if (listA[ixListA] <= listB[ixListB])
            {
                result[ixResult] = listA[ixListA];
                ++ixListA;
            }
            else
            {
                result[ixResult] = listB[ixListB];
                ++ixListB;
            }
            ++ixResult;
        }

        while (ixListA < listA.Length)
        {
            result[ixResult] = listA[ixListA];
            ++ixListA;
        }

        while (ixListB < listB.Length)
        {
            result[ixResult] = listB[ixListB];
            ++ixListB;
        }

        return result;
    }

    private static readonly Random Rnd = new Random();
    static int[] CreateRandomList(int count)
    {
        var result = new int[count];
        for (int i = 0; i < count; ++i)
        {
            result[i] = Rnd.Next();
        }
        return result;
    }

Running that with different list lengths shows how much faster the standard merge algorithm is in comparison to the combine-and-sort method, especially when the lists get very long. Here are the results:

Items Sort Standard merge
100 0.02 ms 0.01 ms
1,000 0.14 ms 0.03
1,000,000 284.39 ms 19.42 ms
100,000,000 32.3 seconds 1.92 seconds

I think you can see where that’s going. It’s pretty clear that the standard merge is a whole lot faster when the lists get large. And that’s just sorting integers. As item comparisons get more expensive (sorting strings, for example), the sort gets even more expensive.

So the standard merge is faster. As I mentioned above, it can also require much less memory than the combine and sort method. Because the standard merge only has to compare the current items in each list, and once an item is output it’s never looked at again, the merge only has to hold two items in memory at any time. Granted, it needs a way to get the next item from each list, but even when you add it all up it’s still a constant amount of memory. You can merge two lists of a billion items each using exactly the same amount of memory that it takes to merge two lists of 10 items each.

The merge example I showed above merges arrays of integers, but it’s not difficult to extend it to merge IEnumerable<T> instances. With slightly more work, I can add LINQ extension methods for merging. Specifically, extension methods MergeWith and MergeWithDescending that merge sorted sequences in ascending or descending order, and support the ThenBy and ThenByDescending LINQ methods so that you can merge based on multiple keys in the same way that you can sort by multiple keys with OrderBy and ThenBy.

The other thing I should mention is that, at least for now, I’m going to limit my discussion to merging two sorted sequences with the standard two-way merge algorithm shown above. There is another algorithm, called a k-way merge, that merges k sorted sequences in O(n log2 k) time, where n is the total number of items in all sequences.

The two-way merge, in fact, is just a special case of the k-way merge. That is, O(n log2 k), when k is equal to 2, works out to O(n), because log2(2) is equal to 1.

I’ll get around to discussing the k-way merge at some point.

Next time I’ll show a LINQ-ified version of the standard two-way merge, and then start looking at how to implement the MergeWith and MergeWithDescending extension methods.

Solving the right problem

As programmers, we sometimes lose sight of the problem we’re supposed to solve because we’re focused on the problem we want to solve. Or perhaps on the problem we think we need to solve. Today’s lesson is a case in point.

Imagine you have a very large collection (many thousands) of records that contain individuals’ names and their skills. The data consists of a name followed by a comma-separated list of skills. Something like:

Jim Mischel, programming, wood carving, brewing, curmudgeon

The problem is that some of the skills are misspelled or abbreviated: “mgmt” for “Management,” for example, or “hadup” instead of “Hadoop,” etc.

Your job is to go through the records and fix the misspellings.

Seems easy enough, right? Just go through the skills, look up each word in a dictionary, and correct the entry if the word isn’t in the dictionary. But there are two problems:

  1. A lot of common terms used in describing skills are domain specific and won’t be in any dictionary.
  2. Even if there were such a dictionary, how would you determine that “mgmt” is supposed to be “Management,” or that “hadup” should be “Hadoop?”

It’s at this point that we lose sight of the real problem we’re trying to solve. Being programmers, we start thinking about figuring out edit distances, machine learning, support vector machines, etc. In no time at all we’re shaving a yak.

Writing a computer program that will detect and correct misspellings like this is hard. I’m talking Ph.D. research project hard. I guarantee that the person who asked you to solve the problem isn’t willing to let you embark on a multi-year research project.

Fortunately, you don’t have to. You can solve it in a day or two with two simple programs and a little bit of manual effort.

First, write a program that builds a dictionary of all the terms used, along with how often they’re used. You’ll end up with a very large list that looks something like:

    Management,200
    Mgmt,3
    Hadoop,10
    Hadup,1
    Mamagement,1
    ...

If you make the assumption that if a term appears very often, it’s probably spelled correctly, then all you’re really concerned about are those terms that occur infrequently. You can decide on a threshold of, say, three or five, and have your program output only those terms. You’ll probably end up with a list of a few hundred infrequently used terms.

That’s where the manual work comes in. You have to look at each word, determine if it’s a misspelling, and if so find out what the correct spelling is. You manually create a file that contains the misspelled term along with the correct term. For example:

    mgmt, Management
    mamagement, Management
    hadup, Hadoop
    wood craving, wood carving
    etc.

The second program is even easier. All it does is load the replacements file into a dictionary with the key being the misspelled term, and then reads each name record. For each name record, the program checks each term against the dictionary. If it finds a misspelling that’s in the dictionary, it writes the replacement. Otherwise, it writes the original text. The algorithm looks like this:

    dict = LoadDictionary();
    create output file
    for each record in file
        output name
        for each skill in record.skills
            if skill in dict
                output replacement text
            else
                output original skill text

This approach won’t be perfect, but it will be very good. You’ll have to tinker with the threshold value a bit to get the right balance between detecting real misspellings and “false positives”: terms that are used infrequently but aren’t actual misspellings.

You’ll also have to accept that common misspellings will not be detected. Fortunately, common misspellings will be … common. So you can add them to the replacements dictionary as they’re detected, and subsequent passes over the file will catch them. You could do some work with edit distance algorithms to catch the more common misspellings, but it would be a huge amount of work for relatively little gain.

I’ll admit that it’s somewhat dissatisfying having to manually create the replacements file, but it’s hard to argue with the result. Especially I can get somebody else to do that work. Then all I have to do is create the list of terms and their counts, and write the program that will make the replacements once the manual work is done.

It’s not sexy, but it solves the problem quickly and effectively.

The Toothpaste Rant

I’m cranky this morning. Like many people, I have kind of a ritual I go through when I get out of bed. The first thing I do is wander, bleary-eyed, into the bathroom to do my business. Then I stop by the sink and brush my teeth. The day just doesn’t seem right if I can’t brush my teeth first thing.

So why am I cranky? The toothpaste tube was empty, and I grabbed one of those little sample tubes that come in the mail from time to time. Crest wanting me to try their new flavor, I guess. Heck, it’s toothpaste, right? I squeezed a bit out onto my brush, put the thing in my mouth and … nearly vomited.

High on the list of things that you just shouldn’t mess with is toothpaste. Long before I started brushing my teeth, civilization figured out that the proper flavor for toothpaste was mint. It doesn’t really matter what kind of mint, as long as it has that … well, that minty flavor. And consistency matters, too. Toothpaste is done.

But can they leave well enough alone? No! Some idiot at the Crest toothpaste company decided that they needed a cinnamon flavor. To compete with cinnamon Altoids, perhaps? Who knows what goes through their minds? I know it’s hard to believe that they even thought of making a non-mint toothpaste, but for sure their taste testers should have nixed this crap before it ever got out of the lab. The stuff tastes like bathtub caulk mixed with just a hint of fake cinnamon–maybe from one of those Atomic Fireball candies. And it left a sour aftertaste in the back of my mouth. To top it all off, it has the consistency of bathtub caulk, too. Toothpaste is supposed to be a little moist–almost runny. It’s not supposed to suck all the moisture off your tongue when you put it in your mouth.

Fortunately, I was able to get rid of that taste by gargling with Listerine, and I brushed my teeth with Scope, if you can believe it. So I’m not as cranky as I could be. But it was a close thing.

Fair warning: check the flavor on that toothpaste tube before you squeeze some out onto your brush. You do not want the kind of rude surprise that I got this morning.

And to the people who make toothpaste: Stop messing with it! Just make sure that it cleans my teeth and leaves my mouth feeling minty fresh. Some things are perfect just the way they’ve been for a hundred years, thank you very much. Spend your time on real problems, like figuring out how to make a floss that doesn’t get stuck between my teeth, or maybe a “scrubbing bubbles” mouthwash that I can swish around instead of having to use the dang brush. But make sure that it’s a mint flavor, okay?

Stop the style bashing

I might have mentioned before that I try to spend some time every day answering questions on StackOverflow. I learn a lot about the common issues that other programmers face, and I’ve picked up quite a few tips from reading answers supplied by others. I also learn quite a bit when I answer questions; sometimes providing a complete useful answer requires a lot of research.

Note that the discussion below is specific to C#. I’m not familiar with the implementation of the bool data type in C++ or boolean in Java, and C didn’t have a separate Boolean type.

I also learn about others’ misconceptions. For example, whenever somebody posts code that compares a Boolean value to true or false, at least one person will say something like “Never write == true or == false.” One person said, “Whenever you write == true or == false, God kills a kitten.”.

Those types of comments are pretty annoying. It’s just a style thing. We all know that these two conditionals are equivalent:

    if (timer1.Enabled == true) { ... }
    if (timer1.Enabled) { ... }  // the "== true" is implied

Complaining about the code containing == true is like complaining about using braces to enclose a single statement, or the extraneous parentheses in x = x + (y/z);, or getting mad because the person didn’t write x += y/z;. It’s a style thing! If you don’t like it, just move on.

Until recently, I honestly thought that’s all it was: immature griping about a minor style issue. But recent comments indicate that several people think there’s some difference in the generated code. That is, they think that if (timer1.Enabled == true) requires an extra operation to compare the values! That’s not true at all, as you can clearly demonstrate by compiling a simple program and looking at the generated code.

    System.Timers.Timer myTimer = new Timer();
    if (timer1.Enabled) Console.WriteLine("Enabled!");
    if (timer1.Enabled == true) Console.WriteLine("Enabled");

Compiling that in Release mode with Visual Studio 2013 results in the following intermediate code. I’ve added a few comments.

  // start of first conditional: if (timer1.Enabled) ...
  IL_0000:  ldarg.0
  IL_0001:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_0006:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_000b:  brfalse.s  IL_0017
  IL_000d:  ldstr      "Enabled!"
  IL_0012:  call       void [mscorlib]System.Console::WriteLine(string)
  // start of second conditional if (timer1.Enabled == true)
  IL_0017:  ldarg.0
  IL_0018:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_001d:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_0022:  brfalse.s  IL_002e
  IL_0024:  ldstr      "Enabled!"
  IL_0029:  call       void [mscorlib]System.Console::WriteLine(string)

Those two bits of code are identical. Each one calls the get accessor for the myTimer.Enabled property, and then branches around the WriteLine if the value is false.

Note that this is intermediate code. The JIT compiler might be able to inline the get_Elapsed method, which would eliminate the method call. Regardless, the code for these two snippets is identical. So stop with the == true bashing, already.

Free the Scratch!

The other day I overheard two women talking about baking cakes. I wasn’t eavesdropping; they were sitting at the table next to mine. Nor was I paying much attention to the conversation, engrossed as I was in the technical manual that so often serves as my lunch partner. But then one of the women said: “You bake from scratch?” That totally broke my concentration. It was time to go anyway.

As I walked back to the office, I got to wondering what this magical “scratch” stuff is that people make things from. I often hear people say that they bake from scratch, but that’s not all it’s used for. I hear programmers and builders say they’re going to “start over from scratch.” In almost every field I’ve examined, there’s somebody making things from scratch.

It took me a while to track this down, but what I discovered flies in the face of all the science we’ve been spoon fed over the years. Scratch builders have learned how to transmute matter and make wondrous things from seemingly nothing. Science teaches us that you can’t turn lead into gold–that you can’t change the fundamental properties of matter. I haven’t seen a scratch builder turn lead into gold, but I wouldn’t put it past them. Why would they want to, anyway, when they can take beans from a plant, add some butter and sugar, and come up with chocolate? Gold has nothing on chocolate, right?

Scratch is, despite what your science teachers will try to tell you, the basic building block of the universe. Forget chemistry and quantum physics with all their arcane formulas, quirks, quarks, electron orbitals, and difficult math. Those fields of study are just red herrings intended to divert your attention from where the real action is. You won’t see “Scratch Studies” at even the best universities because the powers that be don’t want the word to get out. They don’t want just anybody knowing how to transform scratch into all the wondrous things that we see every day.

The Scratch Police let amateurs dabble in the field–baking cakes or building clunky radios from spare parts. They do this because it turns out that the principles of scratch are quite simple and they can’t stop people from discovering them. But they keep a close eye to ensure that a baker, for example, doesn’t figure out that he can use the same principles of scratch to make a CD player, a car, or anything else. If that knowledge got out, the economy would crash like nothing you’ve ever seen before.

But it’s time the word got out. We can no longer be held hostage by the Scratch Police and the small group of people who hold so tightly the knowledge of scratch. My life is forfeit, I know. The Scratch Police will come track me down when they see this blog, and I doubt that Charlie will be able to protect me from them. If you see this blog, please copy and email it to everybody on your address list. Help spread the word and free us from the tyranny of those who would keep this fundamental knowledge to themselves.

Free the Scratch!

ReSharper kills comments

I’ve said before that I really like ReSharper. Apart from Visual Studio itself, it’s the most valuable development tool I have. Whereas Visual Studio increases my productivity by letting me create code faster than I would with a simple text editor and command line tools, ReSharper helps me create better code. It also provides tools that make it easier to understand unfamiliar code and refactor older code to bring it up to modern standards. Altogether, I’ll agree with my friend Dennis who says that if you’re writing C# code professionally and you’re not using ReSharper, you’re probably committing malpractice.

That said, ReSharper has a few annoying features. Fortunately, most of them can be disabled or their behavior modified through options settings. I discovered one yesterday, though, that I find particularly disturbing.

Imagine you have this declaration in your code:

Func<bool> canExecute = () =>
{
// In this case, we can always execute the function because …
return true;
};

ReSharper helpfully recommends converting that into a lambda expression. Seems a reasonable thing to do. If you tell ReSharper to go ahead and make the transformation, you end up with:

Func<bool> canExecute = () => true;

That’s right, it deleted the comment! Not a good thing.

This isn’t a particular problem because I applied the transformation individually and saw the comment disappear. But if you modify the settings to automatically apply the transformation when you run Code Cleanup, you won’t see the comments being removed. I repeat: Not a good thing.

That’s unfortunate because it forces me not to use the automatic “convert to lambda expression” function in Code Cleanup. Not a deal killer, but a little annoying.

Time management?

During the dot-com boom (1998 or 1999), one of my coworkers overheard me discussing stocks with another guy in the office. He later asked me for investment advice–a fairly odd experience for me. My investment knowledge is rudimentary at best. But I knew from previous discussions that my coworker (we’ll call him P) had some pretty significant credit card debt. The conversation went something like this:

Me: P, I can guarantee you 18% on your money.

P: Really? What stock?

Me: No stock. Just take the money you were going to buy stock with and pay off your credit card.

P: But … but … I’m wanting to save money. Invest!

Me: You’re paying 18% or more on your credit card debt. If you’re lucky you’ll average 10% on your stock investments. So even if you do well, you’re losing 8% per year. If you pay off your credit card debt, then at least you break even.

After that we got to the real issue: P was looking for tips on stocks that would skyrocket. 18%? Bah! He was looking for 1,000% gains. I told him that if I knew how to do that I wouldn’t be slaving away at the pixel mines.

I’m no investment guru, but I understand basic math.

Friends fairly regularly ask me where I find the time to do my wood carving. Some seem to think that I’m some great time management guru. Nothing could be further from the truth. I’ll admit that I’m terrible at time management. I’m easily distracted if I’m not very interested in what I have to do, and if I do get interested I’ll get lost in whatever I’m working on. It’s not uncommon for me to look at the clock and find that it’s 3:00 AM when I thought it might be approaching midnight.

Time management is not one of my strong points.

But finding time to work on my wood carving is no problem at all, and I suspect most of the people I know could do the same thing. How? Turn off your television!

According to recent Nielsen data, Americans aged 35 to 49 watch an average of 33 hours of television per week. 33 hours? That’s nearly five hours a day! And that number increases as we get older. Americans older than 65 watch an average of seven hours of television every day. What a waste.

I’ve mentioned this to people before, and they look at me like I’m crazy. “Give up my <insert name of favorite show here>? No way!”

No skin off my nose. But if you’re sitting in front of the idiot box and wondering where I find the time to do things, maybe you should re-think your priorities. My dad used to say, “We find time to do the things we think are important.” In my experience, that’s true. And if the Nielsen data is reliable, most of the people asking me where I find the time are spending five hours a day staring at the answer.

 

Crime and punishment

I was 13 years old when I discovered that I could sneak out my bedroom window at night after Mom had gone to bed. Dad was often out of town or working late and I’d normally be home before he returned, so I had a few hours after my official bed time to spend cutting up with my friends who also would sneak out.

One Sunday night a few of us pooled our resources and managed to get an older kid to buy us some beer and wine. I think there were four of us, a six pack of Budweiser, and a bottle of Boone’s Farm something or other. We went down by the creek and proceeded to get stupid.

I don’t remember a whole lot about the walk back home. Just little bits and pieces. We did discover some kid’s Big Wheel left in a yard at the top of a hill, and we took turns trying to ride it down the hill. As I recall, we were all too big to actually ride the darned thing. We’d just sit in it, try to make it roll, and then give up in disgust. Yeah, we were pretty drunk and probably making all kinds of noise in the neighborhood as we stumbled home.

I lived farthest from where we had been, so I got to go the last half mile by myself. I got tired of walking and sat down under a tree in somebody’s front yard. Next thing I remember, I was waking up under that tree, and throwing up in the grass. I got up, wiped my mouth, and stumbled the rest of the way home.

When I got home and tried my window, it was closed. I guess Mom discovered that I was sneaking out. I knew it was later than I usually got home after one of my after hours excursions, but I didn’t know how late. I tried the back door, which was locked, and every other window in the house. No dice. I wasn’t thinking too clearly and I needed some sleep, so I went to the front door and rang the bell.

Mom opened the door, took one look at me, and said, “You’re drunk. Go to bed.”

In my inebriated state, I wasn’t in much condition to consider the ramifications of my good fortune. No scolding or anything? I just went to my room, stripped off my clothes, and passed out on the bed. It was 2:00 AM.

Dad woke me up at 6:00 AM. “Son,” he said, “time to shower and head to school.” I was still drunk and tried to pull the “I don’t feel good” routine. But Dad was having none of it. He made it clear that being drunk or hung over was no excuse for skipping school.

I felt marginally better after a shower and a glass of orange juice, but that day was miserable. I went through most of the morning in a fog, and the afternoon was a giant headache. By the time school was over all I wanted was to go home and get some sleep. But when I got there, Mom had a bunch of chores for me to do. I didn’t get to bed until my normal bed time.

The only thing Dad said to me about that incident was that I can’t use my bad decisions as excuses to shirk my responsibilities. Neither he nor Mom ever mentioned it again.

They could have let me sleep, and punished me that afternoon: extra chores, restrict me to school and home, no phone privileges, etc. But making me to go to school was the most fitting punishment Dad could have devised. He forced me to face the direct consequences of my actions. Any other punishment would have been a proxy, and not nearly as effective.

I doubt that memory of a grounding would have stuck with me for 40 years like my memory of that day at school has. I rarely drink on a “school night,” and if I do over indulge, I don’t use that as an excuse to skip work. All because my parents wouldn’t let me skip school because I was hung over. I’d say they did a good job teaching me a valuable life lesson.

 

Carving a whimsical house, Part 3

This is the third in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

In Part 2, I finished carving the roof. The next thing to do is establish the roof line and gables, and carve the sides down in preparation for the doors and windows. To that end, the first order of business is to draw the roof line similar to this:

roofline1 roofline2

 

typhoonBudI use a Typhoon “bud” burr, shown on the right, to cut the roof line in the wood. If you
don’t have that particular burr, a flame shape or a taper would probably do just as well. This line doesn’t have to be terribly precise; just be careful not to carve away the roof.

roofline3The picture on the left shows the house after I’ve cut the roof line. You probably noticed that the bottom line looks like I cut it with the same burr. I might have. That bottom line, by the way, is where I’ll carve the rocks that the house sits on.

The next thing to do is carve the space between the roof and the base relatively flat so that the roof overhangs the house. I’ve used several different burrs for this and still haven’t decided which one I like the best. The 9/16″ Typhoon that I used for rough shaping the roof works okay, although it’s kind of hard to get the middle portion. However you do it, be careful not to go too deep. Also don’t worry about getting it perfectly flat. At this point, the most important part is to carve down the triangle that makes the gable on each end.

gable1 gable2

Above you can see that I’ve carved the gables, and then drawn lines to outline the trim board that serves to separate the wall from the gable. The next step is to relieve that board by carving down the wood above and below it.

gable3

Carving away the wood below is easy enough: just outline with the “bud” burr and then carve the rest of the wall. Carving the gable is a bit of a pain. I’m still experimenting with the best way to do that. If the area is large enough, I’ll use the bud or a flame shaped Typhoon burr. If it’s smaller, then the ruby carver seems to work pretty well. However I’ve done it, though, I’ve not been able to get a good sharp line. I keep trying.

gable4Here’s what my house looks like after carving both gables and the walls on all four sides.

Again using the Typhoon bud burr, I roughed out the rocks that make up the base of the carving. I probably made a mistake using that large burr for this, because the resulting rocks have too much space between them. I had to do a lot of smoothing and shaping work to get it to look even halfway decent in the rocksbaseend. Live and learn, I guess. I would suggest taking your time with a ruby carver or perhaps a long taper Typhoon if you have one. In any case, something smaller and less aggressive than what I used.

After I finished roughing out the rocks, I spent a little time with the flap sander to smooth the walls a bit so that I had a reasonably flat surface on which to carve the doors and windows. If you don’t have a flap sander, a sanding drum will work as will any other burr that leaves a relatively smooth surface. The surface doesn’t have to be completely free from scratches, as you’ll be cleaning it up later. But you probably want something pretty flat for the door and window frames.

Next time: doors and windows.

 

 

Carving a whimsical house, Part 2

This is the second in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

roof1In the first part, I cut a piece of fence post into four blocks, each 2″ x 2″ x 3″ high, and drilled a 3/8″ hole in the center. The result is shown on the left.

I should note here that all of the carving on this piece is done with the Foredom power carver using various bits and burrs. It’s important when using these tools to wear eye protection, have some type of dust collection apparatus, and wear a respirator or some other way to protect your lungs. Power carving produces a lot of dust, and you do not want that stuff in your lungs. You might also want to wear ear plugs if your dust collection system is especially noisy.

I also want to point out that this series of posts shows how I carved one house. This isn’t the only way to do it. In fact my method is constantly changing as I become more familiar with using the tools. This is only the fifth one of these houses I’ve made, so I’m still just a beginner.

With that out of the way, let’s get started.

Oh, by the way, you can click on any of these pictures to get a larger view. Although the “larger” view might not be a whole lot larger. I seem to be having trouble currently uploading larger pictures.

I find it best to start with the roof. Establishing the roof line sets the stage for the rest of the house. Also, if you carve the rest of the house and leave the roof for last, it’s very possible that you’ll run out of wood to complete the roof you want. The style of roof I’m creating here can eat up quite a bit of wood. I had to throw a Cottonwood bark house away once because I didn’t leave enough space for the roof. I was carving that one with knives and gouges.

roof2In the picture above, you can see that I’ve drawn a rough profile for the roof and chimney. Using a 1/2″ coarse Typhoon burr, I first outline the chimney.

The picture at the right shows how I begin to rough out the roof. The Typhoon burr is pretty aggressive, so I’m careful around the chimney, and I leave a lot of extra wood. I’ll come back later with a less aggressive burr and shape it.

Although the burr is aggressive, the cedar is a medium-hard wood and it takes some time to remove all that wood. Be patient and check your outline from time to time so that you don’t take off too much.

roof3It took me about 20 minutes to finish roughing out the roof to this point. That’s okay. I’m not in a race to see how quickly I can carve one of these little houses. I’d rather take a little extra time than get in a hurry and either destroy the carving or, worse, lose control of the tool and injure myself. Running that Typhoon burr over a thumb hurts. Trust me.

roof4Remember, too, that making a mistake isn’t fatal. These houses are supposed to be whimsical. They’re certainly not architecturally correct. If you inadvertently carve through the chimney, for example, don’t worry too much about it. You can always carve it to look like the chimney is falling apart. Carvers don’t make mistakes; we make adjustments.

roof5Next, I shape the chimney using a smaller and less aggressive Typhoon burr. I also put a few dips and humps in the roof surface in order to make it a little less uniformly flat, although that turns out not to be necessary for the roof style I chose; the process of adding the roof tiles makes for an irregular surface.

The last thing I did before beginning to carve the roof tiles is go over the roof with a 120 grit flap sander to remove most of the scratches left by the Typhoon burrs. Again, that’s not strictly necessary because the next step has me going over the roof with a finer ruby carver, but it’s part of the process for me–something I do regardless of the roof style I choose.

ruby1

I do all of the tiling with the flame-shaped ruby carver shown on the right. I’ve tried other bits, particularly for carving the lines between roof tiles, but they end up making deep narrow lines that I then have to spend time removing. I like the flame shape, but the ruby carver is a bit less aggressive than I’d like. It also tends to clog up on cedar, and I have to clean it now and then with a brass brush. Don’t use a steel brush.

rooftile1

I chose to do a tile roof on this house. This is something I’ve tried once before with a Cottonwood bark house, and also on an earlier Cedar house. Those two didn’t turn out so well. I experimented with one side of this roof to refine my technique. The photos and description I show here are from the second side. I think I almost have this roof style figured out.

Here you can see that I’ve outlined the first tile. The next step is to remove wood to the right and below so that the tile stands out from the roof. Then, draw lines for the next two tiles.

rooftile2When I started, I found it helpful to draw lines for the tiles before outlining each one. I got the hang of it after a while and began just carving the next tile without first drawing a line. Do what you feel comfortable with.

rooftile3Next, outline those two tiles, carve away wood to the right and below, and outline the next tile. Note that you don’t have to carve the whole rest of the roof down after outlining each tile. Instead, just carve down enough for the next tile. What you’re going for is a gentle slope from the top left to the center bottom.

rooftile4Work your way down and to the right, outlining and carving away wood for each tile until you get approximately to the center of the roof. When you’ve completed the left side, it should look something like the large photo on the right.rooftile5

The general flow of the roof should be towards the right and down. That is, tiles on the left should appear “above” tiles on the right. It’s okay if some tiles appear to stick up above where they “should” be; that’s part of the house’s whimsical nature. But do try to keep the general flow to the right and down.

rooftile6Next, start at the top right corner and do the same thing, but work down and to the left. Again, take your time. When you get close to the center, where the tiles from the left meet the tiles to the right, you’ll probably have to make some adjustments. If there’s a trick to making that come out just right, I don’t yet know what it is. I will say, however, that this is the best I’ve done so far.

rooftile7And that’s one side of the roof tiles, almost completed. You can see that I made a few mistakes, the biggest one being there on the far right where I have one tile sitting completely on top of another. It looks a bit strange, and is not what I had planned. Don’t know how I managed that.

If you have a smaller ruby or diamond carver, you might want to sharpen the edges between tiles. I suspect that with a bit more practice I’ll be able to get sharp edges between all the tiles. I did a passable job here, but some of the lines aren’t as clean as I’d like them.

That’s it for the roof tiles. Next time I’ll rough out the roof line and carve the gables.

 

 

Carving a whimsical house, Part 1

house_banner

An online carving group with which I’m involved is doing a “friendship stick” project. Each of 10 carvers makes 10 carvings from a 2″ x 2″ x 3″ block of wood, and sends one carving to each of the other carvers. Every carving has a 3/8″ diameter hole drilled through it from top to bottom. When we receive the carvings, we display them stacked on a 3/8″ dowel.

The only rules are that the carving cannot be larger than 2″ x 2″ x 3″, and it has to have that hole in the middle. Beyond that, anything goes. I decided that I’d make little houses from a cedar fence post. The picture above is one of the houses I’ve carved for the project. This blog post is the first of several parts showing how I go from fence post to finished house.

fencepost

Above is a 13 inch long piece of cedar fence post. We’re replacing the 30-year-old fence around our property, and I pulled this post from the ground a month or two ago. The wood isn’t really cedar, but rather Ashe Juniper, Juniperus ashei. The stuff grows like weeds all over Central Texas, and the wood is commonly used for fence posts. That it’s held the fence up for 30 years is good testament to its suitability for that purpose.

The first step in making a little house is turning this post fragment into blocks. And the first step of that process is making one side reasonably flat. After carefully checking for and removing any nails and fence staples, I took the piece over to the bandsaw. I set my fence about 1/2″ away from the blade, adjusted the height, and made the cut.

bandsaw1

(Yes, the picture is a little blurry.)

You can see a few ripples in the cut.  Normally I would have done with a resaw blade (a 1/2″ blade with four teeth per inch), but I had the 3/16″ blade on the saw and didn’t want to mess with changing it. The resaw blade would have made for a straighter cut, but this was good enough for my purposes.

With one flat side, I could then take the piece of wood over to the table saw to finish squaring it up. I used to do this on the bandsaw, but it’s easier to get square sides with the table saw.

tablesaw1

The picture above shows the piece with three sides squared up. After cutting the second side, I set the fence for 2″ and cut the other two sides. Then I set it for 3″ to cut the blocks to the right length.

blocks

The resulting blocks aren’t quite square because my initial cut with the bandsaw wasn’t perfect. I could have made allowances for that and squared things up on the table saw, but it just wasn’t that important to me. As long as the blocks are approximately square and don’t exceed the size limitation, it’s good enough for making these houses.

The next step is drilling a 3/8″ hole through the center of the block. Again, perfect placement isn’t terribly important, although I don’t want to be too far off. I used to do this with a hand drill, but I recently got a drill press, which makes things a bit easier. I just mark the center by drawing lines from corner to corner, and then clamp it in my drill press.

drillpressUnfortunately, my drill press’s range is about two and a half inches. So in order to drill through a 3″ block of wood I had to clamp the block in my bench vise and use a hand drill to finish the job.

drill

With the block cut and a hole in the middle, it’s time to start carving. Stay tuned.

Part 2, in which we carve the roof.