When theory meets practice

A fairly common problem in programming is selecting the “best” items from a large group. For example, I have an application that makes video recommendations based on user input. The program takes the user’s input (typically the name of a song, an artist or a band), matches that up with our index of videos, and then does some calculation to come up with a relevance score for each video. The details of that calculation are interesting, but will have to be covered another time.

When all the calculation is done, I’ll have a list of perhaps two million possible video recommendations, and I need to select the top 200 (based on relevance score) to present to the user. There are several ways to do this, as described in Selecting k smallest or largest elements.

In the discussion below, n is the total number of items in the list and k is the number that I want to select. Using the numbers above, n would be equal to 2,000,000, and k would be 200.

When I first wrote the code, I used the simple heap selection algorithm. In pseudo code, it looks like this:

Create an empty min heap
Add the first k items from the array to the heap
for items k+1 to n
    if (item > smallest item on the heap)
        Remove smallest item from the heap
        Add new item to the heap

That algorithm performs quite well with our data. It is somewhat sensitive to order. If the recommendations in the items array were sorted by relevance score, this would perform very poorly because every item would have to be added to the heap, and all but the last k items would be removed from the heap. Fortunately, it would be highly unlikely to see such an ordering in our data.

I was discussing this problem with a friend recently who suggested that my algorithm was less than optimal and that I should look into a better approach.

In big O notation, that heap selection algorithm’s complexity is O(n log k). Adding an item to a heap of size k takes up to log(k) operations, as does removing an item. And since in the worst case every item gets added to and removed from the heap, there will be n insertions and n removals. That’s 2*(n log k), but constant factors are removed in big O notation, so we’re left with O(n log k).

There are at least two algorithms listed in the Wikipedia article that have better theoretical running times than my simple heap selection technique. One technique involves turning the array of two million items into a max heap and then doing a breadth-first traversal of the first k nodes. Turning an array into a heap is an O(n) operation, and the breadth-first search is O(k log k), giving us a complexity of O(n + k log k).

The other interesting algorithm is called QuickSelect. It uses the Quicksort partitioning algorithm to partition the array in linear time. QuickSelect is based on the idea of finding the median element in an array, which you can do in linear O(n) time. Given the median element, partitioning an array such that all items less than the median are to the “left,” and all items greater than or equal to the median are to the “right” is also an O(n) operation. That’s not exactly how QuickSelect works, but that’s the idea behind it. I should note that QuickSelect has expected linear time. Its worst case, though (when data is ordered or partially ordered), is really bad.

So in theory, the fastest algorithm should be QuickSelect, followed by the breadth-first heap search (which I call BFSHeap), followed by the simple heap selection algorithm (HeapSelect). I thought I’d do some tests to see how well the theory reflects reality.

The BFSHeap turned out to be a bad idea. Although the theoretical complexity analysis tells us that it should be faster than HeapSelect, the O(n) time to build the heap has a huge constant. Whereas the time to build the heap is proportional to the number of items, that “proportion” is very large. It takes more than twice as long for BFSHeap to turn the array into a heap as it does for QuickSelect to select items. Because of that, I didn’t include BFSHeap in my detailed analysis. It’s interesting to note, though, that BFSHeap performs better than HeapSelect when k exceeds about 8% of n. But it’s still much slower than QuickSelect.

QuickSelect, like QuickSort, can be inefficient if the data is sorted or reverse-sorted, and is also sensitive to ordered (or partially ordered) subranges. A simple median of three (or, better, median of five) pivot selection reduces or eliminates those cases. I used a median of three in my tests. Median of five will give slightly better average performance, and much better worst case performance.

HeapSelect, as I noted above, also performs poorly with ordered data, although it’s not as sensitive to partially ordered subranges as is QuickSelect.

Here’s the data for those of you who don’t want to read the rest of this note. In short, HeapSelect is faster if you’re selecting up to about 1% of the items in the list. QuickSelect is faster beyond that, and hugely faster before you get to 10%. HeapSelect has the disadvantage of requiring O(k) extra space, but when k is small, that extra space won’t normally be a problem.

202002,00020,000200,000
QuickSelect (random)24.3625.6625.1825.8528.54
HeapSelect (random)4.975.237.3424.49139.06
QuickSelect (sorted)15.1815.0214.9714.9514.97
HeapSelect (sorted)107.91196.84235.28293.36330.93
QuickSelect (reverse)7.817.737.637.657.60
HeapSelect (reverse)5.114.925.076.2322.81
QuickSort (random)356.90357.95357.96358.24358.01
QuickSelect1 (sorted)153.581,520.43n/an/an/a
QuickSelect1 (reverse)80.56791.03n/an/an/a

All of my tests were written in C# and run in release mode with the 64-bit runtime in .NET 4.0. Each test was run 100 times after an initial run to eliminate JIT times, and the time reported is the average of those 100 runs.

The QuickSort is included for comparison. Obviously if you sort the data then taking the top k items is trivial. But it’s expensive. In the average case, QuickSelect was more than 10 times as fast as QuickSort.

As noted, QuickSelect is an O(n) algorithm. Given an array of size n, it will take the same time to run, regardless of how many items you want to select from the array. Selecting 200 items from an array of 2,000,000 takes just as long as selecting 200,000 items from the same array. My QuickSelect implementation running on the configuration I noted above runs at about 13 milliseconds per million items. Selecting from an array of 100,000 items takes about 1.3 milliseconds. With 2,000,000 items, it takes about 26 milliseconds. The times are very consistent across a large range.

QuickSelect1 is a modified QuickSelect without the median-of-three pivot selection. I included the timings to show how poorly a naive QuickSelect implementation will perform when presented with ordered data. I only included timings for selecting 20 and 200 items, as it’s obvious where things are heading: to select 2,000 items would take 10 times as long as it took to select 200 items. QuickSelect with a median-of-three pivot selection reduces the likelihood of running across these pathological cases, but it’s still possible to hit them in real-world data. Check out median of three killer for more information. A median-of-five pivot selection would be better.

HeapSelect is O(n log k) in the worst case, meaning that the more items you want to select, the longer it will take. Selecting 200 items will be a lot faster than selecting 200,000 items. How much longer is an interesting calculation. Based on the theoretical worst-case running times, the difference should be log(200000)/log(200), or somewhere in the neighborhood of about two and a half times as long. But that’s only true in the worst case–when the data is in sorted order. The real running time is much more skewed.

The running time of HeapSelect is dominated by the number of items added to the heap. When the data is unordered, the actual number of items added to the heap is proportional to k–the number of items to be selected. When k is small in comparison to n, the number of items added to the heap is also small. For example, here are some sample values, again run against an array of 2,000,000 items. m is the number of items actually added to the heap.

kmm/k
22914.5
2024912.45
2002,03810.19
2,00015,8317.915
20,000112,0935.605
200,000660,5083.303

When selecting 200 items, only about 2,000 items are added to the heap. When selecting 200,000 items, 660,000 items are added to the heap: about 300 times more items. This explains why the actual running times are so much different from the theoretical running times. It doesn’t take 300 times as long to select 200,000 items, but it does take 25 times longer.

Note also that when k is 2m is about 15 times k. As k increases, m also increases, but not in proportion. When k reaches 200,000, m is only about 3 times as large. And, of course, when k is the same size as the array, m == k.

In my initial tests, HeapSelect used a generic BinaryHeap collection class similar to this one. In an attempt to level the playing field (after all, QuickSelect was hand-coded), I moved the relevant portions of that BinaryHeap collection in-line so that I was operating on the items array directly. The result is between three and four times as fast as the original BinaryHeap. This resulted in HeapSelect being faster than QuickSelect when k is up to about one percent of n. With the original version QuickSelect is faster when k exceeds one-tenth of one percent of n.

The primary reason for the difference in run time is that the custom heap code uses an array rather than a generic list, and there is minimal error checking. In addition, I was able to combine the “remove lowest and add new” sequence into a single “replace” operation–something that the generic BinaryHeap implementation didn’t supply.

The ideal selection algorithm, would be a hybrid that uses the best approach based on how many items are being selected. I haven’t seen such a thing. There is a hybrid, Introselect, based on David Musser’s Introsort, which gives better performance than QuickSelect if the data is ordered, but it’s unclear to me whether it would perform better than HeapSelect when k is a very small in comparison to n.

HeapSelect, by the way, has a couple of other advantages over QuickSelect and similar algorithms. First, it doesn’t modify the array. QuickSelect moves things around, which might be a bad thing. You can of course make a copy of the array and use QuickSelect on that, but then you’re using O(n) extra space. If you can’t modify the input array, then HeapSelect becomes much more attractive.

On a similar note, HeapSelect works if you don’t know the total number of items to be examined, or if you can’t hold the entire list of items in memory. Say, for example, you want the top 200 items from a file of a several billion records–more records than will fit in memory. HeapSelect shines in this situation because it can start at the beginning and go through the file sequentially. When it reaches the end of the file, the heap will contain the top 200 records. Another example is selecting the top n things that you see over time, for example records coming in over a network stream. With HeapSelect, you don’t need to keep track of every item–just the top k that you’ve seen so far.

The lesson here is that, whereas it’s important to know the theoretical complexity of the algorithm that you’re using, actual running times vary based on the implementation and on the nature of the data. In this case, the “inferior” HeapSelect outperforms QuickSelect when the number of items to select is small in proportion to the total number of items. QuickSelect is a better general-purpose selection algorithm and the one I would select if I could only have one, but HeapSelect is clearly better for a large number of real-world situations. It’s common to want the top .01 percent of items from a large list, and HeapSelect is four or five times faster than QuickSelect in that case.

It’s also important to remember that big O notation ignores constant factorsO(n) means that the running time will be proportional to n: for every order of magnitude change in n, the running time will increase by an order of magnitude. An O(n log n) algorithm is, theoretically, less efficient than an O(n) algorithm. However, big O doesn’t say, what the “proportion” is. For example, the real running time of that O(n) algorithm might be 100,000 * n whereas the real running time of the O(n log n) algorithm is 10 * n * log(n). The O(n log n) algorithm will be faster than the O(n) algorithm until 10 * log(n) exceeds 100,000.

So know your algorithms, but also know your data. When possible, select the algorithm that will give you the best performance for your expected application rather than the algorithm that gives the best overall performance.

Source for QuickSelect and HeapSelect are below.

QuickSelect source

static int QuickSelect(int[] items, int k)
{
    return QuickSelect(items, 0, items.Length - 1, k - 1);
}

static int QuickSelect(int[] items, int left, int right, int k)
{
    // get pivot position  
    int pivot = Partition(items, left, right);

    // if pivot is less than k, select from the right part  
    if (pivot < k) return QuickSelect(items, pivot + 1, right, k);

    // if pivot is greater than k, select from the left side  
    else if (pivot > k) return QuickSelect(items, left, pivot - 1, k);

    // if equal, return the value
    else return items[pivot];
}

static int Partition(int[] items, int left, int right)
{
    int i = left;
    int j = right;

    var Swap = new Action<int, int>((l, r) =>
        {
            int temp = items[l];
            items[l] = items[r];
            items[r] = temp;
        });

    // pick the pivot point and save it
    int pivot;
#if MEDIAN_OF_THREE
    // Median of three optimization improves performance in general,
    // and eliminates worst-case behavior for sorted or reverse-sorted data.
    int center = (right + left) / 2;
    if (items[center] < items[left])
        Swap(center, left);
    if (items[center] < items[right])
        Swap(center, right);
    if (items[left] < items[right])
        Swap(left, right);
    // median of [left], [middle], [right] is now at [left]
#endif
    pivot = items[left];

    // until the indices cross
    while (i < j)
    {
        // move the right pointer left until value >= pivot
        while (items[j] < pivot && i < j) --j;

        // move the right value to the left position
        // increment left pointer
        if (i != j)
        {
            items[i++] = items[j];
        }

        // move the left pointer to the right until value < pivot  
        while (items[i] >= pivot && i < j) ++i;

        // move the left value to the right position  
        // decrement right pointer  
        if (i != j) items[j--] = items[i];  
    }
    // put the pivot holder in the left spot
    items[i] = pivot;

    // return pivot location
    return i;
}

HeapSelect source

// Used to keep track of total heap additions
static int NumHeapAdds = 0;
static int[] HeapSelect(int[] items, int k)
{
#if !CUSTOM_HEAP
    var heap = new BinaryHeap<int>();
    for (int i = 0; i < k && i < items.Length; ++i)
    {
        heap.Insert(items[i]);
        ++NumHeapAdds;
    }
    for (int i = k; i < items.Length; ++i)
    {
        if (items[i] > heap.Peek())
        {
            heap.RemoveRoot();
            heap.Insert(items[i]);
            ++NumHeapAdds;
        }
    }

    return heap.ToArray();
#else
    int[] resultHeap = new int[k];
    int heapCount = 0;

    var Insert = new Action<int>((newItem) =>
        {
            int i = heapCount;
            resultHeap[heapCount++] = newItem;
            while (i > 0 && resultHeap[(i - 1) / 2] > newItem)
            {
                resultHeap[i] = resultHeap[(i - 1) / 2];
                i = (i - 1) / 2;
            }
            resultHeap[i] = newItem;
        });

    var ReplaceRoot = new Func<int, int>((newItem) =>
        {
            int rslt = resultHeap[0];
            int tmp = newItem;
            int i = 0;
            while (i < heapCount / 2)
            {
                int j = (2 * i) + 1;
                if ((j < heapCount - 1) && (resultHeap[j] > resultHeap[j + 1]))
                {
                    ++j;
                }
                if (resultHeap[j] >= tmp)
                {
                    break;
                }
                resultHeap[i] = resultHeap[j];
                i = j;
            }
            resultHeap[i] = tmp;

            return rslt;
        });

    // put the first k items on the heap
    for (int i = 0; i < k && i < items.Length; ++i)
    {
        Insert(items[i]);
        ++NumHeapAdds;
    }

    // and then check the rest
    for (int i = k; i < items.Length; ++i)
    {
        if (items[i] > resultHeap[0])
        {
            ++NumHeapAdds;
            ReplaceRoot(items[i]);
        }
    }

    return resultHeap;
#endif
}

WhittlePup takes five

Debra and I went to Zilker Botanical Garden on Sunday, and took Whittle Pup along. He got tired after wandering around for a while and asked to take a break with his frog buddies. The frogs didn’t show up, but Pup had a good time hanging out in the pond.

When asynchronous calls complete synchronously

The .NET Framework has a rich asynchronous programming model (APM) that’s about as close to “fire and forget” as you’re likely to want. True “fire and forget” usually ends up being “fire and forget until it crashes,” so you want notification when an asynchronous task completes.

There are other asynchronous design patterns in .NET, including an event-based asynchronous pattern, the BackgroundWorker, and the Task Parallel Library, but the pattern I linked above has been around since the beginning and is still quite useful.

I call this the “callback design pattern.” The idea is that you tell the system, “perform this operation on a separate thread and call me back when it’s done.” In the example below, the MyThing class has BeginTask and EndTask methods that operate as recommended by the APM.

void DoSomething(MyThing thing)
{
    IAsyncResult ir = thing.BeginTask(TaskDoneCallback, thing);
    // Not doing anything with ir, yet.
}

// This method is called when the operation has completed.
void TaskDoneCallback(IAsyncResult ir)
{
    var thing = (MyThing)ir.AsyncState;
    var rslt = thing.EndTask(ir);
    // do something with the computed result
}

The idea here is pretty simple. The call to BeginTask starts the operation executing on a separate thread. When the operation is done, it calls TaskDoneCallback (still executing on that other thread), which can then process the result as required.

The above works great. Let me add a new wrinkle to make it more interesting.

Imagine now that you want the asynchronous task to run continually so that it can generate data whenever it’s available. What you want is for the callback to start a new asynchronous task, like this:

void TaskDoneCallback(IAsyncResult ir)
{
    var thing = (MyThing)ir.AsyncState;
    var rslt = thing.EndTask(ir);

    // Start a new async task
    thing.BeginTask(TaskDoneCallback, thing);

    // now do something with the computed result
}

Here, the callback function gets the results of a task, starts a new asynchronous task, and then processes the results returned by the first task.

Again, that all works great. You can call BeginTask once in your main program, and that task gets performed forever until you cancel it. Assuming, of course, that you’ve added the cancellation code.

Okay, so it works great most of the time. There’s an edge case in which this can lead to unbounded stack space usage.

You see, there’s the possibility that an asynchronous call could complete synchronously. That is, something in the runtime environment determines that there’s no need (or perhaps no ability) to spin up a new thread so that it can execute the task asynchronously. Instead, it will execute the task on the calling thread. When that happens, TaskDoneCallback is executed on the calling thread. If this happens repeatedly, then the main thread begins looping and consuming stack space. It behaves as though you’d written this:

void BeginDoTask()
{
    TaskDoneCallback();
}

void TaskDoneCallback()
{
    BeginDoTask();
}

That, as you well know, will blow the stack in short order.

The solution to this problem is kind of ugly:

void AsyncLoop()
{
    for ( ; ; )  // Infinite loop!
    {
        IAsyncResult ir = thing.BeginTask(TaskDoneCallback, thing);

        // see if it completed synchronously
        if (!ir.CompletedSynchronously)
            break;
        
        this.InvokeEnd(ir);
    }
}

void TaskDoneCallback(IAsyncResult ir)
{
    if (ir.CompletedSynchronously)
        return;

    this.InvokeEnd(ir);

    this.AsyncLoop();  // Start another async operation
}

void InvokeEnd(IAsyncResult ir)
{
    var thing = (MyThing)ir.AsyncState;
    var rslt = thing.EndTask(ir);

    // now do something with the computed result
}

If the method executes asynchronously, then things work exactly as I showed earlier. But if the method completes synchronously, the callback function executes immediately and the AsyncLoop loop calls InvokeEnd to process the data.

There is one other difference of note in this code: it processes the computed result before making the next asynchronous call. The result is if the “do something with computed result” takes any significant time, the program could lose data. This is a good argument for making your processing take as little time as possible. If it’s going to take more than a few milliseconds, you probably want to queue the item for processing by another thread.

In truth, I’ve never seen this infinite recursion problem happen, and off the top of my head I can’t imagine a real-world scenario in which it would be a problem. I can, however, contrive some cases. Although I can’t see myself writing code that would suffer from this problem (basically, writing a tight loop as a chain of asynchronous calls), I can imagine others doing it.

I’d put this one far down on the list of things that can go wrong with your asynchronous code. If you’ve written code that works as I described initially, then I wouldn’t worry about changing it to the new model unless all the other bugs in your program are fixed.

Pie server in mesquite

This is a pie server, carved from a piece of mesquite I picked out of the firewood pile. Total length is about ten inches. The blade is just a little over 2-1/4 inches at its widest point.

I mentioned a week or so ago that I was having difficulty cutting a straight line with the bandsaw. I still need a bit more practice, but the biggest help was getting a better blade. I was using a cheap 1/4″, 6 tooth-per-inch blade that I picked up at Lowe’s. I broke that blade on Saturday and replaced it with a 3/16″, 4 TPI blade that I had hanging on the wall. I thought that blade was bad, because I was having trouble with it. But my troubles were apparently due to the bandsaw being misaligned, and I fixed those problems last month.

That 3/16″ blade wasn’t cheap ($25 or $30), but it makes a huge difference. Not only does it cut straighter, it cuts cleaner and faster, even through a piece of mesquite that was 6″ thick. No more cheap blades for me.

I cut the rough shape of this pie server out on the bandsaw, used a knife to clean up the cuts a bit and shape the handle, then used a belt sander clamped upside-down on the bench to flatten the blade. The rotary tool finished shaping the handle, and then it was a bunch of hand sanding up to 600 grit before finishing with Howard Butcher Block Conditioner.

The cracks radiating from the knot were fairly deep. I filled them with super glue, sanded it smooth, and repeated the process twice more to get all the cracks filled in. I’m rather happy with the result.

Christmas dog

It’s Christmas ornament time again. I’ll be making quite a few this year, including a dozen or more of these little dogs.

Like most of my little dogs, this one is just a little over two inches tall. The wood is mesquite. The hat, nose, and tongue are painted with acrylics and the entire piece is finished with Howard Feed ‘N Wax (a mixture of orange oil and beeswax).

This particular dog is the prototype. It should have a little metal loop at the top so it can hang from the tree, but I didn’t drill the pilot hole deep enough and the brass screw eye I tried to insert broke off. I’ll be sure to drill a deeper hole in the rest of them.

Getting videos from YouTube

YouTube has a very rich API that developers can call to get information about videos, post new videos, etc. Lots of Web sites use this API to do all kinds of wonderful things with YouTube videos.

So suppose you wanted to create a little application that works like an “all news” channel. But instead of just showing you CNN or MSNBC or whatever channel, it gathers news from lots of different channels and aggregates them. You then have an “all news all the time” channel that shows lots of different stories and will give you different viewpoints on the same story. Fox news, for example, will have a much different take on a story than will CNN or Al Jazeera.

The YouTube API gives you at least three ways to get new videos from a particular YouTube user. The simplest is to do a search and filter it by the YouTube author name. For example, this query:

http://gdata.youtube.com/feeds/api/videos?orderby=published&max-results=50&author=AssociatedPress&v=2

will give you the 50 most recent videos that were posted by the YouTube user “AssociatedPress”.

There’s a problem, though: the results in that feed will be delayed sometimes by more than 15 minutes. If I go to the AssociatedPress page on YouTube, I’ll often see videos there that do not show up in the feed returned by that query.

You can get up-to-date results by querying the individual user:

http://gdata.youtube.com/feeds/api/users/AssociatedPress/uploads?orderby=published&max-results=50&v=2

With that query, I’m able to get the most recent videos. Anything that shows up in the YouTube page for that user also shows up in the results I get back from the query.

But that’s just one source! If I want my news channel to get data from 50 different sources, I’d have to poll each one individually. That’s not terribly difficult, but it takes time. YouTube places limits on how often I can poll for videos. To keep from being throttled, you have to wait at least 15 seconds (a minute or longer if you don’t have a developer API key) between requests to the API. So getting the latest videos from all 50 sources will take somewhere between 12 and 50 minutes. Perhaps longer.

Not many sources post a video every 12 minutes, so there’s a lot of waste involved in polling all the time. And 50 minutes is way too long to wait for an update. 50 minutes? It’s not news anymore!

There is a third way. Go to your YouTube account and subscribe to the sources that you’re interested in. According to the documentation, you can subscribe to up to 2,000 different sources, and whenever you go to your subscriptions page you’ll see the most recent videos from those sources.

The really nice thing is that you can then do a single query to get the most recent videos from all your sources:

http://gdata.youtube.com/feeds/api/users/YourUserName/newsubscriptionvideos?orderby=published&max-results=50&v=2

Replace YourUserName in the URL above with the name of the user who subscribed to the sources you want to see.

With this, you can poll YouTube once every five minutes and get the most recent videos from all of your sources.

Another benefit is that you don’t have to change your program if you want to add or remove sources. All you have to do is log in to YouTube and edit your subscriptions.

Reduce your bandwidth usage and get your videos more timely. Sounds like a great deal to me.

Fun with the bandsaw

Browsing in the store the other day, Debra showed me a set of bamboo toast tongs that she liked. I suppose I knew such things existed, but I never considered using them. I do, however, like a challenge.

So yesterday I pulled out the bandsaw (now on wheels, so it’s easy to move around) and experimented with a piece of scrap lumber. Sure enough, if I cut it right the wood would flex and not break.

I ended up making two sets of tongs. The set on top is made from mesquite and is nine inches long. The other set is mahogany, a little over eight inches long. The thing at the bottom of the picture is a simple spatula made from mesquite.

I’m beginning to believe that it’s impossible to cut a straight line with a bandsaw. At least, I haven’t figured out how, yet. Maybe I just need more practice.

The tongs are pretty easy to make. The pattern is just two lines 3/4″ apart, connected with a half circle at the top. Then move out about 1/4″ and draw the outside perimeter.

Tape or glue the pattern onto a piece of wood. Then, clamp a piece of wood to the back side and drill the 3/4″ hole at the top. The second piece of wood is clamped there to keep the piece from splintering when you drill through it.

After you’ve drilled the hole, cut the outside pattern first. Then cut the inside. If you cut the inside first, the wood tends to flex a bit when you’re running it through the bandsaw to cut the outside.

If you want to create more than one set of tongs, you can save some wood by laying them out interleaved. You have to be a bit more careful with the bandsaw cuts, but you can get three sets of tongs from less wood than it would take to make two if you laid it out the simple way.