Michael Abrash famously said, “The best optimizer is between your ears.” His point is that no amount of machine-specific optimization is going to make a difference in the running time of a bad algorithm. A poorly written Quicksort, for example, will outperform the most highly optimized Bubble sort. Selecting the right algorithm can give you orders of magnitude (or even more!) improvements in performance. Hand optimizing code usually provides small percentage improvements after an efficient algorithm is selected.
In most of what I do, performance is important, but not super-critical. That is, performance is usually “good enough” if I select an efficient algorithm. As such, I don’t worry too much about micro-level optimizations. In this environment, the .NET Framework and the C# language are very nice. I’m incredibly productive with C#. The language and the runtime library give me many tools with which to quickly and (reasonably) efficiently implement my algorithms. I know that I’m paying a performance penalty. C# applications that do a lot of processing are typically around 10% less efficient than programs compiled to native machine code. But that’s okay. Very little of what I do is so performance sensitive that I’m willing to forego the productivity of .NET for the efficiency of native code.
But when performance is critical, I get very annoyed because the C# way of doing things is often very sub-optimal.
I’ve recently been working on a system that finds similar videos based on Web links. That is, if I like a particular video that Alice linked to on her blog, then it’s likely that I’ll also like other videos that Alice linked to. If 10 people linked to the video, then I might be interested in the other videos that those people linked to. So imagine a search facility that, given a search term (say, “bluegrass”), will find all videos that have that term in the title and then does a spreading activation by activating all of the videos that are linked to by people who linked to the videos in the initial result set.
So, searching for “bluegrass” might find 3,000 videos that have the term in their titles. Those 3,000 videos are in turn linked to by a total of, say, 4,000 different people. Combined, those 4,000 people can have many millions of different links, although there will be duplicates (that is, Alice, Bob, and George could all link the same video). We can score a video by relevance and popularity by taking into account the number of links it gets and the relative weight of each referrer’s links. But to do that I have to:
- Create a referrer activation record for each referrer that links to any of the videos in the initial result set.
- Group the referrer activation records by referrer, and compute a referrer score.
- Create a video activation record for each video that the activated referrers link to.
- Group the video activation records by video, and compute the video’s score.
This isn’t difficult to do, even when the scoring calculation is involved. It’s just creating lists, sorting, and then grouping. It’s what we programmers do all too much of every day. And it shouldn’t take very long, either: sorting and grouping tens of millions of items should be a matter of a second or two on modern hardware, even without taking advantage of muliple cores.
My first working cut of this took about 20 seconds, and a relatively few minutes’ time optimizing cut that in half. And then I was stuck. The first three steps happen in well under a second. But no matter what I did, the sorting and grouping portion (step 4) would take almost 10 seconds on a typical query that had about 10 million video references. And that just seemed like way too much time. I tried lots of different things before I got it into my head to try something extreme.
My video reference record was a simple
KeyValuePair<int, double>, where the
Key is the video id, and the
Value is the score from a particular referrer. Those were being placed into a
List that I then sorted by video id (
Key). The sort was taking most of the time. And that’s just way too much time to sort 10 million items.
In retrospect, I should have thought of this earlier. But things are always obvious after the fact.
You see, I’ve run into this kind of thing before. My sort was written like this:
VideoReferences.Sort((r1, r2) => r1.Key.CompareTo(r2.Key));
I’ll let you look into the details of what exactly that syntax means, but in a nutshell it ends up creating a comparison function that is called through a delegate for each comparison. And, the structures I’m sorting (
KeyValuePair) occupy 16 bytes each, meaning that for each swap that occurs during the sort I’m moving 48 bytes around. Like I said, I don’t normally think about this stuff because performance is usually good enough.
But benchmarking results showed conclusively that the sort was taking way more time than I thought it should. So, I thought, what if I could make a list of a native type and sort that? What, indeed.
I changed my reference structure so that it takes only 8 bytes. It now looks like this:
public ulong Combined;
public float VoteScore;
public int VideoId;
For those of you who are unfamiliar with C#, that’s the C# way of declaring a union. It packs the integer video id and the floating point score (
float is fine here for individual values) into a single 8-byte structure. The
Combined field treats the two values as a single unsigned long integer, with the video id in the high 32 bits.
What’s that do for me? Well, I can then write this:
VideoReferences = new List<ulong>(NumberOfReferences);
// ... populate the list with
var v = new VidVote();
v.VideoId = id;
v.VoteScore = score;
// and then sort it.
With this structure,
Sort uses the default comparison for
ulong, which internally does a native 64-bit integer comparison directly rather than calling a comparison delegate.
The result? Sort time went from almost eight seconds to less than two seconds, and the subsequent grouping code is also faster than in the previous version. I’m a little bit annoyed that I had to resort to these bit twiddling tricks, but I can’t complain too much about a 70% decrease in run time.
And the really nice thing is that given this new structure I can see another improvement that should knock another big chunk off the run time.