One reason I spent so much time last fall working on my generic heap implementation was that I wanted a TopBy
extension method for LINQ to Objects. So, for example, if I have a list of integers and I want the 10 largest, I could write something like:
var top10 = theList.TopBy(10, k => k);
Without a TopBy
method, the LINQ way to do that is:
var top10 = theList.OrderByDescending(k => k).Take(10);
That’s all well and good when the list is quite small but when working with lists that contain millions of items, having to do a complete sort just so I can get the top 10 is hugely expensive in both memory and processing time. And if the list is larger than will fit in memory, the LINQ OrderBy
solution flat doesn’t work.
Given my heap class, it’s easy enough to create an extension method that does something similar to what OrderBy
does. If you ignore the keySelector
parameter and assume that the items have a default comparer, then it’s almost trivial:
public static IEnumerable<TSource> TopBy<TSource>( this IEnumerable<TSource> source, int numItems) { var comparer = Comparer<TSource>.Default; var heap = new MinDHeap<TSource>(3); foreach (var item in source) { if (heap.Count < numItems) { heap.Insert(item); } else if (comparer.Compare(item, heap.Peek()) > 0) { heap.ReplaceRoot(item); } } // and return in reverse heap order return heap.GetConsumingEnumerable().Reverse(); }
Wrap that up into a static class and you can get the top 10 items from a list with:
var top10 = theList.TopBy(10);
Adding the ability to specify a custom comparer is straightforward: something that I covered in my heap discussions. That would produce this function:
public static IEnumerable<TSource> TopBy<TSource>(
this IEnumerable<TSource> source,
int numItems,
IComparer<TSource> keyComparer);
I could probably get by with that just fine. If I need to compare on multiple fields, I can just write a different IComparer
. And if I need to do some fancy key selection, I can either do it in the comparer, or write a front end filter that gets the keys that I need and builds a list of intermediate objects I can work with. It’s not something that I need to do very often.
At least, that was my thinking until I ran into a situation where building that list of temporary objects turned out to be a giant pain in the neck. Here I was able to sort with no trouble by using a key selector and several ThenBy
operations, but to select the top 10 I had to jump through all kinds of hoops. It was just ugly.
Consider, for example, ordering a list of items based on how many users liked or disliked them. You want the items that have the most likes to appear on top, but if two items have the same number of likes, you want the one with the fewest dislikes to be shown first. So, given three items:
bar, 50 likes, 8 dislikes
foo, 10 likes, 1 dislike
bas, 50 likes, 1 dislike
You would present them in the order (bas, bar, foo) because bar and bas both have 50 likes, but bas has fewer dislikes. The LINQ code to order them would be:
var ordered = items
.OrderByDescending(x => x.Likes)
.ThenBy(x => x.Dislikes);
No custom comparer is required. But if I wanted to do that with the TopBy
method shown above, I’d have to create a custom comparer and pass it. What a pain in the neck. And that’s just a simple example. If generating the keys required any calculation I’d want to pre-compute the keys once so that I wouldn’t wast time constructing a key every time I wanted to compare it. As I said, I could get by with the primitive method, but it’d be much nicer if TopBy
would work the same way as OrderBy
.
So I thought I’d look into creating TopBy
and TopByDescending
methods that I can use the same way that I use LINQ’s OrderBy
and OrderByDescending
. That turned out to be more involved than I thought it would be.
The key is the IOrderedEnumerable interface which, although simple enough on the surface, turns out to be a lot more complicated to implement than the documentation would lead you to believe. That’s where I’ll start next time.