Inconceivable!

Every time I hear President Obama say something “will not be tolerated,” I’m reminded of Vizzini’s “inconceivable!”

The silly sanctions we’re placing on Russian officials will have approximately the same effect as the Bush administration’s pointless sanctions on North Korea back in 2006; banning the export of iPods, Segway scooters, and other luxury items so that Kim Jong-Il wouldn’t be able to give trinkets to his cronies.

But the administration has to bluster and posture and appear to be “doing something about the problem” even though the president and anybody with more brains than your average mailbox knows that we are completely powerless: Russia will do what Putin wants, openly and without fear of reprisal.

Why do I think we’re powerless? First, military confrontation is out of the question. Russia isn’t a little country with an insignificant military that we can overpower with a division or two of National Guard troops. Air strikes, even if we were stupid enough to try and could get past the Russian defenses, would be met with major reprisals and ultimately result in a real war that nobody wants and if fought would destroy the world economy.

Not that we could make any kind of military strike. President Obama isn’t dumb enough to try convincing us that Ukraine is worth another foreign excursion, and no member of Congress who is up for re-election this year is likely to approve such a thing even if the president were to suggest it.

Second, serious economic sanctions are out of the question because large parts of Europe depend on Russian natural gas for heat. Nobody in Europe wants to risk upsetting Putin, because he could easily turn off the spigot. The first cold snap come Fall would have every citizen in Europe demanding that The Powers That Be let Russia have whatever they want, so long as the gas is turned back on. Putin isn’t beyond punishing our European allies in response to American “provocation.”

It does make for good theater, though. Too bad we can’t bottle all the hot air from both sides. Between Obama’s rhetoric, Putin’s laughable denials, and Republicans’ excoriating the president for his lack of action, we could make a serious start to reducing the amount of fossil fuels burned to keep warm next Winter.

Random selection

If you know how many items are in a collection, picking one at random is trivial: just select a random number from 0 to count-1, and you’re done. Things get more interesting when you don’t know how many items are in the collection. The naive way to do the selection is to first make an array. Then you can use the solution I just mentioned. For example:

    readonly Random _rnd = new Random();

    int GetRandomItem(IEnumerable<int> items)
    {
        var a = items.ToArray();
        var ix = _rnd.Next(a.Length - 1);
        return a[ix];
    }

Whereas that works, it can be expensive or even impossible to create a new array if the collection is very large.

When the question is modified to include the restriction that the collection won’t fit in memory (say, a text file with hundreds of millions of lines), things get more interesting still. You could go through the file to count the lines, pick a random number from 0 to count - 1, and then go back through the file to read that many lines. In code:

    var count = File.ReadLines(filename).Count();  // reads the whole file
    var ix = _rnd.Next(count - 1);
    var sel = File.ReadLines(filename)  // potentially reads the whole file
        .Skip(ix-1)
        .First();
    return sel;

If the collection is a file, then you end up reading the file once, and at least part of the file twice. On average, you end up reading the file halfway through the second time. It works, but it’s pretty expensive.

Worse, it’s impossible to use that technique if the collection is a bunch of records coming in from a network stream. It’s not like you can rewind the stream and start over after you figure out how many records there are.

There is a way to do this that doesn’t require knowing up front how many items are in the collection, and guarantees that each item has the same probability of being selected. Using this technique, you can select any number of items in a single pass through the collection. Provided, of course, that you have enough memory to hold the selected items.

Imagine that you are asked to select an item at random from a sequence of items presented to you one at a time. You don’t know how many items will be presented, and you have to prove that each item was given equal opportunity to be selected. You can’t decide to pick the first item every time, and you can’t decide to pick the fifteenth item, for example, because there might not be fifteen items presented. What to do?

First, since you always have to be prepared to present a selected item, it follows that you must select the first item because there might be only one item presented. But you can’t unconditionally keep the first item. When the second item is presented, you have to decide whether you should keep the first item, or replace it with the second item. How? By flipping a coin. Heads you keep the first, tails you replace it. Assuming a fair flip of a fair coin, then after two items half the time you’ll be holding the first item, and half the time you’ll be holding the second item.

Now the third item comes along. How do you decide whether to keep your previously selected item, or if you should replace it with the third item? You can’t flip a coin because doing so would mean that half the time you’d keep the third item. You want to keep the third item one third of the time.

What you really want is to roll a three-sided die and if it turns up a 3, replace the currently selected item with that third item. It should be obvious that the third item will be selected one third of the time.

If you don’t believe that this technique treats all items equally (i.e. gives every item an equal chance of ending up as the selected item), consider the possibilities with three items.

First Coin Die Result
1 Heads 1 1
1 Heads 2 1
1 Heads 3 3
1 Tails 1 2
1 Tails 2 2
1 Tails 3 3

Those are all of the possibilities when selecting from three items. Each number is selected in two of the six cases, or one third of the time.

If you continue in this manner, selecting the nth item only if rolling an n-sided die turns up n, then when you’ve reached the end of the list you will have given each item an equal chance of being selected.

Let’s see if that really does work. The program below uses the described technique to randomly select a value from 1 to 100. The program does that 10 million times and reports the results.

    private void DoIt()
    {
        const int arraySize = 100;
        const int iterations = 10*1000*1000;
        var selected = new int[arraySize];

        for (var i = 0; i < iterations; ++i)
        {
            var sel = SelectRandomInt(arraySize);
            ++selected[sel];
        }

        for (var i = 0; i < selected.Length; ++i)
        {
            Console.WriteLine("{0}: {1} ({2:P4})", i, selected[i],
                (double) selected[i]/iterations);
        }
        Console.WriteLine("Expected value = {0:N0}", iterations/arraySize);
        Console.WriteLine("Min value = {0:N0}", selected.Min());
        Console.WriteLine("Max value = {0:N0}", selected.Max());
    }

    private readonly Random _rnd = new Random();

    private int SelectRandomInt(int max)
    {
        var selected = -1;
        for (int i = 0; i < max; ++i)
        {
            if (_rnd.Next(i + 1) == i)
                selected = i;
        }
        return selected;
    }

The “expected” number that’s output is just the number of iterations divided by the number of items. In this case, 100,000. That’s the number of times you would expect any item to be selected, given a sufficiently large number of trials. This program outputs the number of times each item is selected, but below I show just the minimum and maximum.

    Expected value = 100,000
    Min value = 98,012
    Max value = 101,257

Not surprisingly, not every item was selected exactly 100,000 times. But it was close. The maximum variance was less than 2% on the low side, and about 1.3% on the high side. How does that compare with just selecting a random number from 0 to 99?

I replaced the SelectRandomInt method above with one that just returns a random number from 0 to max-1, like this:

    private int SelectRandomInt(int max)
    {
        return _rnd.Next(max);
    }

That gave me:

    Expected value = 100,000
    Min value = 99,069
    Max value = 100,842

This second version consistently gives me results that are closer to the expected values. The variance is approximately half. That is, where the first version might vary by 2% on the low side, this version varies by 1%. I think the difference is due to the random number generator being slightly biased (that is, it doesn’t produce a perfectly uniform distribution), and because we’re selecting many more random numbers with the first version than with the second, that bias is amplified. It’s not so far out of whack that I find it unusual, but I do think it merits further study.

Random number weirdness aside, you can see here that I was able to randomly choose one item from a collection of unknown size.

You wouldn’t use that technique to select a random number from 0 to 99. A real method would take an IEnumerable<T> and return a randomly selected item. The code looks like this:

    T SelectRandomItem<T>(IEnumerable<T> source, Random rnd)
    {
        T selectedItem;
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) == 0)
            {
                selectedItem = item;
            }
        }
        return selectedItem;
    }

Note that I modified the test a bit. In the previous version I checked to see if the selected random number was equal to the top of the range. Here, I test to see if it’s equal to zero. The effect is the same because the conditional is selecting one item from n possible items. The probability that the random number will be 0 is the same as the probability that it’s equal to (nItems - 1). The reason I changed it here is that it’s easier for me to think about it this way, and it simplifies the task of modifying this code to select multiple items at random.

Using that code is very simple. If you want to select a line from a large text file, you can do it in a single line of code:

    string selected = SelectRandom(File.ReadLines("foo.txt", new Random()));

You might ask why I pass in the random number generator rather than creating a new one when the method is called, or expecting there to be one at class scope. There are several reasons.

The default Random constructor seeds the random number generator with the current time, which is expressed in milliseconds. If I were to call SelectRandomItem multiple times in a single millisecond (which could happen when working with short lists), I’d be very likely to select the same item each time. Consider, for example, selecting from two lists:

    IEnumerable<string> firstNames = LoadFirstNames();
    IEnumerable<string> lastNames = LoadLastNames();
    var firstName = SelectRandomItem(firstNames);
    var lastName = SelectRandomItem(lastNames);

If firstNames and lastNames were very short, then whenever I selected the third item from one, I’d be very likely to select the third item from the other. That wouldn’t make for a very good random name generator.

Programmers often want a repeatable random sequence, especially for testing purposes. That’s not possible if the method creates a new random number generator with every call. It’s possible if there is a random number generator dedicated to the SelectRandomItem method, but if any other piece of code uses that random number generator or calls SelectRandomItem, then the sequence is lost. If I want to guarantee a repeatable random sequence, then I need to supply the random number generator to the method that uses it.

Selecting multiple items

The algorithm to select multiple items in a single pass is a straightforward generalization of the single item selection algorithm. Rather than dealing with a single “currently selected” item, we maintain an array of them. And to decide whether to select a particular item, we get a random number and compare it with the number of items to be selected. So, we replace this line in the inner loop:

    if (_rnd.Next(nItems) == 0)

With this one:

    if (_rnd.Next(nItems) < numToSelect)

If the random number is smaller than the number of items to select, then we have to randomly replace one of the selected items. The resulting method looks like this:

    T SelectRandomItems<T>(
        IEnumerable<T> source, int numToSelect, Random rnd)
    {
        T[] selectedItems = new T[numToSelect];
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) < numToSelect)
            {
                var ix = rnd.Next(numToSelect);
                selectedItems[ix] = item;
            }
        }
        return selectedItems;
    }

When selecting more than a single item, the output from that method is surprisingly similar to the output from a method that selects unique random numbers in the same range. So similar, in fact, that I can’t tell them apart. I find that a little odd. There is a definite difference in the results from selecting a single item, but when selecting multiple items the difference is not detectable. I do need to investigate why that difference in behavior exists.

And that’s all there is to it: a method that will fairly select one or more items at random from a collection of unknown size, and do it with a single scan of the entire collection. If there is a better way to do this, I don’t know what it is.

Source

I first ran across this problem in Jon Bently’s Programming Pearls. He cites Volume 2: Fundamental Algorithms of Knuth’s The Art of Computer Programming. See section 3.4.2: Random Sampling and Shuffling.

The Aho-Corasick string search algorithm

Aho-Corasick String Search

The Aho-Corasick string search algorithm that I mentioned last time works by first building a finite state machine from the list of strings you want to find. After the state machine is built, the input is run through the state machine to find the matches. The key to understanding how the algorithm works is understanding how the state machine works. Fortunately, the original paper, Efficient String Matching: An Aid to Bibliographic Search, by Alfred V. Aho and Margaret J. Corasick, explains the state machine quite well. In fact, the paper is well enough written that I had little difficulty implementing the entire algorithm from it as my only reference.

Last time I said that I was going to explain how the algorithm works. After re-reading the paper and trying to describe the algorithm, I found myself just paraphrasing the original paper. So rather than do a bad job of explaining things, I’ll point you to the paper itself. Download it. Read it. Think about it. Read it again. It is quite approachable. I wish all the research papers I read were as well written.

Building the state machine is a multi-step process that involves first creating a goto function that defines the state transitions, and an output function that defines the outputs (if any) from each state. This process involves reading each keyword character-by-character and inserting the characters into the state machine. Then, a failure function is generated by walking the nodes in the goto function, determining the next node to visit if a failure occurs. An optional final step can be added in which the generated goto and failure functions are combined into a single state transition table. This final step makes walking the state machine easier and more efficient.

The original paper provides pseudocode examples from which it’s pretty easy to create working code. I’ve created two two classes: StateMachine, which contains the completed state machine and the searching function, and StateMachineBuilder, which builds a StateMachine from a list of key words. In the past I’ve combined these two things into the same class, but I found the design cumbersome. Having two classes seems like a cleaner design to me. Building the state machine involves a lot of intermediate data structures that the searching code shouldn’t have to be troubled with. Separating the classes gives the searcher access to only the data that it actually needs.

The public interface to the StateMachineBuilder class is pretty simple. The constructor initializes the state machine. You can then call the AddKeyword method to add each key word to the state machine. When all key words are added, a call to the CompleteAdding method will complete processing and return a reference to the generated state machine.

A convenience function, StateMachine.Create(IEnumerable), allows you to create a StateMachine instance from a sequence of keywords in a single call. It’s equivalent to creating a StateMachineBuilder, adding all the keywords, and then calling CompleteAdding.

The public interfaces, then, look like this:

    public struct SearchResult
    {
        public readonly string Keyword;
        public readonly int Position;
    }

    public class StateMachine
    {
        public static StateMachine Create(IEnumerable keywords);

        public IEnumerable<SearchResult> Search(IEnumerable<char>);
    }

    public class StateMachineBuilder
    {
        public StateMachineBuilder();
        public void AddKeyword(string keyword);
        public void AddKeywords(IEnumerable<string> keywords);
        public StateMachine CompleteAdding();
    }

Imagine that you’re looking in a very large text file for mentions of any from a list of 100 names. Here’s one way you could do it:

    var names = GetListOfNames();
    var machine = StateMachine.Create(names);
    var lineNumber = 0;
    foreach (var line in File.ReadLines(inputFilename))
    {
        ++lineNumber;
        foreach (var result in machine.Search(line))
        {
            Console.WriteLine("Found '{0}' at position {1} on line{2}.",
                result.Keyword, result.Position, lineNumber);
        }
    }

If the file has no line breaks then File.ReadLines will try to read the entire file at once. That will cause a problem if the file is large. In that case you need to create a function that will return the file’s characters one at a time, and feed that to the Search method:

    foreach (var result in machine.Search(ReadFileByChars(inputFilename))
    {
        Console.WriteLine("Found '{0}' at position {1}",
            result.Keyword, result.Position);
    }

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            while (!reader.EndOfStream)
            {
                char c = (char)reader.Read();
                yield return c;
            }
        }
    }

That’s a particularly inefficient way to read a file, but it’ll do the job. You can speed it up by reading a block of characters into a buffer and then returning characters from the buffer one at a time. Each time you reach the end of the buffer, read another buffer full. Repeat until you’ve reached the end of file. Like this:

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            var buffer = new char[65536];
            int charsRead;
            while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (var i = 0; i < charsRead; ++i)
                {
                    yield return buffer[i];
                }
            }
        }
    }

That will perform faster than the previous version because it makes many fewer calls to the runtime library.

So that’s what we’re going to build: methods that build the state machine and then use the state machine to search for strings in text. Next time we’ll dive into building the goto and output functions. In the meantime, you should download and read the original Aho-Corasick paper. It really is approachable, and explains the techniques very well.

The password rant

Every time I have to come up with a password for a Web site, I end up spending 10 minutes trying to find one that I can remember and that will conform to the rules for that site. You’d think there’d be some kind of standard. But, no … every site has its own bunch of rules.

One site requires the password to be “at least eight characters long.” Others specify a minimum length and require at least one number, one capital letter, and one special character. Others specify an absurdly short maximum length. They’ll limit the characters you can use, make you use at least two numbers, prevent you from using any word that’s in an English dictionary, or for all I know require that you type it with your toes while standing on your head. It’s maddening.

I could understand this idiocy 10 years ago. Today it’s simply unacceptable. The key to a good password is that it be long, unusual, and easy to remember. But the most important point is length. Next most important is easy to remember. And I guarantee that a password like “stupid.password.rules.suck” is easier to remember and harder to break with a brute force attack than something like “Stup1d.pw0rd”.

To make matters worse, sites often don’t tell you what the password rules are. It’ll say, “enter password.” So you enter something like “my.mom’s.maiden.name” and it will say, “Passwords must contain at least one number and one capital letter.” So then you enter “3rd.rock&Roll.act” and it will say, “Passwords may only contain the characters [list of characters that doesn't include '&']“. So then you type, “Please.save.M3.From.Stupid.Password.Rules” and it will say, “Passwords must be between 8 and 14 characters in length.”

Why can’t they tell me all of the rules up front? Do they think I’m here to play 20 passwords? This evening my hosting provider told me that my password had expired and that I need to create a new one. Fine. I’m now on my fifth attempt at creating a password that I can remember, that others won’t likely guess, and that fits the rules that they’re revealing to me one at a time.

People who create Web sites: please update your password rules. Force me to create a password that’s at least 15 characters long, and let me put whatever characters I want in it! If you really have to put a limit on the length, make it something reasonable like at least 32 characters. Limiting me to 12 (a banking site, no less) or 14 characters and making me choose from three or four different classes of characters to boot just pisses me off and makes me think that I should find some other place to take my business.

 

Searching for strings in text

Every mainstream programming language’s runtime library contains a function that will find a substring inside a string. The .NET runtime library, for example, has several such methods. String.IndexOf in its many overloaded variants returns the position of the search string in the larger string, or -1 if the substring is not found. There’s also String.Contains and String.LastIndexOf, as well as Regex.Match and several others.

The String functions are great for determining if a particular string occurs in some text. But if you want to search for matches of multiple strings, you have to search for each one individually. For example, if you want to search for occurrences of the strings {he, she, his, hers}, you’d have to write something like this:

    string[] stringsToFind = new string[]{"he", "she", "his", "hers");
    foreach (var s in stringsToFind)
    {
        int pos = 0;
        while ((pos = text.IndexOf(s, pos)) != -1)
        {
            Console.WriteLine("Found {0} at position {1}", s, pos);
            pos += s.Length;
        }
    }

That will print almost every occurrence of each word. In some situations it will miss because it skips over the beginning of a match. For example, if you’re searching for the string “abab” in the text “abababab”, it would only find matches at positions 0 and 4. It would miss the occurrence at position 2. You can fix that by changing the increment to ++pos, which will cause it to restart the search at each character. It’ll just take longer.

The code ends up scanning the entire text for every word to be found. That’s not a problem when the text is short and the number of words to be located is small. But if you’re looking for occurrences of 1,000 different words in a gigabyte-sized document, this is going to take a long time. The amount of time taken by this algorithm is proportional to the number of words (we’ll call it N) times the length of the document squared (call it M). That is, the complexity is O(N * M2).

Why squared? Because it’s possible that the code would have to start the search at every character in the string. Think of searching for all occurrences of the string “aa” in the text string “aaaaaaaaaaaaaaaaaaaaaaaa…” (‘a’ repeated a billion times). That wouldn’t take forever, but close enough. If the computer could do a billion searches per second, it would only take a billion seconds (almost 32 years). Ouch.

The regular expression methods can do things much faster in the general case because they can search for all four strings in a single scan of the text.

    string text = "those shells aren't hers or his.";
    Regex re = new Regex("he|hers|she|his", RegexOptions.Compiled);
    Match m = re.Match(text);
    while (m.Success)
    {
        Console.WriteLine("Found {0} at position {1}.", m.Value, m.Index);
        m = m.NextMatch();
    }

Here’s the output from running that code:

    Found she at position 6.
    Found he at position 20.
    Found his at position 28.

Again, you see that it missed some occurrences. It didn’t find “he” at position 6 or “hers” at position 20. I could fix the missing “he” at position 6 by incrementing the regular expression starting position, but there’s nothing I can do about it not finding “hers”. Well, I could change the regular expression so that it’s "hers|he|she|his", which would return “hers” at position 20 but then wouldn’t find “he” at the same position. The same thing happens if I change the regular expression to "he(rs)?|she|his". A regular expression can’t find two matches at the same position.

I freely admit that I’m no expert with regular expressions. Even so, I was somewhat surprised by the regular expression finding “he” rather than “hers” at position 20. I wrongly thought that regular expression matching behavior defaulted to “longest left-most,” when in fact it’s a traditional NFA engine that, according to the documentation, accepts the first match it finds.

But “first match” doesn’t explain why it matches “hers” instead of “he” when I reverse the order of the words in the regular expression. After all, the engine should see “he” and stop, right? There’s something about the way the regex engine works with alternation that I don’t fully understand. I realize that I can solve the problem in this case by combining the two alternatives into "he(rs)?", but that becomes problematic when constructing a regular expression at run time from a list of words.

Understanding why alternation works this way would be nice, but I’m not so curious that I’ll spend time digging into the implementation in order to find out. For now I’ll accept that order matters in alternation constructs.

Although the regular expression version will be faster in the general case, it, too, will take essentially forever in pathological cases. And, as you’ve seen, it can’t find everything and it can’t easily be changed to do so.

Another problem is that both techniques require that the entire text be held in memory. If you’re searching for strings in a file that’s larger than available memory, you can’t use either of these methods without writing code that processes the file in chunks.

There is a better and faster way that can find all occurrences (given “hers”, for example, it will find both “he” and “hers”), doesn’t suffer from the pathological worst-case behavior, and that can process text on a character-by-character basis. It’s called the Aho-Corasick string search algorithm. Developed in the mid 1970s, it’s a surprisingly simple algorithm that scales well, is reasonably easy to implement, and performs excellently even with a straightforward implementation. I’ve used it to search for occurrences of millions of different strings among many hundreds of gigabytes of text.

Next time I’ll explain how the algorithm works, and start building code to implement it.

Hand made beer mug

Debra and I went to the Sherwood Forest Faire last weekend, and had a great time. I felt a little out of place, though, wandering around in costume and drinking mead from a plastic cup. Many people had wooden mugs, and there were mugs available for purchase at several shops. Although they were attractive, I didn’t get one. First of all, the least expensive mug I saw for sale was $45. That’s a lot for a hollowed-out piece of wood.

More importantly, the mugs they had for sale were too pretty. Most were perfectly turned on a lathe, had perfectly shaped handles, and were coated with a thick epoxy resin that turned the wooden mug into what was essentially a modern plastic mug that just looks like wood.

Sunday morning I went out to the wood pile and grabbed a hunk of mesquite that seemed about the right size. My plan was to carve a mug that would hold 16 ounces.

mug1

After cutting it to size and removing the bark with an angle grinder, I drilled a hole in the center to get started, and then used a die grinder and the Foredom power carver to hollow it. This turned out to be more difficult than I had expected due to the depth. It’s distressingly easy to lose control of the tool when you can’t see where it’s cutting. An aggressive bit spinning at 15,000 RPM will throw a piece of wood across the room when it binds up. I ended up putting a few cracks in the cup that way.

mug2

It took a long time to thin the walls from there (using a hook knife), and sand it smooth. I also had to fill the cracks that I made while hollowing. I didn’t take any pictures until after I’d completed the cup and put the oil and wax finish on it.

mug3

The finished mug is 4-1/4 inches tall and between 3-1/4 and 3-1/2 inches wide. The width varies because the mesquite log that it came from wasn’t perfectly round.

mug4

It holds just a little more than 12 ounces. I just couldn’t get it any deeper with the tools that I currently have. I’ll have to come up with a better way if I’m going to make another one of these. I’ve considered hollowing from both ends and then carving a 1/4 inch thick piece to fit inside at the bottom. The only difficulty would be getting that bottom piece carved correctly. I’m also considering a few other options.

mug5

The residue you see floating in the cup there is a little of the oil and wax. Not to worry: the oil and wax are non-toxic. I was impatient to try the mug. It really is a pleasure to drink from.

I decided not to add a handle. Maybe next time. I’ll probably come up with some sort of leather sling that fits around the mug and lets me attach it to my belt so that I can carry it around the next time we go to one of those Renaissance festivals.

ReSharper annoyances

If you’re writing C# code, you should be using ReSharper. It’s a Visual Studio plugin that examines your code in real time (as you edit), helping you avoid common errors and write better code. I admit that I found ReSharper annoying when I first started using it, but after a week or so I found it invaluable. I found it so useful at work that I purchased a license myself so that I can use it at home. It’s not exactly cheap (about $150), but I find that it improves my code and saves me time. Well worth the price.

But sometimes its suggestions are just annoying. One in particular drives me bonkers. Consider this fragment from my heap selection code:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

ReSharper suggests that I “Invert ‘if’ statement to reduce nesting.” If I tell it to perform the transformation, it turns the code into this:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) <= 0) continue;
        // The current item is larger than the smallest item.
        // So move the current item to the root and sift down.
        Swap(0, _numItems);
        SiftDown(0);
    }

So rather than having a simple conditional and a short bit of subordinate code, it inverts the conditional and adds an extra logical branch in the flow. All for no good reason. Rather than saying “If condition do x,” the code is now saying “If not condition go on to the next iteration, otherwise do x.” Negative logic and a more complicated control flow does not, in my opinion, constitute an improvement.

ReSharper is suggesting that I make my code harder to read and harder to understand.

Ostensibly, ReSharper is helping to avoid the arrow anti-pattern, but it’s being too aggressive. I fully agree that one should avoid writing such hideous code. However, the arrow anti pattern involves “excessive nested conditional operators.” One conditional operator is not nested. What we have here is a simple conditional that ReSharper should ignore.

I can change ReSharper options to disable that suggestion, or I could annotate my code to disable that suggestion in this particular case. I won’t do the former because I want ReSharper to suggest improvements that make sense, and I definitely want to avoid the arrow anti-pattern. I won’t add ReSharper-specific (or any other tool-specific) annotations to my code because those annotations are not germane to the problem being solved. They’re just clutter. It’s bad enough that I have to suffer those XML documentation comments in the code.

ReSharper got that one wrong, and there’s nothing I can reasonably do to prevent it from displaying the warning.

ReSharper also warns me when I write something like this:

    private int _head = 0;

It says, “Initializing field by default value is redundant.” And ReSharper is right. But the redundancy is harmless and even helpful at times. I like this particular redundancy because it explicitly states what the initial value of that variable is. I don’t have to think, “Oh, right. The default value is zero.” I’ve turned this warning off because there is no context in which I want to be told that my preference for explicit initialization is redundant. Besides, the compiler will silently remove the redundancy, so there’s no inefficiency in the generated code.

Come to think of it, I’d like an option to flag code that doesn’t explicitly initialize something. I’d enable that option in a heartbeat.

All that said, I really like ReSharper. As annoying as it can be from time to time, I find that it really does help me write better code. Highly recommended.

Performance testing the TopBy method

The purpose of my TopBy extension method is to provide a faster and more memory efficient way to select the top k items from a list. Absent this method, the LINQ way to get those items is to sort the sequence in descending order and then take the top k. For example:

    var oldestPeople = people
        .OrderbyDescending(x => x.Age)
        .Take(10);

That will give you the 10 oldest people in the list. My TopBy method makes it a little easier:

    var oldestPeople = people.TopBy(10, x => x.Age);

That reads better to me and, done right, should be a whole lot faster than the existing LINQ alternative.

When LINQ does that selection, it has to copy the entire source list (people in this case), sort the entire list, and then return the top 10. That requires extra memory proportional to the size of the source list, and time proportional to n*log2(n). If you want to select the top 10 from a list of one million, LINQ will create another list of one million items to sort.

It should also be clear that LINQ will take almost the same amount of time to select the top 10 as it would to select the top 100,000. The bottleneck in the LINQ way of doing things is the sort, and it always has to sort the entire array.

My TopBy method uses a heap selection algorithm similar to the one that I discussed in my article When theory meets practice. This algorithm requires extra space proportional to k–the number of items to be selected. It takes time proportional to n*log2(k), meaning that it should be much faster than the LINQ method when selecting a relatively small percentage of the items in the list, and it will use a whole lot less memory.

One other benefit of the new TopBy method is that you can use it to select the top items from a list that won’t even fit in memory. Imagine, for example, that you have a text file that contains hundreds of millions of records–more than will fit in memory. You want to select the top 100 records from that file. You can’t do that with OrderBy because you can’t fit those hundreds of millions of records in memory. But TopBy only needs space for 100 records, so you can easily use it to select the records you want.

That’s the theory. Let’s see how it works in practice.

The short version is that TopBy is faster than OrderBy followed by Take when the number of items that you want to select is less than about 25% of the total number of items. For primitive types like integers, “about” is closer to 20%, and for strings it’s closer to 40%. As the cost of comparisons increases, TopBy is a better choice.

How much faster depends mostly on the ratio of selected items (k) to total items (n). When k/n is small, TopBy is hugely faster. For example, selecting the top 100 items from a list of 1,000,000 strings takes TopBy about 500 milliseconds on my machine. It takes the OrderBy method about 17 seconds.

When k/n approaches 0.25, they’re about the same speed, and when k/n is near 1.0, TopBy takes about twice as long. Again, TopBy has an advantage as the cost of key selection and comparison increases.

The first table below compares the performance of OrderBy and TopBy when selecting varying percentages of integers from different list sizes. The second table compares the performance when selecting strings.

Time (in milliseconds) to select integers

Total Items
Percent 100 1,000 10,000 100,000 1,000,000
Selected OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy
0.01% 9.41 0.76 112 7.60 1,475 75.25
0.10% 0.74 0.08 8.87 0.79 112 8.18 1,479 84.73
1% 0.04 0.01 0.73 0.10 8.89 1.20 113 14.84 1,477 176.37
10% 0.04 0.03 0.81 0.39 9.06 5.18 116 66.84 1,536 854
25% 0.04 0.04 0.71 0.69 9.18 9.75 116 126 1,501 1,677
50% 0.04 0.07 0.73 1.35 9.50 14.32 118 189 1,554 2,547
75% 0.04 0.09 0.73 1.35 9.59 17.01 118 224 1,572 3,106
90% 0.05 0.09 0.75 1.32 9.57 18.55 123 235 1,699 3,287

Time (in milliseconds) to select strings

Total Items
Percent 100 1,000 10,000 100,000 1,000,000
Selected OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy
0.01% 97.85 5.14 1,499 54.43 17,595 540
0.10% 6.63 0.53 94.79 5.40 1,455 58.73 16,410 612
1% 0.45 0.06 7.20 0.78 91.62 8.42 1,360 122 16,670 1,364
10% 0.53 0.14 6.54 2.40 91.11 36.16 1,376 556 16,877 6,919
25% 0.42 0.27 6.61 4.62 100 69.69 1,337 967 17,147 13,375
50% 0.44 0.43 7.21 7.34 90.45 104 1,299 1,481 16,854 20,561
75% 0.46 0.52 6.47 8.60 94.60 124 1,318 1,823 17,106 24,166
90% 0.46 0.58 6.93 9.17 105 135 1,400 1,998 17,886 26,071

The important thing to note here is that for a given list size, OrderBy followed by Take takes nearly constant time regardless of how many items are selected. That is, selecting 100 items from a list of 1 million takes approximately the same amount of time as selecting 900,000 items from the same list. In comparison, the time required by TopBy depends on the ratio of selected items to total items.

The time I spent developing the TopBy extension method is a huge win for me because I regularly select small numbers of items from very large lists. Selecting the top 100 from a list of 1 million is a common operation in the work that I do. Being able to do that in 500 milliseconds rather than 16 seconds means that my programs run much faster. They also require a whole lot less memory.

I also find myself, sometimes, having to select the top 100 items from a list that’s too large to fit in memory. Again, that’s something that TopBy can do, but OrderBy can’t. I realize that my situation is a little out of the ordinary in that most people who work with hundreds of millions of records will have them in a database and they can use SQL to do the selection.

It’s important to point out, though, that my TopBy extension method isn’t the optimum in performance. I have a custom generic selection method that’s three to five times faster, and outperforms LINQ’s OrderBy even when selecting 100% of items. Unfortunately, it doesn’t support the ThenBy and ThenByDescending methods, and it doesn’t do a stable ordering. It works more like List<T>.Sort in that I can pass it a comparison delegate. It’s useful, but using it in conjunction with LINQ is difficult.

All in all, this was an interesting and enjoyable diversion. I learned a lot about how LINQ to Objects works under the hood, and I ended up with a very useful extension method that makes my programs run faster and use less memory.

Below is the code I used to generate the data that went into the tables above. Of course, all tests were run with a release build with no debugger attached.


    private const int Million = 1000*1000;
    private const int Billion = 1000*Million;
    private const int NumIterations = 10;

    private static readonly double[] Percentages = new double[] {0.0001, 0.001, 0.01, 0.1, .25, .50, .75, .90};
    private static readonly int[] Sizes = new int[] {100, 1000, 10000, 100000, Million};
    private static void TestSelect()
    {
        // prime the jitter pump
        TestOrderBy(new List<int> {1, 2}, 1, 1);
        TestOrderBy(new List<string> {"foo", "bar"}, 1, 1);
        TestTopBy(new List<int> { 1, 2 }, 1, 1);
        TestTopBy(new List<string> { "foo", "bar" }, 1, 1);

        foreach (var numItems in Sizes)
        {
            foreach (var t in Percentages)
            {
                int numSelect = (int) (t*numItems);
                //var items = CreateRandomInts(numItems);
                var items = CreateRandomStrings(numItems, 4, 20);
                var orderBy = TestOrderBy(items, numSelect, NumIterations);
                var topBy = TestTopBy(items, numSelect, NumIterations);
                Console.WriteLine("{0},{1:N2},{2},{3},{4}", numItems, t*100, numSelect, orderBy, topBy);
            }
        }
    }

    private static double TestOrderBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.OrderByDescending(k => k).Take(numSelect).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static double TestTopBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.TopBy(numSelect, k => k).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static Random rnd = new Random();

    static List<int> CreateRandomInts(int howMany)
    {
        return Enumerable.Range(0, howMany).Select(i => rnd.Next()).ToList();
    }

    static List<string> CreateRandomStrings(int howMany, int minLength, int maxLength)
    {
        var items = new List<string>(howMany);
        for (int i = 0; i < howMany; ++i)
        {
            int len = rnd.Next(minLength, maxLength);
            var sb = new StringBuilder(len);
            for (int c = 0; c < len; ++c)
            {
                int x = rnd.Next(52);
                if (x < 26)
                    sb.Append((char)(x + 'A'));
                else
                    sb.Append((char)(x - 26 + 'a'));
            }
            items.Add(sb.ToString());
        }
        return items;
    }

Some optimizations to the TopBy method

I had planned to present the results of performance testing here, comparing my TopBy extension method against the existing LINQ alternative (OrderByDescending() followed by Take()). But while testing I came up with a couple of optimizations that provide slight performance increases. So I’ll show you those optimizations today, and then follow up with the performance tests in the next day or two.

My original code used a 3-heap because testing of my DHeap class showed that a 3-heap is almost always faster than a binary heap. But when I made the decision to use a 3-heap, I didn’t take into consideration the operations that the heap is expected to perform. If you recall from my discussion of the d-heap, heap insertions (the BubbleUp method) become faster as the value of d increases. Inserting into a 3-heap is faster than inserting into a binary heap. But heap removals (the SiftDown method) are slower as d increases.

The HeapSelector<TElement> class I presented last time never does a heap insert! The BubbleUp method doesn’t even exists in the code. The DoSelection method creates an array of the first numItems items and heapifies it, and then any changes to the heap are performed by sifting new elements down. The result is that my SiftDown method does more comparisons than it would as a binary heap, and there’s no corresponding reduction in comparisons during heap insertion.

The Heapify method in a 3-heap is faster than in a binary heap, and better locality of reference with the 3-heap offsets some of the lost performance in SiftDown, but overall the 3-heap is slightly slower with this mix of operations.

So I modified the code to use a binary heap. That simplifies the SiftDown method and provides a small increase (about 2%) in speed. Here are the modified Heapify and SiftDown methods.

    private void Heapify()
    {
        for (int i = _count / 2; i >= 0; --i)
        {
            SiftDown(i);
        }
    }

    private void SiftDown(int ix)
    {
        var child = (ix*2) + 1;
        while (child < _count)
        {
            if (child+1 < _count && Compare(child, child + 1) > 0)
                ++child;
            if (Compare(child, ix) >= 0)
                break;
            Swap(ix, child);
            ix = child;
            child = (ix*2) + 1;
        }
    }

The other optimization has to do with extracting keys from records before comparing to see if an item should go onto the heap. The original code looks like this:

    // For each remaining item . . .
    while (srcEnumerator.MoveNext())
    {
        ExtractKeys(srcEnumerator.Current, _numItems);
        if (Compare(_numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

That extracts all of the keys for an item and then compares the item against the first (smallest) item on the heap. If the new item is larger than the smallest item, then the smallest item is replaced by the new item.

One of the things I learned when working with heap selection some time ago is that with typical data, especially when you’re selecting a small percentage of the total items, only a very small number of items are actually placed on the heap. In most cases the vast majority of items examined are smaller than the smallest item on the heap. That being true, it’s useful to perform as little work as possible to determine if an item will go onto the heap.

The other observation is that, when multiple sort keys are specified, the majority of items will differ in the first key. If you’re sorting people by last name and then first name, most of them will have different last names. It makes no sense, then, to extract the first name key if it won’t ever be compared. If we had a method that extracted only those keys it needs for the comparison, we should spend a lot less time extracting keys.

With that in mind, I created the ExtractCompare method:

    private int ExtractCompare(TElement item, int ix, int x, int y)
    {
        var rslt = 0;
        var i = 0;
        while (rslt == 0 && i < _criteria.Length)
        {
            // Extract this key
            _criteria[i].ExtractKey(item, ix);
            rslt = _criteria[i].CompareKeys(x, y);
            ++i;
        }
        // If the item isn't larger, we can short circuit
        if (rslt <= 0) return rslt;

        // The new item compares larger, so it will be placed on the heap.
        // Extract the rest of the keys
        _items[ix] = item;
        while (i < _criteria.Length)
        {
            _criteria[i].ExtractKey(item, ix);
            ++i;
        }
        return rslt;
    }

You can see here that if the new item is smaller than the item it’s being compared to (which in this case will be the smallest item on the heap), the other keys aren’t even extracted. Again, why extract a key that you won’t use?

Using this new method requires a small change in the selection loop:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

The performance increase from this optimization isn’t very noticeable when you have just one ordering clause (i.e. no ThenBy or ThenByDescending). My testing shows that it’s only two to three percent in the general case. The effect becomes much more evident when you have multiple sort keys, or if extracting the keys is expensive.

I did have one reservation about this optimization. Somebody could write a key selector method that has side effects that other parts of the program (perhaps subsequent key selections) depends on. I can’t think of a good reason why anybody would write their program to depend on the key selector’s side effects, though. I gave this quite a bit of thought and came to the conclusion that depending on a key selector’s side effects is so fraught with peril anyway that adding this potential problem is no big deal. Key selection shouldn’t modify global state.

Together, these two optimizations resulted in a performance increase of from five to seven percent in the single sort key case. Not huge by any standard, but worthwhile considering the simplicity of the changes.

Performance comparisons next time. Really.

Implementing IOrderedEnumerable, Part 2: The details

In my previous post, I developed the classes that implement IOrderedEnumerable so that we can create a chain of sort criteria that the GetEnumerator method can use to select the top items from the list. The harder part, is writing that GetEnumerator method.

When I started this project I intended to use my generic heap class, DHeap, to do the selection. I actually wrote code to do that and almost published it. But the code was convoluted, and used a few “clever” tricks that were too ugly to post. Besides, the code was slow. It was faster than using OrderBy and Take, but not by a whole lot. It was slow and ugly, and not something that I wanted to post when there was a better (and, as it turns out, much faster) way to do things.

It’s instructive to understand the way the LINQ to Objects OrderBy implementation works. When code calls OrderBy, the LINQ code creates a chain of ordering criteria quite similar to what I showed in my previous post. The GetEnumerator method, when it’s eventually called, first extracts the sort keys and stores them in individual arrays. For example, if the code were OrderBy(p => p.Age).ThenBy(p => p.FirstName), then there are two key arrays created: one that contains all of the ages, and one that contains all of the first names. When sorting, the keys are compared and swapped as necessary.

Doing things that way costs memory, but lets us create and compare custom keys that might take significant time to generate. Rather than generating the key each time a comparison is made, the keys are generated one time and cached. It’s less than optimum, but it’s a much more flexible way to do things.

After studying the problem a bit (that is, after my initial solution was such a disappointment), I determined that I could use a technique that’s very similar to what the LINQ to Objects implementation of OrderBy does. But my key arrays are only the size of the heap (the number of items to be selected) rather than the size of the entire list. So if the code is supposed to select 100 items, then each key array is large enough to hold 101 items–the last item used as a scratch pad.

Each HeapOrderedEnumerable<TElement, TKey> instance has a _keys array and implements three abstract methods that are defined by the base HeapOrderedEnumerable<TElement> class:

  • ExtractKey(element, index) extracts a key of type TKey from the passed element and stores it in the _keys array at the specified index.
  • Compare(x, y) uses the comparison delegate to compare the keys at indexes x and y.
  • Swap(x, y) swaps the values in the _keys array located at indexes x and y.

The HeapOrderedEnumerable<TElement,TKey> class, then, is pretty simple. Modifying the code I posted last time, I just need to add the _keys array and the three methods that I outlined above. Here’s the completed class.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        private readonly TKey[] _keys;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;

            // Allocate one extra key for the working item.
            _keys = new TKey[numItems+1];
        }

        public override int CompareKeys(int x, int y)
        {
            return _descending
                ? _comparer.Compare(_keys[y], _keys[x])
                : _comparer.Compare(_keys[x], _keys[y]);
        }

        public override void SwapKeys(int x, int y)
        {
            var t = _keys[x];
            _keys[x] = _keys[y];
            _keys[y] = t;
        }

        public override void ExtractKey(TElement item, int ix)
        {
            _keys[ix] = _keySelector(item);
        }
    }

I then created a class called HeapSelector that, given a list of ordering criteria (the chain of HeapOrderedEnumerable<TElement,TKey> instances) and a list of items, can select the top items that match those criteria. That class maintains an array of the items currently on the heap, and calls the ExtractKey, Compare, and Swap methods described above to maintain the keys for those items. Internally, HeapSelector implements a custom d-Heap to do the actual selection. The public interface consists of the constructor and the DoSelection method. Here’s the public interface.

    internal class HeapSelector<TElement>
    {
        public HeapSelector(
            IEnumerable source,
            HeapOrderedEnumerable[] criteria,
            int numItems);

        public IEnumerable DoSelect();
    }

The full code is shown at the end of this post, along with the rest.

All that’s left is the GetEnumerator method in HeapOrderedEnumerable<TElement> which, in concept, is very simple. It just has to create an array of ordering criteria to pass to the HeapSelector, create the HeapSelector, and then return the sequence generated by the DoSelect method. It’s almost that simple, except for one complication.

OrderBy and ThenBy state that they do a stable ordering. A stable ordering simply means that items that compare equal maintain their original relative order in the output. For example, imagine that I had this list of names and ages.

    Jim,52
    Ralph,30
    Susie,37
    Mary,52
    George,47

If I were to sort those names by descending age, the first two items could be Jim followed by Mary, or Mary followed by Jim. If a stable sort isn’t specified, then either ordering is correct. But if the sort is guaranteed stable, then Jim must come before Mary in the output because Jim appeared before Mary in the original list. Note that if I reversed the sort order (i.e. sorted by ascending age), Jim would still appear before Mary.

It’s reasonable for users of TopBy to expect a stable ordering, because that’s the way that OrderBy works and, more importantly, how ThenBy is documented to work. If I want ThenBy to work with TopBy, then TopBy must do a stable ordering. That complicates things a little bit because heap selection isn’t typically a stable operation.

It can be made stable, though, by adding one final ordering criterion: the record number. In the example above, Jim would be record 0 and Mary would be record 3. Each record is assigned a unique, monotonically increasing numeric key. If the other keys compare as equal, the record numbers are compared. This guarantees the correct ordering of otherwise equal items.

It turns out that adding that final ordering isn’t terribly complicated. The code essentially tacks on a final ThenByDescending that stores the unique record number. It then walks the chain of ordering criteria to build an array that it can pass to the HeapSelector constructor. Finally, it calls the HeapSelector instance’s DoSelect method.

This whole thing turned out to be about 30 lines longer than my initial attempt, but much easier to understand. It’s also about five times faster than my original code, which is a nice bonus. The entire code is shown below. Next time we’ll do a little testing to see how it stacks up with other ways of selecting the top items.

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;

    namespace Heaps
    {
        public static class HeapEnumerable
        {
            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
            }

            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
            }

            internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly int _numItems;
                internal HeapOrderedEnumerable<TElement> Parent;

                protected HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems)
                {
                    _source = source;
                    _numItems = numItems;
                }

                public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer, bool @descending)
                {
                    var oe = new HeapOrderedEnumerable<TElement, TKey>(
                        _source, _numItems, keySelector, comparer, descending);
                    oe.Parent = this;
                    return oe;
                }

                public IEnumerator<TElement> GetEnumerator()
                {
                    int numRecs = 0;
                    var recordKeySelector = new Func<TElement, int>(item => ++numRecs);

                    // Add a ThenByDescending for the record key.
                    // This ensures a stable ordering.
                    var oe = (HeapOrderedEnumerable<TElement>)CreateOrderedEnumerable(recordKeySelector, null, true);

                    // Get the ordering criteria, starting with the last ordering clause.
                    // Which will always be the record key ordering.
                    var criteria = oe.GetCriteria().ToArray();

                    var selector = new HeapSelector<TElement>(_source, criteria, _numItems);
                    return selector.DoSelect().GetEnumerator();
                }

                // Walks the ordering criteria to build an array that the HeapSelector can use.
                private IEnumerable<HeapOrderedEnumerable<TElement>> GetCriteria()
                {
                    var keys = new Stack<HeapOrderedEnumerable<TElement>>();

                    var current = this;
                    while (current != null)
                    {
                        keys.Push(current);
                        current = current.Parent;
                    }
                    return keys;
                }

                IEnumerator IEnumerable.GetEnumerator()
                {
                    return GetEnumerator();
                }

                // The individual ordering criteria instances (HeapOrderedEnumerable<TElement, TKey>)
                // implement these abstract methods to provice key extraction, comparison, and swapping.
                public abstract void ExtractKey(TElement item, int ix);
                public abstract int CompareKeys(int x, int y);
                public abstract void SwapKeys(int x, int y);
            }

            internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
            {
                private readonly Func<TElement, TKey> _keySelector;
                private readonly IComparer<TKey> _comparer;
                private readonly bool _descending;

                private readonly TKey[] _keys;

                internal HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems,
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer,
                    bool descending) : base(source, numItems)
                {
                    _keySelector = keySelector;
                    _comparer = comparer ?? Comparer<TKey>.Default;
                    _descending = descending;

                    // Allocate one extra key for the working item.
                    _keys = new TKey[numItems+1];
                }

                public override int CompareKeys(int x, int y)
                {
                    return _descending
                        ? _comparer.Compare(_keys[y], _keys[x])
                        : _comparer.Compare(_keys[x], _keys[y]);
                }

                public override void SwapKeys(int x, int y)
                {
                    var t = _keys[x];
                    _keys[x] = _keys[y];
                    _keys[y] = t;
                }

                public override void ExtractKey(TElement item, int ix)
                {
                    _keys[ix] = _keySelector(item);
                }
            }

            internal class HeapSelector<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly HeapOrderedEnumerable<TElement>[] _criteria;
                private readonly int _numItems;
                private readonly TElement[] _items;
                private int _count;

                public HeapSelector(
                    IEnumerable<TElement> source,
                    HeapOrderedEnumerable<TElement>[] criteria,
                    int numItems)
                {
                    _source = source;
                    _criteria = criteria;
                    _numItems = numItems;
                    _items = new TElement[numItems+1];
                }

                public IEnumerable<TElement> DoSelect()
                {
                    // Build a heap from the first _numItems items
                    var srcEnumerator = _source.GetEnumerator();
                    while (_count < _numItems && srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _count);
                        ++_count;
                    }
                    Heapify();

                    // For each remaining item . . .
                    while (srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _numItems);
                        if (Compare(_numItems, 0) > 0)
                        {
                            // The current item is larger than the smallest item.
                            // So move the current item to the root and sift down.
                            Swap(0, _numItems);
                            SiftDown(0);
                        }
                    }

                    // Top N items are on the heap. Sort them.
                    int saveCount = _count;
                    while (_count > 0)
                    {
                        --_count;
                        Swap(0, _count);
                        SiftDown(0);
                    }

                    // And then return.
                    // Have to use the Take here because it's possible that saveCount
                    // will be smaller than _numItems.
                    return _items.Take(saveCount);
                }

                private const int ary = 3;

                private void Heapify()
                {
                    for (int i = _count / ary; i >= 0; --i)
                    {
                        SiftDown(i);
                    }
                }

                private void SiftDown(int ix)
                {
                    while ((ary*ix) + 1 < _count)
                    {
                        var child = (ix*ary) + 1;
                        // find the smallest child
                        var currentSmallestChild = child;
                        var maxChild = child + ary;
                        if (maxChild > _count) maxChild = _count;
                        ++child;
                        while (child < maxChild)
                        {
                            if (Compare(currentSmallestChild, child) > 0)
                                currentSmallestChild = child;
                            ++child;
                        }

                        child = currentSmallestChild;
                        if (Compare(child, ix) >= 0)
                            break;
                        Swap(ix, child);
                        ix = child;
                    }
                }

                private void ExtractKeys(TElement item, int ix)
                {
                    // Extract keys from the record into the array at index ix.
                    // Also calls the ExtractKey method for each ordering criteria.
                    _items[ix] = item;
                    foreach (var t in _criteria)
                    {
                        t.ExtractKey(item, ix);
                    }
                }

                private int Compare(int x, int y)
                {
                    // Walks the list of comparers, doing the comparisons.
                    // The first unequal comparison short-circuits the loop.
                    var rslt = 0;
                    for (int i = 0; rslt == 0 && i < _criteria.Length; ++i)
                    {
                        rslt = _criteria[i].CompareKeys(x, y);
                    }
                    return rslt;
                }

                // Swap two items. This swaps the elements in the local array,
                // and calls the Swap method for each of the ordering criteria.
                private void Swap(int x, int y)
                {
                    var temp = _items[x];
                    _items[x] = _items[y];
                    _items[y] = temp;
                    foreach (var t in _criteria)
                    {
                        t.SwapKeys(x, y);
                    }
                }
            }
        }
    }

Implementing IOrderedEnumerable, Part 1: The basics

Last time I introduced the TopBy extension method, which I want to implement for LINQ to Objects. On the surface of it, doing such a thing seems simple enough: just implement the IOrderedEnumerable interface. That turns out to be more involved than you might expect.

Documentation for IOrderedEnumerable isn’t particularly helpful. The brief description says:

Represents a sorted sequence.

There are no remarks or other detailed information. The interface has two methods: CreateOrderedEnumerable<TKey> and GetEnumerator.

If you’ve worked with LINQ at all, you know what GetEnumerator is supposed to do: generate the sequence. More correctly, it returns an enumerator that iterates through the collection, but something has to generate the items in the collection, and in keeping with the lazy nature of LINQ, we typically generate those items during enumeration.

Documentation for CreateOrderedEnumerable<TKey> isn’t much help, either. The brief description says:

Performs a subsequent ordering on the elements of an IOrderedEnumerable<TElement> according to a key.

The prototype tells us that there are three parameters: a key selector, a comparer, and a flag to indicate if the subsequent ordering should be ascending or descending. And, the only remark is:

The functionality provided by this method is like that provided by ThenBy or ThenByDescending, depending on whether descending is true or false. They both perform a subordinate ordering of an already sorted sequence of type IOrderedEnumerable<TElement>.

Not so helpful. The only clue in the documentation that even hints at an implementation strategy is in the Remarks section of ThenBy and ThenByDescending:

ThenBy and ThenByDescending are defined to extend the type IOrderedEnumerable<TElement>, which is also the return type of these methods. This design enables you to specify multiple sort criteria by applying any number of ThenBy or ThenByDescending methods.

The idea is easy enough. We want to create a chain of comparison operations so that we can apply code that does essentially this:

    if first keys are equal then
        if second keys are equal then
            if third keys are equal then
            ... etc.

It all starts with the TopBy method, which has two overloads:

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector);

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer);

There are corresponding TopByDescending methods, as well, which have the same parameters.

Those methods return an IOrderedEnumerable<TSource>, so I need to create a class that implements the IOrderedEnumerable<TSource> interface. I called that class HeapOrderedEnumerable<TElement>. In its skeletal form, it looks like this:

    internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
    {
        public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending)
        {
            throw new NotImplementedException();
        }

        public IEnumerator<TElement> GetEnumerator()
        {
            throw new NotImplementedException();
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }

If you think about it a bit, that HeapOrderedEnumerable<TElement> class isn’t sufficient because the key type varies. What we really need is a HeapOrderedEnumerable<TElement,TKey> that we derive from HeapOrderedEnumerable<TElement>.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
    }

TopBy, then, creates and returns a HeapOrderedEnumerable<TSource, TKey>. A reference to that object is passed to ThenBy, which in turns calls the CreateOrderedEnumerable method to create additional sort criteria. It’s up to us to figure out how to chain those multiple sort criteria together so that when GetEnumerator is called, items are selected appropriately.

Clear as mud, right?

The HeapOrderedEnumerable<TElement, TKey> instance has to save the parameters that are passed to its constructor so that they can be used when it’s time to generate the sequence. The class also has to save a reference to its parent so that we can properly construct the chain of comparers when the time comes. Some of the information to be saved (the source list, number of items to select, and the parent pointer) has to be accessible by the base class. The rest can be private to the derived class.

With that information, we can create internal data definitions and constructors for the classes. First, the base class:

    // Internal data and constructor for HeapOrderedEnumerable<TElement>
    private readonly IEnumerable<TElement> _source;
    private readonly int _numItems;
    internal HeapOrderedEnumerable<TElement> Parent;

    protected HeapOrderedEnumerable(
        IEnumerable<TElement> source,
        int numItems)
    {
        _source = source;
        _numItems = numItems;
    }

Note that the Parent isn’t set by the constructor. We’ll come back to it in a minute.

The derived class, HeapOrderedEnumerable<TElement, TKey>, saves the key selector, the comparer, and the descending flag.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;
        }
    }

Now we can write the CreateOrderedEnumerable method. All it does is create a HeapOrderedEnumerable<TElement, TKey>, and link it to the parent so that we have a chain that we can work with when GetEnumerator is called.

    public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
        Func<TElement, TKey> keySelector,
        IComparer<TKey> comparer, bool descending)
    {
        var oe = new HeapOrderedEnumerable<TElement, TKey>(
            _source, _numItems, keySelector, comparer, descending);
        oe.Parent = this;
        return oe;
    }

And, finally, the TopBy and TopByDescending methods.

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
    }

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
    }

Note that I don’t have to define the ThenBy or ThenByDescending methods. Those methods are already defined by LINQ. They call CreateOrderedEnumerable on the IOrderedEnumerable reference that’s passed as the first parameter. So, if there’s a TopBy followed by ThenBy, the HeapOrderedEnumerable<TElement> that TopBy created gets called. If the first function is OrderBy, then the IOrderedEnumerable implementation created by that is called.

The code now will create a chain of comparers from a TopBy or TopByDescending followed by any number of ThenBy or ThenByDescending calls. The chain is built backwards, but that’s okay; it’s easy enough to reverse the chain when it comes time to enumerate items.

Enumerating the items involves extracting and saving keys, and doing the actual heap selection. In the next post I’ll talk a bit about key selection, and start building the GetEnumerator method. I might be able to finish next time, but there’s a lot of material to cover and it might be too big for a single post. If so, we’ll finish up after that.

Creating a TopBy method for LINQ to Objects

One reason I spent so much time last fall working on my generic heap implementation was that I wanted a TopBy extension method for LINQ to Objects. So, for example, if I have a list of integers and I want the 10 largest, I could write something like:

  var top10 = theList.TopBy(10, k => k);

Without a TopBy method, the LINQ way to do that is:

  var top10 = theList.OrderByDescending(k => k).Take(10);

That’s all well and good when the list is quite small but when working with lists that contain millions of items, having to do a complete sort just so I can get the top 10 is hugely expensive in both memory and processing time. And if the list is larger than will fit in memory, the LINQ OrderBy solution flat doesn’t work.

Given my heap class, it’s easy enough to create an extension method that does something similar to what OrderBy does. If you ignore the keySelector parameter and assume that the items have a default comparer, then it’s almost trivial:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems)
    {
        var comparer = Comparer<TSource>.Default;
        var heap = new MinDHeap<TSource>(3);
        foreach (var item in source)
        {
            if (heap.Count < numItems)
            {
                heap.Insert(item);
            }
            else if (comparer.Compare(item, heap.Peek()) > 0)
            {
                heap.ReplaceRoot(item);
            }
        }
        // and return in reverse heap order
        return heap.GetConsumingEnumerable().Reverse();
    }

Wrap that up into a static class and you can get the top 10 items from a list with:

    var top10 = theList.TopBy(10);

Adding the ability to specify a custom comparer is straightforward: something that I covered in my heap discussions. That would produce this function:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems,
        IComparer<TSource> keyComparer);

I could probably get by with that just fine. If I need to compare on multiple fields, I can just write a different IComparer. And if I need to do some fancy key selection, I can either do it in the comparer, or write a front end filter that gets the keys that I need and builds a list of intermediate objects I can work with. It’s not something that I need to do very often.

At least, that was my thinking until I ran into a situation where building that list of temporary objects turned out to be a giant pain in the neck. Here I was able to sort with no trouble by using a key selector and several ThenBy operations, but to select the top 10 I had to jump through all kinds of hoops. It was just ugly.

Consider, for example, ordering a list of items based on how many users liked or disliked them. You want the items that have the most likes to appear on top, but if two items have the same number of likes, you want the one with the fewest dislikes to be shown first. So, given three items:

    bar, 50 likes, 8 dislikes
    foo, 10 likes, 1 dislike
    bas, 50 likes, 1 dislike

You would present them in the order (bas, bar, foo) because bar and bas both have 50 likes, but bas has fewer dislikes. The LINQ code to order them would be:

    var ordered = items
        .OrderByDescending(x => x.Likes)
        .ThenBy(x => x.Dislikes);

No custom comparer is required. But if I wanted to do that with the TopBy method shown above, I’d have to create a custom comparer and pass it. What a pain in the neck. And that’s just a simple example. If generating the keys required any calculation I’d want to pre-compute the keys once so that I wouldn’t wast time constructing a key every time I wanted to compare it. As I said, I could get by with the primitive method, but it’d be much nicer if TopBy would work the same way as OrderBy.

So I thought I’d look into creating TopBy and TopByDescending methods that I can use the same way that I use LINQ’s OrderBy and OrderByDescending. That turned out to be more involved than I thought it would be.

The key is the IOrderedEnumerable interface which, although simple enough on the surface, turns out to be a lot more complicated to implement than the documentation would lead you to believe. That’s where I’ll start next time.

Next: Implementing IOrderedEnumerable, Part 1

My next car

I drive a 1996 GMC Sonoma pickup truck that we bought new in February 1996. The truck has about 190,000 miles on it. It’s been wrecked and repaired twice, and I put a new engine in it at 120,000 miles. I keep up on the maintenance so the truck still runs well, and it looks much better than you’d expect an 18 year old truck to look. A few minor dents here and there is all.

Debra drives a 1996 Nissan Maxima that we bought new in December 1996. It has about 235,000 miles and is starting to show its age. She has to drive around town for work sometimes, so the car has to be reliable. We’ve kept up the maintenance on this car, so it runs well. But we haven’t fixed the dents from the minor wreck she had a few years ago. We had planned to replace the car by now but every time we think of getting something new we decide to wait another few months.

Debra’s new car will have to be something with four seats. We really like the Maxima and might get another. We both have a tough time with the price range, though: $30,000 to $40,000 for a car just seems crazy. There are lots of good cars in that price range, and even some nice ones in the $25,000 range (the Nissan Altima, for example). Debra wants four doors and plenty of space. Other than that, we haven’t narrowed it down. A 4 door sedan? Minivan? SUV? We’ll be looking for a long while.

I on the other hand just need basic transportation. The truck is handy for hauling things, but in truth I don’t often carry anything in the back. So I’m looking for something small and inexpensive to take me to work and back. I looked at the all electric Nissan Leaf, but can’t justify the $30,000 price tag for a basic commuter car. Even the Smart and the Fiat 500 are pretty expensive for what you get, but they’re less expensive than the Leaf and I can actually go places if I want to. The Leaf’s 100 mile range is a real drawback.

About six months ago I ran across Elio Motors, a company that’s gearing up to produce a small, affordable, and high mileage car. Last I heard (earlier this week), they expect to start shipping cars in the fourth quarter of this year.

elio-motors

It has two seats (front and back) and gets 80 MPG on the highway. City mileage will be about 50. It’s a 0.9 liter, 3-cylinder engine. The best part: $6,800. Yes, $6,800. It’s the perfect little commuter car.

I’m sending in my deposit to reserve one.

Using LINQ’s ToLookup to replace dictionary creation code

I often find myself having to build a dictionary from a list of items, with each entry in the dictionary containing a subset of the list. For example, given a list of people and the shifts they’ve worked, I need to group them by shift so that all the people who worked the morning shift are grouped together. I can then operate on those sublists separately.

Rather than illustrating with a list of employees and shifts they’ve worked, imagine you have a list of words, and you want to group them by their first letter. So given the words (ant, aardvark, baboon, giraffe, tortoise, gorilla turtle), you’d end up with a list of lists that looks something like this:

A – ant, aardvark
B – baboon
G – giraffe, gorilla
T – tortoise, turtle

You could then query the data structure to get a list of all the words that begin with a particular letter.
The traditional way to do this in C# would be to create a Dictionary<char, List<string>>, and populate it as you enumerate the words list. For example:

    private Dictionary<char, List<string>> BuildWordsByFirstChar(IEnumerable<string> words)
    {
        var wordsByChar = new Dictionary<char, List<string>>();
        foreach (var word in words)
        {
            List<string> l;
            var key = word[0];
            if (!wordsByChar.TryGetValue(key, out l))
            {
                // Key doesn't exist in dictionary.
                // Make a new entry with an empty list.
                l = new List<string>();
                wordsByChar.Add(key, l);
            }
            l.Add(word);
        }
        return wordsByChar;
    }

You can then query the resulting dictionary data structure to get the list of words that start with a particular character, or you can iterate over the dictionary keys. For example, this code will display all of the keys and their associated words:

    var wordsByChar = BuildWordsByFirstChar(_words);
    foreach (var c in wordsByChar.OrderBy(kvp => kvp.Key))
    {
        Console.Write(c.Key + ": ");
        foreach (var w in c.Value)
        {
            Console.Write(w + ",");
        }
        Console.WriteLine();
    }

I’ve written such dictionary building code countless times in the past. It’s tedious. It’s also unnecessary because you can do the same thing with LINQ in a single line of code:

    var wbcLookup = _words.ToLookup(s => s[0]);

That creates a Lookup<char, string>, which is very similar to the Dictionary<char, List<string>> that the other code creates. For example, code to output the contents is almost identical:

    foreach (var c in wbcLookup.OrderBy(g => g.Key))
    {
        Console.Write(c.Key + ": ");
        foreach (var w in c)
        {
            Console.Write(w + ",");
        }
        Console.WriteLine();
    }

There are differences, however. Whereas attempting to access a non-existent key in the dictionary will result in a KeyNotFoundException, the Lookup will return an empty list. In addition, the value returned by the dictionary is a List<string> whereas the value returned by the lookup is of type IEnumerable<string>.

Lookup, by the way, is a convenient wrapper around IEnumerable<IGrouping<TKey, TElement>>. You can get much of the above functionality without using ToLookup at all, but rather by just grouping the elements like this:

    var groupedWords = _words.GroupBy(s => s[0]);

That gives you a grouping, but you can’t access the individual elements by key as you can with Lookup. If you really need to create a Dictionary<char, List<string>>, you can build it from the grouping, like this:

    var wbcDict = groupedWords.ToDictionary(c => c.Key, c => c.ToList());

Or, if you have a  Lookup:

    var wbcDict = wbcLookup.ToDictionary(c => c.Key, c => c.ToList());

But I’ve found that in most cases a Lookup will do and there’s no need to create a Dictionary.

The primary difference between ToDictionary and ToLookup, is that ToDictionary is designed to create a one-to-one mapping from a list, and ToLookup creates a one-to-many mapping.

The primary point here is that with ToLookup, I can replace the dozen or more lines of procedural dictionary creation code with a single line of code that more clearly states the intention. And if I absolutely need a Dictionary, I can just call ToDictionary on the result.

Looks like a win to me.

More fun with LINQ: SelectMany

A coworker came to me with a little coding problem yesterday, and my investigation of the problem revealed a few things about C#, the .NET Framework, and LINQ. It all started with an email in which he decried the “ickiness” of this code.

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            if (!workTypes.Contains(row.WorkTypeKey))
                workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited && !workTypes.Contains(row.BreakWorkTypeKey))
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes.Select(wt => wt);
    }

Simply explained, he’s creating a list of all the work types that are used in that particular row set.

We can simplify that quite a bit by removing some unnecessary code. First of all, HashSet<T>.Add fails gracefully if the item passed to it already exists in the collection. The method returns True if the item is added to the to the collection. If item already exists in the collection, then Add returns False.

In addition, the return statement is unnecessarily complex. Select(wt => wt) doesn’t do anything different; it just enumerates the collection and creates another IEnumerable.

Taking advantage of those observations, the simplified code becomes:

    private IEnumerable<long> GetTimeWorkedAndBreakWorkTypes(IEnumerable<Row> rows)
    {
        var workTypes = new HashSet<long>();
        foreach (var row in rows)
        {
            workTypes.Add(row.WorkTypeKey);
            if (row.IsBreaksLimited)
                workTypes.Add(row.BreakWorkTypeKey);
        }
        return workTypes;
    }

That’s pretty standard code, and about as good as you’re going to get with looping constructs. It’s not aesthetically pleasing, though. You can improve it with LINQ.

Anybody who’s worked with LINQ is familiar with the Select method and its variants to project each element of a sequence into a new form. If all he wanted to do was get a list of the individual rows’ WorkTypeKey values, for example, he could have written:

    return rows.Select(r => r.WorkTypeKey).Distinct();

But that only provides one item per row. The goal in the original code is to provide at least one but possibly two items per row. Select can’t do that. But SelectMany can. For example, if he wanted to unconditionally return both items, he could write:

    return rows.SelectMany(r => new long[] {r.WorkTypeKey, r.BreakWorkTypeKey}).Distinct();

SelectMany lets you provide a sequence for each item, and then flattens all of those sequences into a single sequence. So the result of the SelectMany would be a single sequence of long values.

Conditionally providing that second item, though, is a little more complicated. Here’s one way to do it.

    var wt =
        rows.SelectMany(
            r => r.IsBreaksLimited
                ? new[] { r.WorkTypeKey, r.BreakWorkTypeKey }
                : new[] { r.WorkTypeKey });
    return wt.Distinct();

Note that I didn’t specify long in the array declarations. The compiler can infer the type, so in this case new[] resolves to new long[]. Whether you take advantage of the type inference is a matter of style.

I see two common objections to this type of code. The first is that it’s inefficient because it creates a list and then does a Distinct call to cull the duplicates. But that’s not what happens. You have to keep in mind that LINQ is using lazy evaluation and deferred execution. The calls to SelectMany and Distinct don’t actually do anything until some code tries to enumerate the resulting sequence (by using foreach or by calling ToList(), for example). Even then, it’s not as though the code calls SelectMany to create a list and then Distinct to cull the duplicates. What really happens is that Distinct creates a HashSet and then iterates over the rows collection, checking each row in turn. Essentially, the code does this:

    var wt = new HashSet<long>();
    foreach (var row in rows)
    {
        var items = row.IsBreaksLimited
                        ? new[] {row.WorkTypeKey, row.BreakWorkTypeKey}
                        : new[] {row.WorkTypeKey};
        wt.UnionWith(items);
    }
    return wt;

So you see that the code doesn’t really create a temporary list and then de-dupe it. It does, however, create small temporary arrays, which causes some people to raise efficiency concerns. In truth, there is a small inefficiency there. But the garbage collector is optimized to efficiently handle allocation of many small, short-lived objects. So those little arrays shouldn’t cause a problem. I might think differently if the rows list were to contain millions of items, but for the handful of items it will typically contain, there’s no reason to worry about the small additional load on the garbage collector.

That said, what I really wanted to write was:

    // NOTE: This code doesn't compile!
    var wt =
        rows.SelectMany(r =>
        {
            yield return r.WorkTypeKey;
            if (r.IsBreaksLimited)
                yield return r.BreakWorkTypeKey;
        });
    return wt.Distinct();

Unfortunately, that code doesn’t compile. For reasons that I won’t go into, you can’t use yield return in a Lambda or anonymous method. You can, however, write a separate method to do that, resulting in:

    private IEnumerable GetRowWorkTypes(Row row)
    {
        yield return row.WorkTypeKey;
        if (row.IsBreaksLimited)
            yield return row.BreakWorkTypeKey;
    }

    private IEnumerable GetTimeWorkedAndBreakWorkTypes(IEnumerable rows)
    {
        var wt = rows.SelectMany(GetRowWorkTypes);
        return wt.Distinct();
    }

I find it somewhat annoying that I have to create a separate method in order to do it this way, but the result is clear and reasonably concise.

I’m still finding out new things about LINQ. SelectMany is one of those things that I don’t need very often, but comes in very handy from time to time, this case being a prime example.

Posting the heap code

I started a series of articles about heaps a few months back, but then got sidetracked with other projects and didn’t complete my article series. All that was lacking was the code and some detailed explanation of some parts, but I’d already covered the important theoretical stuff.

I probably won’t finish those articles any time soon, but the code is complete and usable. The download file, HeapCode, is a .zip file that contains the IHeap interface and the DHeap class. Using them is straightforward. If you’re at all familiar with .NET generic collections and you’ve looked over my heap articles, you’ll be able to use the DHeap class without trouble.

For reference, the articles in the series are:

Priority queues
A better way to do it: the heap
A simple heap of integers
Uses for heaps
The d-ary heap
Heaps: cleaning up some code
The IHeap interface
Heaps: constructors for the DHeap class

Happy heaping.

Was the ACA designed to fail?

One of the things that worries me about the Affordable Care Act (Obamacare) is that it depends on a large number of young, healthy people signing up on the exchanges. The idea is that their premiums will more than pay for the care that they use, and the excess will go to pay for the older people who consume more health care dollars. It’s a big Ponzi Scheme. I explained almost four years ago why I think it won’t work. If the young and healthy don’t sign up on the exchanges, or if people consume more health care resources than the planners projected, then the whole scheme falls apart.

There are plenty of other things wrong with the ACA as well. It’s just bad legislation that was pushed through Congress in a hurry and in a somewhat irregular fashion because Democrats knew that they couldn’t get it to pass the normal way and time was running out. The ACA is something like 1,000 pages of text, and it’s doubtful that any one person understands everything in it. It’s a certainty that none of our Representatives or Senators fully understood what they were agreeing to when they voted for the thing.

It’s no secret that many of those who pushed for the ACA are unhappy that they couldn’t push through a single payer system: fully government-paid health care. I’ve heard Democrats say, in private conversations, that when the insurance companies fail to live up to the provisions of the ACA, we can finally move to a single payer system. And I begin to wonder.

I’ve long held that bad government is the result of incompetence and unintended consequences; that nobody could purposely create the inefficient, ineffective, and idiotic government bureaucracies, programs, departments, rules, regulations, commissions, etc. that we see every day. But in my more cynical moments I wonder . . .

Was the ACA crafted to fail? Was the plan all along to create a system that can’t possibly work, knowing that when it does there will be so many people dependent on the health care subsidies that it will be politically impossible to cancel the law and the only way forward will be to go to a single payer system? Because I think that’s what will eventually happen. Perhaps even in my lifetime.

It’s a frightening thought: that inefficient and ineffective government is created on purpose, slowly becoming larger and more intrusive. Much like the metaphorical boiling frog, we wouldn’t stand for the government we have if it had been sprung on us all at once, but we accept (with protest) continually more expensive and intrusive government if our taxes increase and our liberties erode a little at a time.

The problem, though, is that the metaphorical frog eventually dies.

Lizard on a log

I don’t remember where I got this piece of wood or even what kind it is. It’s been sitting in my garage for at least a year.

liz1

It’s pretty light, and as I recall it’s some kind of pine or cypress. A conifer, almost certainly. It’s very dry and has lots of bug holes. Turned out it still had some bugs in it, too, which I took care of later after I’d run across.

Wanting to carve a figure into the wood in much the same way as I did the Bristlecone bird, I chose to carve a gecko.

liz3

I first cleared the remaining bark and sapwood from the area I wanted to carve, then drew a rough outline of the gecko on it and used a thin cutter to make it really stand out.

I should note here that I did almost all the carving on this piece with my Foredom power carver. I could have used knives and gouges–the wood was soft enough–but I’m trying to get more proficient with the power carver. Plus, that dang thing can remove wood fast.

With the outline etched into the wood, the first task is to rough out the shape.

liz4

Roughing that out was a lot of work, and as it turned out I removed a lot more than I had to. Live and learn, I guess.

Granted, that rough shape is really rough. Time to start refining.

liz5

You can see that I’ve refined the shape and taken it down quite a bit. In the prior picture the rough shape was just a blob. Here it’s pretty obvious that I’m going for a lizard. But it needs more refining.

You can see here that I removed an inch or to from the front of the branch, putting the lizard’s head over the edge. The primary reason I did this was to make it easier to shape the under side of the head and neck.

liz6

Here you can see that I goofed a bit. I somehow managed to thin the lizard too much. Instead of having a nicely rounded belly, it’s almost straight. But it still resembles the gecko.

I also cut off the other end of the log. I had originally planned to do something over there, but that chunk over there was making it difficult to work on the right rear leg and the right side of the tail. In addition, removing that end put the lizard in the center of the picture.

By this point I’d been working on the thing for four or five hours. I was tired. Also, I’d found a wood borer with the Foredom (bug guts on the carving), so I figured I’d better cook the piece in the oven for a few hours. Otherwise the bugs would end up eating my finished carving.

liz7

I had intended to let it sit until the next day, but after it was done cooking I was refreshed. So I went back out to the garage for another hour to finish up the shaping. Above is the final shape, with only a little more work left on the feet, and a lot of sanding.

The next morning I finished the sanding. I’m still not real happy with the feet. The toes are too hard-edged. I haven’t yet figured out how to round them well. I had the same problem with the standalone geckos I’ve carved.

liz8

I had originally planned to smooth and sand the base, but when I got started it was looking kind of boring. So I opted instead to try making something that resembles tree bark.

liz9

I’ll be the first to admit that it doesn’t look much like bark, but the random squiggles are more interesting than a flat smooth surface. I think it serves to emphasize the lizard.

With that done, I finished with two coats of Deft Satin polyurethane spray.

lizf1

You can click on the above image, or on the one directly below, for a much larger view.

lizf5

And here are a few other shots from different angles.

lizf2 lizf3 lizf4

In that last picture, you can see that the wood cracked on the right front leg, just above the foot. That happened while it was cooling, after I’d cooked it in the oven (90 minutes at 200 degrees) to kill any bugs. I considered mixing some wood dust and glue to fill the crack, but figured I’d leave it alone.

It’s been said that carvers don’t make mistakes. Rather, we make adjustments. I made lots of adjustments on this piece. But it was a great learning experience and it turned out okay. Will be fun to keep around and look at in a few years.

It’s going to be a week or two before I can attack another project like this. My garage isn’t heated, so when the temperature drops much below 60 degrees it’s uncomfortably cold to be working in there. The combination of shivering and frozen fingers isn’t conducive to wood carving, and the forecast is for cold weather (some freezing, even!) for the next two weeks. If I do any carving it’ll probably be small basswood figures with knives: something I can do while sitting behind my desk.

An interesting programming puzzle

I ran across this puzzle yesterday.

Write a function that, given a positive integer, will express that number as a series of simple mathematical operations (add, subtract, multiply, divide) using only single-digit numbers. Output of the function is a string containing a postfix expression.

So, for example, 36 would result in 66* or 94*. 11 would be 92+ or 74+, etc. The number 101 could be expressed as 554**1+.

It didn’t take long to come up with a recursive function that meets the requirements. Here it is in C#:

    string GetExpression(int val)
    {
        if (val < 10)
        {
            return val.ToString();
        }
        int quo, rem;
        // first see if it's evenly divisible
        for (int i = 9; i > 1; --i)
        {
            quo = Math.DivRem(val, i, out rem);
            if (rem == 0)
            {
                if (val >= 90 || (val < 90 && quo <= 9))
                {
                    // value is (i * quo)
                    return i + GetExpression(quo) + "*";
                }
            }
        }

        quo = Math.DivRem(val, 9, out rem);
        // value is (9 * quo) + rem
        // optimization reduces (9 * 1) to 9
        var s1 = "9" + ((quo == 1) ? string.Empty : GetExpression(quo) + "*");
        var s2 = GetExpression(rem) + "+";
        return s1 + s2;
    }

The overall idea here is that first I try to divide the number evenly by a one-digit number. If it does divide evenly, then the result is (x * quotient), where x is the divisor. If the number isn’t evenly divisible by a one-digit number, then the function divides the value by nine and generates (9 * quotient) + remainder.

I made one simple optimization based on the observation that any number in the range 0-90 inclusive can be expressed either as the product of two single-digit numbers, or with an expression of the form (9 * x) + y, where x and y are one-digit numbers. 31, for example, is (9 * 3) + 4, or 93*4+ in postfix.

The function above generates the shortest possible expressions for all numbers in the range 0 through 90. I think its expressions for 91 through 101 are as short as possible, if non-obvious. But I know that it doesn’t always generate optimum expressions. For example, the generated expression for 10,000 is 85595*5+*** (it works out to (8 * 1250)). But 10,001 results in 99394*5+**4+*2+ (postfix for (9 * 1111) + 2). A shorter expression would be (10000 + 1) or, in postfix, 85595*5+***1+.

I’m at a loss as to how I would go about ensuring an optimum expression for any positive integer. I know that I could add more special cases, but I’m already disappointed by the existing special case in my algorithm for numbers 0 through 90. Even the check for even divisibility kind of bothers me, but that’s the only way I could find to make things reasonable at all.

If the puzzle interests you, I’d be interested in any insights.

Making me crazy

I don’t travel as much as I used to, so I’m not up on all the latest changes. The last time I traveled by air was two years ago, and somebody else made the reservations. I don’t remember the last time I booked a flight.

This evening I was making reservations to go to southern California. I typically just go to Southwest Airlines because their rates are competitive if not always the lowest, and I always get good service. But seeing the cost of the flight I thought I’d shop around. Several carriers showed much lower prices for the trip I wanted. Until I took into account restrictions, extra charges, etc. Their Web sites don’t make it easy for me to feel confident that I’m getting what I think I’m getting, and by the time I added everything up it wasn’t a whole lot cheaper than Southwest.

I entered my Rapid Rewards account (Southwest’s frequent flier program) with my flight information so I’d get points for the flight. Why not, right? But then I couldn’t check out. You see, my Rapid Rewards account has my name as Jim Mischel. But new (new to me, at least) government regulations (“Safe Travel” or some such) insist that the name on the ticket match the name on my driver’s license, which is James Mischel. Uff. Southwest’s helpful Web site suggested that I change the name on my Rapid Rewards account.

But the Rapid Rewards account information page says:

For security purposes, we do not allow name changes online. Instead, please forward your Rapid Rewards card, along with photocopies of legal documentation (ex. driver license, marriage certificate, etc.) and an informal letter indicating your legal name, to Rapid Rewards, P.O. Box 36657, Dallas, TX 75235.

Just shoot me.

The only solution I could come up with was to remove the Rapid Rewards number. So I won’t get my points for the flight. Probably wouldn’t matter, anyway; I don’t fly enough for the points to mean anything.

Ain’t technology wonderful?