Carving a whimsical house, Part 3

This is the third in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

In Part 2, I finished carving the roof. The next thing to do is establish the roof line and gables, and carve the sides down in preparation for the doors and windows. To that end, the first order of business is to draw the roof line similar to this:

roofline1 roofline2

 

typhoonBudI use a Typhoon “bud” burr, shown on the right, to cut the roof line in the wood. If you
don’t have that particular burr, a flame shape or a taper would probably do just as well. This line doesn’t have to be terribly precise; just be careful not to carve away the roof.

roofline3The picture on the left shows the house after I’ve cut the roof line. You probably noticed that the bottom line looks like I cut it with the same burr. I might have. That bottom line, by the way, is where I’ll carve the rocks that the house sits on.

The next thing to do is carve the space between the roof and the base relatively flat so that the roof overhangs the house. I’ve used several different burrs for this and still haven’t decided which one I like the best. The 9/16″ Typhoon that I used for rough shaping the roof works okay, although it’s kind of hard to get the middle portion. However you do it, be careful not to go too deep. Also don’t worry about getting it perfectly flat. At this point, the most important part is to carve down the triangle that makes the gable on each end.

gable1 gable2

Above you can see that I’ve carved the gables, and then drawn lines to outline the trim board that serves to separate the wall from the gable. The next step is to relieve that board by carving down the wood above and below it.

gable3

Carving away the wood below is easy enough: just outline with the “bud” burr and then carve the rest of the wall. Carving the gable is a bit of a pain. I’m still experimenting with the best way to do that. If the area is large enough, I’ll use the bud or a flame shaped Typhoon burr. If it’s smaller, then the ruby carver seems to work pretty well. However I’ve done it, though, I’ve not been able to get a good sharp line. I keep trying.

gable4Here’s what my house looks like after carving both gables and the walls on all four sides.

Again using the Typhoon bud burr, I roughed out the rocks that make up the base of the carving. I probably made a mistake using that large burr for this, because the resulting rocks have too much space between them. I had to do a lot of smoothing and shaping work to get it to look even halfway decent in the rocksbaseend. Live and learn, I guess. I would suggest taking your time with a ruby carver or perhaps a long taper Typhoon if you have one. In any case, something smaller and less aggressive than what I used.

After I finished roughing out the rocks, I spent a little time with the flap sander to smooth the walls a bit so that I had a reasonably flat surface on which to carve the doors and windows. If you don’t have a flap sander, a sanding drum will work as will any other burr that leaves a relatively smooth surface. The surface doesn’t have to be completely free from scratches, as you’ll be cleaning it up later. But you probably want something pretty flat for the door and window frames.

Next time: doors and windows.

 

 

Carving a whimsical house, Part 2

This is the second in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

roof1In the first part, I cut a piece of fence post into four blocks, each 2″ x 2″ x 3″ high, and drilled a 3/8″ hole in the center. The result is shown on the left.

I should note here that all of the carving on this piece is done with the Foredom power carver using various bits and burrs. It’s important when using these tools to wear eye protection, have some type of dust collection apparatus, and wear a respirator or some other way to protect your lungs. Power carving produces a lot of dust, and you do not want that stuff in your lungs. You might also want to wear ear plugs if your dust collection system is especially noisy.

I also want to point out that this series of posts shows how I carved one house. This isn’t the only way to do it. In fact my method is constantly changing as I become more familiar with using the tools. This is only the fifth one of these houses I’ve made, so I’m still just a beginner.

With that out of the way, let’s get started.

Oh, by the way, you can click on any of these pictures to get a larger view. Although the “larger” view might not be a whole lot larger. I seem to be having trouble currently uploading larger pictures.

I find it best to start with the roof. Establishing the roof line sets the stage for the rest of the house. Also, if you carve the rest of the house and leave the roof for last, it’s very possible that you’ll run out of wood to complete the roof you want. The style of roof I’m creating here can eat up quite a bit of wood. I had to throw a Cottonwood bark house away once because I didn’t leave enough space for the roof. I was carving that one with knives and gouges.

roof2In the picture above, you can see that I’ve drawn a rough profile for the roof and chimney. Using a 1/2″ coarse Typhoon burr, I first outline the chimney.

The picture at the right shows how I begin to rough out the roof. The Typhoon burr is pretty aggressive, so I’m careful around the chimney, and I leave a lot of extra wood. I’ll come back later with a less aggressive burr and shape it.

Although the burr is aggressive, the cedar is a medium-hard wood and it takes some time to remove all that wood. Be patient and check your outline from time to time so that you don’t take off too much.

roof3It took me about 20 minutes to finish roughing out the roof to this point. That’s okay. I’m not in a race to see how quickly I can carve one of these little houses. I’d rather take a little extra time than get in a hurry and either destroy the carving or, worse, lose control of the tool and injure myself. Running that Typhoon burr over a thumb hurts. Trust me.

roof4Remember, too, that making a mistake isn’t fatal. These houses are supposed to be whimsical. They’re certainly not architecturally correct. If you inadvertently carve through the chimney, for example, don’t worry too much about it. You can always carve it to look like the chimney is falling apart. Carvers don’t make mistakes; we make adjustments.

roof5Next, I shape the chimney using a smaller and less aggressive Typhoon burr. I also put a few dips and humps in the roof surface in order to make it a little less uniformly flat, although that turns out not to be necessary for the roof style I chose; the process of adding the roof tiles makes for an irregular surface.

The last thing I did before beginning to carve the roof tiles is go over the roof with a 120 grit flap sander to remove most of the scratches left by the Typhoon burrs. Again, that’s not strictly necessary because the next step has me going over the roof with a finer ruby carver, but it’s part of the process for me–something I do regardless of the roof style I choose.

ruby1

I do all of the tiling with the flame-shaped ruby carver shown on the right. I’ve tried other bits, particularly for carving the lines between roof tiles, but they end up making deep narrow lines that I then have to spend time removing. I like the flame shape, but the ruby carver is a bit less aggressive than I’d like. It also tends to clog up on cedar, and I have to clean it now and then with a brass brush. Don’t use a steel brush.

rooftile1

I chose to do a tile roof on this house. This is something I’ve tried once before with a Cottonwood bark house, and also on an earlier Cedar house. Those two didn’t turn out so well. I experimented with one side of this roof to refine my technique. The photos and description I show here are from the second side. I think I almost have this roof style figured out.

Here you can see that I’ve outlined the first tile. The next step is to remove wood to the right and below so that the tile stands out from the roof. Then, draw lines for the next two tiles.

rooftile2When I started, I found it helpful to draw lines for the tiles before outlining each one. I got the hang of it after a while and began just carving the next tile without first drawing a line. Do what you feel comfortable with.

rooftile3Next, outline those two tiles, carve away wood to the right and below, and outline the next tile. Note that you don’t have to carve the whole rest of the roof down after outlining each tile. Instead, just carve down enough for the next tile. What you’re going for is a gentle slope from the top left to the center bottom.

rooftile4Work your way down and to the right, outlining and carving away wood for each tile until you get approximately to the center of the roof. When you’ve completed the left side, it should look something like the large photo on the right.rooftile5

The general flow of the roof should be towards the right and down. That is, tiles on the left should appear “above” tiles on the right. It’s okay if some tiles appear to stick up above where they “should” be; that’s part of the house’s whimsical nature. But do try to keep the general flow to the right and down.

rooftile6Next, start at the top right corner and do the same thing, but work down and to the left. Again, take your time. When you get close to the center, where the tiles from the left meet the tiles to the right, you’ll probably have to make some adjustments. If there’s a trick to making that come out just right, I don’t yet know what it is. I will say, however, that this is the best I’ve done so far.

rooftile7And that’s one side of the roof tiles, almost completed. You can see that I made a few mistakes, the biggest one being there on the far right where I have one tile sitting completely on top of another. It looks a bit strange, and is not what I had planned. Don’t know how I managed that.

If you have a smaller ruby or diamond carver, you might want to sharpen the edges between tiles. I suspect that with a bit more practice I’ll be able to get sharp edges between all the tiles. I did a passable job here, but some of the lines aren’t as clean as I’d like them.

That’s it for the roof tiles. Next time I’ll rough out the roof line and carve the gables.

 

 

Carving a whimsical house, Part 1

house_banner

An online carving group with which I’m involved is doing a “friendship stick” project. Each of 10 carvers makes 10 carvings from a 2″ x 2″ x 3″ block of wood, and sends one carving to each of the other carvers. Every carving has a 3/8″ diameter hole drilled through it from top to bottom. When we receive the carvings, we display them stacked on a 3/8″ dowel.

The only rules are that the carving cannot be larger than 2″ x 2″ x 3″, and it has to have that hole in the middle. Beyond that, anything goes. I decided that I’d make little houses from a cedar fence post. The picture above is one of the houses I’ve carved for the project. This blog post is the first of several parts showing how I go from fence post to finished house.

fencepost

Above is a 13 inch long piece of cedar fence post. We’re replacing the 30-year-old fence around our property, and I pulled this post from the ground a month or two ago. The wood isn’t really cedar, but rather Ashe Juniper, Juniperus ashei. The stuff grows like weeds all over Central Texas, and the wood is commonly used for fence posts. That it’s held the fence up for 30 years is good testament to its suitability for that purpose.

The first step in making a little house is turning this post fragment into blocks. And the first step of that process is making one side reasonably flat. After carefully checking for and removing any nails and fence staples, I took the piece over to the bandsaw. I set my fence about 1/2″ away from the blade, adjusted the height, and made the cut.

bandsaw1

(Yes, the picture is a little blurry.)

You can see a few ripples in the cut.  Normally I would have done with a resaw blade (a 1/2″ blade with four teeth per inch), but I had the 3/16″ blade on the saw and didn’t want to mess with changing it. The resaw blade would have made for a straighter cut, but this was good enough for my purposes.

With one flat side, I could then take the piece of wood over to the table saw to finish squaring it up. I used to do this on the bandsaw, but it’s easier to get square sides with the table saw.

tablesaw1

The picture above shows the piece with three sides squared up. After cutting the second side, I set the fence for 2″ and cut the other two sides. Then I set it for 3″ to cut the blocks to the right length.

blocks

The resulting blocks aren’t quite square because my initial cut with the bandsaw wasn’t perfect. I could have made allowances for that and squared things up on the table saw, but it just wasn’t that important to me. As long as the blocks are approximately square and don’t exceed the size limitation, it’s good enough for making these houses.

The next step is drilling a 3/8″ hole through the center of the block. Again, perfect placement isn’t terribly important, although I don’t want to be too far off. I used to do this with a hand drill, but I recently got a drill press, which makes things a bit easier. I just mark the center by drawing lines from corner to corner, and then clamp it in my drill press.

drillpressUnfortunately, my drill press’s range is about two and a half inches. So in order to drill through a 3″ block of wood I had to clamp the block in my bench vise and use a hand drill to finish the job.

drill

With the block cut and a hole in the middle, it’s time to start carving. Stay tuned.

Part 2, in which we carve the roof.

 

Rubber soles won’t save you

It’s thunderstorm season again. The overly cautious are telling everybody about the dangers of lightning, and the brave and clueless are spreading misinformation. Dangerous misinformation, at that.

First of all, the disclaimer: lightning is dangerous. In the United States, there are about 400 injuries and 40 deaths due to lightning every year. Those numbers were much higher in the past. They’ve declined as the country has become more urban and people spend more of their time indoors.

If you can hear thunder, then a lightning strike in your area is possible. The likelihood of a strike is another matter, but if you want to be safe and follow the NOAA guidelines: “When thunder roars, go indoors.”

I’m not nearly that cautious. After all, bee stings kill 25% more people per year in the U.S. than does lightning.

That said, if you decide to stay outside despite the NOAA’s advice, there are some things you should know. You shouldn’t be near the tallest tree or structure in the area. If you’re in the middle of an empty field, you’re at a much higher risk of being struck by lightning. Follow the NOAA’s outdoor guidelines.

Most importantly, don’t think that wearing shoes with rubber soles will help you. Sure, rubber is a good insulator, but all insulators have voltage ratings. Lightning melts thick glass insulators. Your wimpy little 1/2 rubber sole might as well not exist when it comes to a million-volt lightning strike.

And don’t think that cars can’t be struck by lightning. If you believe that the rubber tires on your car are protection against lightning, take a look at this video of a moving pickup truck being struck by lightning.

 

 

Vending vent

The vending area in my office building contains a snack vending machine with the screw-type dispensers, and a soda machine with eight lines. I typically walk down there around lunch time to get a drink, and sometimes I’ll get a snack.

My first choice for a drink is Coca-Cola. There are two lines of Coke (no jokes, please) in that machine, and about half the time I go to select a Coke, there isn’t any. There’s always a Coke Zero, though, and most times there’s a Dr. Pepper. I suspect the other drinks (Orange soda, Diet Dr. Pepper, and some others) are available, too.

The snack machine suffers a similar fate. Snickers bars go quick, as do the M&M’s and a few other things. I commonly see that snack machine with six or more empty lines, and the remaining selections are far down on my list of desirable snacks.

As far as I can tell, they refill these machines every two weeks. The Coca-Cola runs out in a week or less. Same with the Snickers and other popular snacks. Whoever’s operating these vending machines is losing a whole lot of potential sales. It is not true that people will select something else if their first choice is unavailable. When I want a Snicker’s bar, for example, I probably won’t settle for something else. I’ll just put my dollar back in my pocket and walk away. Cheese and peanut butter crackers are not a substitute when you’re craving a Snickers.

My dad and uncle owned a vending company back in the early ’80s. I worked there briefly between the time I left school and when I got my first programming job. They would get upset about one empty line on a vending machine. I can’t imagine what they would have done if they found one of their machines out of the most popular products for an extended period.

Odds ‘n ends

Once again, a whole month has passed without a blog entry. Blogging seems like such a heavyweight operation these days, compared to tossing something out on Facebook. I find myself posting short notes rather than exploring a topic in detail. Maybe I’m just getting lazy.

  • I started a new job at the beginning of April. I’m now working for a small company called Syvance, writing control and data acquisition software for laboratory equipment. I’m learning all about auto samplers, mass spectrometers, gas chromatography, and other equipment used in science labs, and writing software that drives the instruments.
  • You’d be surprised at just how old some of the software I’m working with is. My first project involves adding support for a new device to a program that was originally written in C for Windows 3.1. The program has been upgraded over the years, but it’s still C and much of the original code remains. In particular, there’s a cooperative multitasking scheduler that was written because Windows 3.1 didn’t have threads.
  • The manufacturer of the instrument I’m working with has supplied .NET assemblies for controlling the device, and some .NET GUI controls that display status, etc. Calling those from the main program involves making calls to a mixed mode assembly that has some MFC and some C++/CLI. That DLL in turn calls into .NET assemblies that implement the actual machine control. So I’m writing code in four different languages: C, C++, C++/CLI, and C#.
  • You could think of C++/CLI as C++ with a few extensions to support the .NET runtime, but it’s not really that simple. There are distinct differences. I’m almost ashamed to admit that I don’t fully understand how the whole mixed mode thing works. I have a general idea, but there’s a lot I don’t know. That will have to do for now, because finishing this project is more important than understanding all the nuances of the underlying platform. I have to say, though, that all those different layers make me nervous.
  • I did quite a bit of carving in May. I’ll have to post pictures at some point. There’s another bird in a branch, a whimsical house carved from basswood, a hummingbird pin, a couple of Tiki carvings, a bowl that’s almost done, and I’m sure something else that I can’t recall at the moment. I hope to have pictures of those in the next few days.
  • It seems as though we’ve had more rain this Spring than any time in the last 8 or 10 years. It’s also been much cooler. Here we are at the end of May and we haven’t topped 100 degrees yet. It would be nice to have a wetter and slightly cooler summer. The only drawback is that I’ll have to mow the lawn more than in previous years, but I can live with that.
  • Debra and I discovered the joy of Renaissance fairs earlier this Spring, and spent last weekend at one. I do need to write about that and post some pictures.
  • Living life and loving it. I’m busy with work, enjoying my hobbies, and mentally feel better than I have in years. Debra is training for another competition, and she’s looking better than ever.

Squiggle birds

When I was working on my 100 Birds Project, I saved the larger scraps from the bandsaw cutoffs. I nearly always had two cutouts that were in the shape of the bird’s side profile. The cutoffs are flat on one side and have the contour of the bird on the other side. Below are pictures of two cutoffs from a Juniper bird I did a while back, and then a bird cutout and the left side cutoff from a piece of Mesquite.

squiggle1

squiggle3squiggle2

(If you look closely, you see that the mesquite cutoff is from a different mesquite bird. But you get the idea.)

I had originally planned to take a belt sander to the curved side of the cutoffs to create a bunch of flat bird profiles that I could make into refrigerator magnets or hang from a mobile. I was working on those when my friend Mike took one and made what I call a “squiggle bird.” He sanded the flat side to follow the contour of the curved side. The result is pretty cool.squiggle4We made a couple dozen of those, the idea being that I could hang them from a mobile. But there are problems. Finding the balance point is somewhat difficult. If I attach a line just a few millimeters forward or back of the balance point, the bird takes a severe nose-up or nose-down attitude. I tried to solve that problem by attaching the line to the tail and to the head, but that just looked ugly. In addition, if I attach the line to the back, then the bird ends up tilted on the roll axis–as if it’s making a turn. I’ve yet to find a good way to attach the squiggle birds to a mobile.

Last week I got the idea to mount a couple of birds on a standing base.

2014-04-20 21.37.54-1

The bird on top is mesquite, and the one on the bottom is pecan. They’re attached to a mesquite log with little twigs, also carved out of mesquite.

That looks okay, and I made another just like it. But it seems a waste to use such a large piece of mesquite for the base, and it’d look neater and more compact if I swapped the birds’ positions. So I made another one with the new design.

birdplaque

Both of the birds are mesquite, and the base is a piece of mesquite from which I removed the bark and then sanded away most of the sap wood. The birds and the plaque got two coats of Watco Natural Danish Oil, and then the birds got a couple coats of Deft Gloss spray polyurethane. The birds are attached to the plaque with mesquite twigs that are approximately 3/8″ in diameter. The plaque is flat on the back, and about 1/2 inch thick at its thickest. It’s designed to hang on a wall rather than stand on a desk or shelf.

I really like how this one turned out. I suspect I’ll be making more like it. My biggest problem might be picking out wood combinations that look good together.

 

Inconceivable!

Every time I hear President Obama say something “will not be tolerated,” I’m reminded of Vizzini’s “inconceivable!”

The silly sanctions we’re placing on Russian officials will have approximately the same effect as the Bush administration’s pointless sanctions on North Korea back in 2006; banning the export of iPods, Segway scooters, and other luxury items so that Kim Jong-Il wouldn’t be able to give trinkets to his cronies.

But the administration has to bluster and posture and appear to be “doing something about the problem” even though the president and anybody with more brains than your average mailbox knows that we are completely powerless: Russia will do what Putin wants, openly and without fear of reprisal.

Why do I think we’re powerless? First, military confrontation is out of the question. Russia isn’t a little country with an insignificant military that we can overpower with a division or two of National Guard troops. Air strikes, even if we were stupid enough to try and could get past the Russian defenses, would be met with major reprisals and ultimately result in a real war that nobody wants and if fought would destroy the world economy.

Not that we could make any kind of military strike. President Obama isn’t dumb enough to try convincing us that Ukraine is worth another foreign excursion, and no member of Congress who is up for re-election this year is likely to approve such a thing even if the president were to suggest it.

Second, serious economic sanctions are out of the question because large parts of Europe depend on Russian natural gas for heat. Nobody in Europe wants to risk upsetting Putin, because he could easily turn off the spigot. The first cold snap come Fall would have every citizen in Europe demanding that The Powers That Be let Russia have whatever they want, so long as the gas is turned back on. Putin isn’t beyond punishing our European allies in response to American “provocation.”

It does make for good theater, though. Too bad we can’t bottle all the hot air from both sides. Between Obama’s rhetoric, Putin’s laughable denials, and Republicans’ excoriating the president for his lack of action, we could make a serious start to reducing the amount of fossil fuels burned to keep warm next Winter.

Random selection

If you know how many items are in a collection, picking one at random is trivial: just select a random number from 0 to count-1, and you’re done. Things get more interesting when you don’t know how many items are in the collection. The naive way to do the selection is to first make an array. Then you can use the solution I just mentioned. For example:

    readonly Random _rnd = new Random();

    int GetRandomItem(IEnumerable<int> items)
    {
        var a = items.ToArray();
        var ix = _rnd.Next(a.Length - 1);
        return a[ix];
    }

Whereas that works, it can be expensive or even impossible to create a new array if the collection is very large.

When the question is modified to include the restriction that the collection won’t fit in memory (say, a text file with hundreds of millions of lines), things get more interesting still. You could go through the file to count the lines, pick a random number from 0 to count - 1, and then go back through the file to read that many lines. In code:

    var count = File.ReadLines(filename).Count();  // reads the whole file
    var ix = _rnd.Next(count - 1);
    var sel = File.ReadLines(filename)  // potentially reads the whole file
        .Skip(ix-1)
        .First();
    return sel;

If the collection is a file, then you end up reading the file once, and at least part of the file twice. On average, you end up reading the file halfway through the second time. It works, but it’s pretty expensive.

Worse, it’s impossible to use that technique if the collection is a bunch of records coming in from a network stream. It’s not like you can rewind the stream and start over after you figure out how many records there are.

There is a way to do this that doesn’t require knowing up front how many items are in the collection, and guarantees that each item has the same probability of being selected. Using this technique, you can select any number of items in a single pass through the collection. Provided, of course, that you have enough memory to hold the selected items.

Imagine that you are asked to select an item at random from a sequence of items presented to you one at a time. You don’t know how many items will be presented, and you have to prove that each item was given equal opportunity to be selected. You can’t decide to pick the first item every time, and you can’t decide to pick the fifteenth item, for example, because there might not be fifteen items presented. What to do?

First, since you always have to be prepared to present a selected item, it follows that you must select the first item because there might be only one item presented. But you can’t unconditionally keep the first item. When the second item is presented, you have to decide whether you should keep the first item, or replace it with the second item. How? By flipping a coin. Heads you keep the first, tails you replace it. Assuming a fair flip of a fair coin, then after two items half the time you’ll be holding the first item, and half the time you’ll be holding the second item.

Now the third item comes along. How do you decide whether to keep your previously selected item, or if you should replace it with the third item? You can’t flip a coin because doing so would mean that half the time you’d keep the third item. You want to keep the third item one third of the time.

What you really want is to roll a three-sided die and if it turns up a 3, replace the currently selected item with that third item. It should be obvious that the third item will be selected one third of the time.

If you don’t believe that this technique treats all items equally (i.e. gives every item an equal chance of ending up as the selected item), consider the possibilities with three items.

First Coin Die Result
1 Heads 1 1
1 Heads 2 1
1 Heads 3 3
1 Tails 1 2
1 Tails 2 2
1 Tails 3 3

Those are all of the possibilities when selecting from three items. Each number is selected in two of the six cases, or one third of the time.

If you continue in this manner, selecting the nth item only if rolling an n-sided die turns up n, then when you’ve reached the end of the list you will have given each item an equal chance of being selected.

Let’s see if that really does work. The program below uses the described technique to randomly select a value from 1 to 100. The program does that 10 million times and reports the results.

    private void DoIt()
    {
        const int arraySize = 100;
        const int iterations = 10*1000*1000;
        var selected = new int[arraySize];

        for (var i = 0; i < iterations; ++i)
        {
            var sel = SelectRandomInt(arraySize);
            ++selected[sel];
        }

        for (var i = 0; i < selected.Length; ++i)
        {
            Console.WriteLine("{0}: {1} ({2:P4})", i, selected[i],
                (double) selected[i]/iterations);
        }
        Console.WriteLine("Expected value = {0:N0}", iterations/arraySize);
        Console.WriteLine("Min value = {0:N0}", selected.Min());
        Console.WriteLine("Max value = {0:N0}", selected.Max());
    }

    private readonly Random _rnd = new Random();

    private int SelectRandomInt(int max)
    {
        var selected = -1;
        for (int i = 0; i < max; ++i)
        {
            if (_rnd.Next(i + 1) == i)
                selected = i;
        }
        return selected;
    }

The “expected” number that’s output is just the number of iterations divided by the number of items. In this case, 100,000. That’s the number of times you would expect any item to be selected, given a sufficiently large number of trials. This program outputs the number of times each item is selected, but below I show just the minimum and maximum.

    Expected value = 100,000
    Min value = 98,012
    Max value = 101,257

Not surprisingly, not every item was selected exactly 100,000 times. But it was close. The maximum variance was less than 2% on the low side, and about 1.3% on the high side. How does that compare with just selecting a random number from 0 to 99?

I replaced the SelectRandomInt method above with one that just returns a random number from 0 to max-1, like this:

    private int SelectRandomInt(int max)
    {
        return _rnd.Next(max);
    }

That gave me:

    Expected value = 100,000
    Min value = 99,069
    Max value = 100,842

This second version consistently gives me results that are closer to the expected values. The variance is approximately half. That is, where the first version might vary by 2% on the low side, this version varies by 1%. I think the difference is due to the random number generator being slightly biased (that is, it doesn’t produce a perfectly uniform distribution), and because we’re selecting many more random numbers with the first version than with the second, that bias is amplified. It’s not so far out of whack that I find it unusual, but I do think it merits further study.

Random number weirdness aside, you can see here that I was able to randomly choose one item from a collection of unknown size.

You wouldn’t use that technique to select a random number from 0 to 99. A real method would take an IEnumerable<T> and return a randomly selected item. The code looks like this:

    T SelectRandomItem<T>(IEnumerable<T> source, Random rnd)
    {
        T selectedItem;
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) == 0)
            {
                selectedItem = item;
            }
        }
        return selectedItem;
    }

Note that I modified the test a bit. In the previous version I checked to see if the selected random number was equal to the top of the range. Here, I test to see if it’s equal to zero. The effect is the same because the conditional is selecting one item from n possible items. The probability that the random number will be 0 is the same as the probability that it’s equal to (nItems - 1). The reason I changed it here is that it’s easier for me to think about it this way, and it simplifies the task of modifying this code to select multiple items at random.

Using that code is very simple. If you want to select a line from a large text file, you can do it in a single line of code:

    string selected = SelectRandom(File.ReadLines("foo.txt", new Random()));

You might ask why I pass in the random number generator rather than creating a new one when the method is called, or expecting there to be one at class scope. There are several reasons.

The default Random constructor seeds the random number generator with the current time, which is expressed in milliseconds. If I were to call SelectRandomItem multiple times in a single millisecond (which could happen when working with short lists), I’d be very likely to select the same item each time. Consider, for example, selecting from two lists:

    IEnumerable<string> firstNames = LoadFirstNames();
    IEnumerable<string> lastNames = LoadLastNames();
    var firstName = SelectRandomItem(firstNames);
    var lastName = SelectRandomItem(lastNames);

If firstNames and lastNames were very short, then whenever I selected the third item from one, I’d be very likely to select the third item from the other. That wouldn’t make for a very good random name generator.

Programmers often want a repeatable random sequence, especially for testing purposes. That’s not possible if the method creates a new random number generator with every call. It’s possible if there is a random number generator dedicated to the SelectRandomItem method, but if any other piece of code uses that random number generator or calls SelectRandomItem, then the sequence is lost. If I want to guarantee a repeatable random sequence, then I need to supply the random number generator to the method that uses it.

Selecting multiple items

The algorithm to select multiple items in a single pass is a straightforward generalization of the single item selection algorithm. Rather than dealing with a single “currently selected” item, we maintain an array of them. And to decide whether to select a particular item, we get a random number and compare it with the number of items to be selected. So, we replace this line in the inner loop:

    if (_rnd.Next(nItems) == 0)

With this one:

    if (_rnd.Next(nItems) < numToSelect)

If the random number is smaller than the number of items to select, then we have to randomly replace one of the selected items. The resulting method looks like this:

    T SelectRandomItems<T>(
        IEnumerable<T> source, int numToSelect, Random rnd)
    {
        T[] selectedItems = new T[numToSelect];
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) < numToSelect)
            {
                var ix = rnd.Next(numToSelect);
                selectedItems[ix] = item;
            }
        }
        return selectedItems;
    }

When selecting more than a single item, the output from that method is surprisingly similar to the output from a method that selects unique random numbers in the same range. So similar, in fact, that I can’t tell them apart. I find that a little odd. There is a definite difference in the results from selecting a single item, but when selecting multiple items the difference is not detectable. I do need to investigate why that difference in behavior exists.

And that’s all there is to it: a method that will fairly select one or more items at random from a collection of unknown size, and do it with a single scan of the entire collection. If there is a better way to do this, I don’t know what it is.

Source

I first ran across this problem in Jon Bently’s Programming Pearls. He cites Volume 2: Fundamental Algorithms of Knuth’s The Art of Computer Programming. See section 3.4.2: Random Sampling and Shuffling.

The Aho-Corasick string search algorithm

Aho-Corasick String Search

The Aho-Corasick string search algorithm that I mentioned last time works by first building a finite state machine from the list of strings you want to find. After the state machine is built, the input is run through the state machine to find the matches. The key to understanding how the algorithm works is understanding how the state machine works. Fortunately, the original paper, Efficient String Matching: An Aid to Bibliographic Search, by Alfred V. Aho and Margaret J. Corasick, explains the state machine quite well. In fact, the paper is well enough written that I had little difficulty implementing the entire algorithm from it as my only reference.

Last time I said that I was going to explain how the algorithm works. After re-reading the paper and trying to describe the algorithm, I found myself just paraphrasing the original paper. So rather than do a bad job of explaining things, I’ll point you to the paper itself. Download it. Read it. Think about it. Read it again. It is quite approachable. I wish all the research papers I read were as well written.

Building the state machine is a multi-step process that involves first creating a goto function that defines the state transitions, and an output function that defines the outputs (if any) from each state. This process involves reading each keyword character-by-character and inserting the characters into the state machine. Then, a failure function is generated by walking the nodes in the goto function, determining the next node to visit if a failure occurs. An optional final step can be added in which the generated goto and failure functions are combined into a single state transition table. This final step makes walking the state machine easier and more efficient.

The original paper provides pseudocode examples from which it’s pretty easy to create working code. I’ve created two two classes: StateMachine, which contains the completed state machine and the searching function, and StateMachineBuilder, which builds a StateMachine from a list of key words. In the past I’ve combined these two things into the same class, but I found the design cumbersome. Having two classes seems like a cleaner design to me. Building the state machine involves a lot of intermediate data structures that the searching code shouldn’t have to be troubled with. Separating the classes gives the searcher access to only the data that it actually needs.

The public interface to the StateMachineBuilder class is pretty simple. The constructor initializes the state machine. You can then call the AddKeyword method to add each key word to the state machine. When all key words are added, a call to the CompleteAdding method will complete processing and return a reference to the generated state machine.

A convenience function, StateMachine.Create(IEnumerable), allows you to create a StateMachine instance from a sequence of keywords in a single call. It’s equivalent to creating a StateMachineBuilder, adding all the keywords, and then calling CompleteAdding.

The public interfaces, then, look like this:

    public struct SearchResult
    {
        public readonly string Keyword;
        public readonly int Position;
    }

    public class StateMachine
    {
        public static StateMachine Create(IEnumerable keywords);

        public IEnumerable<SearchResult> Search(IEnumerable<char>);
    }

    public class StateMachineBuilder
    {
        public StateMachineBuilder();
        public void AddKeyword(string keyword);
        public void AddKeywords(IEnumerable<string> keywords);
        public StateMachine CompleteAdding();
    }

Imagine that you’re looking in a very large text file for mentions of any from a list of 100 names. Here’s one way you could do it:

    var names = GetListOfNames();
    var machine = StateMachine.Create(names);
    var lineNumber = 0;
    foreach (var line in File.ReadLines(inputFilename))
    {
        ++lineNumber;
        foreach (var result in machine.Search(line))
        {
            Console.WriteLine("Found '{0}' at position {1} on line{2}.",
                result.Keyword, result.Position, lineNumber);
        }
    }

If the file has no line breaks then File.ReadLines will try to read the entire file at once. That will cause a problem if the file is large. In that case you need to create a function that will return the file’s characters one at a time, and feed that to the Search method:

    foreach (var result in machine.Search(ReadFileByChars(inputFilename))
    {
        Console.WriteLine("Found '{0}' at position {1}",
            result.Keyword, result.Position);
    }

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            while (!reader.EndOfStream)
            {
                char c = (char)reader.Read();
                yield return c;
            }
        }
    }

That’s a particularly inefficient way to read a file, but it’ll do the job. You can speed it up by reading a block of characters into a buffer and then returning characters from the buffer one at a time. Each time you reach the end of the buffer, read another buffer full. Repeat until you’ve reached the end of file. Like this:

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            var buffer = new char[65536];
            int charsRead;
            while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (var i = 0; i < charsRead; ++i)
                {
                    yield return buffer[i];
                }
            }
        }
    }

That will perform faster than the previous version because it makes many fewer calls to the runtime library.

So that’s what we’re going to build: methods that build the state machine and then use the state machine to search for strings in text. Next time we’ll dive into building the goto and output functions. In the meantime, you should download and read the original Aho-Corasick paper. It really is approachable, and explains the techniques very well.

The password rant

Every time I have to come up with a password for a Web site, I end up spending 10 minutes trying to find one that I can remember and that will conform to the rules for that site. You’d think there’d be some kind of standard. But, no … every site has its own bunch of rules.

One site requires the password to be “at least eight characters long.” Others specify a minimum length and require at least one number, one capital letter, and one special character. Others specify an absurdly short maximum length. They’ll limit the characters you can use, make you use at least two numbers, prevent you from using any word that’s in an English dictionary, or for all I know require that you type it with your toes while standing on your head. It’s maddening.

I could understand this idiocy 10 years ago. Today it’s simply unacceptable. The key to a good password is that it be long, unusual, and easy to remember. But the most important point is length. Next most important is easy to remember. And I guarantee that a password like “stupid.password.rules.suck” is easier to remember and harder to break with a brute force attack than something like “Stup1d.pw0rd”.

To make matters worse, sites often don’t tell you what the password rules are. It’ll say, “enter password.” So you enter something like “my.mom’s.maiden.name” and it will say, “Passwords must contain at least one number and one capital letter.” So then you enter “3rd.rock&Roll.act” and it will say, “Passwords may only contain the characters [list of characters that doesn't include '&']“. So then you type, “Please.save.M3.From.Stupid.Password.Rules” and it will say, “Passwords must be between 8 and 14 characters in length.”

Why can’t they tell me all of the rules up front? Do they think I’m here to play 20 passwords? This evening my hosting provider told me that my password had expired and that I need to create a new one. Fine. I’m now on my fifth attempt at creating a password that I can remember, that others won’t likely guess, and that fits the rules that they’re revealing to me one at a time.

People who create Web sites: please update your password rules. Force me to create a password that’s at least 15 characters long, and let me put whatever characters I want in it! If you really have to put a limit on the length, make it something reasonable like at least 32 characters. Limiting me to 12 (a banking site, no less) or 14 characters and making me choose from three or four different classes of characters to boot just pisses me off and makes me think that I should find some other place to take my business.

 

Searching for strings in text

Every mainstream programming language’s runtime library contains a function that will find a substring inside a string. The .NET runtime library, for example, has several such methods. String.IndexOf in its many overloaded variants returns the position of the search string in the larger string, or -1 if the substring is not found. There’s also String.Contains and String.LastIndexOf, as well as Regex.Match and several others.

The String functions are great for determining if a particular string occurs in some text. But if you want to search for matches of multiple strings, you have to search for each one individually. For example, if you want to search for occurrences of the strings {he, she, his, hers}, you’d have to write something like this:

    string[] stringsToFind = new string[]{"he", "she", "his", "hers");
    foreach (var s in stringsToFind)
    {
        int pos = 0;
        while ((pos = text.IndexOf(s, pos)) != -1)
        {
            Console.WriteLine("Found {0} at position {1}", s, pos);
            pos += s.Length;
        }
    }

That will print almost every occurrence of each word. In some situations it will miss because it skips over the beginning of a match. For example, if you’re searching for the string “abab” in the text “abababab”, it would only find matches at positions 0 and 4. It would miss the occurrence at position 2. You can fix that by changing the increment to ++pos, which will cause it to restart the search at each character. It’ll just take longer.

The code ends up scanning the entire text for every word to be found. That’s not a problem when the text is short and the number of words to be located is small. But if you’re looking for occurrences of 1,000 different words in a gigabyte-sized document, this is going to take a long time. The amount of time taken by this algorithm is proportional to the number of words (we’ll call it N) times the length of the document squared (call it M). That is, the complexity is O(N * M2).

Why squared? Because it’s possible that the code would have to start the search at every character in the string. Think of searching for all occurrences of the string “aa” in the text string “aaaaaaaaaaaaaaaaaaaaaaaa…” (‘a’ repeated a billion times). That wouldn’t take forever, but close enough. If the computer could do a billion searches per second, it would only take a billion seconds (almost 32 years). Ouch.

The regular expression methods can do things much faster in the general case because they can search for all four strings in a single scan of the text.

    string text = "those shells aren't hers or his.";
    Regex re = new Regex("he|hers|she|his", RegexOptions.Compiled);
    Match m = re.Match(text);
    while (m.Success)
    {
        Console.WriteLine("Found {0} at position {1}.", m.Value, m.Index);
        m = m.NextMatch();
    }

Here’s the output from running that code:

    Found she at position 6.
    Found he at position 20.
    Found his at position 28.

Again, you see that it missed some occurrences. It didn’t find “he” at position 6 or “hers” at position 20. I could fix the missing “he” at position 6 by incrementing the regular expression starting position, but there’s nothing I can do about it not finding “hers”. Well, I could change the regular expression so that it’s "hers|he|she|his", which would return “hers” at position 20 but then wouldn’t find “he” at the same position. The same thing happens if I change the regular expression to "he(rs)?|she|his". A regular expression can’t find two matches at the same position.

I freely admit that I’m no expert with regular expressions. Even so, I was somewhat surprised by the regular expression finding “he” rather than “hers” at position 20. I wrongly thought that regular expression matching behavior defaulted to “longest left-most,” when in fact it’s a traditional NFA engine that, according to the documentation, accepts the first match it finds.

But “first match” doesn’t explain why it matches “hers” instead of “he” when I reverse the order of the words in the regular expression. After all, the engine should see “he” and stop, right? There’s something about the way the regex engine works with alternation that I don’t fully understand. I realize that I can solve the problem in this case by combining the two alternatives into "he(rs)?", but that becomes problematic when constructing a regular expression at run time from a list of words.

Understanding why alternation works this way would be nice, but I’m not so curious that I’ll spend time digging into the implementation in order to find out. For now I’ll accept that order matters in alternation constructs.

Although the regular expression version will be faster in the general case, it, too, will take essentially forever in pathological cases. And, as you’ve seen, it can’t find everything and it can’t easily be changed to do so.

Another problem is that both techniques require that the entire text be held in memory. If you’re searching for strings in a file that’s larger than available memory, you can’t use either of these methods without writing code that processes the file in chunks.

There is a better and faster way that can find all occurrences (given “hers”, for example, it will find both “he” and “hers”), doesn’t suffer from the pathological worst-case behavior, and that can process text on a character-by-character basis. It’s called the Aho-Corasick string search algorithm. Developed in the mid 1970s, it’s a surprisingly simple algorithm that scales well, is reasonably easy to implement, and performs excellently even with a straightforward implementation. I’ve used it to search for occurrences of millions of different strings among many hundreds of gigabytes of text.

Next time I’ll explain how the algorithm works, and start building code to implement it.

Hand made beer mug

Debra and I went to the Sherwood Forest Faire last weekend, and had a great time. I felt a little out of place, though, wandering around in costume and drinking mead from a plastic cup. Many people had wooden mugs, and there were mugs available for purchase at several shops. Although they were attractive, I didn’t get one. First of all, the least expensive mug I saw for sale was $45. That’s a lot for a hollowed-out piece of wood.

More importantly, the mugs they had for sale were too pretty. Most were perfectly turned on a lathe, had perfectly shaped handles, and were coated with a thick epoxy resin that turned the wooden mug into what was essentially a modern plastic mug that just looks like wood.

Sunday morning I went out to the wood pile and grabbed a hunk of mesquite that seemed about the right size. My plan was to carve a mug that would hold 16 ounces.

mug1

After cutting it to size and removing the bark with an angle grinder, I drilled a hole in the center to get started, and then used a die grinder and the Foredom power carver to hollow it. This turned out to be more difficult than I had expected due to the depth. It’s distressingly easy to lose control of the tool when you can’t see where it’s cutting. An aggressive bit spinning at 15,000 RPM will throw a piece of wood across the room when it binds up. I ended up putting a few cracks in the cup that way.

mug2

It took a long time to thin the walls from there (using a hook knife), and sand it smooth. I also had to fill the cracks that I made while hollowing. I didn’t take any pictures until after I’d completed the cup and put the oil and wax finish on it.

mug3

The finished mug is 4-1/4 inches tall and between 3-1/4 and 3-1/2 inches wide. The width varies because the mesquite log that it came from wasn’t perfectly round.

mug4

It holds just a little more than 12 ounces. I just couldn’t get it any deeper with the tools that I currently have. I’ll have to come up with a better way if I’m going to make another one of these. I’ve considered hollowing from both ends and then carving a 1/4 inch thick piece to fit inside at the bottom. The only difficulty would be getting that bottom piece carved correctly. I’m also considering a few other options.

mug5

The residue you see floating in the cup there is a little of the oil and wax. Not to worry: the oil and wax are non-toxic. I was impatient to try the mug. It really is a pleasure to drink from.

I decided not to add a handle. Maybe next time. I’ll probably come up with some sort of leather sling that fits around the mug and lets me attach it to my belt so that I can carry it around the next time we go to one of those Renaissance festivals.

ReSharper annoyances

If you’re writing C# code, you should be using ReSharper. It’s a Visual Studio plugin that examines your code in real time (as you edit), helping you avoid common errors and write better code. I admit that I found ReSharper annoying when I first started using it, but after a week or so I found it invaluable. I found it so useful at work that I purchased a license myself so that I can use it at home. It’s not exactly cheap (about $150), but I find that it improves my code and saves me time. Well worth the price.

But sometimes its suggestions are just annoying. One in particular drives me bonkers. Consider this fragment from my heap selection code:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

ReSharper suggests that I “Invert ‘if’ statement to reduce nesting.” If I tell it to perform the transformation, it turns the code into this:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) <= 0) continue;
        // The current item is larger than the smallest item.
        // So move the current item to the root and sift down.
        Swap(0, _numItems);
        SiftDown(0);
    }

So rather than having a simple conditional and a short bit of subordinate code, it inverts the conditional and adds an extra logical branch in the flow. All for no good reason. Rather than saying “If condition do x,” the code is now saying “If not condition go on to the next iteration, otherwise do x.” Negative logic and a more complicated control flow does not, in my opinion, constitute an improvement.

ReSharper is suggesting that I make my code harder to read and harder to understand.

Ostensibly, ReSharper is helping to avoid the arrow anti-pattern, but it’s being too aggressive. I fully agree that one should avoid writing such hideous code. However, the arrow anti pattern involves “excessive nested conditional operators.” One conditional operator is not nested. What we have here is a simple conditional that ReSharper should ignore.

I can change ReSharper options to disable that suggestion, or I could annotate my code to disable that suggestion in this particular case. I won’t do the former because I want ReSharper to suggest improvements that make sense, and I definitely want to avoid the arrow anti-pattern. I won’t add ReSharper-specific (or any other tool-specific) annotations to my code because those annotations are not germane to the problem being solved. They’re just clutter. It’s bad enough that I have to suffer those XML documentation comments in the code.

ReSharper got that one wrong, and there’s nothing I can reasonably do to prevent it from displaying the warning.

ReSharper also warns me when I write something like this:

    private int _head = 0;

It says, “Initializing field by default value is redundant.” And ReSharper is right. But the redundancy is harmless and even helpful at times. I like this particular redundancy because it explicitly states what the initial value of that variable is. I don’t have to think, “Oh, right. The default value is zero.” I’ve turned this warning off because there is no context in which I want to be told that my preference for explicit initialization is redundant. Besides, the compiler will silently remove the redundancy, so there’s no inefficiency in the generated code.

Come to think of it, I’d like an option to flag code that doesn’t explicitly initialize something. I’d enable that option in a heartbeat.

All that said, I really like ReSharper. As annoying as it can be from time to time, I find that it really does help me write better code. Highly recommended.

Performance testing the TopBy method

The purpose of my TopBy extension method is to provide a faster and more memory efficient way to select the top k items from a list. Absent this method, the LINQ way to get those items is to sort the sequence in descending order and then take the top k. For example:

    var oldestPeople = people
        .OrderbyDescending(x => x.Age)
        .Take(10);

That will give you the 10 oldest people in the list. My TopBy method makes it a little easier:

    var oldestPeople = people.TopBy(10, x => x.Age);

That reads better to me and, done right, should be a whole lot faster than the existing LINQ alternative.

When LINQ does that selection, it has to copy the entire source list (people in this case), sort the entire list, and then return the top 10. That requires extra memory proportional to the size of the source list, and time proportional to n*log2(n). If you want to select the top 10 from a list of one million, LINQ will create another list of one million items to sort.

It should also be clear that LINQ will take almost the same amount of time to select the top 10 as it would to select the top 100,000. The bottleneck in the LINQ way of doing things is the sort, and it always has to sort the entire array.

My TopBy method uses a heap selection algorithm similar to the one that I discussed in my article When theory meets practice. This algorithm requires extra space proportional to k–the number of items to be selected. It takes time proportional to n*log2(k), meaning that it should be much faster than the LINQ method when selecting a relatively small percentage of the items in the list, and it will use a whole lot less memory.

One other benefit of the new TopBy method is that you can use it to select the top items from a list that won’t even fit in memory. Imagine, for example, that you have a text file that contains hundreds of millions of records–more than will fit in memory. You want to select the top 100 records from that file. You can’t do that with OrderBy because you can’t fit those hundreds of millions of records in memory. But TopBy only needs space for 100 records, so you can easily use it to select the records you want.

That’s the theory. Let’s see how it works in practice.

The short version is that TopBy is faster than OrderBy followed by Take when the number of items that you want to select is less than about 25% of the total number of items. For primitive types like integers, “about” is closer to 20%, and for strings it’s closer to 40%. As the cost of comparisons increases, TopBy is a better choice.

How much faster depends mostly on the ratio of selected items (k) to total items (n). When k/n is small, TopBy is hugely faster. For example, selecting the top 100 items from a list of 1,000,000 strings takes TopBy about 500 milliseconds on my machine. It takes the OrderBy method about 17 seconds.

When k/n approaches 0.25, they’re about the same speed, and when k/n is near 1.0, TopBy takes about twice as long. Again, TopBy has an advantage as the cost of key selection and comparison increases.

The first table below compares the performance of OrderBy and TopBy when selecting varying percentages of integers from different list sizes. The second table compares the performance when selecting strings.

Time (in milliseconds) to select integers

Total Items
Percent 100 1,000 10,000 100,000 1,000,000
Selected OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy
0.01% 9.41 0.76 112 7.60 1,475 75.25
0.10% 0.74 0.08 8.87 0.79 112 8.18 1,479 84.73
1% 0.04 0.01 0.73 0.10 8.89 1.20 113 14.84 1,477 176.37
10% 0.04 0.03 0.81 0.39 9.06 5.18 116 66.84 1,536 854
25% 0.04 0.04 0.71 0.69 9.18 9.75 116 126 1,501 1,677
50% 0.04 0.07 0.73 1.35 9.50 14.32 118 189 1,554 2,547
75% 0.04 0.09 0.73 1.35 9.59 17.01 118 224 1,572 3,106
90% 0.05 0.09 0.75 1.32 9.57 18.55 123 235 1,699 3,287

Time (in milliseconds) to select strings

Total Items
Percent 100 1,000 10,000 100,000 1,000,000
Selected OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy OrderBy TopBy
0.01% 97.85 5.14 1,499 54.43 17,595 540
0.10% 6.63 0.53 94.79 5.40 1,455 58.73 16,410 612
1% 0.45 0.06 7.20 0.78 91.62 8.42 1,360 122 16,670 1,364
10% 0.53 0.14 6.54 2.40 91.11 36.16 1,376 556 16,877 6,919
25% 0.42 0.27 6.61 4.62 100 69.69 1,337 967 17,147 13,375
50% 0.44 0.43 7.21 7.34 90.45 104 1,299 1,481 16,854 20,561
75% 0.46 0.52 6.47 8.60 94.60 124 1,318 1,823 17,106 24,166
90% 0.46 0.58 6.93 9.17 105 135 1,400 1,998 17,886 26,071

The important thing to note here is that for a given list size, OrderBy followed by Take takes nearly constant time regardless of how many items are selected. That is, selecting 100 items from a list of 1 million takes approximately the same amount of time as selecting 900,000 items from the same list. In comparison, the time required by TopBy depends on the ratio of selected items to total items.

The time I spent developing the TopBy extension method is a huge win for me because I regularly select small numbers of items from very large lists. Selecting the top 100 from a list of 1 million is a common operation in the work that I do. Being able to do that in 500 milliseconds rather than 16 seconds means that my programs run much faster. They also require a whole lot less memory.

I also find myself, sometimes, having to select the top 100 items from a list that’s too large to fit in memory. Again, that’s something that TopBy can do, but OrderBy can’t. I realize that my situation is a little out of the ordinary in that most people who work with hundreds of millions of records will have them in a database and they can use SQL to do the selection.

It’s important to point out, though, that my TopBy extension method isn’t the optimum in performance. I have a custom generic selection method that’s three to five times faster, and outperforms LINQ’s OrderBy even when selecting 100% of items. Unfortunately, it doesn’t support the ThenBy and ThenByDescending methods, and it doesn’t do a stable ordering. It works more like List<T>.Sort in that I can pass it a comparison delegate. It’s useful, but using it in conjunction with LINQ is difficult.

All in all, this was an interesting and enjoyable diversion. I learned a lot about how LINQ to Objects works under the hood, and I ended up with a very useful extension method that makes my programs run faster and use less memory.

Below is the code I used to generate the data that went into the tables above. Of course, all tests were run with a release build with no debugger attached.


    private const int Million = 1000*1000;
    private const int Billion = 1000*Million;
    private const int NumIterations = 10;

    private static readonly double[] Percentages = new double[] {0.0001, 0.001, 0.01, 0.1, .25, .50, .75, .90};
    private static readonly int[] Sizes = new int[] {100, 1000, 10000, 100000, Million};
    private static void TestSelect()
    {
        // prime the jitter pump
        TestOrderBy(new List<int> {1, 2}, 1, 1);
        TestOrderBy(new List<string> {"foo", "bar"}, 1, 1);
        TestTopBy(new List<int> { 1, 2 }, 1, 1);
        TestTopBy(new List<string> { "foo", "bar" }, 1, 1);

        foreach (var numItems in Sizes)
        {
            foreach (var t in Percentages)
            {
                int numSelect = (int) (t*numItems);
                //var items = CreateRandomInts(numItems);
                var items = CreateRandomStrings(numItems, 4, 20);
                var orderBy = TestOrderBy(items, numSelect, NumIterations);
                var topBy = TestTopBy(items, numSelect, NumIterations);
                Console.WriteLine("{0},{1:N2},{2},{3},{4}", numItems, t*100, numSelect, orderBy, topBy);
            }
        }
    }

    private static double TestOrderBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.OrderByDescending(k => k).Take(numSelect).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static double TestTopBy<T>(List<T> items, int numSelect, int numIter)
    {
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < numIter; ++i)
        {
            var sel = items.TopBy(numSelect, k => k).ToArray();
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds/numIter;
    }

    private static Random rnd = new Random();

    static List<int> CreateRandomInts(int howMany)
    {
        return Enumerable.Range(0, howMany).Select(i => rnd.Next()).ToList();
    }

    static List<string> CreateRandomStrings(int howMany, int minLength, int maxLength)
    {
        var items = new List<string>(howMany);
        for (int i = 0; i < howMany; ++i)
        {
            int len = rnd.Next(minLength, maxLength);
            var sb = new StringBuilder(len);
            for (int c = 0; c < len; ++c)
            {
                int x = rnd.Next(52);
                if (x < 26)
                    sb.Append((char)(x + 'A'));
                else
                    sb.Append((char)(x - 26 + 'a'));
            }
            items.Add(sb.ToString());
        }
        return items;
    }

Some optimizations to the TopBy method

I had planned to present the results of performance testing here, comparing my TopBy extension method against the existing LINQ alternative (OrderByDescending() followed by Take()). But while testing I came up with a couple of optimizations that provide slight performance increases. So I’ll show you those optimizations today, and then follow up with the performance tests in the next day or two.

My original code used a 3-heap because testing of my DHeap class showed that a 3-heap is almost always faster than a binary heap. But when I made the decision to use a 3-heap, I didn’t take into consideration the operations that the heap is expected to perform. If you recall from my discussion of the d-heap, heap insertions (the BubbleUp method) become faster as the value of d increases. Inserting into a 3-heap is faster than inserting into a binary heap. But heap removals (the SiftDown method) are slower as d increases.

The HeapSelector<TElement> class I presented last time never does a heap insert! The BubbleUp method doesn’t even exists in the code. The DoSelection method creates an array of the first numItems items and heapifies it, and then any changes to the heap are performed by sifting new elements down. The result is that my SiftDown method does more comparisons than it would as a binary heap, and there’s no corresponding reduction in comparisons during heap insertion.

The Heapify method in a 3-heap is faster than in a binary heap, and better locality of reference with the 3-heap offsets some of the lost performance in SiftDown, but overall the 3-heap is slightly slower with this mix of operations.

So I modified the code to use a binary heap. That simplifies the SiftDown method and provides a small increase (about 2%) in speed. Here are the modified Heapify and SiftDown methods.

    private void Heapify()
    {
        for (int i = _count / 2; i >= 0; --i)
        {
            SiftDown(i);
        }
    }

    private void SiftDown(int ix)
    {
        var child = (ix*2) + 1;
        while (child < _count)
        {
            if (child+1 < _count && Compare(child, child + 1) > 0)
                ++child;
            if (Compare(child, ix) >= 0)
                break;
            Swap(ix, child);
            ix = child;
            child = (ix*2) + 1;
        }
    }

The other optimization has to do with extracting keys from records before comparing to see if an item should go onto the heap. The original code looks like this:

    // For each remaining item . . .
    while (srcEnumerator.MoveNext())
    {
        ExtractKeys(srcEnumerator.Current, _numItems);
        if (Compare(_numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

That extracts all of the keys for an item and then compares the item against the first (smallest) item on the heap. If the new item is larger than the smallest item, then the smallest item is replaced by the new item.

One of the things I learned when working with heap selection some time ago is that with typical data, especially when you’re selecting a small percentage of the total items, only a very small number of items are actually placed on the heap. In most cases the vast majority of items examined are smaller than the smallest item on the heap. That being true, it’s useful to perform as little work as possible to determine if an item will go onto the heap.

The other observation is that, when multiple sort keys are specified, the majority of items will differ in the first key. If you’re sorting people by last name and then first name, most of them will have different last names. It makes no sense, then, to extract the first name key if it won’t ever be compared. If we had a method that extracted only those keys it needs for the comparison, we should spend a lot less time extracting keys.

With that in mind, I created the ExtractCompare method:

    private int ExtractCompare(TElement item, int ix, int x, int y)
    {
        var rslt = 0;
        var i = 0;
        while (rslt == 0 && i < _criteria.Length)
        {
            // Extract this key
            _criteria[i].ExtractKey(item, ix);
            rslt = _criteria[i].CompareKeys(x, y);
            ++i;
        }
        // If the item isn't larger, we can short circuit
        if (rslt <= 0) return rslt;

        // The new item compares larger, so it will be placed on the heap.
        // Extract the rest of the keys
        _items[ix] = item;
        while (i < _criteria.Length)
        {
            _criteria[i].ExtractKey(item, ix);
            ++i;
        }
        return rslt;
    }

You can see here that if the new item is smaller than the item it’s being compared to (which in this case will be the smallest item on the heap), the other keys aren’t even extracted. Again, why extract a key that you won’t use?

Using this new method requires a small change in the selection loop:

    while (srcEnumerator.MoveNext())
    {
        if (ExtractCompare(srcEnumerator.Current, _numItems, _numItems, 0) > 0)
        {
            // The current item is larger than the smallest item.
            // So move the current item to the root and sift down.
            Swap(0, _numItems);
            SiftDown(0);
        }
    }

The performance increase from this optimization isn’t very noticeable when you have just one ordering clause (i.e. no ThenBy or ThenByDescending). My testing shows that it’s only two to three percent in the general case. The effect becomes much more evident when you have multiple sort keys, or if extracting the keys is expensive.

I did have one reservation about this optimization. Somebody could write a key selector method that has side effects that other parts of the program (perhaps subsequent key selections) depends on. I can’t think of a good reason why anybody would write their program to depend on the key selector’s side effects, though. I gave this quite a bit of thought and came to the conclusion that depending on a key selector’s side effects is so fraught with peril anyway that adding this potential problem is no big deal. Key selection shouldn’t modify global state.

Together, these two optimizations resulted in a performance increase of from five to seven percent in the single sort key case. Not huge by any standard, but worthwhile considering the simplicity of the changes.

Performance comparisons next time. Really.

Implementing IOrderedEnumerable, Part 2: The details

In my previous post, I developed the classes that implement IOrderedEnumerable so that we can create a chain of sort criteria that the GetEnumerator method can use to select the top items from the list. The harder part, is writing that GetEnumerator method.

When I started this project I intended to use my generic heap class, DHeap, to do the selection. I actually wrote code to do that and almost published it. But the code was convoluted, and used a few “clever” tricks that were too ugly to post. Besides, the code was slow. It was faster than using OrderBy and Take, but not by a whole lot. It was slow and ugly, and not something that I wanted to post when there was a better (and, as it turns out, much faster) way to do things.

It’s instructive to understand the way the LINQ to Objects OrderBy implementation works. When code calls OrderBy, the LINQ code creates a chain of ordering criteria quite similar to what I showed in my previous post. The GetEnumerator method, when it’s eventually called, first extracts the sort keys and stores them in individual arrays. For example, if the code were OrderBy(p => p.Age).ThenBy(p => p.FirstName), then there are two key arrays created: one that contains all of the ages, and one that contains all of the first names. When sorting, the keys are compared and swapped as necessary.

Doing things that way costs memory, but lets us create and compare custom keys that might take significant time to generate. Rather than generating the key each time a comparison is made, the keys are generated one time and cached. It’s less than optimum, but it’s a much more flexible way to do things.

After studying the problem a bit (that is, after my initial solution was such a disappointment), I determined that I could use a technique that’s very similar to what the LINQ to Objects implementation of OrderBy does. But my key arrays are only the size of the heap (the number of items to be selected) rather than the size of the entire list. So if the code is supposed to select 100 items, then each key array is large enough to hold 101 items–the last item used as a scratch pad.

Each HeapOrderedEnumerable<TElement, TKey> instance has a _keys array and implements three abstract methods that are defined by the base HeapOrderedEnumerable<TElement> class:

  • ExtractKey(element, index) extracts a key of type TKey from the passed element and stores it in the _keys array at the specified index.
  • Compare(x, y) uses the comparison delegate to compare the keys at indexes x and y.
  • Swap(x, y) swaps the values in the _keys array located at indexes x and y.

The HeapOrderedEnumerable<TElement,TKey> class, then, is pretty simple. Modifying the code I posted last time, I just need to add the _keys array and the three methods that I outlined above. Here’s the completed class.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        private readonly TKey[] _keys;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;

            // Allocate one extra key for the working item.
            _keys = new TKey[numItems+1];
        }

        public override int CompareKeys(int x, int y)
        {
            return _descending
                ? _comparer.Compare(_keys[y], _keys[x])
                : _comparer.Compare(_keys[x], _keys[y]);
        }

        public override void SwapKeys(int x, int y)
        {
            var t = _keys[x];
            _keys[x] = _keys[y];
            _keys[y] = t;
        }

        public override void ExtractKey(TElement item, int ix)
        {
            _keys[ix] = _keySelector(item);
        }
    }

I then created a class called HeapSelector that, given a list of ordering criteria (the chain of HeapOrderedEnumerable<TElement,TKey> instances) and a list of items, can select the top items that match those criteria. That class maintains an array of the items currently on the heap, and calls the ExtractKey, Compare, and Swap methods described above to maintain the keys for those items. Internally, HeapSelector implements a custom d-Heap to do the actual selection. The public interface consists of the constructor and the DoSelection method. Here’s the public interface.

    internal class HeapSelector<TElement>
    {
        public HeapSelector(
            IEnumerable source,
            HeapOrderedEnumerable[] criteria,
            int numItems);

        public IEnumerable DoSelect();
    }

The full code is shown at the end of this post, along with the rest.

All that’s left is the GetEnumerator method in HeapOrderedEnumerable<TElement> which, in concept, is very simple. It just has to create an array of ordering criteria to pass to the HeapSelector, create the HeapSelector, and then return the sequence generated by the DoSelect method. It’s almost that simple, except for one complication.

OrderBy and ThenBy state that they do a stable ordering. A stable ordering simply means that items that compare equal maintain their original relative order in the output. For example, imagine that I had this list of names and ages.

    Jim,52
    Ralph,30
    Susie,37
    Mary,52
    George,47

If I were to sort those names by descending age, the first two items could be Jim followed by Mary, or Mary followed by Jim. If a stable sort isn’t specified, then either ordering is correct. But if the sort is guaranteed stable, then Jim must come before Mary in the output because Jim appeared before Mary in the original list. Note that if I reversed the sort order (i.e. sorted by ascending age), Jim would still appear before Mary.

It’s reasonable for users of TopBy to expect a stable ordering, because that’s the way that OrderBy works and, more importantly, how ThenBy is documented to work. If I want ThenBy to work with TopBy, then TopBy must do a stable ordering. That complicates things a little bit because heap selection isn’t typically a stable operation.

It can be made stable, though, by adding one final ordering criterion: the record number. In the example above, Jim would be record 0 and Mary would be record 3. Each record is assigned a unique, monotonically increasing numeric key. If the other keys compare as equal, the record numbers are compared. This guarantees the correct ordering of otherwise equal items.

It turns out that adding that final ordering isn’t terribly complicated. The code essentially tacks on a final ThenByDescending that stores the unique record number. It then walks the chain of ordering criteria to build an array that it can pass to the HeapSelector constructor. Finally, it calls the HeapSelector instance’s DoSelect method.

This whole thing turned out to be about 30 lines longer than my initial attempt, but much easier to understand. It’s also about five times faster than my original code, which is a nice bonus. The entire code is shown below. Next time we’ll do a little testing to see how it stacks up with other ways of selecting the top items.

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;

    namespace Heaps
    {
        public static class HeapEnumerable
        {
            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
            }

            public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
            }

            public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
                this IEnumerable<TSource> source,
                int numItems,
                Func<TSource, TKey> keySelector,
                IComparer<TKey> comparer)
            {
                return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
            }

            internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly int _numItems;
                internal HeapOrderedEnumerable<TElement> Parent;

                protected HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems)
                {
                    _source = source;
                    _numItems = numItems;
                }

                public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer, bool @descending)
                {
                    var oe = new HeapOrderedEnumerable<TElement, TKey>(
                        _source, _numItems, keySelector, comparer, descending);
                    oe.Parent = this;
                    return oe;
                }

                public IEnumerator<TElement> GetEnumerator()
                {
                    int numRecs = 0;
                    var recordKeySelector = new Func<TElement, int>(item => ++numRecs);

                    // Add a ThenByDescending for the record key.
                    // This ensures a stable ordering.
                    var oe = (HeapOrderedEnumerable<TElement>)CreateOrderedEnumerable(recordKeySelector, null, true);

                    // Get the ordering criteria, starting with the last ordering clause.
                    // Which will always be the record key ordering.
                    var criteria = oe.GetCriteria().ToArray();

                    var selector = new HeapSelector<TElement>(_source, criteria, _numItems);
                    return selector.DoSelect().GetEnumerator();
                }

                // Walks the ordering criteria to build an array that the HeapSelector can use.
                private IEnumerable<HeapOrderedEnumerable<TElement>> GetCriteria()
                {
                    var keys = new Stack<HeapOrderedEnumerable<TElement>>();

                    var current = this;
                    while (current != null)
                    {
                        keys.Push(current);
                        current = current.Parent;
                    }
                    return keys;
                }

                IEnumerator IEnumerable.GetEnumerator()
                {
                    return GetEnumerator();
                }

                // The individual ordering criteria instances (HeapOrderedEnumerable<TElement, TKey>)
                // implement these abstract methods to provice key extraction, comparison, and swapping.
                public abstract void ExtractKey(TElement item, int ix);
                public abstract int CompareKeys(int x, int y);
                public abstract void SwapKeys(int x, int y);
            }

            internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
            {
                private readonly Func<TElement, TKey> _keySelector;
                private readonly IComparer<TKey> _comparer;
                private readonly bool _descending;

                private readonly TKey[] _keys;

                internal HeapOrderedEnumerable(
                    IEnumerable<TElement> source,
                    int numItems,
                    Func<TElement, TKey> keySelector,
                    IComparer<TKey> comparer,
                    bool descending) : base(source, numItems)
                {
                    _keySelector = keySelector;
                    _comparer = comparer ?? Comparer<TKey>.Default;
                    _descending = descending;

                    // Allocate one extra key for the working item.
                    _keys = new TKey[numItems+1];
                }

                public override int CompareKeys(int x, int y)
                {
                    return _descending
                        ? _comparer.Compare(_keys[y], _keys[x])
                        : _comparer.Compare(_keys[x], _keys[y]);
                }

                public override void SwapKeys(int x, int y)
                {
                    var t = _keys[x];
                    _keys[x] = _keys[y];
                    _keys[y] = t;
                }

                public override void ExtractKey(TElement item, int ix)
                {
                    _keys[ix] = _keySelector(item);
                }
            }

            internal class HeapSelector<TElement>
            {
                private readonly IEnumerable<TElement> _source;
                private readonly HeapOrderedEnumerable<TElement>[] _criteria;
                private readonly int _numItems;
                private readonly TElement[] _items;
                private int _count;

                public HeapSelector(
                    IEnumerable<TElement> source,
                    HeapOrderedEnumerable<TElement>[] criteria,
                    int numItems)
                {
                    _source = source;
                    _criteria = criteria;
                    _numItems = numItems;
                    _items = new TElement[numItems+1];
                }

                public IEnumerable<TElement> DoSelect()
                {
                    // Build a heap from the first _numItems items
                    var srcEnumerator = _source.GetEnumerator();
                    while (_count < _numItems && srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _count);
                        ++_count;
                    }
                    Heapify();

                    // For each remaining item . . .
                    while (srcEnumerator.MoveNext())
                    {
                        ExtractKeys(srcEnumerator.Current, _numItems);
                        if (Compare(_numItems, 0) > 0)
                        {
                            // The current item is larger than the smallest item.
                            // So move the current item to the root and sift down.
                            Swap(0, _numItems);
                            SiftDown(0);
                        }
                    }

                    // Top N items are on the heap. Sort them.
                    int saveCount = _count;
                    while (_count > 0)
                    {
                        --_count;
                        Swap(0, _count);
                        SiftDown(0);
                    }

                    // And then return.
                    // Have to use the Take here because it's possible that saveCount
                    // will be smaller than _numItems.
                    return _items.Take(saveCount);
                }

                private const int ary = 3;

                private void Heapify()
                {
                    for (int i = _count / ary; i >= 0; --i)
                    {
                        SiftDown(i);
                    }
                }

                private void SiftDown(int ix)
                {
                    while ((ary*ix) + 1 < _count)
                    {
                        var child = (ix*ary) + 1;
                        // find the smallest child
                        var currentSmallestChild = child;
                        var maxChild = child + ary;
                        if (maxChild > _count) maxChild = _count;
                        ++child;
                        while (child < maxChild)
                        {
                            if (Compare(currentSmallestChild, child) > 0)
                                currentSmallestChild = child;
                            ++child;
                        }

                        child = currentSmallestChild;
                        if (Compare(child, ix) >= 0)
                            break;
                        Swap(ix, child);
                        ix = child;
                    }
                }

                private void ExtractKeys(TElement item, int ix)
                {
                    // Extract keys from the record into the array at index ix.
                    // Also calls the ExtractKey method for each ordering criteria.
                    _items[ix] = item;
                    foreach (var t in _criteria)
                    {
                        t.ExtractKey(item, ix);
                    }
                }

                private int Compare(int x, int y)
                {
                    // Walks the list of comparers, doing the comparisons.
                    // The first unequal comparison short-circuits the loop.
                    var rslt = 0;
                    for (int i = 0; rslt == 0 && i < _criteria.Length; ++i)
                    {
                        rslt = _criteria[i].CompareKeys(x, y);
                    }
                    return rslt;
                }

                // Swap two items. This swaps the elements in the local array,
                // and calls the Swap method for each of the ordering criteria.
                private void Swap(int x, int y)
                {
                    var temp = _items[x];
                    _items[x] = _items[y];
                    _items[y] = temp;
                    foreach (var t in _criteria)
                    {
                        t.SwapKeys(x, y);
                    }
                }
            }
        }
    }

Implementing IOrderedEnumerable, Part 1: The basics

Last time I introduced the TopBy extension method, which I want to implement for LINQ to Objects. On the surface of it, doing such a thing seems simple enough: just implement the IOrderedEnumerable interface. That turns out to be more involved than you might expect.

Documentation for IOrderedEnumerable isn’t particularly helpful. The brief description says:

Represents a sorted sequence.

There are no remarks or other detailed information. The interface has two methods: CreateOrderedEnumerable<TKey> and GetEnumerator.

If you’ve worked with LINQ at all, you know what GetEnumerator is supposed to do: generate the sequence. More correctly, it returns an enumerator that iterates through the collection, but something has to generate the items in the collection, and in keeping with the lazy nature of LINQ, we typically generate those items during enumeration.

Documentation for CreateOrderedEnumerable<TKey> isn’t much help, either. The brief description says:

Performs a subsequent ordering on the elements of an IOrderedEnumerable<TElement> according to a key.

The prototype tells us that there are three parameters: a key selector, a comparer, and a flag to indicate if the subsequent ordering should be ascending or descending. And, the only remark is:

The functionality provided by this method is like that provided by ThenBy or ThenByDescending, depending on whether descending is true or false. They both perform a subordinate ordering of an already sorted sequence of type IOrderedEnumerable<TElement>.

Not so helpful. The only clue in the documentation that even hints at an implementation strategy is in the Remarks section of ThenBy and ThenByDescending:

ThenBy and ThenByDescending are defined to extend the type IOrderedEnumerable<TElement>, which is also the return type of these methods. This design enables you to specify multiple sort criteria by applying any number of ThenBy or ThenByDescending methods.

The idea is easy enough. We want to create a chain of comparison operations so that we can apply code that does essentially this:

    if first keys are equal then
        if second keys are equal then
            if third keys are equal then
            ... etc.

It all starts with the TopBy method, which has two overloads:

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector);

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer);

There are corresponding TopByDescending methods, as well, which have the same parameters.

Those methods return an IOrderedEnumerable<TSource>, so I need to create a class that implements the IOrderedEnumerable<TSource> interface. I called that class HeapOrderedEnumerable<TElement>. In its skeletal form, it looks like this:

    internal abstract class HeapOrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
    {
        public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending)
        {
            throw new NotImplementedException();
        }

        public IEnumerator<TElement> GetEnumerator()
        {
            throw new NotImplementedException();
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }

If you think about it a bit, that HeapOrderedEnumerable<TElement> class isn’t sufficient because the key type varies. What we really need is a HeapOrderedEnumerable<TElement,TKey> that we derive from HeapOrderedEnumerable<TElement>.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
    }

TopBy, then, creates and returns a HeapOrderedEnumerable<TSource, TKey>. A reference to that object is passed to ThenBy, which in turns calls the CreateOrderedEnumerable method to create additional sort criteria. It’s up to us to figure out how to chain those multiple sort criteria together so that when GetEnumerator is called, items are selected appropriately.

Clear as mud, right?

The HeapOrderedEnumerable<TElement, TKey> instance has to save the parameters that are passed to its constructor so that they can be used when it’s time to generate the sequence. The class also has to save a reference to its parent so that we can properly construct the chain of comparers when the time comes. Some of the information to be saved (the source list, number of items to select, and the parent pointer) has to be accessible by the base class. The rest can be private to the derived class.

With that information, we can create internal data definitions and constructors for the classes. First, the base class:

    // Internal data and constructor for HeapOrderedEnumerable<TElement>
    private readonly IEnumerable<TElement> _source;
    private readonly int _numItems;
    internal HeapOrderedEnumerable<TElement> Parent;

    protected HeapOrderedEnumerable(
        IEnumerable<TElement> source,
        int numItems)
    {
        _source = source;
        _numItems = numItems;
    }

Note that the Parent isn’t set by the constructor. We’ll come back to it in a minute.

The derived class, HeapOrderedEnumerable<TElement, TKey>, saves the key selector, the comparer, and the descending flag.

    internal class HeapOrderedEnumerable<TElement, TKey> : HeapOrderedEnumerable<TElement>
    {
        private readonly Func<TElement, TKey> _keySelector;
        private readonly IComparer<TKey> _comparer;
        private readonly bool _descending;

        internal HeapOrderedEnumerable(
            IEnumerable<TElement> source,
            int numItems,
            Func<TElement, TKey> keySelector,
            IComparer<TKey> comparer,
            bool descending) : base(source, numItems)
        {
            _keySelector = keySelector;
            _comparer = comparer ?? Comparer<TKey>.Default;
            _descending = descending;
        }
    }

Now we can write the CreateOrderedEnumerable method. All it does is create a HeapOrderedEnumerable<TElement, TKey>, and link it to the parent so that we have a chain that we can work with when GetEnumerator is called.

    public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
        Func<TElement, TKey> keySelector,
        IComparer<TKey> comparer, bool descending)
    {
        var oe = new HeapOrderedEnumerable<TElement, TKey>(
            _source, _numItems, keySelector, comparer, descending);
        oe.Parent = this;
        return oe;
    }

And, finally, the TopBy and TopByDescending methods.

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, false);
    }

    public static IOrderedEnumerable<TSource> TopBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, false);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, null, true);
    }

    public static IOrderedEnumerable<TSource> TopByDescending<TSource, TKey>(
        this IEnumerable<TSource> source,
        int numItems,
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        return new HeapOrderedEnumerable<TSource, TKey>(source, numItems, keySelector, comparer, true);
    }

Note that I don’t have to define the ThenBy or ThenByDescending methods. Those methods are already defined by LINQ. They call CreateOrderedEnumerable on the IOrderedEnumerable reference that’s passed as the first parameter. So, if there’s a TopBy followed by ThenBy, the HeapOrderedEnumerable<TElement> that TopBy created gets called. If the first function is OrderBy, then the IOrderedEnumerable implementation created by that is called.

The code now will create a chain of comparers from a TopBy or TopByDescending followed by any number of ThenBy or ThenByDescending calls. The chain is built backwards, but that’s okay; it’s easy enough to reverse the chain when it comes time to enumerate items.

Enumerating the items involves extracting and saving keys, and doing the actual heap selection. In the next post I’ll talk a bit about key selection, and start building the GetEnumerator method. I might be able to finish next time, but there’s a lot of material to cover and it might be too big for a single post. If so, we’ll finish up after that.

Creating a TopBy method for LINQ to Objects

One reason I spent so much time last fall working on my generic heap implementation was that I wanted a TopBy extension method for LINQ to Objects. So, for example, if I have a list of integers and I want the 10 largest, I could write something like:

  var top10 = theList.TopBy(10, k => k);

Without a TopBy method, the LINQ way to do that is:

  var top10 = theList.OrderByDescending(k => k).Take(10);

That’s all well and good when the list is quite small but when working with lists that contain millions of items, having to do a complete sort just so I can get the top 10 is hugely expensive in both memory and processing time. And if the list is larger than will fit in memory, the LINQ OrderBy solution flat doesn’t work.

Given my heap class, it’s easy enough to create an extension method that does something similar to what OrderBy does. If you ignore the keySelector parameter and assume that the items have a default comparer, then it’s almost trivial:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems)
    {
        var comparer = Comparer<TSource>.Default;
        var heap = new MinDHeap<TSource>(3);
        foreach (var item in source)
        {
            if (heap.Count < numItems)
            {
                heap.Insert(item);
            }
            else if (comparer.Compare(item, heap.Peek()) > 0)
            {
                heap.ReplaceRoot(item);
            }
        }
        // and return in reverse heap order
        return heap.GetConsumingEnumerable().Reverse();
    }

Wrap that up into a static class and you can get the top 10 items from a list with:

    var top10 = theList.TopBy(10);

Adding the ability to specify a custom comparer is straightforward: something that I covered in my heap discussions. That would produce this function:

    public static IEnumerable<TSource> TopBy<TSource>(
        this IEnumerable<TSource> source,
        int numItems,
        IComparer<TSource> keyComparer);

I could probably get by with that just fine. If I need to compare on multiple fields, I can just write a different IComparer. And if I need to do some fancy key selection, I can either do it in the comparer, or write a front end filter that gets the keys that I need and builds a list of intermediate objects I can work with. It’s not something that I need to do very often.

At least, that was my thinking until I ran into a situation where building that list of temporary objects turned out to be a giant pain in the neck. Here I was able to sort with no trouble by using a key selector and several ThenBy operations, but to select the top 10 I had to jump through all kinds of hoops. It was just ugly.

Consider, for example, ordering a list of items based on how many users liked or disliked them. You want the items that have the most likes to appear on top, but if two items have the same number of likes, you want the one with the fewest dislikes to be shown first. So, given three items:

    bar, 50 likes, 8 dislikes
    foo, 10 likes, 1 dislike
    bas, 50 likes, 1 dislike

You would present them in the order (bas, bar, foo) because bar and bas both have 50 likes, but bas has fewer dislikes. The LINQ code to order them would be:

    var ordered = items
        .OrderByDescending(x => x.Likes)
        .ThenBy(x => x.Dislikes);

No custom comparer is required. But if I wanted to do that with the TopBy method shown above, I’d have to create a custom comparer and pass it. What a pain in the neck. And that’s just a simple example. If generating the keys required any calculation I’d want to pre-compute the keys once so that I wouldn’t wast time constructing a key every time I wanted to compare it. As I said, I could get by with the primitive method, but it’d be much nicer if TopBy would work the same way as OrderBy.

So I thought I’d look into creating TopBy and TopByDescending methods that I can use the same way that I use LINQ’s OrderBy and OrderByDescending. That turned out to be more involved than I thought it would be.

The key is the IOrderedEnumerable interface which, although simple enough on the surface, turns out to be a lot more complicated to implement than the documentation would lead you to believe. That’s where I’ll start next time.

Next: Implementing IOrderedEnumerable, Part 1

My next car

I drive a 1996 GMC Sonoma pickup truck that we bought new in February 1996. The truck has about 190,000 miles on it. It’s been wrecked and repaired twice, and I put a new engine in it at 120,000 miles. I keep up on the maintenance so the truck still runs well, and it looks much better than you’d expect an 18 year old truck to look. A few minor dents here and there is all.

Debra drives a 1996 Nissan Maxima that we bought new in December 1996. It has about 235,000 miles and is starting to show its age. She has to drive around town for work sometimes, so the car has to be reliable. We’ve kept up the maintenance on this car, so it runs well. But we haven’t fixed the dents from the minor wreck she had a few years ago. We had planned to replace the car by now but every time we think of getting something new we decide to wait another few months.

Debra’s new car will have to be something with four seats. We really like the Maxima and might get another. We both have a tough time with the price range, though: $30,000 to $40,000 for a car just seems crazy. There are lots of good cars in that price range, and even some nice ones in the $25,000 range (the Nissan Altima, for example). Debra wants four doors and plenty of space. Other than that, we haven’t narrowed it down. A 4 door sedan? Minivan? SUV? We’ll be looking for a long while.

I on the other hand just need basic transportation. The truck is handy for hauling things, but in truth I don’t often carry anything in the back. So I’m looking for something small and inexpensive to take me to work and back. I looked at the all electric Nissan Leaf, but can’t justify the $30,000 price tag for a basic commuter car. Even the Smart and the Fiat 500 are pretty expensive for what you get, but they’re less expensive than the Leaf and I can actually go places if I want to. The Leaf’s 100 mile range is a real drawback.

About six months ago I ran across Elio Motors, a company that’s gearing up to produce a small, affordable, and high mileage car. Last I heard (earlier this week), they expect to start shipping cars in the fourth quarter of this year.

elio-motors

It has two seats (front and back) and gets 80 MPG on the highway. City mileage will be about 50. It’s a 0.9 liter, 3-cylinder engine. The best part: $6,800. Yes, $6,800. It’s the perfect little commuter car.

I’m sending in my deposit to reserve one.