Solving the right problem

As programmers, we sometimes lose sight of the problem we’re supposed to solve because we’re focused on the problem we want to solve. Or perhaps on the problem we think we need to solve. Today’s lesson is a case in point.

Imagine you have a very large collection (many thousands) of records that contain individuals’ names and their skills. The data consists of a name followed by a comma-separated list of skills. Something like:

Jim Mischel, programming, wood carving, brewing, curmudgeon

The problem is that some of the skills are misspelled or abbreviated: “mgmt” for “Management,” for example, or “hadup” instead of “Hadoop,” etc.

Your job is to go through the records and fix the misspellings.

Seems easy enough, right? Just go through the skills, look up each word in a dictionary, and correct the entry if the word isn’t in the dictionary. But there are two problems:

  1. A lot of common terms used in describing skills are domain specific and won’t be in any dictionary.
  2. Even if there were such a dictionary, how would you determine that “mgmt” is supposed to be “Management,” or that “hadup” should be “Hadoop?”

It’s at this point that we lose sight of the real problem we’re trying to solve. Being programmers, we start thinking about figuring out edit distances, machine learning, support vector machines, etc. In no time at all we’re shaving a yak.

Writing a computer program that will detect and correct misspellings like this is hard. I’m talking Ph.D. research project hard. I guarantee that the person who asked you to solve the problem isn’t willing to let you embark on a multi-year research project.

Fortunately, you don’t have to. You can solve it in a day or two with two simple programs and a little bit of manual effort.

First, write a program that builds a dictionary of all the terms used, along with how often they’re used. You’ll end up with a very large list that looks something like:

    Management,200
    Mgmt,3
    Hadoop,10
    Hadup,1
    Mamagement,1
    ...

If you make the assumption that if a term appears very often, it’s probably spelled correctly, then all you’re really concerned about are those terms that occur infrequently. You can decide on a threshold of, say, three or five, and have your program output only those terms. You’ll probably end up with a list of a few hundred infrequently used terms.

That’s where the manual work comes in. You have to look at each word, determine if it’s a misspelling, and if so find out what the correct spelling is. You manually create a file that contains the misspelled term along with the correct term. For example:

    mgmt, Management
    mamagement, Management
    hadup, Hadoop
    wood craving, wood carving
    etc.

The second program is even easier. All it does is load the replacements file into a dictionary with the key being the misspelled term, and then reads each name record. For each name record, the program checks each term against the dictionary. If it finds a misspelling that’s in the dictionary, it writes the replacement. Otherwise, it writes the original text. The algorithm looks like this:

    dict = LoadDictionary();
    create output file
    for each record in file
        output name
        for each skill in record.skills
            if skill in dict
                output replacement text
            else
                output original skill text

This approach won’t be perfect, but it will be very good. You’ll have to tinker with the threshold value a bit to get the right balance between detecting real misspellings and “false positives”: terms that are used infrequently but aren’t actual misspellings.

You’ll also have to accept that common misspellings will not be detected. Fortunately, common misspellings will be … common. So you can add them to the replacements dictionary as they’re detected, and subsequent passes over the file will catch them. You could do some work with edit distance algorithms to catch the more common misspellings, but it would be a huge amount of work for relatively little gain.

I’ll admit that it’s somewhat dissatisfying having to manually create the replacements file, but it’s hard to argue with the result. Especially I can get somebody else to do that work. Then all I have to do is create the list of terms and their counts, and write the program that will make the replacements once the manual work is done.

It’s not sexy, but it solves the problem quickly and effectively.

The Toothpaste Rant

I’m cranky this morning. Like many people, I have kind of a ritual I go through when I get out of bed. The first thing I do is wander, bleary-eyed, into the bathroom to do my business. Then I stop by the sink and brush my teeth. The day just doesn’t seem right if I can’t brush my teeth first thing.

So why am I cranky? The toothpaste tube was empty, and I grabbed one of those little sample tubes that come in the mail from time to time. Crest wanting me to try their new flavor, I guess. Heck, it’s toothpaste, right? I squeezed a bit out onto my brush, put the thing in my mouth and … nearly vomited.

High on the list of things that you just shouldn’t mess with is toothpaste. Long before I started brushing my teeth, civilization figured out that the proper flavor for toothpaste was mint. It doesn’t really matter what kind of mint, as long as it has that … well, that minty flavor. And consistency matters, too. Toothpaste is done.

But can they leave well enough alone? No! Some idiot at the Crest toothpaste company decided that they needed a cinnamon flavor. To compete with cinnamon Altoids, perhaps? Who knows what goes through their minds? I know it’s hard to believe that they even thought of making a non-mint toothpaste, but for sure their taste testers should have nixed this crap before it ever got out of the lab. The stuff tastes like bathtub caulk mixed with just a hint of fake cinnamon–maybe from one of those Atomic Fireball candies. And it left a sour aftertaste in the back of my mouth. To top it all off, it has the consistency of bathtub caulk, too. Toothpaste is supposed to be a little moist–almost runny. It’s not supposed to suck all the moisture off your tongue when you put it in your mouth.

Fortunately, I was able to get rid of that taste by gargling with Listerine, and I brushed my teeth with Scope, if you can believe it. So I’m not as cranky as I could be. But it was a close thing.

Fair warning: check the flavor on that toothpaste tube before you squeeze some out onto your brush. You do not want the kind of rude surprise that I got this morning.

And to the people who make toothpaste: Stop messing with it! Just make sure that it cleans my teeth and leaves my mouth feeling minty fresh. Some things are perfect just the way they’ve been for a hundred years, thank you very much. Spend your time on real problems, like figuring out how to make a floss that doesn’t get stuck between my teeth, or maybe a “scrubbing bubbles” mouthwash that I can swish around instead of having to use the dang brush. But make sure that it’s a mint flavor, okay?

Stop the style bashing

I might have mentioned before that I try to spend some time every day answering questions on StackOverflow. I learn a lot about the common issues that other programmers face, and I’ve picked up quite a few tips from reading answers supplied by others. I also learn quite a bit when I answer questions; sometimes providing a complete useful answer requires a lot of research.

Note that the discussion below is specific to C#. I’m not familiar with the implementation of the bool data type in C++ or boolean in Java, and C didn’t have a separate Boolean type.

I also learn about others’ misconceptions. For example, whenever somebody posts code that compares a Boolean value to true or false, at least one person will say something like “Never write == true or == false.” One person said, “Whenever you write == true or == false, God kills a kitten.”.

Those types of comments are pretty annoying. It’s just a style thing. We all know that these two conditionals are equivalent:

    if (timer1.Enabled == true) { ... }
    if (timer1.Enabled) { ... }  // the "== true" is implied

Complaining about the code containing == true is like complaining about using braces to enclose a single statement, or the extraneous parentheses in x = x + (y/z);, or getting mad because the person didn’t write x += y/z;. It’s a style thing! If you don’t like it, just move on.

Until recently, I honestly thought that’s all it was: immature griping about a minor style issue. But recent comments indicate that several people think there’s some difference in the generated code. That is, they think that if (timer1.Enabled == true) requires an extra operation to compare the values! That’s not true at all, as you can clearly demonstrate by compiling a simple program and looking at the generated code.

    System.Timers.Timer myTimer = new Timer();
    if (timer1.Enabled) Console.WriteLine("Enabled!");
    if (timer1.Enabled == true) Console.WriteLine("Enabled");

Compiling that in Release mode with Visual Studio 2013 results in the following intermediate code. I’ve added a few comments.

  // start of first conditional: if (timer1.Enabled) ...
  IL_0000:  ldarg.0
  IL_0001:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_0006:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_000b:  brfalse.s  IL_0017
  IL_000d:  ldstr      "Enabled!"
  IL_0012:  call       void [mscorlib]System.Console::WriteLine(string)
  // start of second conditional if (timer1.Enabled == true)
  IL_0017:  ldarg.0
  IL_0018:  ldfld      class [System]System.Timers.Timer sotesto.Program::myTimer
  IL_001d:  callvirt   instance bool [System]System.Timers.Timer::get_Enabled()
  IL_0022:  brfalse.s  IL_002e
  IL_0024:  ldstr      "Enabled!"
  IL_0029:  call       void [mscorlib]System.Console::WriteLine(string)

Those two bits of code are identical. Each one calls the get accessor for the myTimer.Enabled property, and then branches around the WriteLine if the value is false.

Note that this is intermediate code. The JIT compiler might be able to inline the get_Elapsed method, which would eliminate the method call. Regardless, the code for these two snippets is identical. So stop with the == true bashing, already.

Free the Scratch!

The other day I overheard two women talking about baking cakes. I wasn’t eavesdropping; they were sitting at the table next to mine. Nor was I paying much attention to the conversation, engrossed as I was in the technical manual that so often serves as my lunch partner. But then one of the women said: “You bake from scratch?” That totally broke my concentration. It was time to go anyway.

As I walked back to the office, I got to wondering what this magical “scratch” stuff is that people make things from. I often hear people say that they bake from scratch, but that’s not all it’s used for. I hear programmers and builders say they’re going to “start over from scratch.” In almost every field I’ve examined, there’s somebody making things from scratch.

It took me a while to track this down, but what I discovered flies in the face of all the science we’ve been spoon fed over the years. Scratch builders have learned how to transmute matter and make wondrous things from seemingly nothing. Science teaches us that you can’t turn lead into gold–that you can’t change the fundamental properties of matter. I haven’t seen a scratch builder turn lead into gold, but I wouldn’t put it past them. Why would they want to, anyway, when they can take beans from a plant, add some butter and sugar, and come up with chocolate? Gold has nothing on chocolate, right?

Scratch is, despite what your science teachers will try to tell you, the basic building block of the universe. Forget chemistry and quantum physics with all their arcane formulas, quirks, quarks, electron orbitals, and difficult math. Those fields of study are just red herrings intended to divert your attention from where the real action is. You won’t see “Scratch Studies” at even the best universities because the powers that be don’t want the word to get out. They don’t want just anybody knowing how to transform scratch into all the wondrous things that we see every day.

The Scratch Police let amateurs dabble in the field–baking cakes or building clunky radios from spare parts. They do this because it turns out that the principles of scratch are quite simple and they can’t stop people from discovering them. But they keep a close eye to ensure that a baker, for example, doesn’t figure out that he can use the same principles of scratch to make a CD player, a car, or anything else. If that knowledge got out, the economy would crash like nothing you’ve ever seen before.

But it’s time the word got out. We can no longer be held hostage by the Scratch Police and the small group of people who hold so tightly the knowledge of scratch. My life is forfeit, I know. The Scratch Police will come track me down when they see this blog, and I doubt that Charlie will be able to protect me from them. If you see this blog, please copy and email it to everybody on your address list. Help spread the word and free us from the tyranny of those who would keep this fundamental knowledge to themselves.

Free the Scratch!

ReSharper kills comments

I’ve said before that I really like ReSharper. Apart from Visual Studio itself, it’s the most valuable development tool I have. Whereas Visual Studio increases my productivity by letting me create code faster than I would with a simple text editor and command line tools, ReSharper helps me create better code. It also provides tools that make it easier to understand unfamiliar code and refactor older code to bring it up to modern standards. Altogether, I’ll agree with my friend Dennis who says that if you’re writing C# code professionally and you’re not using ReSharper, you’re probably committing malpractice.

That said, ReSharper has a few annoying features. Fortunately, most of them can be disabled or their behavior modified through options settings. I discovered one yesterday, though, that I find particularly disturbing.

Imagine you have this declaration in your code:

Func<bool> canExecute = () =>
{
// In this case, we can always execute the function because …
return true;
};

ReSharper helpfully recommends converting that into a lambda expression. Seems a reasonable thing to do. If you tell ReSharper to go ahead and make the transformation, you end up with:

Func<bool> canExecute = () => true;

That’s right, it deleted the comment! Not a good thing.

This isn’t a particular problem because I applied the transformation individually and saw the comment disappear. But if you modify the settings to automatically apply the transformation when you run Code Cleanup, you won’t see the comments being removed. I repeat: Not a good thing.

That’s unfortunate because it forces me not to use the automatic “convert to lambda expression” function in Code Cleanup. Not a deal killer, but a little annoying.

Time management?

During the dot-com boom (1998 or 1999), one of my coworkers overheard me discussing stocks with another guy in the office. He later asked me for investment advice–a fairly odd experience for me. My investment knowledge is rudimentary at best. But I knew from previous discussions that my coworker (we’ll call him P) had some pretty significant credit card debt. The conversation went something like this:

Me: P, I can guarantee you 18% on your money.

P: Really? What stock?

Me: No stock. Just take the money you were going to buy stock with and pay off your credit card.

P: But … but … I’m wanting to save money. Invest!

Me: You’re paying 18% or more on your credit card debt. If you’re lucky you’ll average 10% on your stock investments. So even if you do well, you’re losing 8% per year. If you pay off your credit card debt, then at least you break even.

After that we got to the real issue: P was looking for tips on stocks that would skyrocket. 18%? Bah! He was looking for 1,000% gains. I told him that if I knew how to do that I wouldn’t be slaving away at the pixel mines.

I’m no investment guru, but I understand basic math.

Friends fairly regularly ask me where I find the time to do my wood carving. Some seem to think that I’m some great time management guru. Nothing could be further from the truth. I’ll admit that I’m terrible at time management. I’m easily distracted if I’m not very interested in what I have to do, and if I do get interested I’ll get lost in whatever I’m working on. It’s not uncommon for me to look at the clock and find that it’s 3:00 AM when I thought it might be approaching midnight.

Time management is not one of my strong points.

But finding time to work on my wood carving is no problem at all, and I suspect most of the people I know could do the same thing. How? Turn off your television!

According to recent Nielsen data, Americans aged 35 to 49 watch an average of 33 hours of television per week. 33 hours? That’s nearly five hours a day! And that number increases as we get older. Americans older than 65 watch an average of seven hours of television every day. What a waste.

I’ve mentioned this to people before, and they look at me like I’m crazy. “Give up my <insert name of favorite show here>? No way!”

No skin off my nose. But if you’re sitting in front of the idiot box and wondering where I find the time to do things, maybe you should re-think your priorities. My dad used to say, “We find time to do the things we think are important.” In my experience, that’s true. And if the Nielsen data is reliable, most of the people asking me where I find the time are spending five hours a day staring at the answer.

 

Crime and punishment

I was 13 years old when I discovered that I could sneak out my bedroom window at night after Mom had gone to bed. Dad was often out of town or working late and I’d normally be home before he returned, so I had a few hours after my official bed time to spend cutting up with my friends who also would sneak out.

One Sunday night a few of us pooled our resources and managed to get an older kid to buy us some beer and wine. I think there were four of us, a six pack of Budweiser, and a bottle of Boone’s Farm something or other. We went down by the creek and proceeded to get stupid.

I don’t remember a whole lot about the walk back home. Just little bits and pieces. We did discover some kid’s Big Wheel left in a yard at the top of a hill, and we took turns trying to ride it down the hill. As I recall, we were all too big to actually ride the darned thing. We’d just sit in it, try to make it roll, and then give up in disgust. Yeah, we were pretty drunk and probably making all kinds of noise in the neighborhood as we stumbled home.

I lived farthest from where we had been, so I got to go the last half mile by myself. I got tired of walking and sat down under a tree in somebody’s front yard. Next thing I remember, I was waking up under that tree, and throwing up in the grass. I got up, wiped my mouth, and stumbled the rest of the way home.

When I got home and tried my window, it was closed. I guess Mom discovered that I was sneaking out. I knew it was later than I usually got home after one of my after hours excursions, but I didn’t know how late. I tried the back door, which was locked, and every other window in the house. No dice. I wasn’t thinking too clearly and I needed some sleep, so I went to the front door and rang the bell.

Mom opened the door, took one look at me, and said, “You’re drunk. Go to bed.”

In my inebriated state, I wasn’t in much condition to consider the ramifications of my good fortune. No scolding or anything? I just went to my room, stripped off my clothes, and passed out on the bed. It was 2:00 AM.

Dad woke me up at 6:00 AM. “Son,” he said, “time to shower and head to school.” I was still drunk and tried to pull the “I don’t feel good” routine. But Dad was having none of it. He made it clear that being drunk or hung over was no excuse for skipping school.

I felt marginally better after a shower and a glass of orange juice, but that day was miserable. I went through most of the morning in a fog, and the afternoon was a giant headache. By the time school was over all I wanted was to go home and get some sleep. But when I got there, Mom had a bunch of chores for me to do. I didn’t get to bed until my normal bed time.

The only thing Dad said to me about that incident was that I can’t use my bad decisions as excuses to shirk my responsibilities. Neither he nor Mom ever mentioned it again.

They could have let me sleep, and punished me that afternoon: extra chores, restrict me to school and home, no phone privileges, etc. But making me to go to school was the most fitting punishment Dad could have devised. He forced me to face the direct consequences of my actions. Any other punishment would have been a proxy, and not nearly as effective.

I doubt that memory of a grounding would have stuck with me for 40 years like my memory of that day at school has. I rarely drink on a “school night,” and if I do over indulge, I don’t use that as an excuse to skip work. All because my parents wouldn’t let me skip school because I was hung over. I’d say they did a good job teaching me a valuable life lesson.

 

Carving a whimsical house, Part 3

This is the third in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

In Part 2, I finished carving the roof. The next thing to do is establish the roof line and gables, and carve the sides down in preparation for the doors and windows. To that end, the first order of business is to draw the roof line similar to this:

roofline1 roofline2

 

typhoonBudI use a Typhoon “bud” burr, shown on the right, to cut the roof line in the wood. If you
don’t have that particular burr, a flame shape or a taper would probably do just as well. This line doesn’t have to be terribly precise; just be careful not to carve away the roof.

roofline3The picture on the left shows the house after I’ve cut the roof line. You probably noticed that the bottom line looks like I cut it with the same burr. I might have. That bottom line, by the way, is where I’ll carve the rocks that the house sits on.

The next thing to do is carve the space between the roof and the base relatively flat so that the roof overhangs the house. I’ve used several different burrs for this and still haven’t decided which one I like the best. The 9/16″ Typhoon that I used for rough shaping the roof works okay, although it’s kind of hard to get the middle portion. However you do it, be careful not to go too deep. Also don’t worry about getting it perfectly flat. At this point, the most important part is to carve down the triangle that makes the gable on each end.

gable1 gable2

Above you can see that I’ve carved the gables, and then drawn lines to outline the trim board that serves to separate the wall from the gable. The next step is to relieve that board by carving down the wood above and below it.

gable3

Carving away the wood below is easy enough: just outline with the “bud” burr and then carve the rest of the wall. Carving the gable is a bit of a pain. I’m still experimenting with the best way to do that. If the area is large enough, I’ll use the bud or a flame shaped Typhoon burr. If it’s smaller, then the ruby carver seems to work pretty well. However I’ve done it, though, I’ve not been able to get a good sharp line. I keep trying.

gable4Here’s what my house looks like after carving both gables and the walls on all four sides.

Again using the Typhoon bud burr, I roughed out the rocks that make up the base of the carving. I probably made a mistake using that large burr for this, because the resulting rocks have too much space between them. I had to do a lot of smoothing and shaping work to get it to look even halfway decent in the rocksbaseend. Live and learn, I guess. I would suggest taking your time with a ruby carver or perhaps a long taper Typhoon if you have one. In any case, something smaller and less aggressive than what I used.

After I finished roughing out the rocks, I spent a little time with the flap sander to smooth the walls a bit so that I had a reasonably flat surface on which to carve the doors and windows. If you don’t have a flap sander, a sanding drum will work as will any other burr that leaves a relatively smooth surface. The surface doesn’t have to be completely free from scratches, as you’ll be cleaning it up later. But you probably want something pretty flat for the door and window frames.

Next time: doors and windows.

 

 

Carving a whimsical house, Part 2

This is the second in a series of posts about carving a little whimsical house from an old cedar fence post. See Carving a whimsical house, Part 1 for the introduction.

roof1In the first part, I cut a piece of fence post into four blocks, each 2″ x 2″ x 3″ high, and drilled a 3/8″ hole in the center. The result is shown on the left.

I should note here that all of the carving on this piece is done with the Foredom power carver using various bits and burrs. It’s important when using these tools to wear eye protection, have some type of dust collection apparatus, and wear a respirator or some other way to protect your lungs. Power carving produces a lot of dust, and you do not want that stuff in your lungs. You might also want to wear ear plugs if your dust collection system is especially noisy.

I also want to point out that this series of posts shows how I carved one house. This isn’t the only way to do it. In fact my method is constantly changing as I become more familiar with using the tools. This is only the fifth one of these houses I’ve made, so I’m still just a beginner.

With that out of the way, let’s get started.

Oh, by the way, you can click on any of these pictures to get a larger view. Although the “larger” view might not be a whole lot larger. I seem to be having trouble currently uploading larger pictures.

I find it best to start with the roof. Establishing the roof line sets the stage for the rest of the house. Also, if you carve the rest of the house and leave the roof for last, it’s very possible that you’ll run out of wood to complete the roof you want. The style of roof I’m creating here can eat up quite a bit of wood. I had to throw a Cottonwood bark house away once because I didn’t leave enough space for the roof. I was carving that one with knives and gouges.

roof2In the picture above, you can see that I’ve drawn a rough profile for the roof and chimney. Using a 1/2″ coarse Typhoon burr, I first outline the chimney.

The picture at the right shows how I begin to rough out the roof. The Typhoon burr is pretty aggressive, so I’m careful around the chimney, and I leave a lot of extra wood. I’ll come back later with a less aggressive burr and shape it.

Although the burr is aggressive, the cedar is a medium-hard wood and it takes some time to remove all that wood. Be patient and check your outline from time to time so that you don’t take off too much.

roof3It took me about 20 minutes to finish roughing out the roof to this point. That’s okay. I’m not in a race to see how quickly I can carve one of these little houses. I’d rather take a little extra time than get in a hurry and either destroy the carving or, worse, lose control of the tool and injure myself. Running that Typhoon burr over a thumb hurts. Trust me.

roof4Remember, too, that making a mistake isn’t fatal. These houses are supposed to be whimsical. They’re certainly not architecturally correct. If you inadvertently carve through the chimney, for example, don’t worry too much about it. You can always carve it to look like the chimney is falling apart. Carvers don’t make mistakes; we make adjustments.

roof5Next, I shape the chimney using a smaller and less aggressive Typhoon burr. I also put a few dips and humps in the roof surface in order to make it a little less uniformly flat, although that turns out not to be necessary for the roof style I chose; the process of adding the roof tiles makes for an irregular surface.

The last thing I did before beginning to carve the roof tiles is go over the roof with a 120 grit flap sander to remove most of the scratches left by the Typhoon burrs. Again, that’s not strictly necessary because the next step has me going over the roof with a finer ruby carver, but it’s part of the process for me–something I do regardless of the roof style I choose.

ruby1

I do all of the tiling with the flame-shaped ruby carver shown on the right. I’ve tried other bits, particularly for carving the lines between roof tiles, but they end up making deep narrow lines that I then have to spend time removing. I like the flame shape, but the ruby carver is a bit less aggressive than I’d like. It also tends to clog up on cedar, and I have to clean it now and then with a brass brush. Don’t use a steel brush.

rooftile1

I chose to do a tile roof on this house. This is something I’ve tried once before with a Cottonwood bark house, and also on an earlier Cedar house. Those two didn’t turn out so well. I experimented with one side of this roof to refine my technique. The photos and description I show here are from the second side. I think I almost have this roof style figured out.

Here you can see that I’ve outlined the first tile. The next step is to remove wood to the right and below so that the tile stands out from the roof. Then, draw lines for the next two tiles.

rooftile2When I started, I found it helpful to draw lines for the tiles before outlining each one. I got the hang of it after a while and began just carving the next tile without first drawing a line. Do what you feel comfortable with.

rooftile3Next, outline those two tiles, carve away wood to the right and below, and outline the next tile. Note that you don’t have to carve the whole rest of the roof down after outlining each tile. Instead, just carve down enough for the next tile. What you’re going for is a gentle slope from the top left to the center bottom.

rooftile4Work your way down and to the right, outlining and carving away wood for each tile until you get approximately to the center of the roof. When you’ve completed the left side, it should look something like the large photo on the right.rooftile5

The general flow of the roof should be towards the right and down. That is, tiles on the left should appear “above” tiles on the right. It’s okay if some tiles appear to stick up above where they “should” be; that’s part of the house’s whimsical nature. But do try to keep the general flow to the right and down.

rooftile6Next, start at the top right corner and do the same thing, but work down and to the left. Again, take your time. When you get close to the center, where the tiles from the left meet the tiles to the right, you’ll probably have to make some adjustments. If there’s a trick to making that come out just right, I don’t yet know what it is. I will say, however, that this is the best I’ve done so far.

rooftile7And that’s one side of the roof tiles, almost completed. You can see that I made a few mistakes, the biggest one being there on the far right where I have one tile sitting completely on top of another. It looks a bit strange, and is not what I had planned. Don’t know how I managed that.

If you have a smaller ruby or diamond carver, you might want to sharpen the edges between tiles. I suspect that with a bit more practice I’ll be able to get sharp edges between all the tiles. I did a passable job here, but some of the lines aren’t as clean as I’d like them.

That’s it for the roof tiles. Next time I’ll rough out the roof line and carve the gables.

 

 

Carving a whimsical house, Part 1

house_banner

An online carving group with which I’m involved is doing a “friendship stick” project. Each of 10 carvers makes 10 carvings from a 2″ x 2″ x 3″ block of wood, and sends one carving to each of the other carvers. Every carving has a 3/8″ diameter hole drilled through it from top to bottom. When we receive the carvings, we display them stacked on a 3/8″ dowel.

The only rules are that the carving cannot be larger than 2″ x 2″ x 3″, and it has to have that hole in the middle. Beyond that, anything goes. I decided that I’d make little houses from a cedar fence post. The picture above is one of the houses I’ve carved for the project. This blog post is the first of several parts showing how I go from fence post to finished house.

fencepost

Above is a 13 inch long piece of cedar fence post. We’re replacing the 30-year-old fence around our property, and I pulled this post from the ground a month or two ago. The wood isn’t really cedar, but rather Ashe Juniper, Juniperus ashei. The stuff grows like weeds all over Central Texas, and the wood is commonly used for fence posts. That it’s held the fence up for 30 years is good testament to its suitability for that purpose.

The first step in making a little house is turning this post fragment into blocks. And the first step of that process is making one side reasonably flat. After carefully checking for and removing any nails and fence staples, I took the piece over to the bandsaw. I set my fence about 1/2″ away from the blade, adjusted the height, and made the cut.

bandsaw1

(Yes, the picture is a little blurry.)

You can see a few ripples in the cut.  Normally I would have done with a resaw blade (a 1/2″ blade with four teeth per inch), but I had the 3/16″ blade on the saw and didn’t want to mess with changing it. The resaw blade would have made for a straighter cut, but this was good enough for my purposes.

With one flat side, I could then take the piece of wood over to the table saw to finish squaring it up. I used to do this on the bandsaw, but it’s easier to get square sides with the table saw.

tablesaw1

The picture above shows the piece with three sides squared up. After cutting the second side, I set the fence for 2″ and cut the other two sides. Then I set it for 3″ to cut the blocks to the right length.

blocks

The resulting blocks aren’t quite square because my initial cut with the bandsaw wasn’t perfect. I could have made allowances for that and squared things up on the table saw, but it just wasn’t that important to me. As long as the blocks are approximately square and don’t exceed the size limitation, it’s good enough for making these houses.

The next step is drilling a 3/8″ hole through the center of the block. Again, perfect placement isn’t terribly important, although I don’t want to be too far off. I used to do this with a hand drill, but I recently got a drill press, which makes things a bit easier. I just mark the center by drawing lines from corner to corner, and then clamp it in my drill press.

drillpressUnfortunately, my drill press’s range is about two and a half inches. So in order to drill through a 3″ block of wood I had to clamp the block in my bench vise and use a hand drill to finish the job.

drill

With the block cut and a hole in the middle, it’s time to start carving. Stay tuned.

Part 2, in which we carve the roof.

 

Rubber soles won’t save you

It’s thunderstorm season again. The overly cautious are telling everybody about the dangers of lightning, and the brave and clueless are spreading misinformation. Dangerous misinformation, at that.

First of all, the disclaimer: lightning is dangerous. In the United States, there are about 400 injuries and 40 deaths due to lightning every year. Those numbers were much higher in the past. They’ve declined as the country has become more urban and people spend more of their time indoors.

If you can hear thunder, then a lightning strike in your area is possible. The likelihood of a strike is another matter, but if you want to be safe and follow the NOAA guidelines: “When thunder roars, go indoors.”

I’m not nearly that cautious. After all, bee stings kill 25% more people per year in the U.S. than does lightning.

That said, if you decide to stay outside despite the NOAA’s advice, there are some things you should know. You shouldn’t be near the tallest tree or structure in the area. If you’re in the middle of an empty field, you’re at a much higher risk of being struck by lightning. Follow the NOAA’s outdoor guidelines.

Most importantly, don’t think that wearing shoes with rubber soles will help you. Sure, rubber is a good insulator, but all insulators have voltage ratings. Lightning melts thick glass insulators. Your wimpy little 1/2 rubber sole might as well not exist when it comes to a million-volt lightning strike.

And don’t think that cars can’t be struck by lightning. If you believe that the rubber tires on your car are protection against lightning, take a look at this video of a moving pickup truck being struck by lightning.

 

 

Vending vent

The vending area in my office building contains a snack vending machine with the screw-type dispensers, and a soda machine with eight lines. I typically walk down there around lunch time to get a drink, and sometimes I’ll get a snack.

My first choice for a drink is Coca-Cola. There are two lines of Coke (no jokes, please) in that machine, and about half the time I go to select a Coke, there isn’t any. There’s always a Coke Zero, though, and most times there’s a Dr. Pepper. I suspect the other drinks (Orange soda, Diet Dr. Pepper, and some others) are available, too.

The snack machine suffers a similar fate. Snickers bars go quick, as do the M&M’s and a few other things. I commonly see that snack machine with six or more empty lines, and the remaining selections are far down on my list of desirable snacks.

As far as I can tell, they refill these machines every two weeks. The Coca-Cola runs out in a week or less. Same with the Snickers and other popular snacks. Whoever’s operating these vending machines is losing a whole lot of potential sales. It is not true that people will select something else if their first choice is unavailable. When I want a Snicker’s bar, for example, I probably won’t settle for something else. I’ll just put my dollar back in my pocket and walk away. Cheese and peanut butter crackers are not a substitute when you’re craving a Snickers.

My dad and uncle owned a vending company back in the early ’80s. I worked there briefly between the time I left school and when I got my first programming job. They would get upset about one empty line on a vending machine. I can’t imagine what they would have done if they found one of their machines out of the most popular products for an extended period.

Odds ‘n ends

Once again, a whole month has passed without a blog entry. Blogging seems like such a heavyweight operation these days, compared to tossing something out on Facebook. I find myself posting short notes rather than exploring a topic in detail. Maybe I’m just getting lazy.

  • I started a new job at the beginning of April. I’m now working for a small company called Syvance, writing control and data acquisition software for laboratory equipment. I’m learning all about auto samplers, mass spectrometers, gas chromatography, and other equipment used in science labs, and writing software that drives the instruments.
  • You’d be surprised at just how old some of the software I’m working with is. My first project involves adding support for a new device to a program that was originally written in C for Windows 3.1. The program has been upgraded over the years, but it’s still C and much of the original code remains. In particular, there’s a cooperative multitasking scheduler that was written because Windows 3.1 didn’t have threads.
  • The manufacturer of the instrument I’m working with has supplied .NET assemblies for controlling the device, and some .NET GUI controls that display status, etc. Calling those from the main program involves making calls to a mixed mode assembly that has some MFC and some C++/CLI. That DLL in turn calls into .NET assemblies that implement the actual machine control. So I’m writing code in four different languages: C, C++, C++/CLI, and C#.
  • You could think of C++/CLI as C++ with a few extensions to support the .NET runtime, but it’s not really that simple. There are distinct differences. I’m almost ashamed to admit that I don’t fully understand how the whole mixed mode thing works. I have a general idea, but there’s a lot I don’t know. That will have to do for now, because finishing this project is more important than understanding all the nuances of the underlying platform. I have to say, though, that all those different layers make me nervous.
  • I did quite a bit of carving in May. I’ll have to post pictures at some point. There’s another bird in a branch, a whimsical house carved from basswood, a hummingbird pin, a couple of Tiki carvings, a bowl that’s almost done, and I’m sure something else that I can’t recall at the moment. I hope to have pictures of those in the next few days.
  • It seems as though we’ve had more rain this Spring than any time in the last 8 or 10 years. It’s also been much cooler. Here we are at the end of May and we haven’t topped 100 degrees yet. It would be nice to have a wetter and slightly cooler summer. The only drawback is that I’ll have to mow the lawn more than in previous years, but I can live with that.
  • Debra and I discovered the joy of Renaissance fairs earlier this Spring, and spent last weekend at one. I do need to write about that and post some pictures.
  • Living life and loving it. I’m busy with work, enjoying my hobbies, and mentally feel better than I have in years. Debra is training for another competition, and she’s looking better than ever.

Squiggle birds

When I was working on my 100 Birds Project, I saved the larger scraps from the bandsaw cutoffs. I nearly always had two cutouts that were in the shape of the bird’s side profile. The cutoffs are flat on one side and have the contour of the bird on the other side. Below are pictures of two cutoffs from a Juniper bird I did a while back, and then a bird cutout and the left side cutoff from a piece of Mesquite.

squiggle1

squiggle3squiggle2

(If you look closely, you see that the mesquite cutoff is from a different mesquite bird. But you get the idea.)

I had originally planned to take a belt sander to the curved side of the cutoffs to create a bunch of flat bird profiles that I could make into refrigerator magnets or hang from a mobile. I was working on those when my friend Mike took one and made what I call a “squiggle bird.” He sanded the flat side to follow the contour of the curved side. The result is pretty cool.squiggle4We made a couple dozen of those, the idea being that I could hang them from a mobile. But there are problems. Finding the balance point is somewhat difficult. If I attach a line just a few millimeters forward or back of the balance point, the bird takes a severe nose-up or nose-down attitude. I tried to solve that problem by attaching the line to the tail and to the head, but that just looked ugly. In addition, if I attach the line to the back, then the bird ends up tilted on the roll axis–as if it’s making a turn. I’ve yet to find a good way to attach the squiggle birds to a mobile.

Last week I got the idea to mount a couple of birds on a standing base.

2014-04-20 21.37.54-1

The bird on top is mesquite, and the one on the bottom is pecan. They’re attached to a mesquite log with little twigs, also carved out of mesquite.

That looks okay, and I made another just like it. But it seems a waste to use such a large piece of mesquite for the base, and it’d look neater and more compact if I swapped the birds’ positions. So I made another one with the new design.

birdplaque

Both of the birds are mesquite, and the base is a piece of mesquite from which I removed the bark and then sanded away most of the sap wood. The birds and the plaque got two coats of Watco Natural Danish Oil, and then the birds got a couple coats of Deft Gloss spray polyurethane. The birds are attached to the plaque with mesquite twigs that are approximately 3/8″ in diameter. The plaque is flat on the back, and about 1/2 inch thick at its thickest. It’s designed to hang on a wall rather than stand on a desk or shelf.

I really like how this one turned out. I suspect I’ll be making more like it. My biggest problem might be picking out wood combinations that look good together.

 

Inconceivable!

Every time I hear President Obama say something “will not be tolerated,” I’m reminded of Vizzini’s “inconceivable!”

The silly sanctions we’re placing on Russian officials will have approximately the same effect as the Bush administration’s pointless sanctions on North Korea back in 2006; banning the export of iPods, Segway scooters, and other luxury items so that Kim Jong-Il wouldn’t be able to give trinkets to his cronies.

But the administration has to bluster and posture and appear to be “doing something about the problem” even though the president and anybody with more brains than your average mailbox knows that we are completely powerless: Russia will do what Putin wants, openly and without fear of reprisal.

Why do I think we’re powerless? First, military confrontation is out of the question. Russia isn’t a little country with an insignificant military that we can overpower with a division or two of National Guard troops. Air strikes, even if we were stupid enough to try and could get past the Russian defenses, would be met with major reprisals and ultimately result in a real war that nobody wants and if fought would destroy the world economy.

Not that we could make any kind of military strike. President Obama isn’t dumb enough to try convincing us that Ukraine is worth another foreign excursion, and no member of Congress who is up for re-election this year is likely to approve such a thing even if the president were to suggest it.

Second, serious economic sanctions are out of the question because large parts of Europe depend on Russian natural gas for heat. Nobody in Europe wants to risk upsetting Putin, because he could easily turn off the spigot. The first cold snap come Fall would have every citizen in Europe demanding that The Powers That Be let Russia have whatever they want, so long as the gas is turned back on. Putin isn’t beyond punishing our European allies in response to American “provocation.”

It does make for good theater, though. Too bad we can’t bottle all the hot air from both sides. Between Obama’s rhetoric, Putin’s laughable denials, and Republicans’ excoriating the president for his lack of action, we could make a serious start to reducing the amount of fossil fuels burned to keep warm next Winter.

Random selection

If you know how many items are in a collection, picking one at random is trivial: just select a random number from 0 to count-1, and you’re done. Things get more interesting when you don’t know how many items are in the collection. The naive way to do the selection is to first make an array. Then you can use the solution I just mentioned. For example:

    readonly Random _rnd = new Random();

    int GetRandomItem(IEnumerable<int> items)
    {
        var a = items.ToArray();
        var ix = _rnd.Next(a.Length - 1);
        return a[ix];
    }

Whereas that works, it can be expensive or even impossible to create a new array if the collection is very large.

When the question is modified to include the restriction that the collection won’t fit in memory (say, a text file with hundreds of millions of lines), things get more interesting still. You could go through the file to count the lines, pick a random number from 0 to count - 1, and then go back through the file to read that many lines. In code:

    var count = File.ReadLines(filename).Count();  // reads the whole file
    var ix = _rnd.Next(count - 1);
    var sel = File.ReadLines(filename)  // potentially reads the whole file
        .Skip(ix-1)
        .First();
    return sel;

If the collection is a file, then you end up reading the file once, and at least part of the file twice. On average, you end up reading the file halfway through the second time. It works, but it’s pretty expensive.

Worse, it’s impossible to use that technique if the collection is a bunch of records coming in from a network stream. It’s not like you can rewind the stream and start over after you figure out how many records there are.

There is a way to do this that doesn’t require knowing up front how many items are in the collection, and guarantees that each item has the same probability of being selected. Using this technique, you can select any number of items in a single pass through the collection. Provided, of course, that you have enough memory to hold the selected items.

Imagine that you are asked to select an item at random from a sequence of items presented to you one at a time. You don’t know how many items will be presented, and you have to prove that each item was given equal opportunity to be selected. You can’t decide to pick the first item every time, and you can’t decide to pick the fifteenth item, for example, because there might not be fifteen items presented. What to do?

First, since you always have to be prepared to present a selected item, it follows that you must select the first item because there might be only one item presented. But you can’t unconditionally keep the first item. When the second item is presented, you have to decide whether you should keep the first item, or replace it with the second item. How? By flipping a coin. Heads you keep the first, tails you replace it. Assuming a fair flip of a fair coin, then after two items half the time you’ll be holding the first item, and half the time you’ll be holding the second item.

Now the third item comes along. How do you decide whether to keep your previously selected item, or if you should replace it with the third item? You can’t flip a coin because doing so would mean that half the time you’d keep the third item. You want to keep the third item one third of the time.

What you really want is to roll a three-sided die and if it turns up a 3, replace the currently selected item with that third item. It should be obvious that the third item will be selected one third of the time.

If you don’t believe that this technique treats all items equally (i.e. gives every item an equal chance of ending up as the selected item), consider the possibilities with three items.

First Coin Die Result
1 Heads 1 1
1 Heads 2 1
1 Heads 3 3
1 Tails 1 2
1 Tails 2 2
1 Tails 3 3

Those are all of the possibilities when selecting from three items. Each number is selected in two of the six cases, or one third of the time.

If you continue in this manner, selecting the nth item only if rolling an n-sided die turns up n, then when you’ve reached the end of the list you will have given each item an equal chance of being selected.

Let’s see if that really does work. The program below uses the described technique to randomly select a value from 1 to 100. The program does that 10 million times and reports the results.

    private void DoIt()
    {
        const int arraySize = 100;
        const int iterations = 10*1000*1000;
        var selected = new int[arraySize];

        for (var i = 0; i < iterations; ++i)
        {
            var sel = SelectRandomInt(arraySize);
            ++selected[sel];
        }

        for (var i = 0; i < selected.Length; ++i)
        {
            Console.WriteLine("{0}: {1} ({2:P4})", i, selected[i],
                (double) selected[i]/iterations);
        }
        Console.WriteLine("Expected value = {0:N0}", iterations/arraySize);
        Console.WriteLine("Min value = {0:N0}", selected.Min());
        Console.WriteLine("Max value = {0:N0}", selected.Max());
    }

    private readonly Random _rnd = new Random();

    private int SelectRandomInt(int max)
    {
        var selected = -1;
        for (int i = 0; i < max; ++i)
        {
            if (_rnd.Next(i + 1) == i)
                selected = i;
        }
        return selected;
    }

The “expected” number that’s output is just the number of iterations divided by the number of items. In this case, 100,000. That’s the number of times you would expect any item to be selected, given a sufficiently large number of trials. This program outputs the number of times each item is selected, but below I show just the minimum and maximum.

    Expected value = 100,000
    Min value = 98,012
    Max value = 101,257

Not surprisingly, not every item was selected exactly 100,000 times. But it was close. The maximum variance was less than 2% on the low side, and about 1.3% on the high side. How does that compare with just selecting a random number from 0 to 99?

I replaced the SelectRandomInt method above with one that just returns a random number from 0 to max-1, like this:

    private int SelectRandomInt(int max)
    {
        return _rnd.Next(max);
    }

That gave me:

    Expected value = 100,000
    Min value = 99,069
    Max value = 100,842

This second version consistently gives me results that are closer to the expected values. The variance is approximately half. That is, where the first version might vary by 2% on the low side, this version varies by 1%. I think the difference is due to the random number generator being slightly biased (that is, it doesn’t produce a perfectly uniform distribution), and because we’re selecting many more random numbers with the first version than with the second, that bias is amplified. It’s not so far out of whack that I find it unusual, but I do think it merits further study.

Random number weirdness aside, you can see here that I was able to randomly choose one item from a collection of unknown size.

You wouldn’t use that technique to select a random number from 0 to 99. A real method would take an IEnumerable<T> and return a randomly selected item. The code looks like this:

    T SelectRandomItem<T>(IEnumerable<T> source, Random rnd)
    {
        T selectedItem;
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) == 0)
            {
                selectedItem = item;
            }
        }
        return selectedItem;
    }

Note that I modified the test a bit. In the previous version I checked to see if the selected random number was equal to the top of the range. Here, I test to see if it’s equal to zero. The effect is the same because the conditional is selecting one item from n possible items. The probability that the random number will be 0 is the same as the probability that it’s equal to (nItems - 1). The reason I changed it here is that it’s easier for me to think about it this way, and it simplifies the task of modifying this code to select multiple items at random.

Using that code is very simple. If you want to select a line from a large text file, you can do it in a single line of code:

    string selected = SelectRandom(File.ReadLines("foo.txt", new Random()));

You might ask why I pass in the random number generator rather than creating a new one when the method is called, or expecting there to be one at class scope. There are several reasons.

The default Random constructor seeds the random number generator with the current time, which is expressed in milliseconds. If I were to call SelectRandomItem multiple times in a single millisecond (which could happen when working with short lists), I’d be very likely to select the same item each time. Consider, for example, selecting from two lists:

    IEnumerable<string> firstNames = LoadFirstNames();
    IEnumerable<string> lastNames = LoadLastNames();
    var firstName = SelectRandomItem(firstNames);
    var lastName = SelectRandomItem(lastNames);

If firstNames and lastNames were very short, then whenever I selected the third item from one, I’d be very likely to select the third item from the other. That wouldn’t make for a very good random name generator.

Programmers often want a repeatable random sequence, especially for testing purposes. That’s not possible if the method creates a new random number generator with every call. It’s possible if there is a random number generator dedicated to the SelectRandomItem method, but if any other piece of code uses that random number generator or calls SelectRandomItem, then the sequence is lost. If I want to guarantee a repeatable random sequence, then I need to supply the random number generator to the method that uses it.

Selecting multiple items

The algorithm to select multiple items in a single pass is a straightforward generalization of the single item selection algorithm. Rather than dealing with a single “currently selected” item, we maintain an array of them. And to decide whether to select a particular item, we get a random number and compare it with the number of items to be selected. So, we replace this line in the inner loop:

    if (_rnd.Next(nItems) == 0)

With this one:

    if (_rnd.Next(nItems) < numToSelect)

If the random number is smaller than the number of items to select, then we have to randomly replace one of the selected items. The resulting method looks like this:

    T SelectRandomItems<T>(
        IEnumerable<T> source, int numToSelect, Random rnd)
    {
        T[] selectedItems = new T[numToSelect];
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) < numToSelect)
            {
                var ix = rnd.Next(numToSelect);
                selectedItems[ix] = item;
            }
        }
        return selectedItems;
    }

When selecting more than a single item, the output from that method is surprisingly similar to the output from a method that selects unique random numbers in the same range. So similar, in fact, that I can’t tell them apart. I find that a little odd. There is a definite difference in the results from selecting a single item, but when selecting multiple items the difference is not detectable. I do need to investigate why that difference in behavior exists.

And that’s all there is to it: a method that will fairly select one or more items at random from a collection of unknown size, and do it with a single scan of the entire collection. If there is a better way to do this, I don’t know what it is.

Source

I first ran across this problem in Jon Bently’s Programming Pearls. He cites Volume 2: Fundamental Algorithms of Knuth’s The Art of Computer Programming. See section 3.4.2: Random Sampling and Shuffling.

The Aho-Corasick string search algorithm

Aho-Corasick String Search

The Aho-Corasick string search algorithm that I mentioned last time works by first building a finite state machine from the list of strings you want to find. After the state machine is built, the input is run through the state machine to find the matches. The key to understanding how the algorithm works is understanding how the state machine works. Fortunately, the original paper, Efficient String Matching: An Aid to Bibliographic Search, by Alfred V. Aho and Margaret J. Corasick, explains the state machine quite well. In fact, the paper is well enough written that I had little difficulty implementing the entire algorithm from it as my only reference.

Last time I said that I was going to explain how the algorithm works. After re-reading the paper and trying to describe the algorithm, I found myself just paraphrasing the original paper. So rather than do a bad job of explaining things, I’ll point you to the paper itself. Download it. Read it. Think about it. Read it again. It is quite approachable. I wish all the research papers I read were as well written.

Building the state machine is a multi-step process that involves first creating a goto function that defines the state transitions, and an output function that defines the outputs (if any) from each state. This process involves reading each keyword character-by-character and inserting the characters into the state machine. Then, a failure function is generated by walking the nodes in the goto function, determining the next node to visit if a failure occurs. An optional final step can be added in which the generated goto and failure functions are combined into a single state transition table. This final step makes walking the state machine easier and more efficient.

The original paper provides pseudocode examples from which it’s pretty easy to create working code. I’ve created two two classes: StateMachine, which contains the completed state machine and the searching function, and StateMachineBuilder, which builds a StateMachine from a list of key words. In the past I’ve combined these two things into the same class, but I found the design cumbersome. Having two classes seems like a cleaner design to me. Building the state machine involves a lot of intermediate data structures that the searching code shouldn’t have to be troubled with. Separating the classes gives the searcher access to only the data that it actually needs.

The public interface to the StateMachineBuilder class is pretty simple. The constructor initializes the state machine. You can then call the AddKeyword method to add each key word to the state machine. When all key words are added, a call to the CompleteAdding method will complete processing and return a reference to the generated state machine.

A convenience function, StateMachine.Create(IEnumerable), allows you to create a StateMachine instance from a sequence of keywords in a single call. It’s equivalent to creating a StateMachineBuilder, adding all the keywords, and then calling CompleteAdding.

The public interfaces, then, look like this:

    public struct SearchResult
    {
        public readonly string Keyword;
        public readonly int Position;
    }

    public class StateMachine
    {
        public static StateMachine Create(IEnumerable keywords);

        public IEnumerable<SearchResult> Search(IEnumerable<char>);
    }

    public class StateMachineBuilder
    {
        public StateMachineBuilder();
        public void AddKeyword(string keyword);
        public void AddKeywords(IEnumerable<string> keywords);
        public StateMachine CompleteAdding();
    }

Imagine that you’re looking in a very large text file for mentions of any from a list of 100 names. Here’s one way you could do it:

    var names = GetListOfNames();
    var machine = StateMachine.Create(names);
    var lineNumber = 0;
    foreach (var line in File.ReadLines(inputFilename))
    {
        ++lineNumber;
        foreach (var result in machine.Search(line))
        {
            Console.WriteLine("Found '{0}' at position {1} on line{2}.",
                result.Keyword, result.Position, lineNumber);
        }
    }

If the file has no line breaks then File.ReadLines will try to read the entire file at once. That will cause a problem if the file is large. In that case you need to create a function that will return the file’s characters one at a time, and feed that to the Search method:

    foreach (var result in machine.Search(ReadFileByChars(inputFilename))
    {
        Console.WriteLine("Found '{0}' at position {1}",
            result.Keyword, result.Position);
    }

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            while (!reader.EndOfStream)
            {
                char c = (char)reader.Read();
                yield return c;
            }
        }
    }

That’s a particularly inefficient way to read a file, but it’ll do the job. You can speed it up by reading a block of characters into a buffer and then returning characters from the buffer one at a time. Each time you reach the end of the buffer, read another buffer full. Repeat until you’ve reached the end of file. Like this:

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            var buffer = new char[65536];
            int charsRead;
            while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (var i = 0; i < charsRead; ++i)
                {
                    yield return buffer[i];
                }
            }
        }
    }

That will perform faster than the previous version because it makes many fewer calls to the runtime library.

So that’s what we’re going to build: methods that build the state machine and then use the state machine to search for strings in text. Next time we’ll dive into building the goto and output functions. In the meantime, you should download and read the original Aho-Corasick paper. It really is approachable, and explains the techniques very well.

The password rant

Every time I have to come up with a password for a Web site, I end up spending 10 minutes trying to find one that I can remember and that will conform to the rules for that site. You’d think there’d be some kind of standard. But, no … every site has its own bunch of rules.

One site requires the password to be “at least eight characters long.” Others specify a minimum length and require at least one number, one capital letter, and one special character. Others specify an absurdly short maximum length. They’ll limit the characters you can use, make you use at least two numbers, prevent you from using any word that’s in an English dictionary, or for all I know require that you type it with your toes while standing on your head. It’s maddening.

I could understand this idiocy 10 years ago. Today it’s simply unacceptable. The key to a good password is that it be long, unusual, and easy to remember. But the most important point is length. Next most important is easy to remember. And I guarantee that a password like “stupid.password.rules.suck” is easier to remember and harder to break with a brute force attack than something like “Stup1d.pw0rd”.

To make matters worse, sites often don’t tell you what the password rules are. It’ll say, “enter password.” So you enter something like “my.mom’s.maiden.name” and it will say, “Passwords must contain at least one number and one capital letter.” So then you enter “3rd.rock&Roll.act” and it will say, “Passwords may only contain the characters [list of characters that doesn't include '&']“. So then you type, “Please.save.M3.From.Stupid.Password.Rules” and it will say, “Passwords must be between 8 and 14 characters in length.”

Why can’t they tell me all of the rules up front? Do they think I’m here to play 20 passwords? This evening my hosting provider told me that my password had expired and that I need to create a new one. Fine. I’m now on my fifth attempt at creating a password that I can remember, that others won’t likely guess, and that fits the rules that they’re revealing to me one at a time.

People who create Web sites: please update your password rules. Force me to create a password that’s at least 15 characters long, and let me put whatever characters I want in it! If you really have to put a limit on the length, make it something reasonable like at least 32 characters. Limiting me to 12 (a banking site, no less) or 14 characters and making me choose from three or four different classes of characters to boot just pisses me off and makes me think that I should find some other place to take my business.

 

Searching for strings in text

Every mainstream programming language’s runtime library contains a function that will find a substring inside a string. The .NET runtime library, for example, has several such methods. String.IndexOf in its many overloaded variants returns the position of the search string in the larger string, or -1 if the substring is not found. There’s also String.Contains and String.LastIndexOf, as well as Regex.Match and several others.

The String functions are great for determining if a particular string occurs in some text. But if you want to search for matches of multiple strings, you have to search for each one individually. For example, if you want to search for occurrences of the strings {he, she, his, hers}, you’d have to write something like this:

    string[] stringsToFind = new string[]{"he", "she", "his", "hers");
    foreach (var s in stringsToFind)
    {
        int pos = 0;
        while ((pos = text.IndexOf(s, pos)) != -1)
        {
            Console.WriteLine("Found {0} at position {1}", s, pos);
            pos += s.Length;
        }
    }

That will print almost every occurrence of each word. In some situations it will miss because it skips over the beginning of a match. For example, if you’re searching for the string “abab” in the text “abababab”, it would only find matches at positions 0 and 4. It would miss the occurrence at position 2. You can fix that by changing the increment to ++pos, which will cause it to restart the search at each character. It’ll just take longer.

The code ends up scanning the entire text for every word to be found. That’s not a problem when the text is short and the number of words to be located is small. But if you’re looking for occurrences of 1,000 different words in a gigabyte-sized document, this is going to take a long time. The amount of time taken by this algorithm is proportional to the number of words (we’ll call it N) times the length of the document squared (call it M). That is, the complexity is O(N * M2).

Why squared? Because it’s possible that the code would have to start the search at every character in the string. Think of searching for all occurrences of the string “aa” in the text string “aaaaaaaaaaaaaaaaaaaaaaaa…” (‘a’ repeated a billion times). That wouldn’t take forever, but close enough. If the computer could do a billion searches per second, it would only take a billion seconds (almost 32 years). Ouch.

The regular expression methods can do things much faster in the general case because they can search for all four strings in a single scan of the text.

    string text = "those shells aren't hers or his.";
    Regex re = new Regex("he|hers|she|his", RegexOptions.Compiled);
    Match m = re.Match(text);
    while (m.Success)
    {
        Console.WriteLine("Found {0} at position {1}.", m.Value, m.Index);
        m = m.NextMatch();
    }

Here’s the output from running that code:

    Found she at position 6.
    Found he at position 20.
    Found his at position 28.

Again, you see that it missed some occurrences. It didn’t find “he” at position 6 or “hers” at position 20. I could fix the missing “he” at position 6 by incrementing the regular expression starting position, but there’s nothing I can do about it not finding “hers”. Well, I could change the regular expression so that it’s "hers|he|she|his", which would return “hers” at position 20 but then wouldn’t find “he” at the same position. The same thing happens if I change the regular expression to "he(rs)?|she|his". A regular expression can’t find two matches at the same position.

I freely admit that I’m no expert with regular expressions. Even so, I was somewhat surprised by the regular expression finding “he” rather than “hers” at position 20. I wrongly thought that regular expression matching behavior defaulted to “longest left-most,” when in fact it’s a traditional NFA engine that, according to the documentation, accepts the first match it finds.

But “first match” doesn’t explain why it matches “hers” instead of “he” when I reverse the order of the words in the regular expression. After all, the engine should see “he” and stop, right? There’s something about the way the regex engine works with alternation that I don’t fully understand. I realize that I can solve the problem in this case by combining the two alternatives into "he(rs)?", but that becomes problematic when constructing a regular expression at run time from a list of words.

Understanding why alternation works this way would be nice, but I’m not so curious that I’ll spend time digging into the implementation in order to find out. For now I’ll accept that order matters in alternation constructs.

Although the regular expression version will be faster in the general case, it, too, will take essentially forever in pathological cases. And, as you’ve seen, it can’t find everything and it can’t easily be changed to do so.

Another problem is that both techniques require that the entire text be held in memory. If you’re searching for strings in a file that’s larger than available memory, you can’t use either of these methods without writing code that processes the file in chunks.

There is a better and faster way that can find all occurrences (given “hers”, for example, it will find both “he” and “hers”), doesn’t suffer from the pathological worst-case behavior, and that can process text on a character-by-character basis. It’s called the Aho-Corasick string search algorithm. Developed in the mid 1970s, it’s a surprisingly simple algorithm that scales well, is reasonably easy to implement, and performs excellently even with a straightforward implementation. I’ve used it to search for occurrences of millions of different strings among many hundreds of gigabytes of text.

Next time I’ll explain how the algorithm works, and start building code to implement it.

Hand made beer mug

Debra and I went to the Sherwood Forest Faire last weekend, and had a great time. I felt a little out of place, though, wandering around in costume and drinking mead from a plastic cup. Many people had wooden mugs, and there were mugs available for purchase at several shops. Although they were attractive, I didn’t get one. First of all, the least expensive mug I saw for sale was $45. That’s a lot for a hollowed-out piece of wood.

More importantly, the mugs they had for sale were too pretty. Most were perfectly turned on a lathe, had perfectly shaped handles, and were coated with a thick epoxy resin that turned the wooden mug into what was essentially a modern plastic mug that just looks like wood.

Sunday morning I went out to the wood pile and grabbed a hunk of mesquite that seemed about the right size. My plan was to carve a mug that would hold 16 ounces.

mug1

After cutting it to size and removing the bark with an angle grinder, I drilled a hole in the center to get started, and then used a die grinder and the Foredom power carver to hollow it. This turned out to be more difficult than I had expected due to the depth. It’s distressingly easy to lose control of the tool when you can’t see where it’s cutting. An aggressive bit spinning at 15,000 RPM will throw a piece of wood across the room when it binds up. I ended up putting a few cracks in the cup that way.

mug2

It took a long time to thin the walls from there (using a hook knife), and sand it smooth. I also had to fill the cracks that I made while hollowing. I didn’t take any pictures until after I’d completed the cup and put the oil and wax finish on it.

mug3

The finished mug is 4-1/4 inches tall and between 3-1/4 and 3-1/2 inches wide. The width varies because the mesquite log that it came from wasn’t perfectly round.

mug4

It holds just a little more than 12 ounces. I just couldn’t get it any deeper with the tools that I currently have. I’ll have to come up with a better way if I’m going to make another one of these. I’ve considered hollowing from both ends and then carving a 1/4 inch thick piece to fit inside at the bottom. The only difficulty would be getting that bottom piece carved correctly. I’m also considering a few other options.

mug5

The residue you see floating in the cup there is a little of the oil and wax. Not to worry: the oil and wax are non-toxic. I was impatient to try the mug. It really is a pleasure to drink from.

I decided not to add a handle. Maybe next time. I’ll probably come up with some sort of leather sling that fits around the mug and lets me attach it to my belt so that I can carry it around the next time we go to one of those Renaissance festivals.