Inconceivable!

Every time I hear President Obama say something “will not be tolerated,” I’m reminded of Vizzini’s “inconceivable!”

The silly sanctions we’re placing on Russian officials will have approximately the same effect as the Bush administration’s pointless sanctions on North Korea back in 2006; banning the export of iPods, Segway scooters, and other luxury items so that Kim Jong-Il wouldn’t be able to give trinkets to his cronies.

But the administration has to bluster and posture and appear to be “doing something about the problem” even though the president and anybody with more brains than your average mailbox knows that we are completely powerless: Russia will do what Putin wants, openly and without fear of reprisal.

Why do I think we’re powerless? First, military confrontation is out of the question. Russia isn’t a little country with an insignificant military that we can overpower with a division or two of National Guard troops. Air strikes, even if we were stupid enough to try and could get past the Russian defenses, would be met with major reprisals and ultimately result in a real war that nobody wants and if fought would destroy the world economy.

Not that we could make any kind of military strike. President Obama isn’t dumb enough to try convincing us that Ukraine is worth another foreign excursion, and no member of Congress who is up for re-election this year is likely to approve such a thing even if the president were to suggest it.

Second, serious economic sanctions are out of the question because large parts of Europe depend on Russian natural gas for heat. Nobody in Europe wants to risk upsetting Putin, because he could easily turn off the spigot. The first cold snap come Fall would have every citizen in Europe demanding that The Powers That Be let Russia have whatever they want, so long as the gas is turned back on. Putin isn’t beyond punishing our European allies in response to American “provocation.”

It does make for good theater, though. Too bad we can’t bottle all the hot air from both sides. Between Obama’s rhetoric, Putin’s laughable denials, and Republicans’ excoriating the president for his lack of action, we could make a serious start to reducing the amount of fossil fuels burned to keep warm next Winter.

The Aho-Corasick string search algorithm

The Aho-Corasick string search algorithm that I mentioned last time works by first building a finite state machine from the list of strings you want to find. After the state machine is built, the input is run through the state machine to find the matches. The key to understanding how the algorithm works is understanding how the state machine works. Fortunately, the original paper, Efficient String Matching: An Aid to Bibliographic Search, by Alfred V. Aho and Margaret J. Corasick, explains the state machine quite well. In fact, the paper is well enough written that I had little difficulty implementing the entire algorithm from it as my only reference.

Last time I said that I was going to explain how the algorithm works. After re-reading the paper and trying to describe the algorithm, I found myself just paraphrasing the original paper. So rather than do a bad job of explaining things, I’ll point you to the paper itself. Download it. Read it. Think about it. Read it again. It is quite approachable. I wish all the research papers I read were as well written.

Building the state machine is a multi-step process that involves first creating a goto function that defines the state transitions, and an output function that defines the outputs (if any) from each state. This process involves reading each keyword character-by-character and inserting the characters into the state machine. Then, a failure function is generated by walking the nodes in the goto function, determining the next node to visit if a failure occurs. An optional final step can be added in which the generated goto and failure functions are combined into a single state transition table. This final step makes walking the state machine easier and more efficient.

The original paper provides pseudocode examples from which it’s pretty easy to create working code. I’ve created two two classes: StateMachine, which contains the completed state machine and the searching function, and StateMachineBuilder, which builds a StateMachine from a list of key words. In the past I’ve combined these two things into the same class, but I found the design cumbersome. Having two classes seems like a cleaner design to me. Building the state machine involves a lot of intermediate data structures that the searching code shouldn’t have to be troubled with. Separating the classes gives the searcher access to only the data that it actually needs.

The public interface to the StateMachineBuilder class is pretty simple. The constructor initializes the state machine. You can then call the AddKeyword method to add each key word to the state machine. When all key words are added, a call to the CompleteAdding method will complete processing and return a reference to the generated state machine.

A convenience function, StateMachine.Create(IEnumerable), allows you to create a StateMachine instance from a sequence of keywords in a single call. It’s equivalent to creating a StateMachineBuilder, adding all the keywords, and then calling CompleteAdding.

The public interfaces, then, look like this:

    public struct SearchResult
    {
        public readonly string Keyword;
        public readonly int Position;
    }

    public class StateMachine
    {
        public static StateMachine Create(IEnumerable keywords);

        public IEnumerable<SearchResult> Search(IEnumerable<char>);
    }

    public class StateMachineBuilder
    {
        public StateMachineBuilder();
        public void AddKeyword(string keyword);
        public void AddKeywords(IEnumerable<string> keywords);
        public StateMachine CompleteAdding();
    }

Imagine that you’re looking in a very large text file for mentions of any from a list of 100 names. Here’s one way you could do it:

    var names = GetListOfNames();
    var machine = StateMachine.Create(names);
    var lineNumber = 0;
    foreach (var line in File.ReadLines(inputFilename))
    {
        ++lineNumber;
        foreach (var result in machine.Search(line))
        {
            Console.WriteLine("Found '{0}' at position {1} on line{2}.",
                result.Keyword, result.Position, lineNumber);
        }
    }

If the file has no line breaks then File.ReadLines will try to read the entire file at once. That will cause a problem if the file is large. In that case you need to create a function that will return the file’s characters one at a time, and feed that to the Search method:

    foreach (var result in machine.Search(ReadFileByChars(inputFilename))
    {
        Console.WriteLine("Found '{0}' at position {1}",
            result.Keyword, result.Position);
    }

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            while (!reader.EndOfStream)
            {
                char c = (char)reader.Read();
                yield return c;
            }
        }
    }

That’s a particularly inefficient way to read a file, but it’ll do the job. You can speed it up by reading a block of characters into a buffer and then returning characters from the buffer one at a time. Each time you reach the end of the buffer, read another buffer full. Repeat until you’ve reached the end of file. Like this:

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            var buffer = new char[65536];
            int charsRead;
            while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (var i = 0; i < charsRead; ++i)
                {
                    yield return buffer[i];
                }
            }
        }
    }

That will perform faster than the previous version because it makes many fewer calls to the runtime library.

So that’s what we’re going to build: methods that build the state machine and then use the state machine to search for strings in text. Next time we’ll dive into building the goto and output functions. In the meantime, you should download and read the original Aho-Corasick paper. It really is approachable, and explains the techniques very well.

Random selection

If you know how many items are in a collection, picking one at random is trivial: just select a random number from 0 to count-1, and you’re done. Things get more interesting when you don’t know how many items are in the collection. The naive way to do the selection is to first make an array. Then you can use the solution I just mentioned. For example:

    readonly Random _rnd = new Random();

    int GetRandomItem(IEnumerable<int> items)
    {
        var a = items.ToArray();
        var ix = _rnd.Next(a.Length - 1);
        return a[ix];
    }

Whereas that works, it can be expensive or even impossible to create a new array if the collection is very large.

When the question is modified to include the restriction that the collection won’t fit in memory (say, a text file with hundreds of millions of lines), things get more interesting still. You could go through the file to count the lines, pick a random number from 0 to count - 1, and then go back through the file to read that many lines. In code:

    var count = File.ReadLines(filename).Count();  // reads the whole file
    var ix = _rnd.Next(count - 1);
    var sel = File.ReadLines(filename)  // potentially reads the whole file
        .Skip(ix-1)
        .First();
    return sel;

If the collection is a file, then you end up reading the file once, and at least part of the file twice. On average, you end up reading the file halfway through the second time. It works, but it’s pretty expensive.

Worse, it’s impossible to use that technique if the collection is a bunch of records coming in from a network stream. It’s not like you can rewind the stream and start over after you figure out how many records there are.

There is a way to do this that doesn’t require knowing up front how many items are in the collection, and guarantees that each item has the same probability of being selected. Using this technique, you can select any number of items in a single pass through the collection. Provided, of course, that you have enough memory to hold the selected items.

Imagine that you are asked to select an item at random from a sequence of items presented to you one at a time. You don’t know how many items will be presented, and you have to prove that each item was given equal opportunity to be selected. You can’t decide to pick the first item every time, and you can’t decide to pick the fifteenth item, for example, because there might not be fifteen items presented. What to do?

First, since you always have to be prepared to present a selected item, it follows that you must select the first item because there might be only one item presented. But you can’t unconditionally keep the first item. When the second item is presented, you have to decide whether you should keep the first item, or replace it with the second item. How? By flipping a coin. Heads you keep the first, tails you replace it. Assuming a fair flip of a fair coin, then after two items half the time you’ll be holding the first item, and half the time you’ll be holding the second item.

Now the third item comes along. How do you decide whether to keep your previously selected item, or if you should replace it with the third item? You can’t flip a coin because doing so would mean that half the time you’d keep the third item. You want to keep the third item one third of the time.

What you really want is to roll a three-sided die and if it turns up a 3, replace the currently selected item with that third item. It should be obvious that the third item will be selected one third of the time.

If you don’t believe that this technique treats all items equally (i.e. gives every item an equal chance of ending up as the selected item), consider the possibilities with three items.

FirstCoinDieResult
1Heads11
1Heads21
1Heads33
1Tails12
1Tails22
1Tails33

Those are all of the possibilities when selecting from three items. Each number is selected in two of the six cases, or one third of the time.

If you continue in this manner, selecting the nth item only if rolling an n-sided die turns up n, then when you’ve reached the end of the list you will have given each item an equal chance of being selected.

Let’s see if that really does work. The program below uses the described technique to randomly select a value from 1 to 100. The program does that 10 million times and reports the results.

    private void DoIt()
    {
        const int arraySize = 100;
        const int iterations = 10*1000*1000;
        var selected = new int[arraySize];

        for (var i = 0; i < iterations; ++i)
        {
            var sel = SelectRandomInt(arraySize);
            ++selected[sel];
        }

        for (var i = 0; i < selected.Length; ++i)
        {
            Console.WriteLine("{0}: {1} ({2:P4})", i, selected[i],
                (double) selected[i]/iterations);
        }
        Console.WriteLine("Expected value = {0:N0}", iterations/arraySize);
        Console.WriteLine("Min value = {0:N0}", selected.Min());
        Console.WriteLine("Max value = {0:N0}", selected.Max());
    }

    private readonly Random _rnd = new Random();

    private int SelectRandomInt(int max)
    {
        var selected = -1;
        for (int i = 0; i < max; ++i)
        {
            if (_rnd.Next(i + 1) == i)
                selected = i;
        }
        return selected;
    }

The “expected” number that’s output is just the number of iterations divided by the number of items. In this case, 100,000. That’s the number of times you would expect any item to be selected, given a sufficiently large number of trials. This program outputs the number of times each item is selected, but below I show just the minimum and maximum.

    Expected value = 100,000
    Min value = 98,012
    Max value = 101,257

Not surprisingly, not every item was selected exactly 100,000 times. But it was close. The maximum variance was less than 2% on the low side, and about 1.3% on the high side. How does that compare with just selecting a random number from 0 to 99?

I replaced the SelectRandomInt method above with one that just returns a random number from 0 to max-1, like this:

    private int SelectRandomInt(int max)
    {
        return _rnd.Next(max);
    }

That gave me:

    Expected value = 100,000
    Min value = 99,069
    Max value = 100,842

This second version consistently gives me results that are closer to the expected values. The variance is approximately half. That is, where the first version might vary by 2% on the low side, this version varies by 1%. I think the difference is due to the random number generator being slightly biased (that is, it doesn’t produce a perfectly uniform distribution), and because we’re selecting many more random numbers with the first version than with the second, that bias is amplified. It’s not so far out of whack that I find it unusual, but I do think it merits further study.

Random number weirdness aside, you can see here that I was able to randomly choose one item from a collection of unknown size.

You wouldn’t use that technique to select a random number from 0 to 99. A real method would take an IEnumerable<T> and return a randomly selected item. The code looks like this:

    T SelectRandomItem<T>(IEnumerable<T> source, Random rnd)
    {
        T selectedItem;
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) == 0)
            {
                selectedItem = item;
            }
        }
        return selectedItem;
    }

Note that I modified the test a bit. In the previous version I checked to see if the selected random number was equal to the top of the range. Here, I test to see if it’s equal to zero. The effect is the same because the conditional is selecting one item from n possible items. The probability that the random number will be 0 is the same as the probability that it’s equal to (nItems - 1). The reason I changed it here is that it’s easier for me to think about it this way, and it simplifies the task of modifying this code to select multiple items at random.

Using that code is very simple. If you want to select a line from a large text file, you can do it in a single line of code:

    string selected = SelectRandom(File.ReadLines("foo.txt", new Random()));

You might ask why I pass in the random number generator rather than creating a new one when the method is called, or expecting there to be one at class scope. There are several reasons.

The default Random constructor seeds the random number generator with the current time, which is expressed in milliseconds. If I were to call SelectRandomItem multiple times in a single millisecond (which could happen when working with short lists), I’d be very likely to select the same item each time. Consider, for example, selecting from two lists:

    IEnumerable<string> firstNames = LoadFirstNames();
    IEnumerable<string> lastNames = LoadLastNames();
    var firstName = SelectRandomItem(firstNames);
    var lastName = SelectRandomItem(lastNames);

If firstNames and lastNames were very short, then whenever I selected the third item from one, I’d be very likely to select the third item from the other. That wouldn’t make for a very good random name generator.

Programmers often want a repeatable random sequence, especially for testing purposes. That’s not possible if the method creates a new random number generator with every call. It’s possible if there is a random number generator dedicated to the SelectRandomItem method, but if any other piece of code uses that random number generator or calls SelectRandomItem, then the sequence is lost. If I want to guarantee a repeatable random sequence, then I need to supply the random number generator to the method that uses it.

Selecting multiple items

The algorithm to select multiple items in a single pass is a straightforward generalization of the single item selection algorithm. Rather than dealing with a single “currently selected” item, we maintain an array of them. And to decide whether to select a particular item, we get a random number and compare it with the number of items to be selected. So, we replace this line in the inner loop:

    if (_rnd.Next(nItems) == 0)

With this one:

    if (_rnd.Next(nItems) < numToSelect)

If the random number is smaller than the number of items to select, then we have to randomly replace one of the selected items. The resulting method looks like this:

    T SelectRandomItems<T>(
        IEnumerable<T> source, int numToSelect, Random rnd)
    {
        T[] selectedItems = new T[numToSelect];
        int nItems = 0;
        foreach (var item in source)
        {
            ++nItems;
            if (rnd.Next(nItems) < numToSelect)
            {
                var ix = rnd.Next(numToSelect);
                selectedItems[ix] = item;
            }
        }
        return selectedItems;
    }

When selecting more than a single item, the output from that method is surprisingly similar to the output from a method that selects unique random numbers in the same range. So similar, in fact, that I can’t tell them apart. I find that a little odd. There is a definite difference in the results from selecting a single item, but when selecting multiple items the difference is not detectable. I do need to investigate why that difference in behavior exists.

And that’s all there is to it: a method that will fairly select one or more items at random from a collection of unknown size, and do it with a single scan of the entire collection. If there is a better way to do this, I don’t know what it is.

Source

I first ran across this problem in Jon Bently’s Programming Pearls. He cites Volume 2: Fundamental Algorithms of Knuth’s The Art of Computer Programming. See section 3.4.2: Random Sampling and Shuffling.

The password rant

Every time I have to come up with a password for a Web site, I end up spending 10 minutes trying to find one that I can remember and that will conform to the rules for that site. You’d think there’d be some kind of standard. But, no … every site has its own bunch of rules.

One site requires the password to be “at least eight characters long.” Others specify a minimum length and require at least one number, one capital letter, and one special character. Others specify an absurdly short maximum length. They’ll limit the characters you can use, make you use at least two numbers, prevent you from using any word that’s in an English dictionary, or for all I know require that you type it with your toes while standing on your head. It’s maddening.

I could understand this idiocy 10 years ago. Today it’s simply unacceptable. The key to a good password is that it be long, unusual, and easy to remember. But the most important point is length. Next most important is easy to remember. And I guarantee that a password like “stupid.password.rules.suck” is easier to remember and harder to break with a brute force attack than something like “Stup1d.pw0rd”.

To make matters worse, sites often don’t tell you what the password rules are. It’ll say, “enter password.” So you enter something like “my.mom’s.maiden.name” and it will say, “Passwords must contain at least one number and one capital letter.” So then you enter “3rd.rock&Roll.act” and it will say, “Passwords may only contain the characters [list of characters that doesn’t include ‘&’]”. So then you type, “Please.save.M3.From.Stupid.Password.Rules” and it will say, “Passwords must be between 8 and 14 characters in length.”

Why can’t they tell me all of the rules up front? Do they think I’m here to play 20 passwords? This evening my hosting provider told me that my password had expired and that I need to create a new one. Fine. I’m now on my fifth attempt at creating a password that I can remember, that others won’t likely guess, and that fits the rules that they’re revealing to me one at a time.

People who create Web sites: please update your password rules. Force me to create a password that’s at least 15 characters long, and let me put whatever characters I want in it! If you really have to put a limit on the length, make it something reasonable like at least 32 characters. Limiting me to 12 (a banking site, no less) or 14 characters and making me choose from three or four different classes of characters to boot just pisses me off and makes me think that I should find some other place to take my business.

Searching for strings in text

Every mainstream programming language’s runtime library contains a function that will find a substring inside a string. The .NET runtime library, for example, has several such methods. String.IndexOf in its many overloaded variants returns the position of the search string in the larger string, or -1 if the substring is not found. There’s also String.Contains and String.LastIndexOf, as well as Regex.Match and several others.

The String functions are great for determining if a particular string occurs in some text. But if you want to search for matches of multiple strings, you have to search for each one individually. For example, if you want to search for occurrences of the strings {he, she, his, hers}, you’d have to write something like this:

    string[] stringsToFind = new string[]{"he", "she", "his", "hers");
    foreach (var s in stringsToFind)
    {
        int pos = 0;
        while ((pos = text.IndexOf(s, pos)) != -1)
        {
            Console.WriteLine("Found {0} at position {1}", s, pos);
            pos += s.Length;
        }
    }

That will print almost every occurrence of each word. In some situations it will miss because it skips over the beginning of a match. For example, if you’re searching for the string “abab” in the text “abababab”, it would only find matches at positions 0 and 4. It would miss the occurrence at position 2. You can fix that by changing the increment to ++pos, which will cause it to restart the search at each character. It’ll just take longer.

The code ends up scanning the entire text for every word to be found. That’s not a problem when the text is short and the number of words to be located is small. But if you’re looking for occurrences of 1,000 different words in a gigabyte-sized document, this is going to take a long time. The amount of time taken by this algorithm is proportional to the number of words (we’ll call it N) times the length of the document squared (call it M). That is, the complexity is O(N * M2).

Why squared? Because it’s possible that the code would have to start the search at every character in the string. Think of searching for all occurrences of the string “aa” in the text string “aaaaaaaaaaaaaaaaaaaaaaaa…” (‘a’ repeated a billion times). That wouldn’t take forever, but close enough. If the computer could do a billion searches per second, it would only take a billion seconds (almost 32 years). Ouch.

The regular expression methods can do things much faster in the general case because they can search for all four strings in a single scan of the text.

    string text = "those shells aren't hers or his.";
    Regex re = new Regex("he|hers|she|his", RegexOptions.Compiled);
    Match m = re.Match(text);
    while (m.Success)
    {
        Console.WriteLine("Found {0} at position {1}.", m.Value, m.Index);
        m = m.NextMatch();
    }

Here’s the output from running that code:

    Found she at position 6.
    Found he at position 20.
    Found his at position 28.

Again, you see that it missed some occurrences. It didn’t find “he” at position 6 or “hers” at position 20. I could fix the missing “he” at position 6 by incrementing the regular expression starting position, but there’s nothing I can do about it not finding “hers”. Well, I could change the regular expression so that it’s "hers|he|she|his", which would return “hers” at position 20 but then wouldn’t find “he” at the same position. The same thing happens if I change the regular expression to "he(rs)?|she|his"A regular expression can’t find two matches at the same position.

I freely admit that I’m no expert with regular expressions. Even so, I was somewhat surprised by the regular expression finding “he” rather than “hers” at position 20. I wrongly thought that regular expression matching behavior defaulted to “longest left-most,” when in fact it’s a traditional NFA engine that, according to the documentation, accepts the first match it finds.

But “first match” doesn’t explain why it matches “hers” instead of “he” when I reverse the order of the words in the regular expression. After all, the engine should see “he” and stop, right? There’s something about the way the regex engine works with alternation that I don’t fully understand. I realize that I can solve the problem in this case by combining the two alternatives into "he(rs)?", but that becomes problematic when constructing a regular expression at run time from a list of words.

Understanding why alternation works this way would be nice, but I’m not so curious that I’ll spend time digging into the implementation in order to find out. For now I’ll accept that order matters in alternation constructs.

Although the regular expression version will be faster in the general case, it, too, will take essentially forever in pathological cases. And, as you’ve seen, it can’t find everything and it can’t easily be changed to do so.

Another problem is that both techniques require that the entire text be held in memory. If you’re searching for strings in a file that’s larger than available memory, you can’t use either of these methods without writing code that processes the file in chunks.

There is a better and faster way that can find all occurrences (given “hers”, for example, it will find both “he” and “hers”), doesn’t suffer from the pathological worst-case behavior, and that can process text on a character-by-character basis. It’s called the Aho-Corasick string search algorithm. Developed in the mid 1970s, it’s a surprisingly simple algorithm that scales well, is reasonably easy to implement, and performs excellently even with a straightforward implementation. I’ve used it to search for occurrences of millions of different strings among many hundreds of gigabytes of text.

Next time I’ll explain how the algorithm works, and start building code to implement it.

Hand made beer mug

Debra and I went to the Sherwood Forest Faire last weekend, and had a great time. I felt a little out of place, though, wandering around in costume and drinking mead from a plastic cup. Many people had wooden mugs, and there were mugs available for purchase at several shops. Although they were attractive, I didn’t get one. First of all, the least expensive mug I saw for sale was $45. That’s a lot for a hollowed-out piece of wood.

More importantly, the mugs they had for sale were too pretty. Most were perfectly turned on a lathe, had perfectly shaped handles, and were coated with a thick epoxy resin that turned the wooden mug into what was essentially a modern plastic mug that just looks like wood.

Sunday morning I went out to the wood pile and grabbed a hunk of mesquite that seemed about the right size. My plan was to carve a mug that would hold 16 ounces.

After cutting it to size and removing the bark with an angle grinder, I drilled a hole in the center to get started, and then used a die grinder and the Foredom power carver to hollow it. This turned out to be more difficult than I had expected due to the depth. It’s distressingly easy to lose control of the tool when you can’t see where it’s cutting. An aggressive bit spinning at 15,000 RPM will throw a piece of wood across the room when it binds up. I ended up putting a few cracks in the cup that way.

It took a long time to thin the walls from there (using a hook knife), and sand it smooth. I also had to fill the cracks that I made while hollowing. I didn’t take any pictures until after I’d completed the cup and put the oil and wax finish on it.

The finished mug is 4-1/4 inches tall and between 3-1/4 and 3-1/2 inches wide. The width varies because the mesquite log that it came from wasn’t perfectly round.

It holds just a little more than 12 ounces. I just couldn’t get it any deeper with the tools that I currently have. I’ll have to come up with a better way if I’m going to make another one of these. I’ve considered hollowing from both ends and then carving a 1/4 inch thick piece to fit inside at the bottom. The only difficulty would be getting that bottom piece carved correctly. I’m also considering a few other options.

The liquid is, of course, homebrew beer

The residue you see floating in the cup there is a little of the oil and wax. Not to worry: the oil and wax are non-toxic. After the oil soaks into the wood, it won’t leech out into my beer. I was impatient to try the mug. It really is a pleasure to drink from.

I decided not to add a handle. Maybe next time. I’ll probably come up with some sort of leather sling that fits around the mug and lets me attach it to my belt so that I can carry it around the next time we go to one of those Renaissance festivals.