The Aho-Corasick string search algorithm

The Aho-Corasick string search algorithm that I mentioned last time works by first building a finite state machine from the list of strings you want to find. After the state machine is built, the input is run through the state machine to find the matches. The key to understanding how the algorithm works is understanding how the state machine works. Fortunately, the original paper, Efficient String Matching: An Aid to Bibliographic Search, by Alfred V. Aho and Margaret J. Corasick, explains the state machine quite well. In fact, the paper is well enough written that I had little difficulty implementing the entire algorithm from it as my only reference.

Last time I said that I was going to explain how the algorithm works. After re-reading the paper and trying to describe the algorithm, I found myself just paraphrasing the original paper. So rather than do a bad job of explaining things, I’ll point you to the paper itself. Download it. Read it. Think about it. Read it again. It is quite approachable. I wish all the research papers I read were as well written.

Building the state machine is a multi-step process that involves first creating a goto function that defines the state transitions, and an output function that defines the outputs (if any) from each state. This process involves reading each keyword character-by-character and inserting the characters into the state machine. Then, a failure function is generated by walking the nodes in the goto function, determining the next node to visit if a failure occurs. An optional final step can be added in which the generated goto and failure functions are combined into a single state transition table. This final step makes walking the state machine easier and more efficient.

The original paper provides pseudocode examples from which it’s pretty easy to create working code. I’ve created two two classes: StateMachine, which contains the completed state machine and the searching function, and StateMachineBuilder, which builds a StateMachine from a list of key words. In the past I’ve combined these two things into the same class, but I found the design cumbersome. Having two classes seems like a cleaner design to me. Building the state machine involves a lot of intermediate data structures that the searching code shouldn’t have to be troubled with. Separating the classes gives the searcher access to only the data that it actually needs.

The public interface to the StateMachineBuilder class is pretty simple. The constructor initializes the state machine. You can then call the AddKeyword method to add each key word to the state machine. When all key words are added, a call to the CompleteAdding method will complete processing and return a reference to the generated state machine.

A convenience function, StateMachine.Create(IEnumerable), allows you to create a StateMachine instance from a sequence of keywords in a single call. It’s equivalent to creating a StateMachineBuilder, adding all the keywords, and then calling CompleteAdding.

The public interfaces, then, look like this:

    public struct SearchResult
    {
        public readonly string Keyword;
        public readonly int Position;
    }

    public class StateMachine
    {
        public static StateMachine Create(IEnumerable keywords);

        public IEnumerable<SearchResult> Search(IEnumerable<char>);
    }

    public class StateMachineBuilder
    {
        public StateMachineBuilder();
        public void AddKeyword(string keyword);
        public void AddKeywords(IEnumerable<string> keywords);
        public StateMachine CompleteAdding();
    }

Imagine that you’re looking in a very large text file for mentions of any from a list of 100 names. Here’s one way you could do it:

    var names = GetListOfNames();
    var machine = StateMachine.Create(names);
    var lineNumber = 0;
    foreach (var line in File.ReadLines(inputFilename))
    {
        ++lineNumber;
        foreach (var result in machine.Search(line))
        {
            Console.WriteLine("Found '{0}' at position {1} on line{2}.",
                result.Keyword, result.Position, lineNumber);
        }
    }

If the file has no line breaks then File.ReadLines will try to read the entire file at once. That will cause a problem if the file is large. In that case you need to create a function that will return the file’s characters one at a time, and feed that to the Search method:

    foreach (var result in machine.Search(ReadFileByChars(inputFilename))
    {
        Console.WriteLine("Found '{0}' at position {1}",
            result.Keyword, result.Position);
    }

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            while (!reader.EndOfStream)
            {
                char c = (char)reader.Read();
                yield return c;
            }
        }
    }

That’s a particularly inefficient way to read a file, but it’ll do the job. You can speed it up by reading a block of characters into a buffer and then returning characters from the buffer one at a time. Each time you reach the end of the buffer, read another buffer full. Repeat until you’ve reached the end of file. Like this:

    IEnumerable<char> ReadFileByCharacters(string filename)
    {
        using (var reader = new StreamReader(filename))
        {
            var buffer = new char[65536];
            int charsRead;
            while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (var i = 0; i < charsRead; ++i)
                {
                    yield return buffer[i];
                }
            }
        }
    }

That will perform faster than the previous version because it makes many fewer calls to the runtime library.

So that’s what we’re going to build: methods that build the state machine and then use the state machine to search for strings in text. Next time we’ll dive into building the goto and output functions. In the meantime, you should download and read the original Aho-Corasick paper. It really is approachable, and explains the techniques very well.