Searching for strings in text

Every mainstream programming language’s runtime library contains a function that will find a substring inside a string. The .NET runtime library, for example, has several such methods. String.IndexOf in its many overloaded variants returns the position of the search string in the larger string, or -1 if the substring is not found. There’s also String.Contains and String.LastIndexOf, as well as Regex.Match and several others.

The String functions are great for determining if a particular string occurs in some text. But if you want to search for matches of multiple strings, you have to search for each one individually. For example, if you want to search for occurrences of the strings {he, she, his, hers}, you’d have to write something like this:

    string[] stringsToFind = new string[]{"he", "she", "his", "hers");
    foreach (var s in stringsToFind)
    {
        int pos = 0;
        while ((pos = text.IndexOf(s, pos)) != -1)
        {
            Console.WriteLine("Found {0} at position {1}", s, pos);
            pos += s.Length;
        }
    }

That will print almost every occurrence of each word. In some situations it will miss because it skips over the beginning of a match. For example, if you’re searching for the string “abab” in the text “abababab”, it would only find matches at positions 0 and 4. It would miss the occurrence at position 2. You can fix that by changing the increment to ++pos, which will cause it to restart the search at each character. It’ll just take longer.

The code ends up scanning the entire text for every word to be found. That’s not a problem when the text is short and the number of words to be located is small. But if you’re looking for occurrences of 1,000 different words in a gigabyte-sized document, this is going to take a long time. The amount of time taken by this algorithm is proportional to the number of words (we’ll call it N) times the length of the document squared (call it M). That is, the complexity is O(N * M2).

Why squared? Because it’s possible that the code would have to start the search at every character in the string. Think of searching for all occurrences of the string “aa” in the text string “aaaaaaaaaaaaaaaaaaaaaaaa…” (‘a’ repeated a billion times). That wouldn’t take forever, but close enough. If the computer could do a billion searches per second, it would only take a billion seconds (almost 32 years). Ouch.

The regular expression methods can do things much faster in the general case because they can search for all four strings in a single scan of the text.

    string text = "those shells aren't hers or his.";
    Regex re = new Regex("he|hers|she|his", RegexOptions.Compiled);
    Match m = re.Match(text);
    while (m.Success)
    {
        Console.WriteLine("Found {0} at position {1}.", m.Value, m.Index);
        m = m.NextMatch();
    }

Here’s the output from running that code:

    Found she at position 6.
    Found he at position 20.
    Found his at position 28.

Again, you see that it missed some occurrences. It didn’t find “he” at position 6 or “hers” at position 20. I could fix the missing “he” at position 6 by incrementing the regular expression starting position, but there’s nothing I can do about it not finding “hers”. Well, I could change the regular expression so that it’s "hers|he|she|his", which would return “hers” at position 20 but then wouldn’t find “he” at the same position. The same thing happens if I change the regular expression to "he(rs)?|she|his"A regular expression can’t find two matches at the same position.

I freely admit that I’m no expert with regular expressions. Even so, I was somewhat surprised by the regular expression finding “he” rather than “hers” at position 20. I wrongly thought that regular expression matching behavior defaulted to “longest left-most,” when in fact it’s a traditional NFA engine that, according to the documentation, accepts the first match it finds.

But “first match” doesn’t explain why it matches “hers” instead of “he” when I reverse the order of the words in the regular expression. After all, the engine should see “he” and stop, right? There’s something about the way the regex engine works with alternation that I don’t fully understand. I realize that I can solve the problem in this case by combining the two alternatives into "he(rs)?", but that becomes problematic when constructing a regular expression at run time from a list of words.

Understanding why alternation works this way would be nice, but I’m not so curious that I’ll spend time digging into the implementation in order to find out. For now I’ll accept that order matters in alternation constructs.

Although the regular expression version will be faster in the general case, it, too, will take essentially forever in pathological cases. And, as you’ve seen, it can’t find everything and it can’t easily be changed to do so.

Another problem is that both techniques require that the entire text be held in memory. If you’re searching for strings in a file that’s larger than available memory, you can’t use either of these methods without writing code that processes the file in chunks.

There is a better and faster way that can find all occurrences (given “hers”, for example, it will find both “he” and “hers”), doesn’t suffer from the pathological worst-case behavior, and that can process text on a character-by-character basis. It’s called the Aho-Corasick string search algorithm. Developed in the mid 1970s, it’s a surprisingly simple algorithm that scales well, is reasonably easy to implement, and performs excellently even with a straightforward implementation. I’ve used it to search for occurrences of millions of different strings among many hundreds of gigabytes of text.

Next time I’ll explain how the algorithm works, and start building code to implement it.