More fun with regular expressions

I’ve said before that regular expressions make my brain hurt.  I’ve also been rather outspoken on a number of occasions regarding the misuse of regular expressions.  All too often, programmers faced with any kind of parsing problem immediately reach for their regex hammer and then spend an inordinate amount of time trying to use it like a pair of pliers.

That said, regular expressions do have their uses, especially when throwing together a quick and dirty prototype.  In my case, I have a list of about 40 million titles I want to search for occurrences of user-input text strings.  I’m building a prototype, so I’m not terribly worried at the moment with things like stemming or fuzzy matches.  I just want to know how many of the titles in my collection contain the terms “the beatles” or “stairway to heaven”, for example.

I could use naive string searching.  That is, just use the built in String.IndexOf method to do a case-insensitive search of each title for each text string.  So, for instance, I’d have:

if ((title.IndexOf("the beatles", StringComparison.InvariantCultureIgnoreCase) != -1) ||
    (title.IndexOf("stairway to heaven", StringComparison.InvariantCultureIgnoreCase) != -1))
{
    // found a match
}

That mostly works, and would probably be “good enough” for most cases.  But if I were searching for “the who”, it would match “the whosits”, as well.  In addition, if I’m searching for multiple terms, I have to do each search individually.

Regular expressions let me solve both of those problems.  I can search for multiple terms with a single call by creating this regular expression:

    (the beatles)|(stairway to heaven)

which says, “match ‘the beatles’ or ‘stairway to heaven’”.

That doesn’t solve the “whosits” problem, though.  So I need to specify that I want to match only on word boundaries.

In the language of regular expressions, the character escape “\W” says, “match a non-word character.”  For simplicity, let’s just say that a “word character” is an alphanumeric character:  A through Z, a through z, and 0 through 9.  Everything else–white space, punctuation, and special characters–are “non-word characters”.  So my regular expression becomes:

    \W((the beatles)|(stairway to heaven))\W

which you can read as, “match a non-word character, followed by ‘the beatles’ or ‘stairway to heaven’, followed by a non-word character.”

All well and good, right?  Not quite.  It won’t match if the string is found at the beginning or end of the title.  So you’d get a match if the title was, “Nowhere Man by The Beatles (1965)”, but it won’t match “The Beatles – Nowhere Man”.

That’s easily fixed.  We can say, “match at the beginning of the string OR a non-word character,” and something similar for the end:

    (^|\W)((the beatles)|(stairway to heaven))(\W|$)

The character “^” is regex-ese for “match at the beginning of the string,” and “$” is for matching at the end of the string.

That regular expression works.  Slowly.  It takes about 90 seconds to search 10 million titles.

It turns out that there’s another way to search only on word boundaries.  The metasequence “\b” says, “match a word boundary.”  Note that this is a metasequence rather than a character escape.  In a sense, “\b” matches the transitions between words and non-words.  My regular expression above can be rewritten as:

    \b((the beatles)|(stairway to heaven))\b

That’s right, the beginning of the string is one of those word/non-word transitions, as is the end of the string.

In .NET, the new regular expression produces exactly the same result as the old one.  The difference is in how long it takes:  15 seconds rather than 90 seconds.  That’s six times as fast!

Moral of the story: avoid alternation in regular expressions whenever possible.

By the way, “\b” doesn’t work the same in all regex flavors.  In PHP, Ruby, and Java, “\b” only works right when applied to ASCII characters.  Multi-byte (i.e. Unicode) encodings cause “\b” to break.

Regular expressions still make my brain hurt.