Regular expressions, optimization, and practicality

Suppose you wanted a function that would replace multiple instances of a character with a single instance. So the string “+++This+is+++++a++test++” would become “+This+is+a+test+”. The easiest way to do it in C# is to employ regular expressions.

    string testo = "+++This+is+++++a++test++";
    string result = Regex.Replace(testo, @"\++", "+");

The leading slash is there because '+' is a special character in regular expressions. The slash “escapes” the character so that it will be treated as a literal.

By the way, you can’t do this reliably with String.Replace. If you wrote:

    string result = testo.Replace("++", "+");

You would end up with “++This+is+++a+test+”. String.Replace makes a replacement and moves on, so it will miss when the replacement string plus the following characters is equal to the search string.

Curious, I thought I’d write my own function to remove duplicate characters and see how it fares against the regular expression. The result is this RemoveDupes method.

    public static string RemoveDupes(string text, char dupeChar)
        if (string.IsNullOrEmpty(text))
            return text;

        var result = new StringBuilder(text.Length);
        bool dupe = false;
        for (int x = 0; x < text.Length; ++x)
            char c = text[x];
            if (c == dupeChar)
                if (!dupe)
                    dupe = true;
                dupe = false;
        return result.ToString();

That’s the most straightforward way I could think of to write it, so I wouldn’t be surprised if there are some optimizations that would make it faster.

I created the regular expression like this:

    var re = new Regex(@"\++", RegexOptions.Compiled);

And in the timed loop, called it with:

    result = re.Replace(testString, "+");

That should make the regular expression as fast as it can be.

Performance testing was a little bit surprising. The simple RemoveDupes method was 5 to 8 times as fast as the regular expression version, depending on the length of the strings and the percentage of repeated '+' characters in the string.

That the custom code was faster didn’t particularly surprise me; I’ve proven many times that a custom parser can outperform regular expressions. Two or three times as fast is what I expected. Up to eight times as fast came as a surprise.

One million iterations with a string 100 bytes long took the regular expression version 4,767 milliseconds, or 0.004768 ms per call. The same string run through the RemoveDupes method took 793 ms, or 0.000793 ms per call. The custom method is six times as fast, but the absolute difference for those million iterations is only four seconds.

Granted, the absolute differences get larger as the number of iterations or the length of the string increases. The timings (and the differences) increase linearly. If it takes RemoveDupes some N seconds to complete, it will take the regular expression approximately 6*N seconds.

But does it really matter? Most likely not. In most cases, sanitizing the input data takes up a very small percentage of the total CPU time. Optimizing it just doesn’t yield meaningful performance gains. Unless removing duplicated characters from strings is the most time consuming part of the program, this huge boost in performance would mean nothing.

Would you really notice a forty second difference in the running time of a program that takes an hour?

The other practical matter has to do with time to implement. The regular expression code already works. It took some (small) amount of time to write and test my RemoveDupes method, and the result is a very narrowly-targeted solution: it can remove duplicated characters from a string. Even the seemingly simple modification of letting it de-dupe 2-character strings would be a much more involved job, especially when you take into account the weird edge cases inherent in Unicode. Writing and debugging that would be difficult.

I think I’ll stick with regular expressions for this kind of job, even though they’re slower than the custom parser. Unless I really need the speed.