One of the benefits–or curses, depending on my mood and how urgently I need a solution–of programming computers is that I often start working on one thing and end up getting sidetracked by a piece of the problem.
Take today’s distraction, for example. I’m writing a program to experiment with some text classification using the downloaded Wikipedia database. A major part of the program is extracting terms from the individual articles. Since I don’t need anything too fancy (at least not quite yet), I figured that this would be a perfect application for regular expressions. So I coded up my term extractor and let it loose on the Wikipedia data, figuring it’d take an hour or two to process the 13 gigabytes of text.
It’s a good thing I’ve gotten into the habit of instrumenting my code and periodically outputting timing information. It was taking an average of 10 milliseconds to process each Wikipedia article. The file has about 5.8 million articles. A quick back of the envelope calculation says that’ll take 58,000 seconds, or 16 hours. I’ve been wrong before, but not often by an order of magnitude.
After removing some unnecessary code and postponing some processing that can be done against the aggregate, I cut the required time per page to about 4 milliseconds. Better, but still too much. Through the process of elimination, I finally narrowed it down to the loop that extracts terms from the document text using a regular expression. Stripping everything but the critical loop, it looks like this:
static Regex reTerm = new Regex("\\p{L}+", RegexOptions.Compiled);
static Stopwatch reElapsed = new Stopwatch();
static void DoReParse(string pageText)
{
reElapsed.Start();
Match m = reTerm.Match(pageText);
while (m.Success)
{
m = m.NextMatch();
}
reElapsed.Stop();
}
The regular expression, \p{L}+, says, “match a string of one or more contiguous Unicode Letter characters.” reElapsed
is a Stopwatch
object that accumulates the time taken between the Start
and Stop
calls. When my program is done, I divide the accumulated time by the number of documents processed to get the average time per page. For the first 100,000 documents in the Wikipedia download, it averages about 1.5 milliseconds per page, which is pretty close to what I thought all of the processing would take–parsing included.
That seemed unreasonable, so I wrote my own parser that reads each character of the document text and pulls out the substrings of contiguous Unicode letter characters. It’s more code than the regular expression version, but it’s still pretty simple:
static void DoJmParse(string pageText)
{
// Time extraction with direct parsing
jmElapsed.Start();
int i = 0;
int len = pageText.Length;
bool inWord = false;
int start = 0;
for (i = 0; i < len; i++)
{
if (!inWord)
{
if (char.IsLetter(pageText[i]))
{
start = i;
inWord = true;
}
}
else if (!char.IsLetter(pageText[i]))
{
string term = pageText.Substring(start, i - start);
inWord = false;
}
}
if (inWord)
{
string term = pageText.Substring(start);
}
jmElapsed.Stop();
}
Both of the methods shown above are stripped-down versions of the code. The complete code adds the extracted terms to a list so that I can compare what’s extracted by each method. In all cases (the first 100,000 pages in the Wikipedia download), both methods extract the same terms. But the hand-coded parser takes an average of 0.25 milliseconds per page. The regular expression parser takes six times as long.
I could understand my hand-coded routine being somewhat faster because it avoids the overhead of constructing Match objects and some of the other regular expression overhead. But six times? Something smells wrong. I have to think that I’m missing something.
I tried the usual things: fiddling with the regular expression options, calling Matches to get all of the matches in a single call, etc. All to no avail. Everything I tried had either no effect or increased the running time.
Then I thought that the difference had to do with Unicode combining characters or surrogate pairs. That is, characters that take up two code units rather than just one. My parser treats each code unit as a character, whereas the regular expression parser might be taking those multi-code unit characters into account. But a simple test doesn’t bear that out.
Consider this character string defined in C#:
const string testo = "\u0061\u0300xyz";
The first two characters, “\0061\0300”, define the character à, a lower-case “a” with a grave accent. From my understanding of Unicode, this should be treated as a single character. But when I run that string through the regular expression term extractor, I get two strings: “a”, and “xyz”. If the regular expression engine supports combining characters, I should get one string: “àxyz”. The documentation is suspiciously silent on the matter, but a close reading of Jeffrey Friedl’s Mastering Regular Expressions indicates that I shouldn’t expect the .NET regular expression engine to give me just the one string.
So I’m at a loss. I have no idea why the regular expression version of the term extractor is so much slower than my hand-coded version. For now, I’m going to abandon the regular expression, but I’d sure like to hear from anybody who can shed some light on this for me.