FINDSTR: line is too long

Today I tried to use the Windows FINDSTR command to find occurrences of a particular string in a large (33 gigabyte) text file.  Simple enough, right?

findstr /L "xyzzy" bigfile.xml

FINDSTR immediately started giving me errors:

FINDSTR: Line 408555 is too long.
FINDSTR: Line 432128 is too long.
FINDSTR: Line 801201 is too long.
FINDSTR: Line 927897 is too long.
FINDSTR: Line 939189 is too long.
FINDSTR: Line 939189 is too long.
FINDSTR: Line 939189 is too long.
FINDSTR: Line 1006538 is too long
FINDSTR: Line 1579088 is too long.

I couldn’t imagine why it would tell me that the lines are too long.  Unfortunately, I’m not able to view the file in-place because after all this time there still isn’t a decent text file viewer that can handle a file that large.  I can get kind of close with less, although there are problems displaying non-US character sets in a Windows console program.

In any case, if I extract those lines to a file (by writing a program that scans the big file and pulls out the lines in question), I was able to determine that the shortest of the lines listed above is about 3,500 characters long and the longest is about 25,000 characters.  And here’s the kicker:  running the same FINDSTR command on that file results in no errors.

I also noticed that FINDSTR told me three different times that line number 939,189 is too long.

Obviously, “line is too long” is a catch-all message for a number of different errors. FINDSTR has some issues.  Some time ago, I said that FINDSTR was marginally useful.  After today, I’d say it’s even less useful than I thought it was then.

GNU grep for Windows, by the way, has no problems with the file.  The only reason I used FINDSTR is because I don’t have GNU grep installed on the server where the file exists.

Oh, and Microsoft still hasn’t fixed that idiotic file caching bug.

An optimization problem

Michael Abrash famously said, “The best optimizer is between your ears.” His point is that no amount of machine-specific optimization is going to make a difference in the running time of a bad algorithm. A poorly written Quicksort, for example, will outperform the most highly optimized Bubble sort. Selecting the right algorithm can give you orders of magnitude (or even more!) improvements in performance. Hand optimizing code usually provides small percentage improvements after an efficient algorithm is selected.

In most of what I do, performance is important, but not super-critical. That is, performance is usually “good enough” if I select an efficient algorithm. As such, I don’t worry too much about micro-level optimizations. In this environment, the .NET Framework and the C# language are very nice. I’m incredibly productive with C#. The language and the runtime library give me many tools with which to quickly and (reasonably) efficiently implement my algorithms. I know that I’m paying a performance penalty. C# applications that do a lot of processing are typically around 10% less efficient than programs compiled to native machine code. But that’s okay. Very little of what I do is so performance sensitive that I’m willing to forego the productivity of .NET for the efficiency of native code.

But when performance is critical, I get very annoyed because the C# way of doing things is often very sub-optimal.

I’ve recently been working on a system that finds similar videos based on Web links. That is, if I like a particular video that Alice linked to on her blog, then it’s likely that I’ll also like other videos that Alice linked to. If 10 people linked to the video, then I might be interested in the other videos that those people linked to. So imagine a search facility that, given a search term (say, “bluegrass”), will find all videos that have that term in the title and then does a spreading activation by activating all of the videos that are linked to by people who linked to the videos in the initial result set.

So, searching for “bluegrass” might find 3,000 videos that have the term in their titles. Those 3,000 videos are in turn linked to by a total of, say, 4,000 different people. Combined, those 4,000 people can have many millions of different links, although there will be duplicates (that is, Alice, Bob, and George could all link the same video). We can score a video by relevance and popularity by taking into account the number of links it gets and the relative weight of each referrer’s links. But to do that I have to:

  1. Create a referrer activation record for each referrer that links to any of the videos in the initial result set.
  2. Group the referrer activation records by referrer, and compute a referrer score.
  3. Create a video activation record for each video that the activated referrers link to.
  4. Group the video activation records by video, and compute the video’s score.

This isn’t difficult to do, even when the scoring calculation is involved. It’s just creating lists, sorting, and then grouping. It’s what we programmers do all too much of every day. And it shouldn’t take very long, either: sorting and grouping tens of millions of items should be a matter of a second or two on modern hardware, even without taking advantage of muliple cores.

My first working cut of this took about 20 seconds, and a relatively few minutes’ time optimizing cut that in half. And then I was stuck. The first three steps happen in well under a second. But no matter what I did, the sorting and grouping portion (step 4) would take almost 10 seconds on a typical query that had about 10 million video references. And that just seemed like way too much time. I tried lots of different things before I got it into my head to try something extreme.

My video reference record was a simple KeyValuePair<int, double>, where the Key is the video id, and the Value is the score from a particular referrer. Those were being placed into a List that I then sorted by video id (Key). The sort was taking most of the time. And that’s just way too much time to sort 10 million items.

In retrospect, I should have thought of this earlier. But things are always obvious after the fact.

You see, I’ve run into this kind of thing before. My sort was written like this:

VideoReferences.Sort((r1, r2) => r1.Key.CompareTo(r2.Key));

I’ll let you look into the details of what exactly that syntax means, but in a nutshell it ends up creating a comparison function that is called through a delegate for each comparison. And, the structures I’m sorting (KeyValuePair) occupy 16 bytes each, meaning that for each swap that occurs during the sort I’m moving 48 bytes around. Like I said, I don’t normally think about this stuff because performance is usually good enough.

But benchmarking results showed conclusively that the sort was taking way more time than I thought it should. So, I thought, what if I could make a list of a native type and sort that? What, indeed.

I changed my reference structure so that it takes only 8 bytes. It now looks like this:

[StructLayout(LayoutKind.Explicit)]
struct VidVote
{
    [FieldOffset(0)]
    public ulong Combined;
    [FieldOffset(0)]
    public float VoteScore;
    [FieldOffset(4)]
    public int VideoId;
}

For those of you who are unfamiliar with C#, that’s the C# way of declaring a union. It packs the integer video id and the floating point score (float is fine here for individual values) into a single 8-byte structure. The Combined field treats the two values as a single unsigned long integer, with the video id in the high 32 bits.

What’s that do for me? Well, I can then write this:

VideoReferences = new List<ulong>(NumberOfReferences);
// ... populate the list with
var v = new VidVote();
v.VideoId = id;
v.VoteScore = score;
VideoReferences.Add(v.Combined);
// and then sort it.
VideoReferences.Sort();

With this structure, Sort uses the default comparison for ulong, which internally does a native 64-bit integer comparison directly rather than calling a comparison delegate.

The result? Sort time went from almost eight seconds to less than two seconds, and the subsequent grouping code is also faster than in the previous version. I’m a little bit annoyed that I had to resort to these bit twiddling tricks, but I can’t complain too much about a 70% decrease in run time.

And the really nice thing is that given this new structure I can see another improvement that should knock another big chunk off the run time.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.