Large text file viewers

I’ve evaluated a lot of text file viewers for Windows, looking for one that has a decent user interface, a small set of “must have” features, and that can handle very large files.

One of the problems I have when looking for “large text file” viewers is that people have different definitions of “large.” Some programmers proudly state that their programs handle a 10 megabyte file with no problem. Others tout their ability to handle files that are “hundreds of megabytes” in size. And there are plenty that use one gigabyte as their benchmark. Granted, a gigabyte is a lot, but I regularly work with files that are tens of gigabytes. For example, the file I’ve been using recently to test file viewers is 30.83 gigabytes (33,100,508,272 bytes).

Many file viewers that claim to work with files “of any size” forget to add the fine print that says, “as long as it will fit into memory.” Many of those do indeed work on a 10 gigabyte file, provided you’re willing to wait five minutes for the entire file to be loaded into memory. I’ve also found that many of these tools use an inordinate amount of memory. One viewer, for example, requires about 4 gigabytes of memory in order to display a 2 gigabyte file. Why that is isn’t clear, but it certainly limits the size of file that can be viewed.

Of those viewers that don’t choke on my 30 gigabyte file, not one that I’ve tried to date has the combination of features and performance that I desire. A few come close, but fall down in the user interface or, more importantly, in quickly responding to my requests. I wrote a little about this back in April in my .NET Reference Guide. See Viewing Large Text Files. In the article, I identify many of the tools I tested, and took a closer look at three of them.

I’ve thought about this particular issue quite a bit over the years, and have always stopped short of sitting down and writing my own viewer. Why? Because it really is a difficult problem, and I have other more pressing things to do. There are user interface and implementation tradeoffs that you have to make, and although I think I could do a better job with those tradeoffs than others have done (at least, I’d be happier with my implementation of the tradeoffs), I’m not convinced that the improvements I’d make would justify the time I spent on it. It is interesting, though, to think about the problems and possible solutions.

Requirements

In the discussion that follows, I’m going to assume a minimal graphical user interface–something about on the level of Windows Notepad without the editing capabilities. I expect the following features, at minimum:

  • Intelligent handling of Unicode big-endian, little-endian, and UTF-8.
  • Ability to specify the character set.
  • Quick (“instantaneous”) load time, regardless of file size.
  • Ability to handle arbitrarily large (up to 2^64 bytes) files.
  • Vertical and horizontal scrollbars.
  • Standard movement keys (line up/down, page up/down, start/end of file).
  • Optional word wrapping.
  • Optional line number display.
  • Jump to specified line number.
  • Search for specified text.

I define “arbitrarily large” as 2^64 bytes because the most common file systems today (NTFS and the various popular Linux file systems) use 64-bit file pointers. Whenever I say “arbitrarily large file,” assume that I mean “up to 2^64 bytes.” Although I’ll limit my discussion to files that large, there’s no fundamental reason why the techniques I describe can’t apply to larger files.

One might rightly ask why I’m even thinking about files that large. It seems the height of folly to construct such big files. I tend to agree, in theory. But in practice it turns out that creating one huge XML (or some other text format) file is the easiest thing to do. The computer will be reading the data sequentially whether it comes in as one file or multiple files, but writing the program to create, manage, and transmit multiple files in order is more complicated than writing a program that handles one monolithic file. I regularly see files that are tens of gigabytes in size, and others I know talk of hundred-plus gigabyte files. I have no doubt that in ten years we’ll be talking about multi-terabyte (and larger) text files. It’s only when we want to read the files at arbitrary positions identified by line numbers that things get difficult.

Everything else is optional, although regular expression searching and a text selection and copy to clipboard feature would be really nice to have. Those aren’t particularly challenging to implement.

The character set handling is very important, although not especially difficult to implement. Notepad, for example, does a credible job of determining the text encoding, although it does guess wrong from time to time. The program should give the user the opportunity to specify the encoding if things look wrong or if he knows that the file is using a particular encoding or code page.

In reality, there are two major requirements that drive the design of the program:

  1. Handle arbitrarily large files.
  2. Show the user what he wants to see as quickly as possible.

Before we get started talking about how to meet those requirements, I think it’s important to take a closer look at what I consider the best file viewer program available today.

Viewing files the Unix way

The GNU program less, which is derived from the Unix utility of the same name, is a very effective file viewer for looking at arbitrarily large files. I’ve tested less with files larger than 100 gigabytes, and it handled them just fine. I don’t know what it would do with a terabyte-sized file, but I imagine it wouldn’t have any trouble. less is solid, uses very little memory, and works much better than many file viewers that use hundreds of megabytes of memory and don’t gracefully handle large files. It’s rather humorous and faintly depressing to see a 25-year-old command line tool written to use minimal memory outperform modern GUI tools that use hundreds of megabytes of memory. Anybody who’s working on a file viewer should study less very carefully.

less has a few features that give some ideas of how to handle very large files. One of the things that people find surprising is that less can position to any point in an arbitrarily large file very quickly. If you press > (or enter one of the other end-of-file commands) to move to the end of the file, less goes there immediately and shows you the last page full of lines from the file. It’s almost instantaneous. Similarly, entering 50% at the command prompt will take you to the halfway point in the file. Instantly.

It shouldn’t be surprising that less can position to any point in the file. After all, it’s trivial to seek into the file, read a buffer full of data, find the start of the first line in that buffer, and display it. What’s surprising to me is that so many authors of file viewers seem to think that positioning by percentage is difficult or not useful. They insist on doing things by line number which, as you’ll see, is the wrong way to think about the problem.

less can work with line numbers, too. For example, entering 50g will go to the 50th line in the file. You’ll find, though, that if you enter 1000000g, the program appears to lock up. What it’s doing is parsing the file, looking for the millionth line. And the only way to find the millionth line is to find all 999,999 lines that come before it. The program hasn’t locked up, by the way; pressing Ctrl+C will return the program’s command prompt.

The command -N will turn on the line number display. When that mode is enabled, less won’t display the results of a positioning command until it has computed the line numbers. So if you enable line numbers and then enter 50% to go to the halfway point, less will display this message:

Calculating line numbers... (interrupt to abort)

You can wait for the line numbers to be calculated, or you can press Ctrl+C to interrupt. If you interrupt, less turns off the line number display and immediately displays the page of text that you asked for.

If you ask for the millionth line and wait for it to be located, less knows where you are. If you then ask for line 1,000,100, less can find it trivially because the program knows where the millionth line is, and all it has to do is search forward 100 lines. If, however, you then ask for line 500,000, less doesn’t know where the line is. The program has to start over at the beginning of the file and parse the first 499,999 lines to find the position you asked for. It’s possible, by the way, that the program will search backwards. I don’t know how it’s implemented. In either case, less knows where it is in the file, but it doesn’t know where it’s been.

The above is not meant as criticism of less. The program is incredibly useful and works very well within its limitations. The program was designed to work on computers that are very primitive by today’s standards. And yet it often outperforms modern programs designed to take advantage of computers that were inconceivable 25 years ago. Granted, less has been updated over the years, but its basic design hasn’t changed.

If nothing else, programmers who are writing file viewers should study less because it does handle arbitrarily large files–something that most file viewers don’t do. In addition, the program has features that every file viewer should have but most don’t. In general, if less can do it but your program can’t, then there’s a problem with your program. Similarly, if your program doesn’t add something that less doesn’t do, or do something better than less does it, then what’s the point of writing it?

There are two major things I’d like to improve in less: character set handling, and speed of locating lines. There are some smaller features that I wish less had and that I’d add were I to write a file viewer, and I prefer a GUI, but those are relatively minor issues. Were I to embark on actually writing a file viewer to improve on less, my motivation would be speed and character set handling.

Character set handling can be tricky, but it’s a known problem with good solutions. Improving the speed of locating lines, too, doesn’t involve any heavy lifting. Several other file viewers have solved it to various extents, although their solutions are, in my opinion, fundamentally flawed in several ways. In particular, the two programs (other than less) that could view my 30 GB file both appear to create an in-memory line index, which puts serious limitations on the size of files they can view. A small file that contains a lot of very short lines requires more memory than a large file that has long lines. A terabyte-sized file with lines that average 80 characters would have about 13.7 billion lines. Naively stored, that index would require more than 100 gigabytes. With a little thought, you could reduce the memory requirement by half, but 50 gigabytes is still more memory than most servers have, and six times as much memory as a typical developer machine.

I’ve given this problem some serious thought, and I think I’ve come up with a workable solution that improves on less while using a relatively small amount of memory. I’ll explain my approach in my next posting.