Last time, I discussed existing text file viewers and why I find them inadequate. It’s disappointing that GNU less is the closest thing I’ve found to a good text file viewer for Windows. Functionally,
less is almost perfect. For sure it blows away all of the other file viewers I’ve seen. Those other tools have some good ideas, but they don’t handle large files efficiently. In this article, I’ll describe how to improve the file handling, which should make it possible to write a GUI-based file viewer that has the features of
less, and better performance.
The discussion below starts very simply so as to illustrate the problems and possible solutions. The discussion ultimately converges on what I think is the most reasonable way to improve the speed of navigation within a large text file.
Note also that I’m assuming a 64-bit system, where pointers occupy eight bytes and the program has access to more than two gigabytes of memory.
Working with large files
If a file will fit into memory, displaying it is very easy. I can load the file into an array of lines and display lines on the screen as required. Allowing for line wrap adds a small bit of complexity, but it’s quite manageable. For the moment, forget that loading a large file can take quite a bit of time. At 100 megabytes a second, loading a gigabyte takes 10 seconds–an eternity if you’re a user waiting to see the file.
An array is the simplest possible data structure that lets you directly index a line by number. In most programming languages, that array will require eight bytes per line for a pointer to the actual string. The size of file you could display with this program would be limited to available memory. More specifically, if the file’s size plus
8 * LineCount is smaller than available memory, you can view the file.
Note that the line index isn’t strictly required. You could do in memory what
lessdoes on disk: start at the beginning of the file and sequentially scan it to parse out the lines. Doing that in memory would be significantly faster than reading from disk (after the file’s in memory, of course), but it would still take a very long time on a sufficiently large file.
Nobody has infinite memory, so the above won’t work very well. Typical memory on a developer’s computer these days is about eight gigabytes, so to view a very large file, you need a way to represent the important parts of a file using as little memory as is reasonably possible.
You could, if you wanted, store only the index in memory and leave the actual text of the file on disk. You’d still have to build the index, which would take just as much time as loading the entire file into memory, but you could work with a much bigger file. That is, if your average line is 80 bytes (including the line terminator), the in-memory representation of the file plus index would be 88 bytes per line. But if you’re not keeping the text, your cost is only eight bytes per line. Nothing like a 91% savings.
Such a design would likely work very well from the user’s perspective. After the wait for loading, that is. If the user asked for a particular line, the program could position to that line in the file, read it, and display a page full of data. Going to a particular line would be just as fast as positioning into the file by percentage.
Eight bytes per line is hardly free, though. My 30 GB test file has about 400 million lines in it. At eight bytes per line, the line index would occupy 3.2 gigabytes. The maximum sized file you could load on your typical developer machine would be between 60 and 70 gigabytes, and it would take about ten minutes to load the line index.
Reducing the line index size
One way to save memory would be to have the line index store the position of every other line. That would cut the size of the line index for any particular file in half. If you stored the position of every 16th line, you could make the line index 1/16th the size. You can pick any number
N for the line interval, and the index size would be correspondingly reduced.
Of course, the larger
N is, the more searching the program has to do for any particular line. If
N is 16, then, on average, the program will have to search 8 lines of data to find the starting point. If line lengths are relatively short, this isn’t a problem at all. But if some lines are very long, the search time to find an arbitrary line could be several seconds.
Another way to reduce the size of the line index is logically divide the file into blocks of some size. Imagine, for example, that you view the file as a sequence of four-gigabyte blocks. For each block, you store its start position (offset in the file), and the line positions are stored as offsets from that starting position rather than as offsets from the start of the file. Since four gigabytes can be represented by 32 bits (four bytes), the size of each line number entry is only four bytes rather than eight as in the naive index. Your index size is cut in half, and you still have every line indexed. The only cost is a small amount of complexity and minimal runtime overhead to calculate a file position given a block offset and line offset. The seek and read time swamps that overhead–it wouldn’t be noticed.
You can do even better if you use 16-megabyte blocks. 16 megabytes is 24 bits, meaning that the line offsets can be stored in three bytes each, saving you 25% over the index with four-gigabyte blocks. There is another small runtime penalty for working with three-byte quantities, but again it wouldn’t be noticed in comparison with the disk seek and read time. Altogether, we have a 62.5% decrease in memory usage over the naive index. There’s a small amount of memory overhead involved in working with the smaller blocks, but it’s not unmanagable. Each four-gigabyte block needs 256 16-gigabyte blocks, but the savings per index entry more than offsets that.
Note that the size of the index depends entirely on the number of lines in the file. A one-gigabyte file that has 250 million lines (very short lines) would require almost two gigabytes for the naive line index. The “optimized” line index cuts quite a bit off of that, but it would still require about 750 megabytes. On the other hand, an 8-gigabyte file that has 100 million lines (about 86 bytes per line) would only need about 300 megabytes for the line index. Explaining to users why your program can load an 80-gigabyte file with no trouble, but chokes on a 10-gigabyte file would be rather awkward.
Any scheme that indexes every line is going to suffer when there are many lines, because the index will occupy a very large percentage of the total file size. And any scheme that indexes every Nth line will only reduce memory usage by a factor of N. It’s also going to suffer when lines are long, because searching for lines that follow a known line number can take an arbitrarily long time. Also, indexing every Nth line still puts you at the mercy of the number of lines. Small files with many lines will still require more index memory than large files with fewer lines.
The solution is to combine the approaches: divide the file into smaller blocks as in the blocked line index, but only store the line number at the start of the block. The size of the index is then dependent entirely on the size of the file rather than on the number of lines. In addition, we can place an upper limit on the amount of time it takes to locate a line in a block. This scheme solves both problems, at the cost of a little extra processing whenever a block is read.
Suppose we use a block size of 64 megabytes (2^26 bytes). A one-gigabyte file (2^30 bytes) would require an index with 16 entries. The number wouldn’t be less than 16 entries, but there would probably be slightly more because the line lengths aren’t uniform and lines wouldn’t pack exactly into the 64 megabyte block size. Pathological cases could require up to 32 blocks for a gigabyte, but I suspect those would be pretty rare. With a little work, I could flag the start of a block as the continuation of the previous block, but it’s unlikely that the added complexity would be worthwhile.
The index entries would be larger than the line index entries, but there are a lot fewer of them. Figure eight bytes each to store the file position and the line number, and four bytes for the block length. So figure 20 bytes per block for the index. If it requires 32 blocks per gigabyte (the absolute worst case), you’re talking 640 bytes per gigabyte. The best case is 320 bytes, and 512 bytes would easily handle all but the most pathological cases. So figure 512 (2^9) index bytes for every gigabyte (2^30 bytes) indexed.
The size of the index, then, is roughly
Filesize/(2^21). So a terabyte file (2^40 bytes) would require an index of 2^19 bytes, or half a megabyte. A petabyte-sized file (1,024 terabytes, or 2^50) bytes, would only require half a gigabyte for the block index. An exabyte-sized file (2^60 bytes) would require 512 gigabytes for the index, but I figure by the time we’re working with files that large, we’ll have a lot more memory in our computers, and disk performance will be good enough that we can further increase the block size. For now, I’m happy to know that, if I were to write this, I could handle a petabyte-sized file in less than a gigabyte of RAM.
A larger block size would reduce the in-memory index size further, but there are tradeoffs that I’ll discuss below
There is one issue that the discussion above doesn’t address: lines that are longer than the buffer size. There are a couple of ways to handle that situation, the easiest being to automatically split those long lines. So if a line is 65 megabytes long, the program treats it like one line that’s 64 megabytes, followed by a second line of one megabyte. Another way is to flag a block as a continuation block. That is, store the same line number as the previous block, but set a flag to say that it’s a continuation of the previous block. There are other possibilities, none of which are particularly difficult to implement. I’ll wave my hand over the issue and call it solved.
Reviewing the requirements
Remember that my simplified requirements are:
- Handle arbitrarily large files.
- Show the user what he wants to see as quickly as possible.
If you ignore the load time, the line index described above meets both of those requirements. The index design I described above can handle a petabyte-sized file on commonly available hardware, and once the index is constructed the program would be able to show the user any line in the file almost instantly. The algorithm is very simple. The block index contains the line number at the start of each block. So the program can quickly determine in which block the requested line resides, load that block, parse the lines from it to create a per-line index for that block, and then display the line in question. A standard developer machine should be able to load and parse 64 megabytes in less than a second.
I said above that a larger block size would reduce the memory requirements, but that there are tradeoffs. The tradeoff is response time. If the user wants to position to a line that is not in the current block, the program has to load the relevant block and parse it to find the line positions in that block. A typical developer machine these days can do random reads at a rate of about 100 megabytes per second. Loading a 64 megabyte block, then, will take perhaps 700 milliseconds, leaving 300 milliseconds to parse the block and display the first page of data. If I increased the block size to 128 megabytes, it would take twice as long (two seconds) to position within the file. One second is, to me, on the edge of too slow. A two-second response time is unacceptable. If I/O speeds increase, I’m happy to increase the block size. On the same note, I might have to reduce the block size if it takes too long to read and parse 64 megabyte blocks.
The problem I’m faced with is the initial load time, which fails the second requirement. I know that I wouldn’t use the program if I had to wait five minutes or more for it to index my 30 GB file.
At this point it’s important to remember why we’re constructing an index at all. It serves two purposes: to let the user position into the file based on the line number, and to display line numbers if the user wants to see them. Being able to work with line numbers is essential, but it’s not the most important feature of a text file viewer. As we saw with
less, there are many things you can do with a file viewer that don’t involve line numbers at all.
Having to wait for the file to be indexed is especially annoying when all you want to do is look at the first part of the file. Very often, that’s what a user wants to do: start at the beginning of the file and read through it sequentially. It’s unreasonable to make the user wait for all the line numbers to be computed before showing him the first part of the file.
What if, when the user first started the program, it immediately read the first block of the file and presented it to the user? That’s what
less does. It will even display the line numbers immediately. That’s trivial to do, of course, since reading the first part of the file, calculating line numbers for that portion, and displaying the first page is very fast. Any file viewer should do that, at minimum. Remember: show the user what he wants to see as quickly as possible.
But you can improve on
less just a little bit by reading the first block (64 megabytes) of the file and parsing it to build an index of the lines in that block. Whenever the user wants to jump to a line that’s within that first block, the program already knows where that line is and has it in memory! This increases the memory footprint a little bit, as it requires that we keep track of lines in a block, but only for a single block. The worst case would be a block that contains nothing but blank lines, which would mean 64 million entries in the line index for that block, or about 128 megabytes. Not small change by any means, especially compared to
less, but quite acceptable on an 8 gigabyte machine.
If the user is paging sequentially through the file, the program continues to serve lines by looking them up in the index. When the user pages beyond the part of the file that’s already been indexed, the program loads the next block, indexes it, discards the previous block, and continues. As long as the user is paging sequentially through the file, this program acts just like
less, except that it’s building a persistent block index.
An obvious optimization would be to keep a few blocks of the file in memory in order to reduce the ping pong effect that would occur when the user is displaying the end of one block and the beginning of the next. Assume that this is done. I wave my hand over the technical details of the implementation because it’s not pertinent to the discussion at hand, and it’s not a difficult thing to do.
If the user then asks for the millionth (or ten millionth) line, the program again acts just like
less: it makes the user wait until it has parsed enough of the file to locate the millionth line. It will still take just as long to find the millionth line as
less takes, but the program is doing one other very important thing: it’s building and saving the block index.
That’s important for two reasons. First, as you saw with
less, if you jump to the ten millionth line and then ask for the five millionth line,
less has to start over at the beginning of the file and parse it to find line number 5,000,000. Granted, that’s pretty quick on modern hardware because the operating system will cache large parts of the file, but if you go to line 100,000,000 and then 50,000,000, you’re going to wait a long time. This proposed new program doesn’t suffer from that. Since it’s created the block index, it knows which block contains line 5,000,000 (or 50,000,000), and can jump immediately to that block.
The second reason it’s important is similar. With
less, if you go to line 100,000,000 and then back to the beginning of the file, and then want line 100,000,001, the
less has to scan the entire file again–all 100 million lines that it had scanned before. As I said previously,
less knows where it is, but doesn’t know where it’s been. Our program, though, knows where line 100,000,000 is, and can immediately start searching for the next line at that position.
The program would, of course, allow quick positioning to the end of the file or to any point within the file by percentage, just as
less does. It would also make the user wait (with the opportunity to cancel the wait) if line number display is turned on but the line numbers need to be computed.
In the above discussion, the program mimics
less to begin with. It can’t do anything else. But once it’s seen a particular line in the file it knows what block that line is in and can go back to it very quickly, without having to sequentially scan through the file from the beginning.
As useful as such a thing is, it would be only a small incremental improvement over
less. It’s probably worth doing and if such a program existed and provided all the other features of
less, I’d use it. But it’s not enough of an improvement to make me want to write it.
Doing better than less
The only seriously annoying thing about
less is that I have to wait for line numbers to be computed when I ask for a specific line, or when I have line numbers turned on and I position into the file using
%. Whereas line numbers are important, I hate having to wait. The program described above reduces waits after the first time it sees a line, but that initial wait time is a serious annoyance. At 100 megabytes a second, it would take five minutes to compute line numbers for my 30 GB file.
If the user is paging sequentially through a text file, there’s no perceptible delay from one page of the file to the next. The only time he has to wait is when he wants to display a line number that the program hasn’t already seen.
So what would happen if the program jumped ahead of the user? What if, when the program starts, it reads the first block of the file, displays it to the user, and then keeps reading, building the block index as it goes? Given sufficient time, the program would finally reach the end of the file and the block index would be fully constructed. What if the program could do that in the background while letting the user view the file as before?
Doing that wouldn’t be especially difficult. Any developer machine today has multiple processor cores, so a background thread that’s scanning the file isn’t going to interrupt the responsiveness of the UI. Modern programming environments make working with multiple threads much easier than they were in the past, so implementing this improvement would be relatively easy, and the result would reduce or completely eliminate the user’s need to wait for indexing. If, for example, the user started the program and spent three minutes paging through the first part of the file, the program would have had those three minutes to index a large part of the file. Even at 50 megabytes a second (half of what you would expect from a modern developer machine), the program can index three gigabytes a minute. While the user is slowly paging through the file, the program is reading ahead, indexing the file. The result should be much smaller wait times–even the first time through the file.
The user would still be able to position to any point in the file, just as before. Again, the only time he’d have to wait would be to display correct line numbers. And with background indexing of the file, even that requirement can be relaxed.
Estimating line numbers
My 30 GB test file has an average line length of about 84 bytes, and the contents of the file are reasonably uniform. Line 100,000,000 of the file is, to a fair degree of accuracy, at about position 8,400,000,000 in the file. The actual number reported by
less is 8,393,544,127, meaning that my calculation was off by 6,455,873 bytes or .07%. Then again, my average line length estimate was “about 84 bytes.” The program would have a much more accurate value.
If the program is reading and indexing the file in the background, it can keep a running average line length so that when the user asks for a line that hasn’t been read, the program can estimate the position of that line number and jump to that position in the file. Obviously, the line number display would have to indicate that the numbers shown are approximate. Would that be good enough?
It would be for me. Very often, I just want to see what’s “around” line 100,000,000, or I know there’s a particular bit of text on or near that line. If the program can put me close to that line, I can use the search facility to quickly find the exact line I’m looking for by searching in the vicinity.
The program continues to index lines in the background, and periodically updates the line number estimates. When the program gets to the block that contains the current position, it can update the display with the actual line numbers.
This is far from a perfect solution. For example, an error of .07 percent in a petabyte-sized file is a huge chunk–on the order of a terabyte. The error would be much greater if the first part of the file (the part already scanned) is not representative of the rest of the file. The program can use various adaptive techniques in an attempt to reduce the error, but nothing is going to be perfect.
The program described above would be a tremendous improvement over the venerable
less, provided that it maintained all the existing functionality that makes
less so useful. Adding the block index, background file scanning, and estimated line numbers are improvements that should be obvious to anybody who thinks about file viewers. Several of the file viewer tools I’ve seen implement some of what I described, but in fatally flawed ways. And that bothers me.
Background file scanning is an obvious improvement, and one that a few other programs already have, although some implement it stupidly.
The line index, confuses me. The designers of other file viewers had to realize that the line index limited the size of file their programs could handle. It looks like they gave it some thought, as none of the programs seem to use eight byte line indexes. Most appear to use four byte indexes, meaning that they’re doing something to reduce memory consumption. But even four bytes per line adds up when there are hundreds of millions of lines in the file. Why nobody thought of keeping just a block index and re-parsing blocks as required is a bit of a mystery. Or maybe they did think of it, and discarded the idea. That would be an even bigger mystery.
The one thing that doesn’t confuse me, but concerns me nonetheless, is the insistence on knowing the exact position of a line before allowing the user to position to that line. I can’t be the first person to think of using estimated line numbers.
For more than 25 years, hex file viewers have loaded the first page of a file immediately and have had the ability to position to any point in the file instantly. I know this to be true because I wrote such a program in 1983. The original version could handle a file up to four gigabytes in size, using minimal memory to do it. I had to wait about 15 years before I could prove that, since when I originally wrote the program even a 10 megabyte hard drive was a luxury that most didn’t enjoy. To be fair, a hex file viewer is a much simpler program than a text file viewer because there are no “lines” to worry about. As far as the hex file viewer is concerned, a file is just an array of bytes that’s rather slow to access.
My insight, and one that I have to believe others have had, is that a file is a file. It’s our interpretation that makes it a text file, a database file, or whatever. The only thing that prevents a text file viewer from instantly positioning to a particular line is not knowing exactly where that line is in the file. Realizing that, the designer of a file viewer make a choice: force the user wait for the line numbers to be calculated, or display text at the estimated line position. As a user, I’m not going to be completely happy with either (I want it now!), but I’m much happier to be given something rather than having to wait an eternity for the program to calculate line numbers. If other designers thought of this (and I have to believe that they did), they made the other choice. They elected not to potentially confuse users by showing incorrect line numbers.
And I agree that there’s potential for user confusion here. If the user asks for line 100,000,000 and the program puts him there instantly, he’s tempted to believe that he really is at line 100,000,000. That’s especially true if the user doesn’t realize that calculating line numbers takes time. But that user confusion is mitigated by two things:
- Most people who view these huge text files are programmers who will understand the issues.
- A simple popup message explaining that the line numbers are estimates.
It’s my contention that positioning to what the program thinks is the correct line and allowing the user to proceed from there is preferable to making the user wait a potentially very long time (and five seconds is a long time when you’re trying to get things done) before he can do anything at all.
Some thoughts on implementing the user interface
One of the problems I struggled with in working out the solution I described above is how the vertical scrollbar behaves. The horizontal scrollbar turns out to be easy to handle, but the vertical scrollbar required a bit more thought. I also initially thought that the optional line wrapping feature would complicate things. It turns out that if we change the way we think about those two features, their implementations become much easier.
Absent line wrapping, the horizontal scrollbar’s maximum value is set to the length of the longest line so far encountered in the file. It’s as simple as that. In the simplest implementation, the length is updated by the background process that’s scanning the file. However, it could also be updated when the foreground process positions to a point in the file that hasn’t yet been scanned. Whenever a previously unseen line is parsed, the scrollbar limit is updated if necessary. There’s a potential problem if a single line exceeds the scrollbar’s capacity (2^31 in Windows), but that can be handled by setting a limit on line length, or by using a the scaling technique that I describe below.
When line wrapping is turned on, the horizontal scrollbar is disabled or hidden, since there’s no need to scroll horizontally when the text wraps to the window width.
At first thought, the vertical scrollbar seems like a huge problem. If the interface uses it to reflect the line position in the file, then the scrollbar limits will have to be updated continually as the file is being scanned. In addition, using the scrollbar in that way would limit the number of lines that could be accurately represented to the maximum value accepted by the scrollbar. Using the standard Windows scrollbar, that limit is 2^31, meaning that you couldn’t work with a file that has more than two billion lines. Well, you could, but the code would be difficult.
If, instead, we treat the vertical scrollbar as representing the position as a percentage of the file’s content length, things are greatly simplified. If the file is less than two gigabytes, then the scrollbar limit is the file’s length and the scrollbar thumb position is equal to the exact byte position in the file. If the file is larger than two gigabytes, then we can use a constant multiplier for the scrollbar thumb position. The math is quite simple.
The scrollbar small change value is always 1. If the user clicks on a scrollbar arrow, the window scrolls one logical line, a “logical line” being defined by the state of the line wrapping flag. Similarly, the large change is however many logical lines fit in the window at its current size, using the currently selected font. Simple and effective.
One thing that always bothers me about Notepad is that if you turn on line wrapping in a very large file, the program takes a long time to re-compute the wrapping, and ends up moving you to the beginning of the file. Other editors and file viewers do the same thing, and it always struck me as kind of crazy that they would. It appears that the only reason they do this is so that they can make the scrollbar reflect the number of logical lines in the file–that is, the number of display lines it takes to show all of the wrapped lines in the document. It’s nuts.
There’s no other reason I can think of that going from horizontal scrolling to line wrapping would cause such a delay. Certainly, there’s no reason why the program can’t just wrap the current page almost instantly. The top line remains the same, and it’s trivial to line wrap that line and the following lines to fill the page. What comes before and after the current page is irrelevant.
If we think of the scrollbar position as the current position in the file, irrespective of line numbers, then all that complexity goes away.
Another benefit of viewing the scrollbar in this manner is that the user can grab the thumb and pull it down to, say, the halfway point. The program will behave as though the user asked to position at the 50% point in the file–just as if, in
less, the user had entered
50%. Simple, not confusing, and very effective.
Will I write it?
If this program existed today, I would use it. It’s a significant improvement over
less, which I consider the best text file viewer currently available. The question is whether it’s enough of an improvement for me to develop it. Writing this file viewer would entail a lot of work. The algorithms themselves aren’t especially difficult to implement, but it would take some time to get everything right.
For now, it’s just an idea. I have other fish to fry at the moment, but I sure do want this tool.