Elephant on a trampoline

I laughed out loud when I saw this picture.

I was nine years old when Dad bought a trampoline. He had us check out some books from the library so we could learn the proper way to jump. I don’t know how much attention everybody else paid to those books, but I kind of glanced through them to get ideas for weird and wacky ways I could court death.

All five of us kids got pretty good on the trampoline. My older brother and sister had better form than I did, but I was the wild man. I’d try pretty much any trick I heard about, saw a picture of, or saw somebody else do. I even made up a few myself, although I learned later that they weren’t exactly original.

I do think I came up with original ways to perform unexpected dismounts. On one occasion when I was trying a new trick, I did a back flip right off the trampoline and came down directly in front of my brother, who slowed my descent enough that I landed on my feet. My recollection is that it looked almost like we planned the whole thing. Perhaps his memory of the event is better than mine. I was a bit preoccupied with my life flashing in front of my eyes as I sailed through the air to my doom.

I spent a lot of time on that trampoline from the time Dad bought it until I was 16 or 17. One summer I made a point to spend an hour every day on the thing.

Fast forward 20 years or so when Debra bought me a trampoline for my birthday and I set it up out here in the back yard. It was fun for a while–a week or so–but then the new wore off. It wasn’t nearly as much fun as when I was younger and always had friends who would come over and play on the thing with me. And for whom I could show off my latest trick. Plus, I wasn’t in nearly as good physical shape at 35 as I was when I was 16. Jumping on a trampoline is work.

We sold the trampoline a few years later, and a few years after that we were at a friend’s house. He had a trampoline for the kids, so I took it on myself to teach them a few tricks. No crazy flips or anything–Mom didn’t want me teaching her kids that, although I did have to see if I could still do that back and a half.

I was showing them how to get some real air (feet at shoulder width, pushing off gradually at the right time, using arms to control balance, etc.). I was getting some pretty good height. Then I noticed that the kids were looking under the trampoline, pointing, and giggling. And I was transported back 30 years to when we first got the trampoline and Dad was showing us how to use it. When he jumped, the mat nearly hit the ground! We all giggled at that.

Anyway, seeing the kids pointing and laughing made me start laughing with the memory. So I bounced, shifted to a sitting position, hit the trampoline with my butt … and hit the ground! Yes, the combination of my weight and the height I was getting stretched the springs far enough that I hit the ground.

I checked after that. The trampoline’s weight limit was 200 pounds. And I only weighed 180 at the time. I guess they didn’t think a 180 pound man could get enough altitude to push the mat that far.

And that’s why I laughed out loud when I saw the elephant on the trampoline.

More about cache contention

When I started working on yesterday’s blog entry about cache contention, I built an example program that used an array to illustrate the problem. That is, rather than having a struct that contains four counters, I just allocated an array. It made the code somewhat simpler, as you can see here.

const long maxCount = 500000000;
const int numThreads = 2;
const int Multiplier = 1;
static void DoIt()
{
    long[] c = new long[Multiplier * numThreads];
    var threads = new Thread[numThreads];

    // Create the threads
    for (int i = 0; i < numThreads; ++i)
     {
         threads[i] = new Thread((s) =>
            {
                int x = (int)s;
                while (c[x] > 0)
                {
                    --c[x];
                }
            });
    }

    // start threads
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < numThreads; ++i)
    {
        int z = Multiplier * i;
        c[z] = maxCount;
        threads[i].Start(z);
    }

    // Wait for 500 ms and then access the counters.
    // This just proves that the threads are actually updating the counters.
    Thread.Sleep(500);
    for (int i = 0; i < numThreads; ++i)
    {
        Console.WriteLine(c[Multiplier * i]);
    }

    // Wait for threads to stop
    for (int i = 0; i < numThreads; ++i)
    {
        threads[i].Join();
    }
    sw.Stop();
    Console.WriteLine();
    Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}

The purpose of the Multiplier in that program will become evident soon.

Run with a single thread, that code executes in about 1,700 ms on my work machine–same as the version that uses a struct. But run with two threads, the code takes a whopping 25 seconds! At first I thought that this was evidence of cache contention, so I changed the Multiplier to 8, which spreads out the counters so that they’re guaranteed to be on different cache lines. That is, the first thread will access c[0], and the second thread will access c[8].

That change did indeed improve the run time. The two thread case went from 25 seconds to about 12 seconds. Cache contention was part of the problem, but certainly not all of it. Remember, the two-thread version of my first test yesterday ran in about 2,200 ms on my system.

I ruled out array indexing overhead, figuring that if it was a problem it would show up in the single-threaded case. After ruling out everything else, I was left with two possibilities: either there’s some explicit mutual exclusion going on in the runtime, or there’s some other cache contention that I didn’t take into account.

It turns out that there is more cache contention. You just can’t see it directly because it has to do with the way that arrays are allocated.

When the runtime allocates an array, it allocates enough space for the array contents plus a little extra memory to hold metadata: the number of dimensions, the size of each dimension, etc. This is all contained in a single memory allocation. The layout of an array in memory looks like this:

---------------------------------
|  metadata  | array contents   |
---------------------------------
             ^
          array[0]

The array metadata is smaller than 64 bytes, so the chances of the first array element sharing the same cache line as part or all of the metadata is very high.

That’s half of the problem. The other half of the problem is that whenever code needs to access an element in the array, it has to read the metadata in order to do bounds checking and to compute the offset into the memory block. So whenever you write x = a[i] or a[i] = x, the code accesses that metadata.

If the first array element is on the same cache line as the parts of the metadata used for indexing, then every time you modify that first element, any other thread’s access to the metadata is going to incur a wait for the cache to be flushed. Modifying the first array element invalidates the cached metadata.

The reason it’s worse with arrays than with yesterday’s program is because every time the code executes --c[x], it actually makes two array accesses: one to read the current value, and one to write the modified value. Every array access makes multiple requests for the metadata, so there can be multiple stalls per iteration. That’s not true when accessing fields in a structure like we did yesterday.

The solution is to put some padding in the front of the array, as well as between the items we’re incrementing. As it stands right now, the indexes being used are 0, 8, 16, and 24. Shifting that right by eight elements would give us 8, 16, 24, and 32. That’s an easy change to make, as you can see here.

long[] c = new long[Multiplier * (numThreads+1)];
var threads = new Thread[numThreads];

// Create the threads
for (int i = 0; i < numThreads; ++i) {     threads[i] = new Thread((s) =>
        {
            int x = (int)s;
            while (c[x] > 0)
            {
                --c[x];
            }
        });
}

// start threads
var sw = Stopwatch.StartNew();
for (int i = 0; i < numThreads; ++i)
{
    int z = Multiplier * (i + 1);
    c[z] = maxCount;
    threads[i].Start(z);
}
// Wait for 500 ms and then access the counters.
// This just proves that the threads are actually updating the counters.
Thread.Sleep(500);
for (int i = 0; i < numThreads; ++i)
{
    Console.WriteLine(c[Multiplier * (i + 1)]);
}

I just showed the parts of the program that have to change. Instead of computing the index using Multiplier * i, I used Multiplier * (i + 1), the counters array is allocated with Multiplier * (numThreads + 1).

If you compile and run that program, you’ll see that the results for one, two, and three threads are almost identical. The result with four threads will be slightly higher, again because system services take a small amount of processor time away from the test program.

What I’ve been calling cache contention is more often referred to in the literature as false sharing. I’d like to thank Nicholas Butler for answering my Stack Overlow question about this, and pointing me to his article, Concurrency Hazards: False Sharing.

More threads makes the program slower?

Multithreading is A Good Thing. Many solutions can be structured so that you can bring multiple processor cores to bear on the problem, and you can get very close to linear increases in throughput. That is, your quad-core processor will complete a job four times as fast as a single-core processor. Or very close to it. But you have to be careful how you structure your data, or you’ll find that adding threads decreases throughput.

As an example, let’s use a program that has multiple threads, each decrementing a separate counter. That is, each thread does this:

while (myCounter > 0)
{
    --myCounter;
}

myCounter, in this case, is a separate variable for each thread. The program below will handle up to four threads, each decrementing its own counter. Wrapping it all up so we can keep track of everything and supply timings takes a bit of code, but it’s still pretty simple.

struct Counters
{
    public long counter1;
    public long counter2;
    public long counter3;
    public long counter4;
}

const long maxCount = 500000000;
static void DoIt()
{
    var c = new Counters();
    c.counter1 = maxCount;
    c.counter2 = maxCount;
    c.counter3 = maxCount;
    c.counter4 = maxCount;

    int numThreads = 3;
    var threads = new Thread[4];
    threads[0] = new Thread((s) => { while (c.counter1 > 0) --c.counter1; });
    if (numThreads > 1)
        threads[1] = new Thread((s) => { while (c.counter2 > 0) --c.counter2; });
    if (numThreads > 2)
        threads[2] = new Thread((s) => { while (c.counter3 > 0) --c.counter3; });
    if (numThreads > 3)
        threads[3] = new Thread((s) => { while (c.counter4 > 0) --c.counter4; });

    // start threads
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < numThreads; ++i)
    {
        threads[i].Start();
    }
    // Wait for 500 ms and then access the counters.
    // This just proves that the threads are actually updating the counters.
    Thread.Sleep(500);
    long z1 = c.counter1;
    long z2 = c.counter2;
    long z3 = c.counter3;
    long z4 = c.counter4;
    Console.WriteLine("{0},{1},{2},{3}", z1, z2, z3, z4);
    // Wait for threads to stop
    for (int i = 0; i < numThreads; ++i)
    {
        threads[i].Join();
    }
    sw.Stop();
    Console.WriteLine();
    Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}

Understand, I didn’t set out to write a program that just decrements a couple of counters. This is just a very simple illustration of a problem that I encountered. I’m well aware that having public mutable members in a struct is generally a bad idea, and I wouldn’t do it in production code. This code is for illustration purposes only.

Run with a single thread on my system, that code executes in about 1,325 ms. With two threads, you would expect about the same thing, right? After all, each thread is executing on a separate core (assuming you have multiple cores), doing the same amount of work, and there aren’t any locks protecting shared data. It’s straight computation.

Instead, it took about 1,870 ms–more than half second longer. Three threads takes 2,600 ms, and four threads about 2,560 ms. There’s definitely something wrong.

The timings I’m reporting here are from my machine at home, which is a Core2 quad, running at 2.4 GHz. I’m running Windows 8 Developer Preview, and the Visual Studio 11 Developer Preview with .NET 4.5. I’ve also run the tests on my development machine at the office: a Core2 Quad, 2.0 GHz with Visual Studio 2010 and .NET 4.0. My home machine is faster, of course, but the timings are relatively the same. That is, the test that takes 1,325 ms on my home machine takes 1,600 ms on my work machine.

At first I thought that for some reason the threads were not running concurrently, and re-ran the test. But, no. When this starts up, CPU usage is 25% times the number of threads. That makes sense, since I have four cores. When four threads are running, CPU usage is 100%. With two threads, it’s 50%.

The problem turns out to be cache contention: multiple cores trying to write to memory locations that are very close to each other. Let me explain cache contention to those who might not be aware of it.

The memory management hardware divides memory up into “cache lines,” which on my system are 64 bytes. When a processor core reads a value from memory, it actually loads the entire cache line that contains that single value. Those 64 bytes are stored in the core’s memory cache. The core can then read from and write to its cache, which is much faster than reading and writing main memory. The cache control circuitry in the processor takes care of copying changed data from the cache to main memory. Usually, that’s done asynchronously so there’s no performance impact on the processing unit.

Things get messy, though, when another core wants to read a value from the same cache line that the first core has modified. The memory control circuitry only knows that something in the cache line changed–not which value. So in order to ensure that the second core gets the most up-to-date value, it has to get that whole cache line from the first core. And, of course, the second core has to wait for that.

In the little program above, the individual counters are right next to each other in memory. They will occupy one or two cache lines. That is, the first one could be at the end of one cache line, with the next three being allocated on the next cache line. That’s assuming that they are actually stored in memory in the order that they’re defined. Assuming that they’re stored sequentially in memory on the same cache line, whenever any thread wants to read or update its counter, it has to wait for some other thread to flush its cache. Those cache stalls add up to a whole lot of wasted processor time.

If you know about cache contention (which I did, prior to encountering this problem), you’d never write a program that has such an obvious defect as what I show above. And I didn’t. The program in which I encountered this problem is somewhat more complex, and the cache contention took a little bit of debugging to find.

So how do I prove that cache contention really is the issue? What would happen if I added some padding to the Counters structure to force each of the counters onto a separate cache line?

[StructLayout(LayoutKind.Sequential)]
struct Counters
{
    public long counter1;
    public long a1, a2, a3, a4, a5, a6, a7;
    public long counter2;
    public long b1, b2, b3, b4, b5, b6, b7;
    public long counter3;
    public long c1, c2, c3, c4, c5, c6, c7;
    public long counter4;
    public long d1, d2, d3, d4, d5, d6, d7;
}

I just added seven long values after each counter so that, with the counter it added up to 64 bytes. I can’t guarantee that the counter is at the beginning of a cache line, but I know that, since there are 64 bytes from the start of counter1 to the start of counter2, those two values will be on different cache lines. The StructLayout attribute is there to ensure that the compiler lays the structure out in memory exactly as I have defined it.

So what happens if I run that? One thread takes 1,325 ms. Two threads, same thing. Three threads takes a little longer: about 1,350 ms. And with four threads it takes between 1,390 and 1,420 ms. The times for three and four threads aren’t terribly surprising because this isn’t the only program running on the system. System services still have to run, which means that some of my program’s threads will be swapped out. I could spend a lot of time shutting down unnecessary services so I could test that theory, but there’s really no need. I’m satisfied with the “other programs” explanation.

If you reduce the padding to just three long values (i.e. just a1, a2, and a3), the execution time is about 1,900 ms for four threads. That’s because you’re ensuring that no more than two of the counters will be on any one cache line. This is the same behavior you see with just two threads when there’s no padding at all.

The lesson here is that you don’t want multiple threads mutating memory locations that are very close to each other. It’s bound to happen from time to time, of course, and that’s fine. A few cache flushes now and then aren’t going to kill your program. But you should avoid writing code that has critical loops in separate threads modifying adjacent memory locations. If your threads are just reading those locations, there’s no problem. Reads are essentially free. But if you modify memory close to where one of the other threads is reading, those other threads will stall until the processor core updates its cache.

All the fancy new tools that make multithreaded programming easier are wonderful things. I like the Task Parallel Library and the new Asynchronous Programming stuff that’s coming with .NET 4.5. Using those is much easier and often performs better than explicitly managing threads as I’ve shown above. Hiding complexity is usually a good thing. But those libraries don’t do everything. You still need to understand, at least to some extent, what the hardware is doing.

Writing a Web Crawler: Queue Management, Part 1

This is the fourth in a series of posts about writing a Web crawler. Read the Introduction for background and a table of contents. The previous entry is Politeness.

In the discussion of Crawling Models, I distinguished between two types of crawlers: “offline” crawlers that download documents and have a separate process create queue segments that are submitted as seeds to the crawler, and “online” crawlers that maintain a live queue. Queue management for these two types of crawlers is quite different, with the offline crawler presenting a somewhat easier problem to solve. Queue management for the online crawler gets rather complicated.

To review, the general processing model of the offline crawler is:

queue = LoadSeed();
while (queue not empty)
{
    dequeue URL
    request document
    store document
}

A separate program scans the downloaded documents and creates seeds (or segments) that are fed to the crawler. This queue building program maintains a list of URLs that have already been crawled. As it reads documents that the crawlers download, it discards those that have already been crawled or that have already been submitted in a seed. The remainder get added to the new seed. When the seed reaches some threshold, it is typically sorted by domain and then fed to the crawler.

If you think about that a moment, you’ll recognize that the crawler can’t then read the queue sequentially. If it did, then it would crawl all of the URLs for domain A, then domain B, etc. That would violate politeness if the crawler didn’t delay after each request. And if the crawler did delay after each request, it would spend most of its time waiting doing nothing.

What happens is that when the crawler loads the seed, it creates a queue of domains, and each domain has a queue of URLs. When the crawler goes to dequeue a URL, it dequeues a domain, dequeues a URL from that domain’s queue, crawls it, and then adds that domain back to the queue. If there are 100 domains in the queue, then, the first domain is visited once, followed by the other 99 domains before the first domain is visited again.

That’s a somewhat simplified explanation of one way to manage the queue, but it’s very effective. Using that technique almost guarantees that the crawler won’t hammer any domain with too many requests in too short a time. The nice thing is that it works well for a single-threaded crawler and for a multi-threaded crawler. There’s some synchronization involved in managing the queue when you have multiple threads, but that’s easily managed with a simple mutual exclusion lock. As long as a thread obtains a lock on the domain queue before dequeuing or enqueuing an item, things work well. You could use a concurrent lock-free queue rather than a lock, but the truth is that the lock just won’t be contended very often. If you’re crawling on a cable modem, you won’t be able to do more than about 30 requests per second, meaning that you won’t have to take that lock more often than 60 times per second. It’s just not an issue.

That simple domain queue handles the general case very well because there are usually many thousands of domains in the queue. But you’ll find that over time the number of domains decreases because most domains will have only a few URLs to crawl, and some domains will have hundreds or thousands. As the queue size decreases, the time between requests to the same domain decreases. At the end of a crawl, you could have very few domains in the queue. A naive implementation would end up making requests to those domains too quickly.

In general, there are two ways to handle that situation. The easiest is to have the crawler stop and save the remaining seed when the number of domains drops below some threshold. The process that’s controlling the crawler can then combine the remaining seed with the next seed to be crawled. If you have the system set up so that the crawler can request a new seed, then you can have the crawler request a new seed when the queue becomes too small. The crawler would then load that seed and add it to the existing queue.

Both of those solutions are simple to implement, but they will still violate politeness if there is no new seed to be had. In addition, the general model might not be sufficient to guarantee politeness. Remember, different domains have different politeness rules: specifically the Crawl-Delay. One domain might allow requests every 15 seconds, and another might have a delay value of 60 seconds or more. The simple queues that I outlined above do not take that into account.

A better way to handle things is to create another queue–a priority queue–that is kind of like a penalty box. After you crawl a URL from a particular domain, you add that domain to the penalty box, with an expiration time that’s equal to the current time plus the delay value. When that domain’s delay time has expired, you remove it from the penalty box and add it to the back of the domain queue. It looks something like this:

queue = LoadSeed();
while (queue not empty)
{
    dequeue domain
    dequeue URL from domain
    request document
    store document
    if domain queue not empty
    {
        add domain to penalty box
    }
}

You then need a way to remove things from the penalty box. One way is to have the downloader thread do it after crawling. That is, add the following to the loop:

    while (penalty box not empty AND first item release time < now)
    {
        remove item and add to domain queue
    }

That’s simple and effective, but somewhat wasteful. Another way to is to set up a timer that checks the penalty box every second. That’s going to require less processor time, and it’s just not going to matter if a domain remains in the penalty box for an extra second. Either way you do it, you’re guaranteeing politeness. Using the penalty box will prevent you from hitting any domain too often.

I’m implying here that the domain record contains more than the domain name and a queue of URLs. You’ll also need to maintain the robots.txt information for each domain, including the Crawl-Delay value if any. If there is no Crawl-Delay, you should start with a default value (probably one minute), and then keep a running average of the amount of time it takes to request, download, and process a response for that domain. Save the last five times, average them, and use that for the delay time, perhaps multiplying by some constant first. Again, I would recommend a delay time of at least 15 seconds between requests to the same site. Larger sites might not mind you making requests more frequently, but smaller sites will be annoyed.

You’ll also notice in the above code that it doesn’t add a domain back to the domain queue if there are no URLs remaining in its queue. The implication is that the domain is discarded. That’s perhaps a reasonable thing to do if you’re running an offline crawler, although doing so will also discard the computed delay value, the robots.txt file, and any other information you’ve collected for that domain. That okay, although not optimal, for an offline crawler, but it is not something you want to do if you have a live queue that’s continually updated. If you did, you’d be downloading robots.txt far more often than you need to.

A better solution for the offline crawler is to have some persistent way to store information about a domain. So if the first seed you crawl has a dozen URLs for domain A and the second seed has a dozen more for the same domain, you don’t have to download robots.txt again. Instead, you can just load the persisted information. Of course, you’ll want to download robots.txt again if it’s been more than 24 hours (or whatever your cache time is) since the last time you downloaded it, but if it’s only been a few hours there’s no reason to request the document again.

With the offline crawler, you can probably get away with checking the robots.txt file when you load a new seed, provided you don’t expect the crawler to take more than 24 hours to exhaust a seed. If your crawler will be long-running (more than a day), then you’ll want to check every time before you crawl a URL. For example:

dequeue domain
if (domain.robotstxt.downloadtime < now - cachetime)
{
    request robots.txt from domain
    domain.robotstxt.downloadtime = now
}
dequeue URL from domain
if (can crawl URL) // check domain.robotstxt
{
    request document
    store document
}
// etc.

Things get complicated in a hurry, don’t they? It gets worse, but for now I’ll assume that you have some sort of domain persistence mechanism from which the crawler can get information. In a later post I’ll talk about how to build that.

If you have an offline crawler, queue management really can be as simple as what I showed above. In the simplest model, the crawler is just a consumer; it never adds to a domain’s URL queue. Domains come out of the queue, go into the penalty box, and then come out of the penalty box and go back into the domain queue. When a domain’s URL queue is exhausted, that domain is discarded. Even a multi-threaded crawler of this type requires very little “smarts” in the queue management.

The offline queue builder, too, is simple in concept. Implementation can be difficult due to the huge amount of data it has to deal with, but for a small-scale (single machine) crawler, it’s quite manageable. At the base level, it’s just a huge sort/merge operation. The general algorithm is:

for each downloaded document
{
    scan document to extract URLs
    for each extracted URL
    {
        if (URL is not in index)
        {
            place URL in queue
            store URL in index
        }
    }
}
Sort URLs in queue by domain
Save sorted queue

The harder part is maintaining the index. A relational database is effective, but not terribly efficient for this. If the number of URLs you plan to save in the index is less than 100 million, you can probably load the index from disk at the start of a run and then write it back to disk when you’re done. This will eat up lots of memory, though, since the average URL length will be in the neighborhood of 80 characters. If your index is larger than available memory, you have to come up with some hybrid solution. If you’re willing to accept some small percentage of false positives (URLs that your index says were crawled, but in actuality weren’t), then a Bloom filter is a very effective and memory efficient way to manage your index. In my experience, the benefits of a Bloom filter far outweigh the drawbacks of a few false positives.

You’ll also want some way to filter URLs and domains that are unlikely to give you useful information. Remember, there are many more URLs than you’ll ever be able to crawl. In fact, the existence of mirrors, proxies, buggy automatic URL generation, and purposely-created crawler traps makes the number of URLs essentially infinite. Even without those problems (which I’ll discuss in detail in a later post), your crawler will uncover online forums, bulletin boards, social networks, and other sites that have no useful information. You don’t want to spend your precious resources looking at unproductive sites.

If all you want is to crawl sites that you’ve identified, then the problem becomes pretty easy. You just create a table that contains the sites you want to crawl (a white list), and any URL that doesn’t match an item in the table is discarded. Your table can be a list of domain names, or it can use regular expressions that are matched against each URL. The drawback to using regular expressions, of course, is that matching each URL against dozens or hundreds of regular expressions is going to be very expensive unless you use a more advanced algorithm similar to what the Unix grep program uses. Implementation is left as an exercise to the reader.

Handling redirects

There are two types of redirects. The permanent redirect, HTTP status code 301 (Moved Permanently) is supposed to be used for things that have been moved from their original location. When a Web browser encounters a 301, it caches the redirect location (the Location header that is returned with the 301 response) so that the next time you request the original location, the browser automatically requests the new location. The 302 status code (Found) is a temporary redirect. Browsers aren’t supposed to cache any information about temporary redirects.

Redirects work well for Web browsers, but they make crawling more complicated. One way to handle them is to ignore them. That is, rather than handle them specially, just make your HTTP component automatically follow redirects. If your crawler requests http://example.com/file1.html, and it redirects to http://example.com/content/coolstuff.html, then the crawler gets whatever document is stored at the redirect location. You can examine the Content-Location response header to get the URL of the actual document–the redirect destination.

Whereas that works, it has some drawbacks. First, it violates politeness because the HTTP component will make multiple requests in quick succession. That’s not much of a problem in a single-threaded crawler, because a browser would do the same thing. But in a multi-threaded crawler it could (and often does) result in many concurrent requests to the same domain. Imagine that you have two threads. The first requests http://example1.com/file1.html and the second requests http://example2.com/file2.html. The “example” domains, though, are owned by the same company and they’ve set it up so that any request for http://exampleN.com/* redirects to http://allexamples.com/N/*. So the first thread issues a request for http://allexamples.com/1/file1.html, and the second issues a request for http://allexamples.com/2/file2.html. You now have two threads issuing concurrent requests to the same domain, which violates politeness. I can tell you from experience that this can and does happen.

Fortunately, it doesn’t happen very often. When it does happen, it’s usually on a very large site that won’t usually notice the little bit of bandwidth that your crawler consumes. In almost all cases, politeness violations caused by redirects won’t get you into trouble. But there are other reasons you don’t want to automatically follow redirects.

If you automatically follow redirects, you bypass your URL filtering rules. Going to http://example.com/badthing/ could redirect you to http://example.com/badthing/bigfile.pdf, meaning that your crawler will end up downloading a PDF document that it can’t use. Or perhaps an image, a video, a Word document, etc. Granted, your crawler has to have logic that discards unusable documents, since it’s possible that requesting an HTML file will return an image, so this isn’t actually a new problem. But all of your URL filtering is bypassed. If the redirect location is to a domain that’s in your exclusion list, you’re going to violate politeness. If the redirect location is a document that you’ve already seen, you’re going to get duplicate documents.

And if there are multiple levels of redirection, your problems are multiplied. Double redirection is common, and triple redirection isn’t exactly rare. I’ve found that anything beyond three is usually some kind of spam site or crawler trap, so I cap my redirect following to just three levels. HTTP components that allow you to automatically follow redirects typically allow you to set the maximum redirection count in order to prevent infinite redirection (also not uncommon) from putting your crawler thread into an infinite loop.

Automatically following redirects is a really bad idea, and manually following the redirects is hard. It complicates the crawler. But you have to do it if you want to crawl the Web on a large scale. You have several options when it comes to implementing support for redirected URLs. Below I’ll present some options. All but the simplest will greatly complicate the crawler code.

The first thing is to treat all redirects as the same. Ignore the difference between temporary and permanent redirects. As far as the crawler is concerned, all redirects are permanent for that crawling session.

When the crawler encounters a redirect, it should mark the original URL as crawled and also store a document that contains the redirect location (the target URL) as a link. The offline queue builder program can then process that document in the normal way and the next queue segment will contain the redirect location. This approach has the benefit of being dead simple. The only problem is one of latency; you have to wait until the next queue segment is processed before the crawler downloads the target document.

If you want to handle the redirect within the crawler, then things get a bit more complicated. When the crawler encounters a redirect, it has to get the target URL from the Location response header. Then, it needs to filter that URL. How much filtering you do depends on how strictly you want to adhere to the politeness policy. At minimum, the crawler should check that URL against the exclusion list (see the article on Politeness) so that your crawler doesn’t hit a domain that you’ve been asked not to crawl. The crawler also needs to check the URL against the robots.txt information for the target domain. And if you don’t have a record for that domain, you need to create one, download the robots.txt information, etc. Things get complicated in a hurry. You will find that you cannot make the following work.

// BUGGY CODE - DO NOT USE
dequeue URL
request document
while (response is a redirect)
{
    filter URL
    request target URL
}

There are just too many things going on in the “filter URL” phase, and too much can go wrong. You’re better off having the crawling thread add the target URL to the target domain’s URL queue. That complicates queue management, because you then need some way to look up a domain’s information by name, and a way to synchronize access to a domain’s URL queue. Although it seems like this complicates the crawler, the truth is that you probably would have added this more advanced queue management anyway. A crawler with a live queue requires more advanced queue management, so I’ll defer detailed discussion to the next post. You don’t need all of that complexity in order to handle redirects, but you do need the basics. This is also related to the domain persistence discussion that I put off until a later post.

It’s harder than it looks

I think by now you’re getting the idea that writing an effective and efficient Web crawler is a non-trivial bit of work. What looks like a simple queue-based graph traversal problem turns out to be very complicated. Even without redirects, the simple offline crawler is difficult because of politeness concerns. And redirects make things even harder. Maintaining the queue is simplified because it’s handled by an offline process that has lots of time, memory, and disk resources.

In my next post, I’ll talk about queue management for the adaptive crawler. That’s much more complicated because the queue becomes a much more dynamic structure that has to fit into memory, and the crawler needs to update it efficiently.

Short carving notes

I haven’t given up on the carving. I’ve spent the last few weekends cutting up wood and making cutouts for those little birds. I enjoyed making those so much that I decided to send them to family and friends for Christmas. Last week I sent out a dozen, and I have more in the works.

I also liked seeing the different woods. So I announced The Hundred Birds Project. The idea is to carve at least one bird from each of 100 different types of wood. That’s two birds a week for the next year. As of today, Debra has birds from eight different types of wood, and I have cutouts for another six or eight. Then I’ll have to start looking for new woods to carve.

I’ve updated my carving video indexes. See the tabs at the top of the page.

Birds aren’t all I’m doing. I’ve carved a few little dogs recently, and I have several other  projects in various stages of planning or execution: a mesquite version of Pounce the cat, an armadillo made from mesquite, and a stylized tree made from a big slab of oak are some of the projects I’m working on.

I’m really enjoying the power carver, but haven’t given up on the knives. In fact, last night I carved one of the bird ornaments with a knife. It didn’t take as long as I thought it would. I’ll probably carve a few more that way in the next year. It’s a relaxing project, as long as the wood I select for knife carving isn’t too hard.

Three years into this carving hobby and I’m having loads of fun. Lots of good stuff planned for next year.

Writing a Web Crawler: Politeness

This is the third in a series of posts about writing a Web crawler. Read the Introduction for background and a table of contents. The previous entry is Crawling models.

When you’re writing a Web crawler, it’s important for you to understand that your crawler is using others’ resources. Whenever you download a file from somebody else’s server, you’re consuming their bandwidth and server resources. Most site operators welcome well-behaved crawlers, because those crawlers provide exposure, which means potentially more visitors. Ill-behaved crawlers are not welcomed, and often are banned. If you operate an ill-behaved crawler, not only will your crawler be banned by sites on which it misbehaves, but also on many other sites–even sites your crawler has not yet visited.

Site operators do talk amongst themselves. There are active forums on which operators can ask questions about crawlers, and on which they post reports about ill-behaved crawlers. Some operators will ban a crawler that gets one bad report.

Site operators have the right to ban any client for any reason, and the responsibility to their legitimate users to ban crawlers and other bots that interfere with normal site operation. As the author of a Web crawler, it is your responsibility to operate it within accepted community standards and to ensure that it doesn’t interfere with operation of the sites that it crawls.

I can tell you from experience that an ill-behaved crawler will be banned, the IP addresses it crawls from reported widely on Webmaster forums, and complaints about the bot will be lodged with the ISP. Repeated reports can result in your ISP terminating your service, and exceptional cases could result in civil litigation or criminal charges. It’s very difficult to repair the reputation of an ill-behaved crawler, so it’s important that your crawler adhere to community standards from the start.

Unfortunately, those community standards are at best vaguely defined. When it comes to accepted behavior there are grey areas that site operators will interpret differently. In addition, some operators aren’t familiar with the well-known community standards, or have wildly different interpretations of their meanings. Many of the rules I outline below aren’t written anywhere else that I know of, but instead have been developed by us in response to questions, comments, and complaints from site operators who have contacted us about our crawler.

If I had to sum it up in a single rule, it would be:

You are crawling at somebody else’s expense. Act accordingly.

The rules below apply to all types of bots, not just the adaptive crawlers that this series focuses on.

Identification

Web servers keep a log of every request that comes in. That log typically includes, among other things, the date and time, the IP address from which the request was made, the URL that was requested, and a user agent string that identifies the client making the request. There is nothing you can do to prevent that information from being stored in the server log. Often, a site operator will be interested to see who is making requests, and he will use the IP address and user agent strings to determine that. Browser user agent strings are pretty easy to identify. For example, here’s a list of Internet Explorer user agent strings. If an operator sees blank or cryptic user agent strings in his log, he might ban that IP address on principle, figuring it’s an ill-behaved bot or at minimum somebody trying to “hide.”

The first thing you should do is come up with a name for your crawler and make sure that that name is included in the User-Agent HTTP header with every request the crawler makes. That string should include your crawler’s name and a link to a page that contains information about the crawler. For example, our User-Agent string is “MLBot (www.metadatalabs.com/mlbot)”.

The page that contains information about your crawler should contain, at minimum:

  • General information about your crawler. This statement should describe you or your company, the kind of information the crawler is looking for, and what you will be doing with the information. It’s a very good idea to mention if your crawler is part of a school project, because many operators will give more leeway and can offer helpful suggestions.
  • If you think that your product will benefit the people whose sites you’re crawling, you should mention those benefits. For example, “I am building a comprehensive index of information about steam trains. When published, this index will drive many users who are interested in steam trains to your site.
  • List the IP addresses that your crawler operates from. If you have static IP addresses, this is very easy to do. If you’re crawling from home or from some other location that has dynamic IP, then you should update this list as often as you can to show the current IP addresses. It is also a good idea to sign up with a dynamic IP service and list the URL that will resolve to your server. Interested parties can then use NSLOOKUP to resolve that name with your current IP address.
  • You should mention that your crawler respects the robots.txt standard, including any extensions. See robots.txt, below.
  • Include information about how to block your crawler using robots.txt. For example, our FAQ page shows this:
    User-agent: MLBot
    Disallow: *

    You might also show examples of blocking specific directories, and you should certainly include a link to the robots.txt page.

  • Include an email address or form that people can use to contact you about your crawler.

It’s surprising how effective that page can be. Site operators are rightly suspicious of random crawlers that have no supporting information. By posting a page that describes your crawler, you rise above the majority of small-scale bots that continually plague their servers. I won’t say that operators won’t still look at you with suspicion, but they won’t typically block you out of hand. You might still be banned, especially if your crawler is ill-behaved, but the information page gives your crawler an air of legitimacy that it wouldn’t otherwise have.

I know how hard it can be to tear yourself away from working on your crawler, especially if you find writing difficult. But you should post about your crawler from time to time on your blog or FAQ site. Post about new additions, about the kinds of things you’re finding, and the interesting things you’re doing with the data you collect. If a visitor sees recent posts, he knows that you’re still actively working on your project. This isn’t directly related to politeness, but it does help improve your image.

It also helps to keep up on the latest developments in the world of search engine optimization, public sentiment about crawlers, and what other crawlers are doing. Accepted standards change over time. What was acceptable from a crawler a few years ago might now be considered impolite. It’s important that your crawler’s behavior reflect current standards.

robots.txt

The Robots Exclusion Standard is a consensus agreement that has achieved the status of an informal standard. Unfortunately, it’s never been approved as a “real” standard in the way that HTTP, SMTP, and other common Internet protocols are. As a result, there are many extensions to the standard, some contradictory. In 2008, Google, Microsoft, and Yahoo agreed on a set of common extensions to robots.txt, and many less visible crawlers followed suit. If you follow the lead of the major search engine crawlers, nobody can fault you for your handling of robots.txt. See Major search engines support robots.txt standard for more information.

At minimum, your crawler must support proper handling of the Disallow directive. Anything beyond that is optional, but helpful. You should support Allow, Sitemaps if it’s relevant to you, and wildcards. Some of the major crawlers support the Crawl-delay directive, as well, and if you can do it you should.

Related to robots.txt are the HTML Meta tags that some authors use to prevent crawling or indexing of particular pages. Some authors do not have access to their robots.txt file, so their only option is to use Meta tags in the HTML. If you are crawling or indexing HTML documents, you should support these tags.

Proper handling of robots.txt can be kind of confusing. I’ve written about this in the past. See Struggling with the Robots Exclusion Standard and More On Robots Exclusion for more information. In addition, Google’s robots.txt page has some good description of how robots.txt should work.

There are many robots.txt parser implementations available in different languages. Do a search for robots.txt parser should give you plenty of hits for whatever language you’re working in. You’re much better off using one that somebody else wrote rather than trying to implement it yourself.

You’ll want to cache the robots.txt files that you download, so that you don’t have to request robots.txt every time you request a URL from a domain. In general, a 24 hour cache time is acceptable. Some large crawlers say that they cache robots.txt for a week! I’ve found that I get very few complaints with a cache time of 24 hours. A simple MRU caching scheme with a buffer that holds one million robots.txt files works well for us. We crawl perhaps 40 million URLs per day. Your cache size will depend on how many different domains you typically crawl per day.

A common mistake, by the way, is for a site to report the file type of robots.txt as “text/html”. A strict interpretation of the rules says that you don’t have to recognize such a file. I’ve found that those files typically are plain text (i.e. they’re formatted correctly), but have the wrong MIME type attached to them. Our crawler will try to parse anything that calls itself robots.txt, regardless of the MIME type.

Crawling frequency

A single computer, even operating on a cable modem, can cause significant disruption to a Web server if it makes requests too often. If you have a multi-threaded crawler and a whole bunch of URLs from a single site, you might be tempted to request all of those URLs at once. Your cable modem can probably handle you making 30 concurrent requests to a server. And the server can probably handle you doing that … once. If your crawler is making many concurrent requests over a long period of time, it will impact the server. And when the owner figures out why his server stopped responding to customer requests, your bot will be blocked and possibly reported to your ISP as taking part in a denial of service attack. You must limit the rate at which your crawler makes requests to individual sites.

I noted above that some crawlers support the Crawl-delay directive in robots.txt. That directive lists the number of seconds your bot should delay between requests to the site. Not all crawlers support that directive, and relatively few robots.txt files that I’ve seen actually contain the directive. That said, you should support Crawl-delay if you can. But it can’t be your only means of limiting your crawler’s request rate. A technique that works well for us is to keep an average of the server’s response time over the last few requests, and then use a multiplier to come up with a delay time.

For example, say that the last five requests to the example.com domain averaged 1,500 milliseconds each, and you use a multiplier of 30. In that case, you would hit example.com once every 45 seconds. You can use a smaller multiplier, but I would caution against hitting any site more frequently than once every 15 seconds. Some larger sites will allow that, but many smaller sites would not be happy about it. It also depends on how many total requests you’re making. Hitting a site four times a minute for two minutes probably won’t raise any red flags. If you do it all day long, they’re going to wonder why you’re crawling them so often.

In our crawler, the politness delay is integrated with the queue management. I’ll discuss it in more detail then.

Take only what you need

To a crawler, the Web is an infinite free buffet. But as I said previously, it’s not really free. Every document you download required a server’s bandwidth and resources. If you go to an all-you-can-eat place, load your plate with food, eat just the good stuff, and then go back for seconds, the manager will probably kick you out. The rule at the buffet is “Take only what you need, and eat all you take.” The rule is the same when crawling the Web. If you download a bunch of stuff that you don’t need, consume somebody else’s server resources and then discard the document, they’re likely to block you. Take only what you can use.

This is one rule that not only makes your crawler more polite, but also makes it more efficient.

An early version of my crawler was looking for links to MP3 files in HTML documents. The crawler’s general algorithm was this:

Download file
If it's an HTML file
    parse the HTML for links
    queue new links
else if it's an MP3 file
    extract metadata
    store link and metadata
else
    discard the document

That works well, but the program downloaded a huge number of documents that it would never be able to use! For example, about 20% of the documents it was downloading were image files! The crawler couldn’t do anything with a .gif or a .jpeg, but it was downloading them nonetheless. It was an incredible waste of others’ server resources and our bandwidth. By modifying the crawler so that it checked for common file extensions before queuing a URL, we increased our productivity by over 25%, and also made the crawler more polite. We were no longer taking more than we could use.

There’s no guarantee, of course, that a URL ending in “.jpg” is an image. There’s a high likelihood, though. Our experiments at the time showed that if a URL ended in “.jpg”, “.jpeg”, “.bmp”, or “.gif” (among many others), the probability of it being an image file was above 98%. Of the remaining two percent, only a very small fraction were HTML files, and none were audio files. Blocking those extensions cost us nothing and gave a huge benefit.

Images aren’t the only files that most crawlers will find useless. If you’re looking for HTML documents, then you can safely ignore URLs with extensions that are common to images, audio, video, compressed archives, PDFs, executables, and many more. I don’t have an exhaustive list of extensions that you should ignore, but the following is a good start:

".asx",     // Windows video
".bmp",     // bitmap image
".css",     // Cascading Style Sheet
".doc",     // Microsoft Word (mostly)
".docx",    // Microsoft Word
".flv",     // Old Flash video format
".gif",     // GIF image
".jpeg",    // JPEG image
".jpg",     // JPEG image
".mid",     // MIDI file
".mov",     // Quicktime movie
".mp3",     // MP3 audio
".ogg",     // .ogg format media
".pdf",     // PDF files
".png",     // image
".ppt",     // powerpoint
".ra",      // real media
".ram",     // real media
".rm",      // real media
".swf",     // Flash files
".txt",     // plain text
".wav",     // WAV format sound
".wma",     // Windows media audio
".wmv",     // Windows media video
".xml",     // XML files
".zip",     // ZIP files
".m4a",     // MP4 audio
".m4v",     // MP4 video
".mov",     // Quicktime movie
".mp4",     // MP4 video or audio
".m4b",	    // MP4 video or audio

If your crawler derives information from some of those file types, then of course you shouldn’t block them.

I don’t know what the distribution of files is on the Web these days. When I created that list four or five years ago, those extensions made up well over 25% of the files my crawler was encountering. Today, the crawler’s mix is something like 91% HTML files, 5% files with no extension (i.e. http://example.com/), 3% video, and every other file type is less than 0.01%. I’ve blocked lots of extensions other than those I identified above, but together they made up less than 1% of the total number of URLs I encountered.

I determined the extensions to block by logging (see below) every request the crawler makes, along with the HTML headers that it gets back in the response. By correlating the file extension with the Content-Type header, I was able to get a list of extensions that rarely (if ever) lead to a file that I’m interested in.

Preventing your crawler from downloading stuff it can’t use makes you look more professional in the eyes of the people who own the servers you’re visiting, and also makes your crawler more efficient. You have to parse URLs anyway (another subject I’ll get to at some point), so checking for common useless extensions isn’t going to cost you significant processor time. Eliminating those useless files early makes queue management easier (less churn in the queue), and prevents you from wasting your precious bandwidth on things you can’t use.

Maintain an excluded domains file

There are some domains that, for various reasons, you don’t want your crawler to access. When we were crawling for audio files, for example, we encountered a lot of sites that had pirated music or that contained nothing but 30-second music samples. We didn’t want those files, but as far as the crawler and the ML system were concerned, those sites were gold. They had nuggets galore. Our only way to block them was go add them as exclusions.

Another reason you might want to block a site is in response to a complaint. Some site operators do not want you to crawl, period, and do not have access to robots.txt or the htaccess file in order to block your crawler. If somebody doesn’t want you on their site, you should honor their wishes.

You can go a long way with a simple text file that lists one domain per line, and perhaps a comment that says why you blocked it and when. Here’s a sample:

# sites that have requested that we not crawl
site1.com
site93.com
blog.somebody.net

# link farms that return nothing good
spamsiteA.com
spamsiteB.info

We created a file called NeverCrawl.txt in a common location. The crawler checks the last modified date on that file every 15 minutes or so. If the last modified date has changed, the crawler reads and parses the file, adding the entries to a hash table that’s checked whenever a URL is extracted from a Web page. The beauty of this solution is that it’s simple to add an exclusion (just open the text file, add the domain, and save the file), and the file format is easy to parse. To my knowledge this exclusion list has never failed. It’s also very nice being able to say to somebody, “I have instructed the crawler never to visit your site again. That exclusion should take effect within the next half hour.”

It would be nice, at times, to have a richer syntax for specifying which sites to block, but the cost of doing so is pretty high in comparison to its benefit. We could add regular expression support so that we could block any URL that matches a particular query string syntax, regardless of domain. That would catch Web proxies, for example, and many link farm sites. But it complicates the parsing code and could affect reliability. It’s certainly something we’d like to do at some point, but for now the cost outweighs the benefit.

Don’t go overboard on trying to identify sites to block. I know from experience that you could spend the rest of your days adding things to your exclusion file. There are approximately 100 million registered domains that have at least one viable Web page. Add to that the subdomains and the number probably approaches a billion. The vast majority of those sites don’t have anything of interest to you on them. But trying to identify and list them in a file would be impossible. Besides, if you did list 90 million sites, you wouldn’t be able to keep the exclusion list in memory.

You should depend on your ML system to keep you away from unproductive sites. Use the exclusion list to keep you away from things that the ML system thinks are productive, and from sites that you have been asked not to crawl.

Log everything

Your crawler is going to make a lot of requests. Even if you’re running a single-threaded bot that’s making an average of one request per second, you’re going to make more than 85,000 requests per day. You want to have a log of every request for two reasons:

  1. You will receive a complaint about your crawler’s behavior. The person complaining often will not supply dates and times, and if he does you’ll want to verify that it really was your crawler. It might be some other crawler that just happens to have the same name as yours, or it might be somebody spoofing your bot.
  2. You’ll want to analyze the logs to find link farms, crawler traps, and other unproductive sites, errors, and to find file extensions that are (almost) always unproductive.

You should log as much as you can with every request. At minimum you should save the date and time of the request, the request URL, the server’s HTTP response (200, 404, 302, etc.), the Content-Type, Content-Length, the Location header (for redirects), the response URI (the address of the page that responsed), and the total amount of time required for the request and response. Optionally, you can save the actual response (the HTML file received, for example), although depending on your process you might want to save that elsewhere.

I’ve found it very helpful to log the entire contents of any robots.txt file that I download. I’ve been able to solve some curious crawler behavior by looking at the logs to see when the crawler last downloaded robots.txt and what the file said at the time. I’ve also been able to help some operators who had bad syntax in their robots.txt, or had saved it as a .doc or .html file.

A log is essential for debugging, performance analysis, and responding to customer complaints. Keep your logs for a month. I’ve rarely had to go back more than a week or two in order to resolve a complaint, but a month gives a good buffer. And analysis done on a full month of logs can provide better data than just a day or a week worth of info. If space is a problem, compress the logs. They typically compress at the rate of five or ten to one.

Responding to complaints

Regardless of how polite your crawler is, you will receive complaints. Most of the complaints will be from reasonable people who are more curious than angry, but from time to time you will receive a scathing letter from somebody who seems completely unreasonable. Do not take an aggressive or confrontational tone with anybody who complains about your crawler’s behavior. Every complaint, regardless of its tone, is best handled by a friendly reply that thanks the person for the problem report, apologizes for problems, expresses concern at the perceived misbehavior, and asks for more information if you need it in order to research the problem. Here’s an example:

Dear Mr. Webmaster:

Thank you for contacting me about my Web crawler. I make a concerted effort to ensure that the crawler operates within accepted community standards, and I apologize if it caused problems on your site.

I would like to research this matter to discover the source of the problem, which could very well be a bug in the crawler. Can you please send me the domain name for your site, and the approximate date and time from your Web server’s logs? It would be best if I could see the exact entry from your server log. That way I can match up the crawler’s logs with your information and more reliably discover the source of the problem.

Thank you again for notifying me of this problem. I will contact you again when I have discovered the source of the problem.

For more information about my crawler and what I’m doing with the information the crawler collects, please visit my information page at http://example.com/crawlerinfo.html

You would be surprised at how effective such a simple response can be. We have received some strongly worded complaints in the past that evoked pictures of slathering monsters out for blood. A response similar to the one above resulted in a letter of apology, some help diagnosing a problem (which turned out to be a misunderstanding on the part of the site operator), and a recommendation from that person on a Webmaster forum, saying that people should allow our bot to crawl. If we had responded to his initial complaint in kind, we would have made an enemy who very likely would have posted negative information about us. Instead, this person became an advocate and publicly praised us for the way we operate. It’s difficult to imagine a better outcome.

After the initial response, research the matter throroughly and report your findings to the person who complained. If it was a bug in your crawler, admit it. Furthermore, post a blog entry or note on your crawler’s site that explains the error, how it manifests, how long it was going on, and when it was fixed. If you can’t fix the problem immediately, post information about when it will be fixed and how to avoid it if at all possible. It pays to be up front about your crawler’s bugs. If people think you’re being open and honest, they will be much more likely to forgive the occasional error.

But know that you can’t please everybody. There will be people who, for whatever reason, will not accept your explanation. That’s what the exclusion list is for. If you’re unable to have a productive conversation with somebody who complains, apologize for the problem, add their domain to your exclusion list, and let them know that your crawler will not bother them again. There’s just nothing to be gained by arguing with an unreasonable person. Cut your losses and move on.

It’s also a good idea to monitor Webmaster forums for complaints about your crawler. You can set up a Google alert that will notify you whenever Google’s crawler encounters a new posting about your crawler. It’s a good way to gauge your crawler’s reputation, and a way that you can respond in public quickly to any complaints. The rules above still apply, perhaps even more strongly. Be courteous and helpful. Flaming somebody in a public forum will not improve your crawler’s reputation.

I covered a lot of ground in this posting. Politeness, in my opinion, is the most important issue you face when operating a Web crawler. Everything your crawler does will be controlled by the politeness policy. There are plenty of technical hurdles, but if your crawler gets a bad reputation you will find it blocked by a large number of sites, including many that have valuable information that you’re interested in. No amount of technical wizardry will allow you to get past those blocks, and you will find it very difficult to repair a bad reputation.

Design in politeness from the start. You’ll regret it if you don’t.

This isn’t the last word on politeness. Many of the issues I’ll cover later in this series have politeness ramifications. One example is the DUST (different URL, similar text) problem: preventing your crawler from visiting the same page repeatedly, using a different URL each time.

In my next installment, I’ll begin talking about queue management, which is the most technically challenging part of an adaptive crawler. It’s also related to politeness in many ways.

The death of Kim Jong-il

The big news story today was the death of Kim Jong-il, leader of North Korea. Of all the commentary I heard, one idea struck me as rather funny in a “you’re more right than you realize” kind of way. I don’t recall who it was that NPR interviewed, but his comments were to the effect that the North Korean people either wouldn’t believe that the West was supplying food aid, or they would say that the aid was “tribute” demanded by the Dear Leader in lieu of him crushing us with his army.

Regardless of whether the commenter’s characterization of the North Korean people’s intelligence is correct (I somehow doubt they’re that stupid), the “tribute” comment is more correct than perhaps he meant it to be.

North Korea has a huge military–about 1.1 million armed personnel, and a reserve force of about 7.5 million. They have modern weapons, and from all reports have enough food, fuel, and ammunition to wage war for perhaps 100 days. I suspect that 100 days is an over-estimate, but even if it were just 50 days, they could do a lot of damage. It’s doubtful, though, that they could seize, occupy, and hold much territory for very long.

North Korea does, however, have nuclear weapons. Not that I think they’d use them. At least, Kim Jong-il was smart enough not to. He didn’t give a rip about his own people, but from all reports he was rather fond of his own skin. He sure didn’t want to end up like Adolf Hitler, Nicolae Ceau?escu, Saddam Hussein, Kadafi, or any other national leader whose life ended with him being hunted down and executed by major powers or by his own people.

Kim Jong-il wouldn’t actually use his nuclear weapons (he was a lot of things, but never stupid), but he had no problems selling his nuclear technology, or using his nuclear capabilities as a bargaining chip. All he had to do was rattle his nuclear saber a bit and make vague noises about “talks.” The U.S., China, Russia, Japan, South Korea, and the U.N. would all bend over backwards to meet his demands for food aid or other concessions, just to get him to the bargaining table. And, like Lucy pulling the ball away after convincing Charlie Brown that “this time is different,” Dear Leader would invariably use some excuse–any excuse–to delay or halt talks after he got what he wanted. And, just like Charlie Brown, the world powers would fall for it every time.

Tribute? Perhaps not in the way that the term is typically used, but it amounted to the same thing. Kim Jong-il threw a tantrum and the world gave him whatever he wanted. You’ve seen the same kind of thing happen in the toy aisle of Wal-Mart a hundred times. The man was a tyrant and a master manipulator who seemed to get a real thrill out of making the rest of the world dance to his ridiculous tune.

We are well shut of Kim Jong-il. For reasons that escape me, the world kissed his ass to keep him quiet. I’ll never understand why we didn’t just ignore him. Why China kept propping him up is another mystery. I can understand South Korea treating him with kid gloves, because they didn’t want a war and because there are many South Koreans with family in the North. They don’t want another war where they might be shooting at their own kin.

Even though Kim Jong-il acted like a spoiled child, was a known quantity. He was predictable within a farily narrow range. We knew how to deal with him. We don’t know what form the new leadership will take in North Korea. Will Kim Jong-un take over? Will he become a figurehead who’s controlled by the old guard? That’s perhaps the scariest possibility: a young and inexperienced figurehead being controlled by the hard line old guard who have dreams of conquest. It would be enough to plunge the entire region into a war. Perhaps the entire world if China got involved on the side of the tyrants.

Let’s hope that cooler heads prevail.

Writing a Web Crawler: Crawling Models

This is the second post in a series of posts about writing a Web crawler. Read the Introduction to get the background information.

Expectations

I failed to set expectations in the Introduction, which might have misled some readers to believe that I will be presenting a fully-coded, working Web crawler. That is not the case. I will be discussing the issues in detail, and will certainly include pseudo code for some of the tasks or algorithms. It is decidedly not my intention to design, code, test, and debug a fully working Web crawler. The focus in this series is on the issues involved and possible solutions, but it will stop short of actual implementations.

The reason for that is simple: the issues are more important than the code. I wrote my large-scale Web crawler the hard way. I started with a naive approach to crawling, encountered problems, solved them, found more problems, solved them, etc. It was a fun, but frustrating and terribly inefficient way to create a Web crawler. It involved rewriting large portions of the crawler many times. It resulted in some messy code and, worse, some high level design decisions that can’t be changed without a major refactoring. Had I known the issues ahead of time, I would have designed the crawler differently and avoided a lot of pain.

I would like to save you that pain.

On to crawling

The high-level view of a Web crawler’s operation is very simple. In pseudo code, it looks like this.

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    parse document for links and add to queue
}

Admittedly, that’s a very simplistic view of how a crawler works, but it’s essentially correct. There are, however, a number of details that aren’t included. The Web, although finite, is rather large: on the order of 100 billion documents. The number of URIs pointing to those documents, though, is much larger–way too many to count. It’s difficult to say how many pages are actually indexed by major search engines. I’ve seen numbers that range from 20 percent to 60 percent of the visible Web is indexed by Google, Bing, etc. The “visible” Web includes only those pages that are reachable through a simple URL without having to log in and without having to execute JavaScript, etc. It’s unclear if the 100 billion number includes images and other downloadable files, or if it’s just HTML pages.

Whatever the case, the size of the Web is staggering–large enough that the queue would continually grow and your crawler would never stop. This is especially true if you’re running a crawler on a small scale (one machine crawling from a cable modem). It takes a huge amount of space just to store the URLs that you’re going to crawl.

Another detail missing from the pseudo code above is what to do with the documents. The algorithm just extracts and queues links. A real crawler is crawling for a purpose: typically to do something with the documents that it downloads. A search engine crawler, for example, processes the documents and builds an inverted index that can be searched very quickly to identify documents that contain requested keywords. What really happens is that the document is stored in some repository, and a separate process comes along later to read the document repository and create the index.

Note that this series will pointedly not foray into the world of creating a search engine. There is another entire field of technology involved in storing, indexing, and retrieving documents. My focus is on the crawler, which finds the documents.

The final thing missing from the simplified algorithm shown above is some means of preventing the crawler from requesting the same document multiple times. This turns out to be a very hard problem, and one I’ll cover in much detail in a later post. For now, we’ll limit our definition of “same document” to “same URI”. That is, once the crawler has downloaded the page from http://www.mischel.com/index.html, it shouldn’t go to that document again within the crawling period. I’ll define “crawling period” below.

A more realistic (although not quite real) high level view of the crawling process, then, looks like this:

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    store document for later processing
    parse document for links
    add unseen links to queue
}

As I mentioned above, the queue size can get unmanageably large. If you start your crawler with a very small seed that contains portal sites like Yahoo and MSN, and perhaps a few popular blogging sites like WordPress and LiveJournal, your queue is going to grow very quickly. Our experience has been that, on average, an HTML page will contain about 100 links. That, to me, was an astonishingly high number. It turns out that, again on average, only 10 percent of those links will be new. That is, of the 100 links you’ll scrape from a page, only 10 of them will be new links that you haven’t seen yet. That’s a big reduction, but it’s still a lot. For every URL you take out of the queue, you’ll put 10 back in. If you crawl one page per second, you’ll have almost 900,000 URLs in your queue at the end of one day.

One page per second, by the way, is about what you can expect from a very simple single-threaded crawler over the long term. Crawling speed isn’t particularly relevant for the current discussion, so I’ll defer detailed coverage of that topic to a later posting.

It should be obvious that the queue will quickly grow beyond memory capacity, so you need some kind of backing store for it. Not only do you need to maintain the URLs that will be crawled, you also have to maintain a list of the URLs you’ve already crawled, so you don’t crawl them again. You might think of using a database for that, but you’ll quickly find that you need a very fast database cluster to handle all the lookups you’d be making. At one page per second, you’d be querying the database 100 times per second just to see if the extracted URLs are new. When you get your crawler doing 30 pages per second (easily achievable on a single machine and a cable modem), you’ll be doing 3,000 lookups and 30 updates, and 300 insertions per second. That’s going to be beyond the capacity of a simple database server.

Another problem with a “live” queue is that URLs are inserted in essentially random order. There will be blocks of URLs for the same site, but links to the mischel.com domain, for example, might appear in multiple places within the queue. There are certain benefits to sorting links by domain and crawling all of the links for a single domain, one right after the other. You can avoid the cost of repeated DNS lookups, and also get a “snapshot” of a single domain over a short period of time rather than getting some now and some later.

For those reasons, most traditional crawlers use what we’ve come to call an “offline” model. The crawler starts crawling with a queue (sometimes called a segment) that’s ordered in some fashion (typically by domain name), and as it crawls pages it just stores them. A separate process reads the stored pages, extracts links, decides which links need to be crawled, sorts them by domain, and creates a new segment. When the crawler finishes with its initial segment, it grabs a new segment and starts crawling again.

The crawler’s code, then, becomes much simpler:

while (segment is available)
{
    load segment into queue
    while (queue is not empty)
    {
        dequeue url
        request document
        store document
    }
}

One benefit of this model is that the crawler requires almost no CPU resources. All it’s doing is downloading pages and storing them in some kind of repository. It might be possible, on a single machine, to have both the crawler and the URL extractor running at the same time on a single machine. If that’s possible, you should do it. Otherwise you’ll have to use the “crawl-process-crawl” model. That is, you create the initial segment and let the crawler complete that one segment. When the crawler is done, you harvest the links to create another segment, start the crawler again, etc. It seems clunky, but it’s quite effective. The major drawback to the crawl-process-crawl model is that you spend a lot of time not crawling, which means that your Internet connection is idle. If you’re paying big dollars for a fat pipe, that idle time is money down the drain.

The offline model is very effective for many types of crawling. If you want to exhaustively crawl a select number of sites (or the entire Internet), it works quite well. In general, if your crawler is essentially a very fast downloader that just stores whatever it finds, then this model is the way to go.

Adaptive crawling

The “secret” to doing Web-scale crawling at cable modem speeds is to limit what you crawl to just the important stuff–things that are related to whatever you’ve determined is relevant. And that requires a crawler that can react in real time to the information that it’s seeing. It has to be selective about what it downloads. This implies a live queue, a way to prioritize the URLs it encounters, and the ability to prune the queue when it becomes too large. Such a crawler is a somewhat more complicated thing to write, but it’s amazingly effective. If you’re looking for specific types of documents or documents about particular topics, then an adaptive crawler can increase the number of relevant documents you find in a given period of time.

Before I get to the numbers, let’s talk about how an adaptive crawler works. The general algorithm is:

queue = LoadSeed()
while (queue is not empty)
{
    dequeue url
    request document
    store document

    // process document
    analyze document to determine if it's relevant
    update prioritizer with information about this document

    // extract and queue links
    parse document for links
    eliminate duplicate links
    prioritize links
    add prioritized links to queue
}

Note that I show storing the document, even if it isn’t relevant. Depending on your requirements, that might not be necessary. Our crawler, for example, doesn’t store HTML pages or other documents that aren’t videos. Even when it encounters a video, the crawler only stores metadata–information about the video like the title, the author, length, etc.

In addition to the above, an adaptive crawler typically has a background thread that continually goes through the queue, re-prioritizing links based on the ever-changing information that the crawler is collecting from analyzing the documents that it finds. When the queue fills up, the lowest-priority URLs are simply removed. If they’re important (i.e. contain relevant information), the crawler will see them again.

We’ve found that re-prioritizing links in this way reduces over training, and also allows the crawler to adapt when it finds that a previously unproductive path becomes productive. And, conversely, to stop crawling a particular path when that path becomes unproductive.

As you can see, an adaptive crawler is quite a bit more complicated than a general-coverage crawler. It needs some way to analyze a document to derive a relevance score (what we call a “nugget function”), and some type of machine learning (ML) system that can continually learn and then prioritize URLs based on what it’s learned. Not only is this more complicated to build, it puts a much heavier load on the crawler’s memory and CPU resources.

The nature of the nugget function and the ML system will remain unspecified at this point. In later posts, I’ll discuss ways to determine a document’s relevance, and how to build a simple ML system. For now, we can treat those as black boxes.

With this model, the crawler can zero in on relevant information very quickly. If your nugget function returns true when the document is relevant and false when the document is not relevant, the ML system can reward or punish the path that led to the document. The ML system quickly learns that links from site A lead to nuggets 50% of the time, links from site B lead to nuggets only 2% of the time, and links from site C never lead to nuggets. The prioritizer can then prioritize links so that those from site A are crawled before links from site B, and links from site C aren’t crawled unless there’s nothing else to do.

This works amazingly well.

We modified the ML system of our crawler so that it would return a random priority value for every URL given to it. That simulates randomly crawling the Web. We fed the crawler a list of 100 starting URLs, and told it to find videos. After crawling for several hours, we turned it off and measured its effectiveness. Simulating a random crawl, we found approximately one video in every 500 pages crawled. We then re-enabled the ML system and ran the test again, using the same starting point. Within two hours, the crawler was finding, on average, one new video in every 12 pages it examined. That early ML system made the crawler 40 times more productive.

Our crawler is geared towards finding video, but it would be a relatively small matter to change the nugget function and tell it to look for anything we like. An example I like to use is steam trains. I recently did a Google search for [steam trains], and got approximately 794,000 results. If we were to crawl the Web at random–just grab any old URL and download the document–then, on average, we would expect to find one document about steam trains in every 25,000 documents we examine. That’s assuming, of course, that we prevent the crawler from looking at things like videos, images, and other types of files that aren’t “documents.” We would have to examine 25 million documents in order to get 1,000 references to steam trains, and there’s no telling how relevant those documents might be. For example, we might find this page, which mentions steam trains but isn’t really relevant to them.

Even at 30 pages per second, you’re talking 833 seconds (almost 14 minutes) between hits. So if you crawled all day you’d find about 100 pages that mention steam trains, and of those a large number would be like the one I linked above–it mentions steam trains but isn’t relevant.

I would expect the behavior to be much different with the adpative crawler. When it first started up, it would act like a naive random crawler, and its performance would be very similar. But as it began to find documents about steam trains, the URL prioritization would cause it to begin finding relevant documents much faster. Within a few hours, it would be finding a new document about steam trains every 20 seconds or so, and it would likely do even better than that in spurts. The crawler would focus on clusters of relevant documents, find links to other clusters, eventually exhaust those clusters and wander about aimlessly until it found new clusters.

I haven’t actually run that test with steam trains, but my experience with the crawler’s behavior when searching for video gives me some strong evidence that the crawler would likely find more than 10,000 relevant documents in a day, starting from scratch, using cable modem speeds. In addition, the average relevance of the documents would be much higher because the ML system would favor links from highly relevant sources over random links. If we gave the crawler a good starting point (say, the Wikipedia page on steam trains, or a steam train enthusiast’s site), its performance would be even better because it would start with a source of high-quality links (i.e. links to pages that are very likely to be relevant).

Crawling period

Throughout this discussion, I have not mentioned the dynamic nature of the Web. It’s constantly changing. It changes so fast that no search engine can be completely up to date. Early search engines worked by taking a “snapshot” of the Web over some time period. That is, they’d set their crawlers to work and crawl, say, 20 billion pages in a few days. Then the indexer would come along and start building an index from the stored documents while the crawlers started over, building a new repository that the next index run would use. The larger search engines don’t typically work this way anymore, but many smaller scale crawlers do.

A “crawl,” when using this model, is a unit of work. It might be measured by time (“today’s crawl”), or by the number of documents or sites indexed. How you measure it doesn’t really matter so much for this discussion. The point is that you don’t want to visit the same site more than once per crawl–the crawling period.

That’s fine when you’re using an offline model. An adaptive crawler, though, is often run continuously, and is expected to cover old ground from time to time in order to find new material. That steam train afficianado, for example, might have posted a new article. We want our crawler to find that new material as soon as possible.

In that model, there isn’t a “crawl” as a unit of work. As a result, you need some mechanism that allows the crawler to re-crawl a URL from time to time. I’ll cover how that’s done in a later post.

Which model to use?

If you’re building a general coverage crawler, or a crawler that’s expected to index pre-identified sites or pages, then the offline crawler is the way to go. It’s easier to build and operate, and you can take advantage of batching URLs by domain.

Adaptive crawlers are mostly used for exploration: “find documents that meet this criteria.” They are much harder to build than typical crawlers, but they can find more relevant information faster, and with less bandwidth.

One thing that discourages people from writing crawlers is the idea that they need a huge data center, a fat pipe, distributed file system, and parallel processing cluster in order to make it work. That’s just not so! If you’re looking for a relative handful of documents, you can operate an adaptive crawler on a single machine hooked to a cable modem. Even at five megabits per second, you can exhaustively search the Web for information on any topic in just a few days. The trick is limiting where you look and what you look at.

I know that’s a lot to digest in one blog entry, but the choice of crawling model affects what you can do with your crawler. The rest of this series is going to focus on the adaptive crawlers, since that’s what I’m most familiar with, and information about offline crawlers is readily available. In addition, I think most people who are interested in finding business intelligence on the Web are better served by a crawler that seeks out new sources of information rather than depending on a human to identify the sources.

In the next post, I’ll be talking about politeness, which is the most important aspect of operating a Web crawler. I’ll probably follow that up with a discussion of queue management, which turns out to be closely related to politeness in many respects.

Skip list revisited

In C# versus the data structure, I described the skip list data structure and the difficulty I saw with implementing it in .NET. I published a C# implementation a couple of months ago, and it did indeed suffer from excessive memory usage. It performed quite well, but the memory requirement of 100 bytes per node blew my memory budget.

With a little thought and some indirection, I was able to reduce that to an average of about 20 bytes per node. I published that implementation in A More Memory-Efficient Skip List. The resulting data structure is faster than my initial implementation and uses about one-fifth the memory. Overall, I’m happy with the way it works. I think the design is clever–perhaps even elegant–but the implementation is, of necessity, somewhat involved and a little messy. That said, it works well and I’m glad to have it.

My friend and development team member (co-founder, business partner, whatever) Joe Langeway floated the idea one day of building something similar to a skip list, but that had only two pointers per node: a Next pointer, and a FarForward pointer. The idea was to create a simple linked data structure that had much of the benefit of a skip list, but without all that complexity. We tossed ideas around for a day or two, and then I sat down to code it up.

The node structure is very simple. It contains the data and two link references. I’ll illustrate here with a node that contains integers, but it’s easy enough to extend this to make it work with any data type.

class Node
{
    int Value;
    Node Next;
    Node FarForward;
}

The idea, then, is to have the data structure assign far forward pointers as items are inserted. How to assign those far forward pointers was (and perhaps still is) a matter of some disagreement. Since I was implementing the thing, I chose what I thought was the simplest approach: when traversing the list to find the insertion spot, make the first unused far forward pointer encountered in the traversal reference the new node.

Imagine you were inserting the numbers 1, 3, and 2, in that order. The first node is inserted trivially and its far forward pointer is null. When we try to insert the second node, the code has to traverse the list (follow Next and FarForward pointers) to find the insertion spot. Along the way, the code chooses the first non-null forward pointer to reference the new node. When the value 3 is inserted, the first node is modified so that its Next and FarForward pointers both reference the new node.

When 2 is inserted, there’s no unused forward pointer ahead of it, so the only thing pointing to the new node is the first node’s Next reference. The result is that node 1 has a Next value of 2, and a FarForward value of 3. 2.Next is 3, and 2.FarForward is null.

This data structure has two nice properties. First, it’s already in sorted order. Traversing the list in sorted order is a simple matter of following the Next links. Finding a particular value, though, isn’t necessarily a sequential search. The Find method can use the FarForward pointers to skip through the list much more quickly than examining every single node. In this particular case, for example, it only has to examine two nodes to find the value 3. It never has to examine the second node.

Implementing the first version of this took very little time, and performance is surprisingly good in the general case. I found that search takes, on average, about 10 * (Log10(n) - 1) probes, where n is the number of items in the list. So to find an item in a collection of one million takes approximately 50 probes (10 * (log10(1000000) – 1)). That’s not as fast as a skip list or a balanced binary tree, both of which would be log2(n), or in this case about 20 probes. But it’s a huge improvement over sequential search (n/2, or 500,000 probes).

Unfortunately, the data structure has a few drawbacks. It exhibits pathological behavior when the input data is sorted or reverse-sorted, and it is pretty sensitive to ordered subranges. If you insert items in sorted order, the FarForward link in every node points to the next node. And if you insert items in reverse order, all of the FarForward links are null. In either case, finding an item becomes a sequential scan of the list, taking on average n/2 probes.

The other drawback is that the number of probes required to find an item in the general case is quite variable. The average is reasonable: about 2.5 times the worst case for binary search. But it varies quite a bit. In my testing, the maximum number of probes required to find an item is approximately 3.5 times the average. So, whereas the average for a list of one million items is about 50, the maximum is somewhere around 175.

I’ve explored several different ways to alleviate the problems associated with ordered data. Preventing pathological behavior when data is ordered is not very difficult, although the result is still unacceptable, with a search time that’s about 10 times what it is for a list built from unordered data. Solving the ordered and reverse-ordered cases also complicates the code quite a bit. It’s still manageable, but not nearly as elegant as the initial implementation.

You might rightly ask why I’m even bothering with a data structure that is almost certain to be slower than existing solutions that I already have access to. There are several reasons. First (and perhaps foremost), it’s a really interesting puzzle. Just a little thought produced a simple data structure that has logarithmic complexity in the general case. It comes within spitting distance of the skip list and is much easier to understand and implement. Also, there are certain benefits to having a linked data structure whose nodes are of a fixed size, even if that data structure is somewhat slower than another data structure that has variable-sized nodes.

An interesting use of the insertion algorithm for this data structure would be to build a double-linked list from a stream of unordered data. The idea would be to treat the linked list node’s Prev pointer as a far forward pointer during construction. When the last data item is received, you can make a single sequential pass through the list to set the Prev pointers to their correct values. You could use a heap, but it would require more space. If your desired result is a double-linked list, this algorithm would use only O(n) memory. Building a heap would require O(n) extra memory.

I won’t post the code yet, since I’m still tinkering and the code is kind of ugly right now. But if you let me know that you’re interested in puzzling on this, you might give me some incentive to clean things up sooner. I’d be interested to see if a combined effort could come up with solutions to some of the problems.

Writing a Web Crawler: Introduction

This is the first of a series of posts about writing a custom Web crawler. It assumes some knowledge of what a Web crawler is, and perhaps what crawlers are typically used for. I don’t know how many posts this subject will require, but it could be a rather long series. It turns out that Web crawlers are much more complicated than they look at first.

Articles in this series:

Crawling Models
Politeness
Queue Management: Part 1

When you hear the term “Web crawler,” it’s likely that your first thought is Google. Certainly, Google’s crawler is better known than any other. It’s probably the largest, as well. There are many other crawlers, though: Microsoft’s Bing search runs one, as do Blekko and other search companies (Yandex, Baidu, Internet Archive, etc.) In addition, there are a few open source crawlers such as Nutch, Heretrix, and others. There are also commercial-licensed crawlers available.

The other thing that typically comes to mind when you think of Web crawlers is search engines. Again, Google’s crawler is used primarily to gather data for the Google search engine. When speaking of crawlers, the big search engines get all the press. After all, they’re solving big problems, and their solutions are impressive just on their sheer scale. But that’s not all that crawlers are good for.

The big search engines run what we call “general coverage” crawlers. Their intent is to index a very large part of the Web to support searching for “anything.” But there is a huge number of smaller crawlers: those that are designed to crawl a single site, for example, or those that crawl a relatively small number of sites. And there are crawlers that try to scour the entire Web to find specific information. All of these smaller crawlers are generally lumped together into a category called focused crawlers. The truth is that even the largest crawlers are focused in some way–some more tightly than others.

Smaller focused crawlers might support smaller, targeted, search engines, or they might be used to find particular information for a multitude of purposes. Perhaps a cancer researcher is using a crawler to keep abreast of advances in his field. Or a business is looking for information about competitors. Or a government agency is looking for information about terrorists. I know of projects that do all that and more.

Another class of programs that automatically reads Web pages are called screen scrapers. In general, the difference between a screen scraper and a crawler is that a screen scraper is typically looking for specific information on specific Web pages. For example, a program that reads the HTML page for your location from weather.com and extracts the current forecast would be a screen scraper. Typically, a scraper is a custom piece of software that is written to parse a specific page or small set of pages for very specific information. A crawler is typically more general. Crawlers are designed to traverse the Web graph (follow links from one HTML page to another). Many programs share some attributes of both crawlers and screen scrapers, so there is no clear dividing line between the two. But, in general, a crawler is an explorer that’s given a starting place and told to wander and find new things. A scraper is directed to specific pages from which it extracts very specific information.

My credentials

I’ve spent a good part of the list five years writing and maintaining a Web crawler (MLBot) that examines somewhere in the neighborhood of 40 million URLs every day. The crawler’s primary purpose is to locate and extract information from media (video and audio) files. The crawler’s basic design was pretty well fixed after three months of development, but it took about a year before we had “figured it out.” Even then, it took another year of tweaking, refactoring, and even some major architectural changes before we were happy with it. Since that time (the last three years), we’ve mostly made small incremental changes and added some features to recognize and specially process particular types of URLs that either didn’t exist when we started crawling, or that have become more important to us over time.

That said, I won’t claim to be an “expert” on crawling the Web. If there’s one thing my experiences have taught me, it’s that there’s a whole lot about the Web and how to crawl it that I just don’t know. As a small (four people) startup, we all wear many hats, and there are many things to be done. Although we know there are things our crawler could do better, we just don’t have the time to make those changes. We still discuss possible modifications to the crawler, and in many cases we have a very good idea of what changes to make in order to solve particular problems. But there are still hard problems that we know would take significant research work to solve. It’s unfortunate that we don’t have the resources to make those changes. For now, the crawler does a very good job of finding the information we want.

I have to point out here that, although I wrote almost all the code that makes up the crawler, I could not have done it without the help of my business partners. My understanding of the many challenges associated with crawling the Web, and the solutions that I implemented in code are the result of many hours spent sitting on the beanbags in front of the white board, tossing around ideas with my co-workers. In addition, major components of the crawler and some of the supporting infrastructure were contributed by David and Joe, again as a result of those long brainstorming sessions.

Based on the above, I can say with some confidence that I know a few things about crawling the Web. Again, I won’t claim to be an expert. I do, however, have a crawler that finds, on average, close to a million new videos every day from all over the Web, using a surprisingly small amount of bandwidth in the process.

Why write a Web crawler?

As I pointed out above, there are many open source and commercially licensed Web crawlers available. All can be configured to some degree through configuration files, add-ons or plug-ins, or by directly modifying the source code. But all of the crawlers you come across impose a particular crawling model, and most make assumptions about what you want to do with the data the crawler finds. Most of the crawlers I’ve seen assume some kind of search engine. Whereas it’s often possible to configure or modify those crawlers to do non-traditional things with the data, doing so is not necessarily easy. In addition, many of the crawlers assume a particular back-end data store and, again, whereas it’s possible to use a different data store, the assumptions are often deeply rooted in the code and in the crawler’s operation. It’s often more difficult to modify the existing crawler to do something non-traditional than it is to just write what you want from scratch.

For us, the decision to build our own was not a hard one at all, simply because five years ago the available crawlers were not up to the task that we envisioned. Five years ago, modifying any of the existing crawlers to do what we needed was not possible. That might be possible today, although the research I’ve done leads me to doubt that. Certainly, if I could make one of the crawlers work as we need it to, the result would require more servers and a lot more disk space than what we currently use.

If you’re thinking that you need a Web crawler, it’s definitely a good idea to look at existing solutions to see if they will meet your requirements. But there are still legitimate reasons to build your own, especially if your needs fall far outside the lines of a traditional search engine.

My next post in this series will talk about different crawling models and why the traditional model that’s used in most crawler implementations, although it works well, is not necessarily the most effective way to crawl the Web.

Shopping for a monitor

I’ve put together a desktop computer to use for development at home, and I need to get a new monitor for it. The old ViewSonic 15″ LCD that we paid $1,200 for 10 years ago is still in good shape, but its maximum resolution of 1280 x 1024 is not big enough to do serious work. I borrowed a monitor from the office, but at 1366 x 768, it’s not any better than the ViewSonic. Ideally, I’d like what I have at the office: a 24″ LCD with 1920 x 1280 resolution. But that, as it turns out, is very expensive.

You can get 1080p (1920 x 1080) monitors really cheap: about $100 for a 20″ unit. If you want a 23″ or 24″, you’ll pay $125 to $175, depending on brand and features. If you want HDMI input, you’ll pay another premium. Still, you can get a really nice 24″ 1080p monitor for under $200. You can, of course, spend as much as you want. A 60″ LCD TV makes a really nice monitor. It’ll cost you some serious cash, and it still won’t do better than 1920 x 1080.

If you want 1920 x 1200, you pay a huge premium. I think I saw one online for about $275, but in most cases you’ll have to shell out at least $300 for that extra 120 vertical pixels. That might not seem like much to some, but for a programmer or writer, 120 pixels works out to maybe 10 more lines of text. Vertical screen space is precious. Not $15 per line, though. Since I’m not willing to pay the premium for 1920 x 1200, I’m going with a 23″, 1080p model.

Graphics cards are kind of weird, too. In general, there are two types: the $20 card that will do video, and the $120 high performance card that you use for action games. As with anything, the top price is however much you want to spend. The middle ground (between $20 and $120) is mostly populated by fanless versions of the $20 cards. Trying to cut down on noise, and because I’m not much of a gamer, I’m going for the fanless cheap card. $50 isn’t exactly cheap, but that dang computer (a Dell 490 with a Core 2 Quad running at 2.5 GHz) is already too noisy. I’m not going to add more noise.

Sam Sheepdog and Ralph E. Wolf

Sam and Ralph were among my favorite Looney Tunes characters. I thought of them this morning when I said hello to my co-worker as I came into the office. So I thought I’d look them up on YouTube. Google is great. I searched for [wolf sheepdog looney tunes], and got the Wikipedia article.

All told, there were seven episodes that starred these two. The first was made in 1953, and the last in 1963. I was able to find all but the first (Don’t Give Up the Sheep) on YouTube. There are some clips of the first episode, with an alternate sound track, but I couldn’t find the original.

I did find Don’t Give Up the Sheep on mojvideo, but didn’t see a way to embed it.

Here are the others, from YouTube:

http://www.youtube.com/watch?v=IEQ1tA5QcPY

http://www.youtube.com/watch?v=WQ5ZVrHdtUY

http://www.youtube.com/watch?v=-k6s1_5BMGY

http://www.youtube.com/watch?v=Up0leZe1Un0

http://www.youtube.com/watch?v=yZKvuSYIykY

There also was a cartoon in which Taz takes the place of Ralph.

I didn’t much like the voice of Sam in that episode.

YouTube has a lot of the old Looney Tunes and Merrie Melodies cartoons. Be careful. You could spend the whole day laughing.

If you find Don’t Give Up the Sheep on YouTube, please let me know so that I can include it here.

Power carved tree ornaments

The latest issue (issue #57, Holiday 2011) of Woodcarving Illustrated Magazine has a pattern for a stylized bird that makes a great little tree ornament. I thought it would also be a good way to learn how to use my new Foredom power carver. So I spent some time with the bandsaw, cutting blanks from different types of wood, then took them over to the power carver and started shaping.

From left, the woods are mesquite, Osage orange (also known as bois d’Arc or “bodark”), mahogany, black cherry, and cedar. Click on the picture for a larger image.

I also cut one from mimosa, but already gave away the result. I gave it to the guy who gave me the wood. Fortunately, I have more of that.

The figures are about four and a half inches from beak to tail, and stand a little over two inches tall.

With the exception of the cedar, none of these woods would be considered easy to carve with edged tools. The Osage orange, though, is especially hard. I used up two sanding sleeves carving that one. But it sure is beautiful. And, yes, there’s a big crack in it. Adds character. I’m not going to throw out a perfectly good piece of wood just because there’s a crack or two in it.

I’m not about to give up my knives, but I am enjoying the power carver. Still trying to master the tool, though, and having a little trouble with symmetry. and with finishing. The pictures reveal some pits and scratches that I should have smoothed before I applied the finish. I might have to start taking pictures when I think I’m done. The pictures reveal things that I just don’t see when looking at the carving in my hand.

Debra says she wants one bird of each kind of wood I use. I wonder if she realizes how many different birds she’s going to get. I still have a piece of mulberry that I need to carve up, and there are a few other types in my stash. I’m sure there’s some walnut, at least two different types of oak, and I’ve forgotten what all else.  She might end up with a dozen birds.

 

Designing a better text file viewer

Last time, I discussed existing text file viewers and why I find them inadequate. It’s disappointing that GNU less is the closest thing I’ve found to a good text file viewer for Windows. Functionally, less is almost perfect. For sure it blows away all of the other file viewers I’ve seen. Those other tools have some good ideas, but they don’t handle large files efficiently. In this article, I’ll describe how to improve the file handling, which should make it possible to write a GUI-based file viewer that has the features of less, and better performance.

The discussion below starts very simply so as to illustrate the problems and possible solutions. The discussion ultimately converges on what I think is the most reasonable way to improve the speed of navigation within a large text file.

Note also that I’m assuming a 64-bit system, where pointers occupy eight bytes and the program has access to more than two gigabytes of memory.

Working with large files

If a file will fit into memory, displaying it is very easy. I can load the file into an array of lines and display lines on the screen as required. Allowing for line wrap adds a small bit of complexity, but it’s quite manageable. For the moment, forget that loading a large file can take quite a bit of time. At 100 megabytes a second, loading a gigabyte takes 10 seconds–an eternity if you’re a user waiting to see the file.

An array is the simplest possible data structure that lets you directly index a line by number. In most programming languages, that array will require eight bytes per line for a pointer to the actual string. The size of file you could display with this program would be limited to available memory. More specifically, if the file’s size plus 8 * LineCount is smaller than available memory, you can view the file.

Note that the line index isn’t strictly required. You could do in memory what less does on disk: start at the beginning of the file and sequentially scan it to parse out the lines. Doing that in memory would be significantly faster than reading from disk (after the file’s in memory, of course), but it would still take a very long time on a sufficiently large file.

Nobody has infinite memory, so the above won’t work very well. Typical memory on a developer’s computer these days is about eight gigabytes, so to view a very large file, you need a way to represent the important parts of a file using as little memory as is reasonably possible.

You could, if you wanted, store only the index in memory and leave the actual text of the file on disk. You’d still have to build the index, which would take just as much time as loading the entire file into memory, but you could work with a much bigger file. That is, if your average line is 80 bytes (including the line terminator), the in-memory representation of the file plus index would be 88 bytes per line. But if you’re not keeping the text, your cost is only eight bytes per line. Nothing like a 91% savings.

Such a design would likely work very well from the user’s perspective. After the wait for loading, that is. If the user asked for a particular line, the program could position to that line in the file, read it, and display a page full of data. Going to a particular line would be just as fast as positioning into the file by percentage.

Eight bytes per line is hardly free, though. My 30 GB test file has about 400 million lines in it. At eight bytes per line, the line index would occupy 3.2 gigabytes. The maximum sized file you could load on your typical developer machine would be between 60 and 70 gigabytes, and it would take about ten minutes to load the line index.

Reducing the line index size

One way to save memory would be to have the line index store the position of every other line. That would cut the size of the line index for any particular file in half. If you stored the position of every 16th line, you could make the line index 1/16th the size. You can pick any number N for the line interval, and the index size would be correspondingly reduced.

Of course, the larger N is, the more searching the program has to do for any particular line. If N is 16, then, on average, the program will have to search 8 lines of data to find the starting point. If line lengths are relatively short, this isn’t a problem at all. But if some lines are very long, the search time to find an arbitrary line could be several seconds.

Another way to reduce the size of the line index is logically divide the file into blocks of some size. Imagine, for example, that you view the file as a sequence of four-gigabyte blocks. For each block, you store its start position (offset in the file), and the line positions are stored as offsets from that starting position rather than as offsets from the start of the file. Since four gigabytes can be represented by 32 bits (four bytes), the size of each line number entry is only four bytes rather than eight as in the naive index. Your index size is cut in half, and you still have every line indexed. The only cost is a small amount of complexity and minimal runtime overhead to calculate a file position given a block offset and line offset. The seek and read time swamps that overhead–it wouldn’t be noticed.

You can do even better if you use 16-megabyte blocks. 16 megabytes is 24 bits, meaning that the line offsets can be stored in three bytes each, saving you 25% over the index with four-gigabyte blocks. There is another small runtime penalty for working with three-byte quantities, but again it wouldn’t be noticed in comparison with the disk seek and read time. Altogether, we have a 62.5% decrease in memory usage over the naive index. There’s a small amount of memory overhead involved in working with the smaller blocks, but it’s not unmanagable. Each four-gigabyte block needs 256 16-gigabyte blocks, but the savings per index entry more than offsets that.

Note that the size of the index depends entirely on the number of lines in the file. A one-gigabyte file that has 250 million lines (very short lines) would require almost two gigabytes for the naive line index. The “optimized” line index cuts quite a bit off of that, but it would still require about 750 megabytes. On the other hand, an 8-gigabyte file that has 100 million lines (about 86 bytes per line) would only need about 300 megabytes for the line index. Explaining to users why your program can load an 80-gigabyte file with no trouble, but chokes on a 10-gigabyte file would be rather awkward.

Indexing blocks

Any scheme that indexes every line is going to suffer when there are many lines, because the index will occupy a very large percentage of the total file size. And any scheme that indexes every Nth line will only reduce memory usage by a factor of N. It’s also going to suffer when lines are long, because searching for lines that follow a known line number can take an arbitrarily long time. Also, indexing every Nth line still puts you at the mercy of the number of lines. Small files with many lines will still require more index memory than large files with fewer lines.

The solution is to combine the approaches: divide the file into smaller blocks as in the blocked line index, but only store the line number at the start of the block. The size of the index is then dependent entirely on the size of the file rather than on the number of lines. In addition, we can place an upper limit on the amount of time it takes to locate a line in a block. This scheme solves both problems, at the cost of a little extra processing whenever a block is read.

Suppose we use a block size of 64 megabytes (2^26 bytes). A one-gigabyte file (2^30 bytes) would require an index with 16 entries. The number wouldn’t be less than 16 entries, but there would probably be slightly more because the line lengths aren’t uniform and lines wouldn’t pack exactly into the 64 megabyte block size. Pathological cases could require up to 32 blocks for a gigabyte, but I suspect those would be pretty rare. With a little work, I could flag the start of a block as the continuation of the previous block, but it’s unlikely that the added complexity would be worthwhile.

The index entries would be larger than the line index entries, but there are a lot fewer of them. Figure eight bytes each to store the file position and the line number, and four bytes for the block length. So figure 20 bytes per block for the index. If it requires 32 blocks per gigabyte (the absolute worst case), you’re talking 640 bytes per gigabyte. The best case is 320 bytes, and 512 bytes would easily handle all but the most pathological cases. So figure 512 (2^9) index bytes for every gigabyte (2^30 bytes) indexed.

The size of the index, then, is roughly Filesize/(2^21). So a terabyte file (2^40 bytes) would require an index of 2^19 bytes, or half a megabyte. A petabyte-sized file (1,024 terabytes, or 2^50) bytes, would only require half a gigabyte for the block index. An exabyte-sized file (2^60 bytes) would require 512 gigabytes for the index, but I figure by the time we’re working with files that large, we’ll have a lot more memory in our computers, and disk performance will be good enough that we can further increase the block size. For now, I’m happy to know that, if I were to write this, I could handle a petabyte-sized file in less than a gigabyte of RAM.

A larger block size would reduce the in-memory index size further, but there are tradeoffs that I’ll discuss below

There is one issue that the discussion above doesn’t address: lines that are longer than the buffer size. There are a couple of ways to handle that situation, the easiest being to automatically split those long lines. So if a line is 65 megabytes long, the program treats it like one line that’s 64 megabytes, followed by a second line of one megabyte. Another way is to flag a block as a continuation block. That is, store the same line number as the previous block, but set a flag to say that it’s a continuation of the previous block. There are other possibilities, none of which are particularly difficult to implement. I’ll wave my hand over the issue and call it solved.

Reviewing the requirements

Remember that my simplified requirements are:

  1. Handle arbitrarily large files.
  2. Show the user what he wants to see as quickly as possible.

If you ignore the load time, the line index described above meets both of those requirements. The index design I described above can handle a petabyte-sized file on commonly available hardware, and once the index is constructed the program would be able to show the user any line in the file almost instantly. The algorithm is very simple. The block index contains the line number at the start of each block. So the program can quickly determine in which block the requested line resides, load that block, parse the lines from it to create a per-line index for that block, and then display the line in question. A standard developer machine should be able to load and parse 64 megabytes in less than a second.

I said above that a larger block size would reduce the memory requirements, but that there are tradeoffs. The tradeoff is response time. If the user wants to position to a line that is not in the current block, the program has to load the relevant block and parse it to find the line positions in that block. A typical developer machine these days can do random reads at a rate of about 100 megabytes per second. Loading a 64 megabyte block, then, will take perhaps 700 milliseconds, leaving 300 milliseconds to parse the block and display the first page of data. If I increased the block size to 128 megabytes, it would take twice as long (two seconds) to position within the file. One second is, to me, on the edge of too slow. A two-second response time is unacceptable. If I/O speeds increase, I’m happy to increase the block size. On the same note, I might have to reduce the block size if it takes too long to read and parse 64 megabyte blocks.

The problem I’m faced with is the initial load time, which fails the second requirement. I know that I wouldn’t use the program if I had to wait five minutes or more for it to index my 30 GB file.

At this point it’s important to remember why we’re constructing an index at all. It serves two purposes: to let the user position into the file based on the line number, and to display line numbers if the user wants to see them. Being able to work with line numbers is essential, but it’s not the most important feature of a text file viewer. As we saw with less, there are many things you can do with a file viewer that don’t involve line numbers at all.

Having to wait for the file to be indexed is especially annoying when all you want to do is look at the first part of the file. Very often, that’s what a user wants to do: start at the beginning of the file and read through it sequentially. It’s unreasonable to make the user wait for all the line numbers to be computed before showing him the first part of the file.

Mimicking less

What if, when the user first started the program, it immediately read the first block of the file and presented it to the user? That’s what less does. It will even display the line numbers immediately. That’s trivial to do, of course, since reading the first part of the file, calculating line numbers for that portion, and displaying the first page is very fast. Any file viewer should do that, at minimum. Remember: show the user what he wants to see as quickly as possible.

But you can improve on less just a little bit by reading the first block (64 megabytes) of the file and parsing it to build an index of the lines in that block. Whenever the user wants to jump to a line that’s within that first block, the program already knows where that line is and has it in memory! This increases the memory footprint a little bit, as it requires that we keep track of lines in a block, but only for a single block. The worst case would be a block that contains nothing but blank lines, which would mean 64 million entries in the line index for that block, or about 128 megabytes. Not small change by any means, especially compared to less, but quite acceptable on an 8 gigabyte machine.

If the user is paging sequentially through the file, the program continues to serve lines by looking them up in the index. When the user pages beyond the part of the file that’s already been indexed, the program loads the next block, indexes it, discards the previous block, and continues. As long as the user is paging sequentially through the file, this program acts just like less, except that it’s building a persistent block index.

An obvious optimization would be to keep a few blocks of the file in memory in order to reduce the ping pong effect that would occur when the user is displaying the end of one block and the beginning of the next. Assume that this is done. I wave my hand over the technical details of the implementation because it’s not pertinent to the discussion at hand, and it’s not a difficult thing to do.

If the user then asks for the millionth (or ten millionth) line, the program again acts just like less: it makes the user wait until it has parsed enough of the file to locate the millionth line. It will still take just as long to find the millionth line as less takes, but the program is doing one other very important thing: it’s building and saving the block index.

That’s important for two reasons. First, as you saw with less, if you jump to the ten millionth line and then ask for the five millionth line, less has to start over at the beginning of the file and parse it to find line number 5,000,000. Granted, that’s pretty quick on modern hardware because the operating system will cache large parts of the file, but if you go to line 100,000,000 and then 50,000,000, you’re going to wait a long time. This proposed new program doesn’t suffer from that. Since it’s created the block index, it knows which block contains line 5,000,000 (or 50,000,000), and can jump immediately to that block.

The second reason it’s important is similar. With less, if you go to line 100,000,000 and then back to the beginning of the file, and then want line 100,000,001, the less has to scan the entire file again–all 100 million lines that it had scanned before. As I said previously, less knows where it is, but doesn’t know where it’s been. Our program, though, knows where line 100,000,000 is, and can immediately start searching for the next line at that position.

The program would, of course, allow quick positioning to the end of the file or to any point within the file by percentage, just as less does. It would also make the user wait (with the opportunity to cancel the wait) if line number display is turned on but the line numbers need to be computed.

In the above discussion, the program mimics less to begin with. It can’t do anything else. But once it’s seen a particular line in the file it knows what block that line is in and can go back to it very quickly, without having to sequentially scan through the file from the beginning.

As useful as such a thing is, it would be only a small incremental improvement over less. It’s probably worth doing and if such a program existed and provided all the other features of less, I’d use it. But it’s not enough of an improvement to make me want to write it.

Doing better than less

The only seriously annoying thing about less is that I have to wait for line numbers to be computed when I ask for a specific line, or when I have line numbers turned on and I position into the file using %. Whereas line numbers are important, I hate having to wait. The program described above reduces waits after the first time it sees a line, but that initial wait time is a serious annoyance. At 100 megabytes a second, it would take five minutes to compute line numbers for my 30 GB file.

If the user is paging sequentially through a text file, there’s no perceptible delay from one page of the file to the next. The only time he has to wait is when he wants to display a line number that the program hasn’t already seen.

So what would happen if the program jumped ahead of the user? What if, when the program starts, it reads the first block of the file, displays it to the user, and then keeps reading, building the block index as it goes? Given sufficient time, the program would finally reach the end of the file and the block index would be fully constructed. What if the program could do that in the background while letting the user view the file as before?

Doing that wouldn’t be especially difficult. Any developer machine today has multiple processor cores, so a background thread that’s scanning the file isn’t going to interrupt the responsiveness of the UI. Modern programming environments make working with multiple threads much easier than they were in the past, so implementing this improvement would be relatively easy, and the result would reduce or completely eliminate the user’s need to wait for indexing. If, for example, the user started the program and spent three minutes paging through the first part of the file, the program would have had those three minutes to index a large part of the file. Even at 50 megabytes a second (half of what you would expect from a modern developer machine), the program can index three gigabytes a minute. While the user is slowly paging through the file, the program is reading ahead, indexing the file. The result should be much smaller wait times–even the first time through the file.

The user would still be able to position to any point in the file, just as before. Again, the only time he’d have to wait would be to display correct line numbers. And with background indexing of the file, even that requirement can be relaxed.

Estimating line numbers

My 30 GB test file has an average line length of about 84 bytes, and the contents of the file are reasonably uniform. Line 100,000,000 of the file is, to a fair degree of accuracy, at about position 8,400,000,000 in the file. The actual number reported by less is 8,393,544,127, meaning that my calculation was off by 6,455,873 bytes or .07%. Then again, my average line length estimate was “about 84 bytes.” The program would have a much more accurate value.

If the program is reading and indexing the file in the background, it can keep a running average line length so that when the user asks for a line that hasn’t been read, the program can estimate the position of that line number and jump to that position in the file. Obviously, the line number display would have to indicate that the numbers shown are approximate. Would that be good enough?

It would be for me. Very often, I just want to see what’s “around” line 100,000,000, or I know there’s a particular bit of text on or near that line. If the program can put me close to that line, I can use the search facility to quickly find the exact line I’m looking for by searching in the vicinity.

The program continues to index lines in the background, and periodically updates the line number estimates. When the program gets to the block that contains the current position, it can update the display with the actual line numbers.

This is far from a perfect solution. For example, an error of .07 percent in a petabyte-sized file is a huge chunk–on the order of a terabyte. The error would be much greater if the first part of the file (the part already scanned) is not representative of the rest of the file. The program can use various adaptive techniques in an attempt to reduce the error, but nothing is going to be perfect.

The program described above would be a tremendous improvement over the venerable less, provided that it maintained all the existing functionality that makes less so useful. Adding the block index, background file scanning, and estimated line numbers are improvements that should be obvious to anybody who thinks about file viewers. Several of the file viewer tools I’ve seen implement some of what I described, but in fatally flawed ways. And that bothers me.

Background file scanning is an obvious improvement, and one that a few other programs already have, although some implement it stupidly.

The line index, confuses me. The designers of other file viewers had to realize that the line index limited the size of file their programs could handle. It looks like they gave it some thought, as none of the programs seem to use eight byte line indexes. Most appear to use four byte indexes, meaning that they’re doing something to reduce memory consumption. But even four bytes per line adds up when there are hundreds of millions of lines in the file. Why nobody thought of keeping just a block index and re-parsing blocks as required is a bit of a mystery. Or maybe they did think of it, and discarded the idea. That would be an even bigger mystery.

The one thing that doesn’t confuse me, but concerns me nonetheless, is the insistence on knowing the exact position of a line before allowing the user to position to that line. I can’t be the first person to think of using estimated line numbers.

For more than 25 years, hex file viewers have loaded the first page of a file immediately and have had the ability to position to any point in the file instantly. I know this to be true because I wrote such a program in 1983. The original version could handle a file up to four gigabytes in size, using minimal memory to do it. I had to wait about 15 years before I could prove that, since when I originally wrote the program even a 10 megabyte hard drive was a luxury that most didn’t enjoy. To be fair, a hex file viewer is a much simpler program than a text file viewer because there are no “lines” to worry about. As far as the hex file viewer is concerned, a file is just an array of bytes that’s rather slow to access.

My insight, and one that I have to believe others have had, is that a file is a file. It’s our interpretation that makes it a text file, a database file, or whatever. The only thing that prevents a text file viewer from instantly positioning to a particular line is not knowing exactly where that line is in the file. Realizing that, the designer of a file viewer make a choice: force the user wait for the line numbers to be calculated, or display text at the estimated line position. As a user, I’m not going to be completely happy with either (I want it now!), but I’m much happier to be given something rather than having to wait an eternity for the program to calculate line numbers. If other designers thought of this (and I have to believe that they did), they made the other choice. They elected not to potentially confuse users by showing incorrect line numbers.

And I agree that there’s potential for user confusion here. If the user asks for line 100,000,000 and the program puts him there instantly, he’s tempted to believe that he really is at line 100,000,000. That’s especially true if the user doesn’t realize that calculating line numbers takes time. But that user confusion is mitigated by two things:

  1. Most people who view these huge text files are programmers who will understand the issues.
  2. A simple popup message explaining that the line numbers are estimates.

It’s my contention that positioning to what the program thinks is the correct line and allowing the user to proceed from there is preferable to making the user wait a potentially very long time (and five seconds is a long time when you’re trying to get things done) before he can do anything at all.

Some thoughts on implementing the user interface

One of the problems I struggled with in working out the solution I described above is how the vertical scrollbar behaves. The horizontal scrollbar turns out to be easy to handle, but the vertical scrollbar required a bit more thought. I also initially thought that the optional line wrapping feature would complicate things. It turns out that if we change the way we think about those two features, their implementations become much easier.

Absent line wrapping, the horizontal scrollbar’s maximum value is set to the length of the longest line so far encountered in the file. It’s as simple as that. In the simplest implementation, the length is updated by the background process that’s scanning the file. However, it could also be updated when the foreground process positions to a point in the file that hasn’t yet been scanned. Whenever a previously unseen line is parsed, the scrollbar limit is updated if necessary. There’s a potential problem if a single line exceeds the scrollbar’s capacity (2^31 in Windows), but that can be handled by setting a limit on line length, or by using a the scaling technique that I describe below.

When line wrapping is turned on, the horizontal scrollbar is disabled or hidden, since there’s no need to scroll horizontally when the text wraps to the window width.

At first thought, the vertical scrollbar seems like a huge problem. If the interface uses it to reflect the line position in the file, then the scrollbar limits will have to be updated continually as the file is being scanned. In addition, using the scrollbar in that way would limit the number of lines that could be accurately represented to the maximum value accepted by the scrollbar. Using the standard Windows scrollbar, that limit is 2^31, meaning that you couldn’t work with a file that has more than two billion lines. Well, you could, but the code would be difficult.

If, instead, we treat the vertical scrollbar as representing the position as a percentage of the file’s content length, things are greatly simplified. If the file is less than two gigabytes, then the scrollbar limit is the file’s length and the scrollbar thumb position is equal to the exact byte position in the file. If the file is larger than two gigabytes, then we can use a constant multiplier for the scrollbar thumb position. The math is quite simple.

The scrollbar small change value is always 1. If the user clicks on a scrollbar arrow, the window scrolls one logical line, a “logical line” being defined by the state of the line wrapping flag. Similarly, the large change is however many logical lines fit in the window at its current size, using the currently selected font. Simple and effective.

One thing that always bothers me about Notepad is that if you turn on line wrapping in a very large file, the program takes a long time to re-compute the wrapping, and ends up moving you to the beginning of the file. Other editors and file viewers do the same thing, and it always struck me as kind of crazy that they would. It appears that the only reason they do this is so that they can make the scrollbar reflect the number of logical lines in the file–that is, the number of display lines it takes to show all of the wrapped lines in the document. It’s nuts.

There’s no other reason I can think of that going from horizontal scrolling to line wrapping would cause such a delay. Certainly, there’s no reason why the program can’t just wrap the current page almost instantly. The top line remains the same, and it’s trivial to line wrap that line and the following lines to fill the page. What comes before and after the current page is irrelevant.

If we think of the scrollbar position as the current position in the file, irrespective of line numbers, then all that complexity goes away.

Another benefit of viewing the scrollbar in this manner is that the user can grab the thumb and pull it down to, say, the halfway point. The program will behave as though the user asked to position at the 50% point in the file–just as if, in less, the user had entered 50%. Simple, not confusing, and very effective.

Will I write it?

If this program existed today, I would use it. It’s a significant improvement over less, which I consider the best text file viewer currently available. The question is whether it’s enough of an improvement for me to develop it. Writing this file viewer would entail a lot of work. The algorithms themselves aren’t especially difficult to implement, but it would take some time to get everything right.

For now, it’s just an idea. I have other fish to fry at the moment, but I sure do want this tool.

Large text file viewers

I’ve evaluated a lot of text file viewers for Windows, looking for one that has a decent user interface, a small set of “must have” features, and that can handle very large files.

One of the problems I have when looking for “large text file” viewers is that people have different definitions of “large.” Some programmers proudly state that their programs handle a 10 megabyte file with no problem. Others tout their ability to handle files that are “hundreds of megabytes” in size. And there are plenty that use one gigabyte as their benchmark. Granted, a gigabyte is a lot, but I regularly work with files that are tens of gigabytes. For example, the file I’ve been using recently to test file viewers is 30.83 gigabytes (33,100,508,272 bytes).

Many file viewers that claim to work with files “of any size” forget to add the fine print that says, “as long as it will fit into memory.” Many of those do indeed work on a 10 gigabyte file, provided you’re willing to wait five minutes for the entire file to be loaded into memory. I’ve also found that many of these tools use an inordinate amount of memory. One viewer, for example, requires about 4 gigabytes of memory in order to display a 2 gigabyte file. Why that is isn’t clear, but it certainly limits the size of file that can be viewed.

Of those viewers that don’t choke on my 30 gigabyte file, not one that I’ve tried to date has the combination of features and performance that I desire. A few come close, but fall down in the user interface or, more importantly, in quickly responding to my requests. I wrote a little about this back in April in my .NET Reference Guide. See Viewing Large Text Files. In the article, I identify many of the tools I tested, and took a closer look at three of them.

I’ve thought about this particular issue quite a bit over the years, and have always stopped short of sitting down and writing my own viewer. Why? Because it really is a difficult problem, and I have other more pressing things to do. There are user interface and implementation tradeoffs that you have to make, and although I think I could do a better job with those tradeoffs than others have done (at least, I’d be happier with my implementation of the tradeoffs), I’m not convinced that the improvements I’d make would justify the time I spent on it. It is interesting, though, to think about the problems and possible solutions.

Requirements

In the discussion that follows, I’m going to assume a minimal graphical user interface–something about on the level of Windows Notepad without the editing capabilities. I expect the following features, at minimum:

  • Intelligent handling of Unicode big-endian, little-endian, and UTF-8.
  • Ability to specify the character set.
  • Quick (“instantaneous”) load time, regardless of file size.
  • Ability to handle arbitrarily large (up to 2^64 bytes) files.
  • Vertical and horizontal scrollbars.
  • Standard movement keys (line up/down, page up/down, start/end of file).
  • Optional word wrapping.
  • Optional line number display.
  • Jump to specified line number.
  • Search for specified text.

I define “arbitrarily large” as 2^64 bytes because the most common file systems today (NTFS and the various popular Linux file systems) use 64-bit file pointers. Whenever I say “arbitrarily large file,” assume that I mean “up to 2^64 bytes.” Although I’ll limit my discussion to files that large, there’s no fundamental reason why the techniques I describe can’t apply to larger files.

One might rightly ask why I’m even thinking about files that large. It seems the height of folly to construct such big files. I tend to agree, in theory. But in practice it turns out that creating one huge XML (or some other text format) file is the easiest thing to do. The computer will be reading the data sequentially whether it comes in as one file or multiple files, but writing the program to create, manage, and transmit multiple files in order is more complicated than writing a program that handles one monolithic file. I regularly see files that are tens of gigabytes in size, and others I know talk of hundred-plus gigabyte files. I have no doubt that in ten years we’ll be talking about multi-terabyte (and larger) text files. It’s only when we want to read the files at arbitrary positions identified by line numbers that things get difficult.

Everything else is optional, although regular expression searching and a text selection and copy to clipboard feature would be really nice to have. Those aren’t particularly challenging to implement.

The character set handling is very important, although not especially difficult to implement. Notepad, for example, does a credible job of determining the text encoding, although it does guess wrong from time to time. The program should give the user the opportunity to specify the encoding if things look wrong or if he knows that the file is using a particular encoding or code page.

In reality, there are two major requirements that drive the design of the program:

  1. Handle arbitrarily large files.
  2. Show the user what he wants to see as quickly as possible.

Before we get started talking about how to meet those requirements, I think it’s important to take a closer look at what I consider the best file viewer program available today.

Viewing files the Unix way

The GNU program less, which is derived from the Unix utility of the same name, is a very effective file viewer for looking at arbitrarily large files. I’ve tested less with files larger than 100 gigabytes, and it handled them just fine. I don’t know what it would do with a terabyte-sized file, but I imagine it wouldn’t have any trouble. less is solid, uses very little memory, and works much better than many file viewers that use hundreds of megabytes of memory and don’t gracefully handle large files. It’s rather humorous and faintly depressing to see a 25-year-old command line tool written to use minimal memory outperform modern GUI tools that use hundreds of megabytes of memory. Anybody who’s working on a file viewer should study less very carefully.

less has a few features that give some ideas of how to handle very large files. One of the things that people find surprising is that less can position to any point in an arbitrarily large file very quickly. If you press > (or enter one of the other end-of-file commands) to move to the end of the file, less goes there immediately and shows you the last page full of lines from the file. It’s almost instantaneous. Similarly, entering 50% at the command prompt will take you to the halfway point in the file. Instantly.

It shouldn’t be surprising that less can position to any point in the file. After all, it’s trivial to seek into the file, read a buffer full of data, find the start of the first line in that buffer, and display it. What’s surprising to me is that so many authors of file viewers seem to think that positioning by percentage is difficult or not useful. They insist on doing things by line number which, as you’ll see, is the wrong way to think about the problem.

less can work with line numbers, too. For example, entering 50g will go to the 50th line in the file. You’ll find, though, that if you enter 1000000g, the program appears to lock up. What it’s doing is parsing the file, looking for the millionth line. And the only way to find the millionth line is to find all 999,999 lines that come before it. The program hasn’t locked up, by the way; pressing Ctrl+C will return the program’s command prompt.

The command -N will turn on the line number display. When that mode is enabled, less won’t display the results of a positioning command until it has computed the line numbers. So if you enable line numbers and then enter 50% to go to the halfway point, less will display this message:

Calculating line numbers... (interrupt to abort)

You can wait for the line numbers to be calculated, or you can press Ctrl+C to interrupt. If you interrupt, less turns off the line number display and immediately displays the page of text that you asked for.

If you ask for the millionth line and wait for it to be located, less knows where you are. If you then ask for line 1,000,100, less can find it trivially because the program knows where the millionth line is, and all it has to do is search forward 100 lines. If, however, you then ask for line 500,000, less doesn’t know where the line is. The program has to start over at the start of the file and parse the first 499,999 lines to find the position you asked for. It’s possible, by the way, that the program will search backwards. I don’t know how it’s implemented. In either case, less knows where it is in the file, but it doesn’t know where it’s been.

The above is not meant as criticism of less. The program is incredibly useful and works very well within its limitations. The program was designed to work on computers that are very primitive by today’s standards. And yet it often outperforms modern programs designed to take advantage of computers that were inconceivable 25 years ago. Granted, less has been updated over the years, but its basic design hasn’t changed.

If nothing else, programmers who are writing file viewers should study less because it does handle arbitrarily large files–something that most file viewers don’t do. In addition, the program has features that every file viewer should have but most don’t. In general, if less can do it but your program can’t, then there’s a problem with your program. Similarly, if your program doesn’t add something that less doesn’t do, or do something better than less does it, then what’s the point of writing it?

There are two major things I’d like to improve in less: character set handling, and speed of locating lines. There are some smaller features that I wish less had and that I’d add were I to write a file viewer, and I prefer a GUI, but those are relatively minor issues. Were I to embark on actually writing a file viewer to improve on less, my motivation would be speed and character set handling.

Character set handling can be tricky, but it’s a known problem with good solutions. Improving the speed of locating lines, too, doesn’t involve any heavy lifting. Several other file viewers have solved it to various extents, although their solutions are, in my opinion, fundamentally flawed in several ways. In particular, the two programs (other than less) that could view my 30 GB file both appear to create an in-memory line index, which puts serious limitations on the size of files they can view. A small file that contains a lot of very short lines requires more memory than a large file that has long lines. A terabyte-sized file with lines that average 80 characters would have about 13.7 billion lines. Naively stored, that index would require more than 100 gigabytes. With a little thought, you could reduce the memory requirement by half, but 50 gigabytes is still more memory than most servers have, and six times as much memory as a typical developer machine.

I’ve given this problem some serious thought, and I think I’ve come up with a workable solution that improves on less while using a relatively small amount of memory. I’ll explain my approach in my next posting.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.