You want it when?

The web crawler I’m working on, as I’ve mentioned before, is a distributed application. Currently it consists of a URL Server and multiple Crawlers. The basic idea is that the URL Server is a traffic director that tells each Crawler machine which Web sites to visit. Each Crawler machine hosts multiple Worker threads, each of which reads a URL from a queue, downloads the page, harvests the links, and reports the results.

Coordinating the the Worker threads and the communication with the URL Server (which itself contains several threads) isn’t exactly easy, but it’s manageable. This past week I’ve been working on what I call the Crawler Control Panel: a GUI application that lets me monitor and control the URL Server and the individual Crawlers. Today I finally got around to adding the shutdown commands, and making the Crawlers and URL Server shut down gracefully. That turned out to be an interesting experience.

I don’t want to just terminate a Crawler because if I do, I lose data. Each of the Worker threads is in some stage of downloading a Web page and it also has some buffered data and a small batch of URLs (a few dozen) that it has “checked out” but not yet processed. If the Crawler terminates immediately upon receiving a shutdown message, all of the buffered data is lost. To avoid losing data, the Crawler waits for each Worker thread to finish its current URL and then tells the Worker to shut down. Because shutting down takes some time (sending results to the URL server, updating the queue, etc.), it is done asynchronously (on a separate thread) to avoid blocking the Crawler’s main processing thread. The Crawler shuts down after all of the Worker threads have been shut down.

A much-simplified version of the Crawler’s main processing loop looks like this:

While True
    Wait for WorkerReadyEvent or CrawlerShutdownEvent
    if CrawlerShutdownEvent
        CrawlerShutdown = True
    if WorkerReadyEvent
        While there are Workers in the queue
            Get Worker from queue
            if CrawlerShutdown = True
                Shutdown Worker asynchronously
            else
                Tell Worker to process next URL
    if CrawlerShutdown = True and Number of workers = 0
        Terminate

That looks like it should work, right? Imagine my surprise when all the Worker threads shut down but the Crawler wouldn’t terminate. It’s painfully obvious now, but at the time I was scratching my head. It’s easy to see what’s wrong if you imagine that there is a single Worker thread. Everything works fine until the CrawlerShutdownEvent is signaled. Then, events happen in this order:

  1. The CrawlerShutdown flag is set.
  2. At some point, the Worker thread finishes with its current URL, gets put into the queue, and signals the WorkerReadyEvent.
  3. The Crawler dequeues the Worker, notes that CrawlerShutdown is set, and spawns a new thread to shutdown the Worker.
  4. While the Worker is shutting down, the Crawler checks to see if the number of workers is zero. Since the Worker hasn’t fully shut down, it hasn’t been removed from the list. The Crawler goes back to waiting for the next event.
  5. The Worker terminates and no more events are forthcoming.

The first idea–decrementing the count of Workers in the main loop–won’t work because if you did that, the program might terminate before the Worker is finished saving its buffered data. No, the answer is to put a timeout (I used five seconds) on the wait at the top of the loop. So the code looks like this:

While True
    Wait up to five seconds for WorkerReadyEvent or CrawlerShutdownEvent
    if the wait didn't time out
        // do all that other stuff
    if CrawlerShutdown = True and Number of workers = 0
        Terminate

Sometimes–rarely, but it happens–you do want a Crawler to terminate immediately. Simple enough, right? Just signal the CrawlerShutdownEvent and pass a flag that says, in effect, “terminate right now without waiting for the Worker processes.” That was absurdly easy to implement, but then I ran into another problem.

Remember the Crawler Control Panel (CCP)? It has a persistent connection the the URL Server and all of the Crawlers. To terminate a Crawler or the URL Server, the CCP sends a message on that connection. Every such message requires a response. That usually isn’t a problem. But sometimes, because things are happening on multiple threads–the Crawler will terminate and closes the connection before it returns a response to the CCP. The result is a big ugly exception: “connection forcibly closed,” or something similar.

Some days you just can’t win.

Anybody who’s done any multi-threaded or distributed processing is familiar with these kinds of problems. The interesting thing is that most programmers–even the experienced ones–continually run into stuff like this. Thinking about concurrency is hard. You have to think of all the wacky cases–events that can happen at any time in almost any order–and write your code so that it handles those cases intelligently.

When you’re writing this kind of code, paranoia pays. No wonder I don’t sleep nights.

Credit Card Fraud

Debra called while I was on my way to lunch this afternoon. Somebody purporting to be the Capital One fraud department had left a message on the home answering machine saying that it was imperative that I contact them. They left a toll-free callback number. That got me to thinking, though. How could I trust the number? Anybody could call and say that they’re Capital One.

So I called the customer service number on the back of my card. After wading through several levels of menus and going around the loop twice, I finally started hitting ‘0’ until I got a real person on the line. After verifying my identity, the woman confirmed that the fraud department had called, and she transferred me to them.

Ten minutes on hold later, I got to verify my identity once more and the representative asked me about some possibly fraudulent charges: one for over $1,000 at the Boeing Store, and one for a two-year subscription to Experts Exchange–neither of which I had authorized. The card is of course being canceled and I’ll receive a replacement in the mail soon. But it got me to wondering about several things.

  • Why were the charges refused? Perhaps the Web sites asked for the 3-digit security code and when the number entered wasn’t correct, they reported possible fraud? Or maybe the shipping address was different from my home address. I’m curious how Capital One decided that those charges were possibly fraudulent. It’s not inconceivable that I would have made those purchases.
  • How the heck did somebody get hold of my credit card number? I guess that in itself isn’t terribly difficult, but you’d think that somebody who was buying a two-year subscription to Experts Exchange would be smart enough to know that he couldn’t make the purchase without the card security code and a confirmed home address.
  • How many people would call a number left on an answering machine from somebody purporting to be the fraud department? I think it’s terribly irresponsible for Capital One to leave a callback number. I honestly thought the answering machine message was a scam. It would be more reasonable to say something like, “Please call the customer service number printed on the back of your card and ask to speak with the fraud department.”

Regardless, I’m happy to see that they were able to identify those unauthorized charges and prevent somebody from having fun at my expense. I hope they can track down the miscreant before he does real damage to somebody’s credit.

Build your own notebook flash drive

Further to yesterday’s entry about notebook flash drives, I forgot to mention the Addonics CF drive adapter. For $30, you get an adapter that plugs into your notebook’s IDE connector. Add a 16 GB compact flash card, and you have a solid state “hard drive” in your notebook. There’s also a dual CF adapter with a second CF slot, so you could have 32 GB. That doesn’t sound like much in today’s world of 160 GB notebook hard drives, but 32 GB is plenty for most needs–provided you don’t have 20 gigabytes of music files. Imagine what the reduced power consumption will do for your battery life.

At over $200 for a 16 GB CF card, this is still a bit too expensive for most people. But not terribly so, and not for much longer if flash memory prices continue to drop. I can almost guarantee that the next notebook computer I buy will have a compact flash storage device rather than a 2.5″ hard disk.

Odds ‘n Ends

A few items that have been gathering dust here while I bang away on the crawler.

  • Ever wonder what you could do with a terabyte of really fast storage? Check out the Tera-RamSan. I hope you have a big budget, though. The unit is undoubtedly expensive, and it requires 2,500 watts. That’s some heater!
  • Speaking of faster mass storage, several manufacturers (Samsung and SanDisk, among others) have announced or demonstrated flash replacements for laptop hard drives. These aren’t new, but soon we’ll be seeing 160 GB and possibly larger drives. They’re not exactly cheap, either, but the idea of a solid state “disk drive” for my laptop is very attractive.
  • Not as much call for this sort of thing as there used to be, but sometimes there’s no substitute for a good bit twiddling hack. Thanks to David Stafford for the link.
  • Got a crazy idea? Google has $12 billion dollars and an acquisitions chief who’s looking for “ideas that are ‘really crazy.”
  • If you’re looking for WordPress blog templates, check out the Theme viewer. You can search for themes based on a wide variety of criteria, and see them on a sample blog before downloading and installing. As you might expect, some work better than others. I really like this current theme (Tiga), because it includes an HTML interface for changing colors, column widths, field heights, fonts, etc. It’s very nicely done.

Computers update

After a little more than three weeks with the new computer, I’m mostly happy with it. It’s blindingly fast, both in processing and in disk access. And whisper quiet, really. I got into the office very early the other day–about 4:30 AM–when there was almost no background noise. I could barely hear the computer. The Dell flat panel monitor on my desk was making more noise than the computer, and I could hear the machine in the other room (a machine that I used to think was quiet) better than I could hear the one right next to my desk.

Windows XP 64 is working well for me, although there are a few of oddities:

  • For some reason there’s no “minimize all” button on the quick launch bar. That’s a handy little tool to hit when I want to quickly see my desktop. Fortunately there is a key combination–Windows key+D–that will do the same thing, but I’m used to hitting the button with my mouse.
  • I put my Windows task bar on the right side of the screen rather than at the bottom. I started doing that years ago because vertical screen space was hard to come by. As a programmer, I want to see as many lines of code as possible. There have always been a few oddities with having the task bar over there, mostly due to application developers not taking into account the possibility. But the problem I’m having on XP 64 is not caused by that. For some reason, when I reboot the task bar takes on some default size values. I have to resize it after every reboot.
  • The Microsoft PowerToys for Windows XP don’t work on XP 64. I really miss CmdHere and SyncToy.
  • XP 64 doesn’t run 16-bit apps. I knew that going in, and I can understand why. But it’s disappointing nonetheless. I have some utility programs that I last compiled in 1987 and that I used on a regular basis until I upgraded to this machine. I can probably just recompile some of them–those that I wrote in C. The ones I wrote in assembler will take a little work to port, and those that I created with Turbo Pascal will have to be rewritten entirely.
  • The Windows SORT utility is worthless for sorting large files. It sorted a 1.6 gigabyte file just fine, but choked when I gave it a 3 gigabyte file. I can’t imagine why it runs out of memory when it’s supposed to make use of temporary files. And apparently it reads strings into fixed-length records for sorting. Seems like a bad design decision to me.
  • I’m convinced that the NT file system is not designed to reduce fragmentation. I’m working with large (2 gigabytes and more) files. The file system insists on filling the small empty spaces on the disk before allocating any of the huge unused space. As a result, I get fragmentation like you wouldn’t believe.

That said, I’m pretty dang happy with the new machine. It’s taken some time getting accustomed to the full-sized keyboard again after two years on the laptop, and I’m still suffering a bit from not having some tools, but in general I really like the new machine.

Bloom Filters in C#

As I’ve mentioned before, writing a Web crawler is conceptually simple: read a page, extract the links, and then go visit those links. Lather, rinse, repeat. But it gets complicated in a hurry.

The first thing that comes to mind is the problem of duplicate URLs. If you don’t have a way to discard URLs that you’ve already seen, you’ll spend most of your time revisiting pages. The naive approach of keeping an in-memory list of visited URLs breaks down pretty quickly when the number of visited sites moves into the tens of millions. Overflowing that list to disk also has problems: maintaining a sorted index and searching that index quickly are very difficult problems. Databases can handle many billions of records fairly quickly, but I don’t know of a database that can keep up with a distributed crawler that’s processing almost 1,000 documents per second, each of which has an average of 100. [As an aside, I was very surprised to find that the average number of links per HTML page–this is based on examining several million documents–is about 130. I would have guessed 15 or 20.]

I ran across a mention of Bloom filters while reading about a different Web crawler. I’ll let you get the detail from the linked Wikipedia article, and just say that a Bloom filter is a space-efficient way of allowing you to test for inclusion in a set. It uses hashing, but has a much lower false positive rate than a simple hash table. The cost is slightly more processing time, but the memory savings and decreased false positive rate are well worth the cost. Besides, my crawler is not at all processor bound.

I coded up a C# class to experiment with Bloom filters, and compared it with a simple hash table. The code along with more detailed explanation is available in my new article, Bloom Filters in C#.

A new file system primitive?

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.

A simple example (certainly not the only or necessarily the best example) is a queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:

  • As each item is removed, copy the remaining items up to replace it, and truncate the file. Although it works, this solution is very expensive time-wise.
  • Monitor the size of the empty space at the front, and when it reaches a particular size or percentage of the entire file size, move everything up and truncate the file. This is much more efficient than the previous solution, but still costs time when items are moved in the file.
  • Implement a circular queue in the file, adding new items to the hole at the front of the file as items are removed. This can be quite efficient, especially if you don’t mind the possibility of things getting out of order in the queue. If you do care about order, there’s the potential of having to move items around. But in general, a circular queue is pretty easy to implement and manages disk space well.

But if there was a lop operation, removing an item from the queue would be as easy as adding one. As easy, in fact, as truncating a file. Why, then, is there no such operation?

Disk files are typically allocated in fixed-size blocks: four kilobytes, for example. If you create a file and only put one character in it, the file system will allocate an entire block for that file. The file allocation table (or equivalent structure) associates the file name with that particular block, and also stores either the length of the file, or the offset in the block of the last used byte.

If a file exceeds the block allocation size, the file system allocates another block, links it to the previous block, and updates the file size or last byte pointer. In this way, files of arbitrary size can be allocated and disk fragmentation is kept, if not to a minimum, at least manageable.

Truncating a file is a very simple and quick operation. If the truncation is within the last allocated block, all the file system has to do is adjust the file size. If truncating removes allocated blocks, the file system adjusts the file size pointer into the final used block and then frees any following blocks that were used by the file.

A lop operation (chopping off the front of the file) would be similarly easy to implement if the file allocation table maintained an “offset to first byte” pointer similar to the file size pointer. If you wanted to lop the file off at position 10, for example, the file system would just have to update the pointer. If a lop occurred in something other than the first block of the file, the file system would have to adjust the pointer into that block, make that block the first block of the file, and then free the previous blocks. Simple and neat.

The only real cost is a few extra bytes for each directory entry in the file allocation table. In the past, when disk space was expensive and precious, the idea of even one extra byte per directory entry was unthinkable. But with terabyte hard drives appearing on the market for $400 (yeah, 40 cents per gigabyte), the idea of saving a few bytes per directory entry is just silly. Even if we allocated four bytes per directory entry (which is overkill, considering that most file systems use block sizes that are less than 64K, which would require only two bytes), we’re only talking one megabyte for every 250,000 files. On my laptop, where I have about 140,000 files occupying 30 gigabytes, that works out a little less than one megabyte per 50 gigabytes, or 0.002% of disk space. Seems like a win to me.

Are there other reasons why file systems don’t implement a lop operation? What other operations could file systems implement if designers started worrying less about using a few extra megabytes here and there?