The web crawler I’m working on, as I’ve mentioned before, is a distributed application. Currently it consists of a URL Server and multiple Crawlers. The basic idea is that the URL Server is a traffic director that tells each Crawler machine which Web sites to visit. Each Crawler machine hosts multiple Worker threads, each of which reads a URL from a queue, downloads the page, harvests the links, and reports the results.
Coordinating the the Worker threads and the communication with the URL Server (which itself contains several threads) isn’t exactly easy, but it’s manageable. This past week I’ve been working on what I call the Crawler Control Panel: a GUI application that lets me monitor and control the URL Server and the individual Crawlers. Today I finally got around to adding the shutdown commands, and making the Crawlers and URL Server shut down gracefully. That turned out to be an interesting experience.
I don’t want to just terminate a Crawler because if I do, I lose data. Each of the Worker threads is in some stage of downloading a Web page and it also has some buffered data and a small batch of URLs (a few dozen) that it has “checked out” but not yet processed. If the Crawler terminates immediately upon receiving a shutdown message, all of the buffered data is lost. To avoid losing data, the Crawler waits for each Worker thread to finish its current URL and then tells the Worker to shut down. Because shutting down takes some time (sending results to the URL server, updating the queue, etc.), it is done asynchronously (on a separate thread) to avoid blocking the Crawler’s main processing thread. The Crawler shuts down after all of the Worker threads have been shut down.
A much-simplified version of the Crawler’s main processing loop looks like this:
While True Wait for WorkerReadyEvent or CrawlerShutdownEvent if CrawlerShutdownEvent CrawlerShutdown = True if WorkerReadyEvent While there are Workers in the queue Get Worker from queue if CrawlerShutdown = True Shutdown Worker asynchronously else Tell Worker to process next URL if CrawlerShutdown = True and Number of workers = 0 Terminate
That looks like it should work, right? Imagine my surprise when all the Worker threads shut down but the Crawler wouldn’t terminate. It’s painfully obvious now, but at the time I was scratching my head. It’s easy to see what’s wrong if you imagine that there is a single Worker thread. Everything works fine until the CrawlerShutdownEvent is signaled. Then, events happen in this order:
- The CrawlerShutdown flag is set.
- At some point, the Worker thread finishes with its current URL, gets put into the queue, and signals the WorkerReadyEvent.
- The Crawler dequeues the Worker, notes that CrawlerShutdown is set, and spawns a new thread to shutdown the Worker.
- While the Worker is shutting down, the Crawler checks to see if the number of workers is zero. Since the Worker hasn’t fully shut down, it hasn’t been removed from the list. The Crawler goes back to waiting for the next event.
- The Worker terminates and no more events are forthcoming.
The first idea–decrementing the count of Workers in the main loop–won’t work because if you did that, the program might terminate before the Worker is finished saving its buffered data. No, the answer is to put a timeout (I used five seconds) on the wait at the top of the loop. So the code looks like this:
While True Wait up to five seconds for WorkerReadyEvent or CrawlerShutdownEvent if the wait didn't time out // do all that other stuff if CrawlerShutdown = True and Number of workers = 0 Terminate
Sometimes–rarely, but it happens–you do want a Crawler to terminate immediately. Simple enough, right? Just signal the CrawlerShutdownEvent and pass a flag that says, in effect, “terminate right now without waiting for the Worker processes.” That was absurdly easy to implement, but then I ran into another problem.
Remember the Crawler Control Panel (CCP)? It has a persistent connection the the URL Server and all of the Crawlers. To terminate a Crawler or the URL Server, the CCP sends a message on that connection. Every such message requires a response. That usually isn’t a problem. But sometimes, because things are happening on multiple threads–the Crawler will terminate and closes the connection before it returns a response to the CCP. The result is a big ugly exception: “connection forcibly closed,” or something similar.
Some days you just can’t win.
Anybody who’s done any multi-threaded or distributed processing is familiar with these kinds of problems. The interesting thing is that most programmers–even the experienced ones–continually run into stuff like this. Thinking about concurrency is hard. You have to think of all the wacky cases–events that can happen at any time in almost any order–and write your code so that it handles those cases intelligently.
When you’re writing this kind of code, paranoia pays. No wonder I don’t sleep nights.