I’ve been head-down here working on the Web crawler and haven’t had much occasion to sit down and write blog entries. It’s been a very busy but interesting and rewarding time. A high performance distributed Web crawler is a rather complex piece of work. I knew that it would be involved, but there’s a lot more to it than I originally thought. Managing hundreds of concurrent socket connections on a single machine and coordinating multiple machines gets complicated in a hurry.
It’s been a long time since I did any serious multi-threaded programming, and I’m re-learning some lessons:
- When there are multiple threads, everything is subject to concurrency issues.
- Unless you can guarantee that a data structure cannot possibly be accessed by more than one thread of execution at a time, you will eventually experience a problem if you don’t protect it with some kind of synchronization lock.
- You’ll find that guarantee very difficult to make.
- It can take hours or days before your program crashes inexplicably.
- It can take just as long to find that one place where you forgot to protect a global data structure with a mutual exclusion lock.
- If you’re developing a multi-threaded application on a single-processor machine, you’ll invariably end up finding more bugs when you move to a dual-processor machine where the computer really can do more than one thing at a time.
- Learn how to use trace logging and prepare to spend a lot of time examining the log files for problems.
- Race conditions and deadlocks are facts of life, difficult to predict, and painfully obvious in retrospect.
It’s really good to be programming again.
David and I went to Fry’s on Monday to pick up parts for three new computers. We’ve been plugging away on this project for the last two months using our personal machines and some castoffs, and started bumping into hardware limitations. Time to upgrade.
The new machines are Intel Core 2 Duo, 2.4 gigahertz with 4 MB L2 cache. We put 4 gigabytes of RAM on two of the machines, and 8 gigs on the other. We’re going to need lots of memory.
We put two 750 GB Seagate SATA drives on each machine, striped so that Windows sees them as a single 1.36 terabyte drive. It takes approximately half of forever to format the dang thing.
I’ve installed Windows XP Pro, x64 edition. Haven’t had a chance yet to put the machine through its paces. I’ll have to find some kind of benchmark to see what it can do.
Perhaps the best thing about the new machines is the case: Antec Sonata II. With the CPU fan, power supply fan, standard case fan, and the additional front fan, the thing is still whisper quiet. It’s by far the quietest computer I’ve ever owned–including my laptop. And working on it? It’s a joy. Everything is well laid out, solidly constructed, and just plain done right. You can buy a cheaper case but as far as I’m concerned the 85 bucks for one of these is money well spent.
More once I’ve had a chance to work with the machine a bit. I can’t wait to allocate a 2 gigabyte array.
I uninstalled the Nero InCD program today while I was cleaning up my Windows machine. The uninstall program told me, before I confirmed that I wanted to remove the program, that I would have to reboot in order to complete the task. Fine. Except that when it was done uninstalling, it displayed this message box.
Would it have been so difficult for them to add a line that says, “Would you like to restart now?”
Almost two years ago in The Housing Bubble, I warned that the then-current home building boom was unsustainable and that soon we would begin to see record numbers of defaults and foreclosures. I said then that I hoped I was wrong. I wasn’t.
Last Tuesday, the Mortgage Bankers Association released their Latest MBA National Delinquency Survey (for the end of 2006) and issued a press release that summarizes the report. According to the report:
- The delinquency rate for all home mortage loans was 4.59 percent. This represents an increase in deliquencies for all loan types, but particularly for subprime and FHA loans.
- The percentage of loans in the foreclosure process was 1.19 percent of all loans outstanding. This, too, is a significant increase over the 2005 numbers.
- The foreclosure rate for subprime loans was 4.53 percent, up from 3.86 percent a year before.
Things are getting bad. Mississippi has an overall delinquency rate of 10.64 percent. Louisiana 9.1 percent, and Michigan 7.87 percent. Foreclosures in Ohio are at 3.83 percent. In Illinois, 6.22 percent of all subprime loans were in foreclosure at the end of 2006 (source). It’s almost certain that the numbers are much worse now, almost three months later.
The not surprising part of this whole mess is, now that the things many analysts had predicted and warned about two and three years ago are coming to pass, people are starting to look for somebody to blame. It couldn’t be their own dang fault for taking on the risk of getting a subprime loan. Of course not. It’s predatory lending practices or the government’s fault:
The Administration and Congress helped create the sub-prime crisis by ignoring warning signs and, as a result, they bear some responsibility for assisting the families facing payment shock and foreclosure…
This Administration and the previous Congress allowed the horse not only to gallop out of the barn but jump over the cliff as well…
We call on this Administration and this Congress to take the reins back by immediately allowing FHA to play a role in helping borrowers and passing legislation to protect not only borrowers but also the nation’s overall economy.
The denial is almost over, and the anger has begun. It’s time to find a scapegoat. I said two years ago that this was going to be worse than the fallout from the late 1980s S&L crisis. It’s going to be much worse than I ever imagined.
(Note 2007/04/18: I closed comments on this entry because it has attracted a large amount of comment spam.)
Tasha the poodle came to us in May of 1997 along with her mom, Tiffany. Tasha was tiny when we got her. I think she weighed all of five pounds. I was always afraid that I’d somehow hurt her.
Timid as she was, and kind of dull personality-wise compared with Tiffany, Tasha still managed to work herself into my heart. She’d come by and look at me with those adoring eyes and there’s no way I couldn’t love her. She wasn’t particularly snuggly towards me, but when she did curl up with me she’d just sprawl out. She’d lie there for as long as I’d pet her. Crazy little poodle.
The thing we’ll remember the most about Tasha, though, is that she always put Charlie in his place. That’s right, our six pound toy poodle kept our 70 pound pit bull in line. If he ran her over, she’d get up and give him a nasty scolding. More than once she ended up literally hanging from his lip, she bit so hard. That was quite a trick, too, considering that she didn’t have a whole lot of teeth left. Charlie always looked properly contrite. He’d shake his head and slink away carefully so as not to incur her wrath again.
Tasha was sick a lot. She had a collapsed trachea, enlarged heart, hip problems, back problems, and undoubtedly other problems of which I know nothing. When we got Charlie five years ago, we thought Tasha was on her last legs. Charlie perked her up, though, and she was quite spry for the next few years. Last summer she got sick and bounced back. Same thing happened again in November. We’d think, “this is it,” but somehow she’d miraculously recover.
But not today. She’d been deterioriating quickly over the last several days, and this morning she wasn’t moving well at all. A few hours at the vet and we determined that she just wasn’t going to get better this time. Something was causing the bones in her vertebrae to deteroriate, making it increasingly difficult for her to walk or stand. There was nothing that the doctors could do for her. They told us that they couldn’t even understand how she’d lasted this long.
We cried. Of course we did. But we couldn’t watch her in pain any longer. We petted her and said goodbye, and held her as the doctor administered the shot. Tasha was a part of our family, a source of amusement and a loyal friend. We will miss her.
After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a link to blog.mischel.com, how do I feed that new link back to the crawler so that it will download and examine that page? At first glance, it doesn’t seem like a problem at all: simply add the new link to the queue. If only it were so easy.
There are several problems one must deal with. First, you have to make sure that the crawler hasn’t already visited that page, and that the new url isn’t already in the queue. If you just add every harvested link to the queue, you’ll soon find yourself in an infinite loop. Not a problem, right? Just keep a list of visited links. Except there are billions of URLs. If you’re going to crawl any significant portion of the web, you won’t be able to keep the queue or the visited URLs list in memory. Forget storing the things in a traditional database. SQL Server, mySQL, and the like won’t perform well enough to keep your bandwidth saturated.
You could devise a hashing scheme and keep a big array (on the order of four to eight gigabytes) of bits in memory. That would solve the problem at the risk of some small number of false positives–URLs that hash to the same key. If you have a distributed crawler, you can soup up one machine to do all the URL management tasks and have the individual crawler machines send their harvested links to it. This “URL server” makes sure that no link is visited twice, and also controls which URLs the individual crawlers visit. You can draw a pretty picture that shows how it all works together. It’s feasible, but seriously complicated.
There is a simpler method. The URL server tells each crawling process which URLs to visit. They report their harvested links back to the server, which simply writes them to disk. When all of the crawlers have exhausted the links that they’ve been given, they exit. The URL server then sorts the harvested links file, eliminates duplicates by merging with the existing visited links file, and then starts another crawler pass with the newly discovered links. Granted, that sort and merge is going to take a lot of time, but the job can be split up among many machines.
This approach also has the benefit of providing a stopping point where some other process can take the collected data up to this point and start processing. After all, the point of crawling the web is to collect information, not just have the crawlers out there harvesting links.
There is a major drawback to this approach. When the crawlers are idle, you’re not using any of that expensive bandwidth you’re paying for. It is possible to use a hybrid approach, where the URL server fires off a sort/merge (on a separate machine) after collecting some large number of links, and continues to keep the crawlers busy while the other machine is doing its thing. When it’s done, that other machine gives the URL server another batch of links to process. This keeps the bandwidth saturated while still simplifying the process of identifying duplicate URLs.
Of course, if you have something else to do with the bandwidth while the sort/merge is going on, then the crawl-process-crawl scheme comes in very handy indeed.
That’s a surprise, huh? You came here expecting to see the same old Random Notes page and you got something entirely different. It’s still me. But I’m in the process of changing things around a bit.
I’ve installed WordPress on my hosting server. WordPress is an online blogging tool that lets me post entries from a Web browser, and also allows readers to post comments and trackbacks. It also allows me to categorize posts and do many other blog-related things that I can’t do or can’t do easily with CityDesk.
The template here looks a little plain, I know. I’m using the default WordPress template for now. I’ll be selecting or creating a new template soon. Until then, this will do.
I haven’t decided yet whether I’ll be moving the last six years of entries to WordPress. Although WordPress will allow me to import entries, and I’m perfectly capable of writing a program that will convert the CityDesk files to WordPress import format, there are some issues like images and links to other entries that make the conversion a little bit more complicated. And there’s no way I’m going to hand-enter 1,500 historical entries.
In any case, that’s what’s happening.
If you want to see the old entries, go to http://www.mischel.com/diary/oldindex.htm.
I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down hundreds or thousands of documents per second. Granted, processing more than a few dozen documents per second will require a heck of a lot more bandwidth than I currently have, and if I want to get to the thousands I’ll need multiple machines working together.
But first things first. Before you can scale up, you need something that works no matter how slow it is. This turns out to be more difficult than it seems at first glance. The idea is very simple: read a Web page, scan it for links, put those links in a queue, grab a link from the queue, lather, rinse, repeat. Ten billion times.
Ten billion is a big number. If you process 20 documents per second, it’ll only take you 500 million seconds to finish the job. That’s almost 16 years. Obviously, you need to process things much faster than that. It’s a bandwidth issue and a machine optimization issue. Let’s look at both.
The easy part is bandwidth. It’s hard to get an authoritative figure, but a good round number for average Web page size (HTML only) is about 25 kilobytes. The best throughput I’ve seen on my cable modem is about 5 megabits (approx 500 kilobytes) per second. So figure a sustained upper limit of 20 pages per second. That’s 20 HTML pages downloaded and links extracted. Doesn’t seem so tough, right? I mean, a 2 GHz computer with a 100 megabit network card can easily keep up with that.
It’s harder than it sounds. Not because the processor can’t keep up, but because it’s incredibly difficult to saturate the bandwidth. There’s a whole lot of latency involved in reading Web pages. Most of the time, the computer is just sitting there waiting for servers to return data. The solution is to keep multiple connections going so that while some are waiting, others are pulling down data. That turns out to be a non-trivial task. First, because managing dozens or hundreds of concurrent connections is hard, and second because domain name resolution takes an absurdly long time. And it’s often implemented very inefficiently.
There are many other issues you have to think about if you want to crawl the Web. You’d think that keeping track of the URLs you’ve visited and those that you haven’t visited yet would be easy. And it would be if the numbers were manageable. 10 billion documents means 10 billion URLs. Just the URLs themselves, stored naively, would take close to a terabyte. So you have to come up with a memory-efficient way to store them and a time-efficient way to search them.
There are other issues as well. For example, did you know that www.mischel.com and mischel.com go to exactly the same place? That’s a trivial example, but you’ll find out that some Web pages have many different URLs. If you don’t have some way of resolving those and other traps, URLs with odd query strings, unique session IDs, and any number of other oddities will quickly have you chasing your own tail.
Today I finally got my first multiple-connection web crawler up and limping. It took a whle lot longer than I thought it would because the asynchronous HTTP library I downloaded was buggy and it took me forever to figure out how to make Pytho break out of an infinite loop. For all that’s good about Python, the development and debugging environment is distressingly primitive.
All of which is to say that I’ve latched onto a project that is very interesting and challenging, and doesn’t leave me time for much else. When I’m not banging on code, I’m reading research papers about crawling, storing, indexing, and searching large amounts of data, or thinking about better ways to implement parts of my crawler. It’s exhausting, but pleasantly so.