The ultimate developer machine?

In Understanding the Hardware, Jeff Atwood describes his “best bang for the buck developer x86 box,” at a cost of about $1,100.  The system he describes is quite a nice development machine, although it’s probably overkill for a lot of developers.  Seriously.  How many developers do you know who really need a 10,000 RPM drive and a screaming video card?

Surprisingly, he doesn’t mention what case he’s going to put all that fancy hardware in.  I’d really like to know.  I’ve mentioned before that I like the Antec Sonata cases because they’re very quiet.  But with their fans, they almost certainly create more noise than whatever Jeff’s using for passive cooling.

My development machine these days is quite a bit different from what he describes, but I realize that I have somewhat different needs.  I’ll give you a quick rundown.

Start with a Dell Precision 490 case, with power supply and motherboard.  These can be had for under $200 on eBay, or from Dell surplus suppliers.  They’re starting to become a bit scarce on the surplus market now, because most have gone off lease and Dell doesn’t make that model anymore.  One drawback to this system is that it creates a bit more noise than the Antec case, but I’ve found that I can accept a certain amount of noise.  And it’s hard to beat the price.

Add a quad-core Xeon E5335 processor running at 2 GHz.  Granted, 2 GHz isn’t exactly blindingly fast, but it’s quite well suited to the work that I do.  Unlike most developers, the code I’m working on does benefit from multiple cores.  The motherboard in this 490 has two processor slots, so I could potentially run two of those quad-core Xeons.  And I can make good use of all eight cores.  The Xeon is pretty pricey if you buy it new.  You might consider picking one up on eBay.  We’ve purchased dozens of these processors on eBay and haven’t had a problem with any of them.

I would have been shocked a year ago if somebody told me that I’d have a need for more than 8 gigabytes of RAM.  But the stuff I’m doing is memory hungry in the extreme.  This is another reason we go for the Dell 490 motherboard:  it was one of very few that supported 16 gigabytes a year ago, and I use every bit of it.  At about $80 for four gigabytes, memory is still a bit expensive.  But the stuff we’re working on really does need all the memory it can get.

I also use a lot of disk space.  Hard disk speed is important, but capacity is way more important to me.  I’ve loaded the box with two 7,200 RPM 750-gigabyte drives.  Terabyte drives are available, but at a huge premium.  The 750 GB drives go for about $120, or 6.25 cents per gigabyte.  A terabyte drive will run about $220, or 22 cents per gigabyte.  If I need more storage, I’ll find a way to shoehorn a third drive into this Dell box.

I’m not writing computer games, and I’ve turned off all the fancy Windows Aero features that do nothing but annoy me and chew up system resources.  My video card is a low-end ATI Sapphire 1650 for which we paid less than $50.  It drives my 24″ LCD at 1920 x 1200 resolution just fine.  I have no need for really high end video performance.

When you add everything up and throw in the DVD burner, we can put together one of these machines for under $1,500, which isn’t very much more than Jeff’s system once he adds the case and DVD.

I realize that I’m somewhat out of the ordinary, working with programs that require multiple cores as well as enormous amounts of memory and disk space.  I suspect that my ultimate development machine would be complete overkill for most developers.  But I find it interesting to compare what other developers need against what I’m using.

Do you have an ultimate developer machine?  Drop me a note.

An aside:
Jeff also uses the word commodification, as in, “This industry was built on the commodification of hardware. If you can snap together a Lego kit, you can build a computer.”  I had to read that twice before I realized that he wasn’t talking about turning hardware into toilets.  Commodification?  Please stop.

Is that code really from Sun?

I updated my Java runtime the other day, and now every time I open a new tab in Internet Explorer, I get this message box:

It looks like somebody at Sun forgot to sign their update agent. At least, I think this control came from Sun.  But there’s no way to be sure, is there? Do I blindly assume that this really is from Sun and that they made a mistake in generating the build, or do I do the prudent thing and permanently disallow it?

In a security conscious world, there’s no excuse for a major player like Sun to have released something with this error. One wonders, if an obvious bug like this makes it through their quality control, what other less obvious nasties are lurking in the code.

To heck with it. If Sun wants to push their software on me, they’ll have to get it right. I’m going to disallow the update agent. If I ever need to update my Java runtime, I guess I’ll just have to do it manually.

Charlie versus the wildlife. Again.

Every time I get to thinking that maybe Charlie’s learned not to mess with the local wildlife, he does something incredibly stupid to set me straight. Last night I let him out just before going to bed. He stood there by the door for a minute and then took off around the corner after something. 30 seconds later he was running across the yard with his face in the grass, and the unmistakable aroma of skunk assaulted my olfactory system.

Yes, Charlie got another skunk. More correctly, the skunk got him. Not only does the dog stink (he’s at the vet now, getting a skunk bath), but the skunk let loose around the side of the house–right next to the air conditioning unit. The house reeks. I’m at home today with the windows open and the whole-house fan pulling in the 95-degree air, hoping to get rid of that smell.

This is Charlie’s second skunk. I had hoped that after the last time he would have learned that the stinky black kitty with the white stripe is strictly hands-off. Sadly, he seems to be a slow learner.

C# and .NET: What’s next?

About 10 days ago, MSDN’s Channel 9 site released an hour-long video entitled Meet the Design Team, that talks in very vague terms about upcoming features in C# 4.0. You’ll learn that the language will include more dynamic constructs and built-in support for multiple cores. Honestly, that’s about all you’ll learn from watching the video. Granted, either one of those broad features implies many changes to the language and to the underlying runtime.

Improvements to the language are all well and good, but given the choice I’d rather have them address some fundamental runtime issues: the two-gigabyte limit, and garbage collection. Both of these issues have caused me no end of grief over the past year.

All things considered, the .NET garbage collector is a definite win. It handles the majority of memory management tasks much better than most programmers. It’s not impossible to create a memory leak in a .NET program, but you really have to try. Unfortunately, garbage collection is not free. You’ll find that out pretty quickly if you write a long-running program that does a lot of string manipulation. For example, take a look at this clip, which shows bandwidth usage from a Web crawler written in .NET:

Those times of zero bandwidth usage you see coincide with the garbage collector pausing all the threads to clean things up. We lose somewhere around 10% of our potential bandwidth usage due to garbage collection. This particular graph is from a dual-core machine. The graph looks the same on a quad-core processor.

Obviously, they’ll have to do something about the garbage collector if they’re going to support multiple cores. No amount of multi-core support in the language or in the runtime will do me a bit of good if every core stops whenever the garbage collector kicks in.

I’ve mentioned the .NET two-gigabyte limit before. The 64-bit runtime has access to as much memory as you can put in a machine, but no single object can be larger than two gigabytes.  When you’re working with data sets that contain hundreds of millions of items, that’s just not acceptable. When $2,000 will buy you a machine with 16 gigabytes of memory, it’s time that the .NET runtime give me the ability to allocate an object that makes use of that capacity.

I’m happy to see the team continue improving the C# language. I’ll undoubtedly find many of their improvements useful. But no amount of language improvement will increase my productivity if I’m hamstrung by the absurd limit on individual object size and the garbage collector continues to eat my processor cycles.

Unfortunately, we’ll have to wait a bit longer before we know what all will be included in the next versions of C# and .NET. Microsoft is keeping pretty quiet, apparently in an attempt to make a big splash at the Professional Developer’s Conference in October.

Anybody care to pay my way to the conference?

More URL filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these:

http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3
http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3

I don’t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely. Or until our self-imposed URL length limit kicks in. Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating http://www.example.com/mp3/song.mp3) is the right thing to do.

A single repeated component is by far the most common, but we frequently see two or three repeated components:

http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3
http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3

It’s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial. But it’s not a good general solution.  For example this blog (and many others) uses URLs of the form blog.mischel.com/yyyy/mm/dd/post-name/, so the entry for July 7 is blog.mischel.com/2008/07/07/post-name/. Globally applying the repeated component removal rules would break a very large number of URLs.

This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate. We can use some heuristics based on the crawl history (i.e. if /mp3/song.mp3 generates /mp3/mp3/song.mp3) to identify problem sites, but in the end we end up having to write domain-specific filtering rules. Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.

Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like PHPSESSID= and sessionid=, but these URLs will escape the filter unscathed:

http://www.example.com/file.html?exSession=123456xyzzy
http://www.example.com/file.html?exSession=845038plugh
http://www.example.com/coolstuff/123456xyzzy/index.html
http://www.example.com/coolstuff/845038plugh/index.html

It’s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn’t quite that smart, though, and making it that smart is very difficult.

Developing a system that automatically identifies problem URLs and generates filtering rules is a “big-R” research project–something that we don’t have time to work on at the moment.  Even if we were to develop such a thing, it’d be pretty fragile and would require constant monitoring and tweaking. If a site’s URL format changes (something that happens with distressing frequency), the filtering rules become invalid. Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.

When I started this project, I knew that crawling the Web was non-trivial. But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.

Odds ‘n Ends

  • Tom’s Hardware is running a review of solid state drives that compares the latest generation of SSDs against current mechanical drive technology.  It’s little surprise that SSDs are in general faster than hard drives.  What I found surprising is that some SSDs actually require more power than hard drives.  Not the newer crop, though.  Even the least efficient SSD has better performance-per-watt numbers than the most efficient hard drive.  And the OCZ SATA II is very impressive.
  • Solid state drives are still very expensive, though.  The 64 gigabyte OCZ SATA II will cost you about $17 per gigabyte.  That’s the high end.  Typical SSD prices are in the $10 per gigabyte range.  That’s a whole lot more than you’ll pay for a mechanical hard drive.  You can pick up a 320 Gb notebook drive for $110–less than 30 cents per gigabyte.  It’s nice to know that SSD is coming along, but it’ll be a year or two before I can justify replacing my notebook’s hard drive.
  • If you’re interested in using Windows Server 2008 as a workstation operating system, you should visit win2008workstation.com. But be careful. The site has a lot of good information, but there’s a large hacker/cracker component that sees nothing wrong with sharing component files. I wouldn’t trust downloading anything pointed to by forum posts.
  • If you’re in the market for a “dual core” laptop, be careful. Intel made a “Core Duo” line of processors which is in effect two Pentium M processors on one die. These are 32-bit processors. You probably want a machine that has a “Core 2 Duo” processor–a 64-bit part. I can’t see any reason why a typical user would want to buy a machine with a 32-bit processor.
  • Also on the subject of laptop computers, don’t assume that you’re getting the best price by buying on eBay. I compared prices for Dell laptops on eBay and at Dell Outlet  The outlet prices compare quite favorably with eBay, the only drawback being that you’ll have to pay sales tax if you buy from Dell. Still, I found plenty of eBay sales where the buyer paid more than what he would have paid at the outlet–including tax. Do your research.

Exceeding the limits

We generate a lot of data here, some of which we want to keep around. Yesterday I noticed that I was running out of space on one of my 750 GB archive drives and figured it was time to start compressing some of the data. The data in question is reasonably compressible. A quick test with Windows’ .zip file creator indicated that I’d get a 30% or better reduction in size.

The data is generated on a continuous basis by a program that is always running.  The program rotates its log once per hour, and the hourly log files can be anywhere from 75 to 200 megabytes in size.  Figuring I’d reduce the number of files while also compressing the data, I wrote a script that uses INFO-ZIP‘s Zip utility to create one .zip file for each day’s data.

And then I hit a wall.  It seems that the largest archive that Zip can create is 2 gigabytes.  As their FAQ entry about Limits says:

While the only theoretical limit on the size of an archive is given by (65,536 files x 4 GB each), realistically UnZip’s random-access operation and (partial) dependence on the stored compressed-size values limits the total size to something in the neighborhood of 2 to 4 GB. This restriction may be relaxed in a future release.

With 24 files ranging in size from 75 to 200 megabytes, it’s inevitable that some days will generate more than 3 gigabytes of data. At about 30% compression, that’s not going to fit into the 2 GB file.

My immediate solution will be to compress the files individually. It’s less than ideal, but at least it’ll give me some breathing room while I look for a new archive utility.

I’m surprised that in today’s world of cheap terabyte-sized hard drives, the most popular compression tools have the same limitations they had 20 years ago. Every modern operating system has supported files larger than 4 gigabytes for at least 10 years. It’s time our tools let us use that functionality.

I’m in the market for a good command-line compression/archiver utility that has true 64-bit file support. Any suggestions?

Going too far back

The other day I intended to close a Remote Desktop window and instead hit the Close button (the X on the right of the window’s caption bar) on the console window running our data broker. Nothing like an abnormal exit to bring the whole house of cards tumbling down.

So I went looking for a way to prevent that particular problem from occurring again. Disabling the Close button is pretty easy. In fact, there are at least two ways to do it. Neither is ideal.

The Close button is on the window’s system menu. You can get a handle to the system menu by calling the GetSystemMenu Windows API function. In addition to the buttons on the window’s caption bar, this menu also contains the menu items you see if you click on the box at the left of the window:

Given a handle to the system menu, you have (at least) two choices:

  1. Call EnableMenuItem to disable the caption bar’s Close button.
  2. Call DeleteMenu to remove the Close item from the menu. Doing so will also disable the Close button on the caption bar.

The second option looks like the best, because it prevents me from hitting the Close button, and also prevents me from inadvertently clicking the Close menu item when I’m going for Edit. The C# code for the second option looks like this:

[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr GetConsoleWindow();

[DllImport("user32")]
private static extern IntPtr GetSystemMenu(IntPtr hWnd, bool bRevert);

[DllImport("user32")]
private static extern bool DeleteMenu(IntPtr hMenu, uint uPosition, uint uFlags);

private const int MF_BYPOSITION = 0x0400;

static void Main(string[] args)
{
    // Get the console window handle
    IntPtr winHandle = GetConsoleWindow();

    // Get the system menu
    IntPtr hmenu = GetSystemMenu(winHandle, false);

    // Delete the Close item from the menu
    DeleteMenu(hmenu, 6, MF_BYPOSITION);

    // rest of program follows
}

That works well, as you can see from this screen shot:

But there’s a problem. To restore the menu when your program is done, you’re supposed to call GetSystemMenu and pass true for the second parameter, telling it to restore the menu, like this:

GetSystemMenu(winHandle, true);

The result is probably not what you expect:

The system didn’t revert to the previous menu, but rather to the default system menu–the one created for every window. The Edit, Defaults, and Properties items that cmd.exe adds to the menu are gone.

Since I can’t reliably restore the menu after deleting an item, I figured I’d call EnableMenuItem to disable the Close item. Unfortunately, that doesn’t appear to be possible. At least, I haven’t been able to make it work. Since I often need the Edit menu item even after the program exits, I’m going with the first option and hoping that I don’t hit the Close menu item by mistake when going for the Edit menu while the program is running.

An aside: we have the term “fat finger” to describe hitting the wrong key on the keyboard. Is there a similar expression for making a mistake with the mouse? I suppose “mis-mouse” would do, but it doesn’t have quite the same ring to it as “fat finger.”

Proxy fits

Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to slashdot.org.

You can often go around such blocks (not that I’m advocating such behavior) by using services such as SureProxy.com. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/. If SureProxy isn’t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)

I’m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.

The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn’t look anything like the real URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.

Fortunately, it’s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I’ve encountered use CGIProxy–a free proxy script. The script itself is usually called nph-proxy.cgi or nph-proxy.pl, although I’ve also seen nph-go and nph-proy, among others. It’s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I’ll have to find and block manually.

I’ve also seen proxies (Invisible Surfing is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=. I’m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. That turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking of the URLs myself and manually block the domains. It’s like playing Whac-A-Mole, though, because new proxies appear all the time.

The other problem with crawling through proxies is that it makes the crawler ignore the robots.txt file on the target Web site. Since the crawler thinks it’s accessing the proxy site, it checks the proxy’s robots.txt. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.

Perhaps most surprising is that proxy sites don’t have robots.txt files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren’t viewing the Web pages, so the proxy site doesn’t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site’s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don’t take the simple step of writing a simple robots.txt, please explain that to me in the comments, or by email. I’m very curious.

Crawler versus the URLs

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe. After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you’ll start to wonder how the Web manages to work at all. Nowhere is this more evident than when working with URLs.

It’s a pleasant fantasy to believe that a document on the Web can be reached through one and only one URL. That is, our training as programmers pushes us into the belief that the URL http://www.example.com/docs/resume.html is the way to reference that particular document. It might be the preferred way, but it’s certainly not the only way. On most servers, for example, you can drop the “www”, so that http://example.com/docs/resume.html will get you to the same place. We call this “the www problem.”

That’s just the simplest example. Did you know that multiple slashes are irrelevant? That is, http://www.example.com/////docs////resume.html will go to the same place as the two URLs above. You can also do some path navigation within the URL so that http://www.example.com/docs/../docs/resume.html goes to the same place as all the other examples I’ve shown.

You can also “escape” any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: http://www.example.com%2Fdocs%2Fresume.html. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.

Above are just some of the simplest examples. I haven’t even started on query strings–parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the “canonical” form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.

If you haven’t written a crawler, you might think I’m just contriving examples. I’m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant–so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in RFC 3986 will get you most of the way there, but there are ambiguities that simply cannot be resolved. So we do the best we can.

You might also wonder where these weird URLs come from. The answer is, “everywhere.”  Scripts are high on the lists of culprits. They can mangle URLs beyond belief. For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string. The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.

So imagine a script reached from the URL http://www.example.com/script.php?page=1&username=Jim%20Mischel. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: http://www.example.com/script.php?page=2&username=Jim%2520Mischel. “%25” is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with http://www.example.com/script.php?page=10&username=Jim%2525252525252525252520Mischel.

What’s a poor crawler to do?

We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it’s a never-ending battle. Whenever we think we’ve seen it all, we run into another surprise.