Jim’s Random Notes

Musings on technology and life

August 10th, 2008

Paranoia versus productivity

We had an interesting discussion at the office about how much validation a collection type should do in its constructor. The key question, I think, came down to this:

If the constructor can determine that using the instantiated object will throw an exception, should the constructor fail rather than returning the instantiated object?

In other words, if I know that the instantiated object won’t work, shouldn’t I just throw the exception now, rather than let you be surprised later?

There are two extremes here: 1) the constructor should go to heroic efforts, and; 2) let the buyer beware. I tend to lean towards putting the onus on the caller, figuring that whoever is instantiating the object knows what he’s doing.  Let me provide an example.

Consider the .NET SortedList generic collection type. To do its job (that is, keep a collection of items sorted), it requires a comparison function. If you don’t specify a comparison function when you call the constructor, the collection uses the default comparison function for whatever type you specify as the key. This sounds simple enough, right? A list of employees that’s sorted by employee number, for example, would be defined like this:

SortedList<int, Employee> Employees =
    new SortedList<int, Employee>();

Since the System.Int32 type (which the C# int type resolves to) implements IComparable, everything works.

But imagine you have an EmployeeNumber type:

class EmployeeNumber
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public EmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }
}

Now, if you create a SortedList that’s keyed on that type, you’ll have:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

Allow me to show the entire program here, so we don’t get confused.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace genericsTest
{
    class EmployeeNumber
    {
        public string Division { get; private set; }
        public int EmpNo { get; private set; }
        public EmployeeNumber(string d, int no)
        {
            Division = d;
            EmpNo = no;
        }
    }

    class Employee
    {
        public string Name { get; set; }
        public Employee(string nm)
        {
            Name = nm;
        }
    }

    class Program
    {
        static SortedList<EmployeeNumber, Employee> Employees =
            new SortedList<EmployeeNumber, Employee>();

        static void Main(string[] args)
        {
            Employees.Add(new EmployeeNumber("Accounting", 1),
                new Employee("Sue"));
            Employees.Add(new EmployeeNumber("Dev", 2),
                new Employee("Jim"));
        }
    }
}

If you compile and run that program, you’ll see that it throws an exception when it tries to add the second employee to the list. The program fails because it can’t compare the items. Neither implements IComparable.

Those who lean towards the first extreme above will argue that the SortedList constructor should determine that the key type doesn’t implement IComparable, and should prevent you from instantiating the collection. It should throw an exception because it knows that trying to add items to the collection will fail.

The constructor could do this. It’s possible for the constructor to get the default comparer and call it. If the comparison function returns a value, then all is well. If it fails, then the constructor throws an exception saying, “Sorry, but you didn’t supply a comparison function.”

The only problem with that scenario is that it’s wrong. Not wrong philosophically, but wrong in a very concrete sense. Extending my example will illustrate why.

Suppose you have two different types of employee numbers. Maybe an OldEmployeeNumber that looks like the one I defined above, and a NewEmployeeNumber that has different fields. Because you want to keep both employee number types in the same list, you define a base class, EmployeeNumberBase from which they both can inherit. The definitions would look like this:

abstract class EmployeeNumber : IComparable
{
    // Some common employee number functionality goes here.

    public int CompareTo(object obj)
    {
        throw new NotImplementedException();
    }
}

class OldEmployeeNumber : EmployeeNumber, IComparable
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public OldEmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is OldEmployeeNumber)
        {
            var o2 = obj as OldEmployeeNumber;
            rslt = Division.CompareTo(o2.Division);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is NewEmployeeNumber)
        {
            // OldEmployeeNumber sorts before NewEmployeeNumber
            rslt = -1;
        }
        return rslt;
    }
}

class NewEmployeeNumber : EmployeeNumber, IComparable
{
    public string Country { get; private set; }
    public decimal EmpNo { get; private set; }
    public NewEmployeeNumber(string c, decimal no)
    {
        Country = c;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is NewEmployeeNumber)
        {
            var o2 = obj as NewEmployeeNumber;
            rslt = Country.CompareTo(o2.Country);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is OldEmployeeNumber)
        {
            // NewEmployeeNumber sorts after OldEmployeeNumber
            rslt = 1;
        }
        return rslt;
    }
}

Yeah, I know. That’s quite a mouthful.

The EmployeeNumberBase class implements the IComparable interface, but its implementation just throws NotImplementedException. Furthermore, the class is marked as abstract to prevent it from being instantiated. Only derived classes can be instantiated.

The derived classes each explicitly implement the IComparable interface. The company-defined sorting rules are that old employee numbers always sort in the list before new employee numbers. Within the same type, the numbers are sorted using their own rules. [Note here that my CompareTo implementations aren't terribly robust. They'll return zero (equal) if the object passed is not of a known type, and they'll fail if the passed object is null. But those details aren't terribly relevant to the example.]

Now, the Employees list is created in exactly the same way:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

We can then add items to the list:

Employees.Add(new NewEmployeeNumber("USA", 2.002m),
    new Employee("Jim"));
Employees.Add(new OldEmployeeNumber("Accounting", 1),
    new Employee("Sue"));
Employees.Add(new OldEmployeeNumber("HR", 3),
    new Employee("Dana"));

If you make those changes and run the program, you’ll see that it does indeed run, and work as expected, and I didn’t change the comparison function that the constructor sees.

If SortedList had attempted to protect me from myself–that is, call the default comparison function and throw an exception because the comparison function had failed–then this final code would not work. By trying to protect me from myself, it would have prevented me from doing what I wanted to do.

Understand, the above is something of a contrived example. I certainly can’t imagine implementing the employee list that way, even if I did have different employee number types. But somebody else might think it’s a perfectly reasonable thing to do. The point is that there could be very good reasons for instantiating a keyed collection with a key type that does not have a valid comparison function. The constructor cannot know if comparisons will fail.

Which brings us back to the original question: how hard should a collection class (or any library object) try to prevent you from instantiating an object that will fail? In my opinion, the constructor should instantiate the object if the immediate parameters look reasonable. My reasoning is that it’s extremely difficult, if not impossible, to know how the caller will be using the class. As you saw above, making broad assumptions about types in a polymorphic environment can be fatal.

This reasoning extends far beyond the question of how a collection class’s constructor should behave. As programmers, we have to strike a balance between paranoia and productivity. We have to decide daily how much trust to put in the code that calls our methods, and how much we can depend on the code we call. Do we write classes that hold the programmer’s hand to help him across the street, or do we provide a “walk” signal and a warning that says, in effect, “If you cross on red, all bets are off”?

August 4th, 2008

Multicore Crisis?

There’s been some talk recently of the next “programming crisis”: multicore computing. I’ll agree that we should be concerned, but I don’t think we’re anywhere near the crisis point. Before I address that specifically, I think it’s instructive to review the background: why multicore processors exist, how they affect existing software, and the issues involved in writing code to make use of multiple cores.

Moore’s law has been quoted and misquoted so often that it’s almost a cliché. His original statement was simply an observation on the rate at which transistor counts were increasing on integrated circuits, and that he expected the trend to continue for at least 10 years. That was 1965. The trend has continued, and there’s no indication that it will slow.

Some people think Moore’s Law has become something of a self-fulfilling prophecy: because we believe that it’s possible, somehow we strive to make it so. One wonders what would have happened if Moore had said that he expected the rate of growth to increase. Would transistor densities have increased at an exponential rate?

Self-fulfilling prophecy or not, it’s almost certain that the trend in increasing transistor densities will continue (it has through 2007) and that as a result we’ll get ever more powerful CPUs as well as faster, higher-capacity RAM. Absolute processor speed as measured by clock rate will continue to increase, but not at the astounding rates that we saw up to 2005 or so. Quantum effects and current leakage have put a little damper on the rate of growth there. Better materials will solve the problem–are solving the problem–but absent a fundamental breakthrough by the chemists working on the problem, clock speeds won’t be doubling every 18 months like they had been in the recent past. The Clock Speed Timeline graph makes this quite evident.

Today’s trend is towards multiple cores on a single processor, running at a somewhat slower clock rate. The machine I’m writing this on, for example, has a quad-core Intel Xeon processor running at 2 GHz. The clock speed is somewhat slower than you can get in a high-end Pentium, but the multiple cores provide more total computing power. Quad core processors today are quite common. Intel demonstrated an 80-core chip in February of 2007, and promised to deliver it within five years. I fully expect to have a 256-core processor in my desktop computer ten years from now.

The trend towards multiple cores and very slowly increasing clock rates has some interesting ramifications for software developers. In the past, we have depended on more RAM and faster processors to give us some very nice performance boosts. All indications are that the amount of available RAM and the size of on-chip caches will continue to grow, but we can’t count on the biannual doubling of processor speed. Unless we learn to write programs that use multiple cores, we will soon reach a very real performance ceiling.

Not all applications can benefit from multiple cores, but you’d be surprised at how many can. And even in those cases when a single program can’t make use of multiple cores, users still benefit from having a multicore processor because the machine is better at multi-tasking. Imagine running four virtual machines on one computer, for example. If the computer has a single processor core, all four virtual machines and all of the operating system services share that one core. On a quad-core processor, the work load is spread out over all four cores. The result is more processor cycles per virtual machine, meaning that all four virtual machines should run faster.

Software systems that consist of multiple mostly-independent processes can make good use of multicore processors without any modification. Consider a system consisting of two services that are constantly running. On a single-core computer, only one can actually be working at a time. You could almost double performance simply by upgrading to a dual-core processor. Such software systems are quite common, and they require no code changes in order to benefit immediately from the new multicore processor designs.

Contrary to popular belief, writing code that is explicitly multi-threaded–designed to take advantage of multiple cores–isn’t necessarily a huge step up in complexity. Such code can be much more complex than single-threaded code, but it doesn’t have to be. Some programs are more multi-threaded than others. I’ve found it useful to think of programs in terms of the following four levels of complexity:

  1. No explicit multi-threading.
  2. Infrequent, mostly independent asynchronous tasks.
  3. Loosely coupled cooperating tasks.
  4. Tightly coupled cooperating tasks.

Obviously, it’s impossible to draw exact boundaries between the levels, and many programs will use features found in two or more of the levels. In general, I would classify a program by the highest level of multi-threading features that it uses.

Level 1 requires little in the way of explanation. This is the most common type of application in use today. In a batch mode program, execution proceeds sequentially from start to finish. In a GUI program, user interface events and processing execute on the same thread. This type of application has served us well over the years.

Most Windows programmers have some experience with the next level of complexity. A GUI application that performs background processing and periodically updates the display is an example of this type of program. Typically, the program starts the background process, which from time to time raises events which the GUI thread handles and updates the display. Data and process synchronization between tasks is limited to the event handlers that respond to asynchronous events. Modern development environments make it very easy to create such programs. These programs can benefit from multiple processor cores because the background thread can operate independently of the the GUI thread, making the GUI thread much more responsive.

I have found the third level of complexity–loosely coupled cooperating tasks–to be a very useful and relatively simple way to make use of multiple cores. The idea is to construct a program that operates in an assembly line fashion. For example, consider a program that gathers input, does some complex processing of the input data, and then generates some output. Many such programs are processor bound. If you structure the program such that it maintains an input queue, a pool of independent worker threads, and an output queue, then there is little danger of running into the problems that often plague more complex programs. You have to supply synchronization (mutual exclusion locks, or similar) on the input and output queues, but the worker threads operate independently. Using this technique on a quad-core processor, it’s possible to get an almost 4x increase in throughput over a single-core processor, with very little danger of running into resource contention issues.

Written correctly, programs that have multiple tightly-coupled cooperating tasks make the best possible use of processor resources. However, explictly coding thread synchronization is perhaps the most difficult type of programming imaginable. Forgetting to lock a resource before accessing it can lead to unexplained crashes or data corruption. Holding a lock for too long can create a performance bottleneck. Locks that are too granular increase complexity and also the chance for deadlock situations. Locks that are not granular enough will stall worker threads. Race conditions are endemic. Assuming you get such a program working, even a small change will often cause new, unanticipated problems. Writing this kind of code is hard. You’re much better off re-thinking your approach to the problem and casting it as a Level 3 problem. Whatever price you pay in performance will be returned many fold in increased reliability and reduced development time.

If you’re writing a Level 3 or Level 4 program, you should very seriously consider using a existing multi-tasking library if at all possible. Doing so will require that you think about your problem differently, but you leverage a lot of known-working code that is almost certainly more robust in all ways than what you’re likely to write yourself in the time allotted. Two good examples of such libraries are the Parallel Extensions to .NET 3.5 and the Java Parallel Processing Framework. Such libraries exist for many other programming environments. Although still in their infancy, these libraries promise to greatly simplify the move to multicore. If you’re contemplating development of a program that makes good use of multiple cores, you definitely should learn about any parallel computing libraries that support your platform.

So, back to the crisis. Bob Warfield over at SmoothSpan Blog has had and continues to have quite a lot to say about it, and many others share his sentiments. I, on the other hand, don’t think we’re anywhere near the crisis point. Nor do I think we’re likely to get there. Whereas it’s true that most current software isn’t multicore ready, software developers have understood for several years now that they need to begin writing applications that take advantage of multiple processor cores. It’s likely that some shops have taken an ad hoc approach to the problem, and they’re probably suffering with the issues I pointed out above. It’s also likely that many (and I would hope, most) development shops have done the prudent thing and adopted a parallel computing library that takes care of the difficult areas, leaving the programmers to worry about their specific applications. Doing so is no different than adopting an operating system, development environment, GUI library, report generator, or any other third party component–something that development shops have long experience with.

In short, the multicore “crisis” that the doomsayers are warning us about is almost a non-issue. It’s going to require a small amount of programmer retraining and there will undoubtedly be a temporary plateau in the rate at which our processing of data increases, but in a very short time we’ll again have mainstream applications that push all this fancy hardware to its limits.

July 21st, 2008

C# and .NET: What’s Next?

About 10 days ago, MSDN’s Channel 9 site released an hour-long video entitled Meet the Design Team, that talks in very vague terms about uncoming features in C# 4.0.  You’ll learn that the language will include more dynamic constructs and built-in support for multiple cores.  Honestly, that’s about all you’ll learn from watching the video.  Granted, either one of those broad features implies many changes to the language and to the underlying runtime.

Improvements to the language are all well and good, but given the choice I’d rather have them address some fundamental runtime issues:  the two-gigabyte limit, and garbage collection.  Both of these issues have caused me no end of grief over the past year.

All things considered, the .NET garbage collector is a definite win.  It handles the majority of memory management tasks much better than most programmers.  It’s not impossible to create a memory leak in a .NET program, but you really have to try.  Unfortunately, garbage collection is not free.  You’ll find that out pretty quickly if you write a long-running program that does a lot of string manipulation.  For example, take a look at this clip, which shows bandwidth usage from a Web crawler written in .NET:

Those times of zero bandwidth usage you see coincide with the garbage collector pausing all the threads to clean things up.  We lose somewhere around 10% of our potential bandwidth usage due to garbage collection.  This particular graph is from a dual-core machine.  The graph looks the same on a quad-core processor.

Obviously, they’ll have to do something about the garbage collector if they’re going to support multiple cores.  No amount of multi-core support in the language or in the runtime will do me a bit of good if every core stops whenever the garbage collector kicks in.

I’ve mentioned the .NET two-gigabyte limit before.  The 64-bit runtime has access to as much memory as you can put in a machine, but no single object can be larger than two gigabytes.  When you’re working with data sets that contain hundreds of millions of items, that’s just not acceptable.  When $2,000 will buy you a machine with 16 gigabytes of memory, it’s time that the .NET runtime give me the ability to allocate an object that makes use of that capacity.

I’m happy to see the team continue improving the C# language.  I’ll undoubtedly find many of their improvements useful.  But no amount of language improvement will increase my productivity if I’m hamstrung by the absurd limit on individual object size and the garbage collector continues to eat my processor cycles.

Unfortunately, we’ll have to wait a bit longer before we know what all will be included in the next versions of C# and .NET.  Microsoft is keeping pretty quiet, apparently in an attempt to make a big splash at the Professional Developer’s Conference in October.

Anybody care to pay my way to the conference?

July 16th, 2008

Exceeding the Limits

We generate a lot of data here, some of which we want to keep around. Yesterday I noticed that I was running out of space on one of my 750 GB archive drives and figured it was time to start compressing some of the data. The data in question is reasonably compressible. A quick test with Windows’ .zip file creator indicated that I’d get a 30% or better reduction in size.

The data is generated on a continuous basis by a program that is always running.  The program rotates its log once per hour, and the hourly log files can be anywhere from 75 to 200 megabytes in size.  Figuring I’d reduce the number of files while also compressing the data, I wrote a script that uses INFO-ZIP’s Zip utility to create one .zip file for each day’s data.

And then I hit a wall.  It seems that the largest archive that Zip can create is 2 gigabytes.  As their FAQ entry about Limits says:

While the only theoretical limit on the size of an archive is given by (65,536 files x 4 GB each), realistically UnZip’s random-access operation and (partial) dependence on the stored compressed-size values limits the total size to something in the neighborhood of 2 to 4 GB. This restriction may be relaxed in a future release.

With 24 files ranging in size from 75 to 200 megabytes, it’s inevitable that some days will generate more than 3 gigabytes of data.  At about 30% compression, that’s not going to fit into the 2 GB file.

My immediate solution will be to compress the files individually.  It’s less than ideal, but at least it’ll give me some breathing room while I look for a new archive utility.

I’m surprised that in today’s world of cheap terabyte-sized hard drives, the most popular compression tools have the same limitations they had 20 years ago.  Every modern operating system has supported files larger than 4 gigabytes for at least 10 years.  It’s time our tools let us use that functionality.

I’m in the market for a good command-line compression/archiver utility that has true 64-bit file support.  Any suggestions?

July 14th, 2008

Going Too Far Back

The other day I intended to close a Remote Desktop window and instead hit the Close button (the X on the right of the window’s caption bar) on the console window running our data broker. Nothing like an abnormal exit to bring the whole house of cards tumbling down.

So I went looking for a way to prevent that particular problem from occurring again. Disabling the Close button is pretty easy. In fact, there are at least two ways to do it. Neither is ideal.

The Close button is on the window’s system menu. You can get a handle to the system menu by calling the GetSystemMenu Windows API function. In addition to the buttons on the window’s caption bar, this menu also contains the menu items you see if you click on the box at the left of the window:

Given a handle to the system menu, you have (at least) two choices:

  1. Call EnableMenuItem to disable the caption bar’s Close button.
  2. Call DeleteMenu to remove the Close item from the menu. Doing so will also disable the Close button on the caption bar.

The second option looks like the best, because it prevents me from hitting the Close button, and also prevents me from inadvertently clicking the Close menu item when I’m going for Edit. The C# code for the second option looks like this:

[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr GetConsoleWindow();

[DllImport("user32")]
private static extern IntPtr GetSystemMenu(IntPtr hWnd, bool bRevert);

[DllImport("user32")]
private static extern bool DeleteMenu(IntPtr hMenu, uint uPosition, uint uFlags);

private const int MF_BYPOSITION = 0x0400;

static void Main(string[] args)
{
    // Get the console window handle
    IntPtr winHandle = GetConsoleWindow();

    // Get the system menu
    IntPtr hmenu = GetSystemMenu(winHandle, false);

    // Delete the Close item from the menu
    DeleteMenu(hmenu, 6, MF_BYPOSITION);

    // rest of program follows
}

That works well, as you can see from this screen shot:

But there’s a problem. To restore the menu when your program is done, you’re supposed to call GetSystemMenu and pass true for the second parameter, telling it to restore the menu, like this:

GetSystemMenu(winHandle, true);

The result is probably not what you expect:

The system didn’t revert to the previous menu, but rather to the default system menu–the one created for every window. The Edit, Defaults, and Properties items that cmd.exe adds to the menu are gone.

Since I can’t reliably restore the menu after deleting an item, I figured I’d call EnableMenuItem to disable the Close item. Unfortunately, that doesn’t appear to be possible. At least, I haven’t been able to make it work. Since I often need the Edit menu item even after the program exits, I’m going with the first option and hoping that I don’t hit the Close menu item by mistake when going for the Edit menu while the program is running.

An aside: we have the term “fat finger” to describe hitting the wrong key on the keyboard. Is there a similar expression for making a mistake with the mouse? I suppose “mis-mouse” would do, but it doesn’t have quite the same ring to it as “fat finger.”

June 11th, 2008

Can’t Configure Windows DNS Resolver Cache

In experimenting with the program I described yesterday, I got to fiddling with the DNS resolver cache, called dnscache. Briefly, dnscache saves the results from recent DNS queries so that it doesn’t have to keep querying the DNS server. Considering that a DNS query can take 100 milliseconds or more to resolve, this can save considerable time. For example, for your browser to load this Web page, it has to make many different requests to my server: one for the base page, one for the stylesheet, one for each image, etc. It wouldn’t be uncommon to require a dozen separate requests to get all the resources that make up the page. If each resource required a separate DNS request, it would take more than a second just for DNS!

I got to wondering just how large the DNS cache is. A little bit of searching brings up any number of pages claiming that you can “speed up your connection” by tweaking the DNS resolver cache parameters. Specifically, they talk about changing registry keys for the cache hash table size, maximum time to live, etc. There’s even a Microsoft TechNet article describing these parameters for Windows Server 2003 (and, by extension, Windows XP). It’s interesting to note that the information on most of the pages claiming to speed things up conflicts rather badly with the information in the TechNet article.

After reading the tweaks and the TechNet article, I figured I’d give it a shot. I fired up the Registry Editor, made the changes, and … is it working? How can I tell? I tried browsing a few Web sites, but I couldn’t see any difference.

A little more searching and I found the command ipconfig /displaydns. This writes the contents of the DNS resolver cache to the console. A little work with the FIND utility, and I was able to count the number of entries in the cache. 34 on my Windows XP box. Interesting, considering that I set the CacheHashTableSize registry entry to over 7,000. I fiddled and tweaked, restarted the DNS Client service, flushed the cache, rebooted my computer, faced Redmond and cursed, and generally tried everything I could think of. No matter what settings I used, I always ended up with between 30 and 40 entries in my DNS cache.

On my Windows Server 2008 machine at the office, I always got between 270 and 300 entries, no matter what I tried.

So that leaves me with the following possibilities:

  1. It’s not possible to change the size of the DNS resolver cache in Windows XP or Windows Server 2008.
  2. It is possible, but the documentation is wrong.
  3. The documentation is correct as far as it goes, but it’s incomplete.
  4. The documentation is correct and complete, but I’m too dumb to make sense of it.
  5. The documented registry entries actually changed the size of the cache, but ipconfig isn’t showing me all the entries that are in the cache.

At this point, all possibilities seem almost equally likely. I could do some indirect testing based on the amount of time it takes to resolve a series of DNS requests, but even that would be inconclusive. There are no documented API calls that allow me to examine the DNS cache or its size. (And the undocumented ones aren’t described well enough to be worth checking out.) My only means of seeing what’s in the cache is the ipconfig tool.

So I ask: does anybody know how to change the size of the Windows DNS resolver cache and prove that those changes actually work? Do I have to restart the DNS Client service? Reboot the machine? Set some super magic registry entry?

Any information greatly appreciated.

June 10th, 2008

Is this really asynchronous?

I’ve been working on a relatively simple program whose purpose is to see just how fast I can issue Web requests. The idea is to get one machine hooked directly to an Internet connection and see how many concurrent connections it can maintain and how much bandwidth it can consume. A straight bandwidth test is easy: just start three or four Linux distribution downloads from different sites. That’ll usually max out a cable modem connection.

But determining the sustained concurrent connection rate is a bit more difficult. It requires that you issue a lot of requests, very quickly, for an extended period of time. By slowly increasing the number of concurrent connections and monitoring the bandwidth used, I should be able to find an optimum range of request rates: one that makes maximum use of bandwidth, but doesn’t cause requests to timeout.

My Web crawler does something similar, but it also does a whole lot of other things that make it impractical for use as a diagnostic tool.

I got the program up and limping today, and was somewhat surprised to find that it couldn’t maintain more than 15 concurrent connections for any length of time. Considering that my crawler can maintain 200 or more connections without a problem, I found that quite curious. It had to be something about the different way I was issuing requests.

Because this is a simple tool, I figured I’d use the .NET Framework’s WebClient component to issue the requests. In order to avoid the overhead of constructing a new WebClient for every request, I initialized 100 WebClient instances to be served from a queue, and then issued the requests in a loop, kind of like this:

while (!shutdown)
{
    if (currentConnections < MaxConnections)
    {
        WebClient cli = GetClientFromQueue();
        ++currentConnections;
        cli.DownloadStringAsync(GetNextUrlFromQueue());
    }
}

The actual code is a bit more involved, of course, but that’s the gist of it. The currentConnections counter gets decremented in the download completed event handler.

The important thing to note here is that I’m issuing asynchronous requests. The call to DownloadStringAsync executes on a thread pool thread. This code should issue requests at a blindingly fast rate, and keep the number of concurrent connections right near the maximum. Even with MaxConnections set to 50, the best I could do was 20 concurrent–and that for only a very short time. Most often I had somewhere between 10 and 15 concurrent connections.

After eliminating everything else, I finally got around to timing just how long it takes to issue that asynchronous request. The result was pretty surprising: in my brief tests, it took anywhere from 0 to 300 milliseconds to issue those requests. The average seemed to be around 100 or 150 ms. That would explain why I could only keep 10 or 15 connections open. If it takes 100 ms to issue a request, then I can only make 10 requests per second. Since it takes about 2 seconds (on average) to complete a request, the absolute best I’ll be able to do is 20 concurrent requests.

So I got to thinking, why would it take 100 milliseconds or more to issue an asynchronous Web request? And the only reasonable answer I could come up with was DNS: resolving the domain name. And it turns out I was right. I flushed the DNS cache and ran my test by requesting a small number of URLs from different domains. Sure enough, it averaged about 150 ms per request. I then ran the program again and it took almost no time at all to issue the requests. Why? Because the DNS cache already had those domain names resolved. Just to make sure, I flushed the DNS cache again and re-ran the test.

By the way, the HttpWebRequest.BeginGetResponse method (the low-level counterpart to WebClient.DownloadStringAsync) exhibits the same behavior. That’s not surprising, considering that WebClient uses HttpWebRequest to do its thing.

This is a fatal flaw in the design of the .NET Framework’s support for asynchronous Web requests. The whole idea of supplying asynchronous methods for I/O requests is to push the waiting off on to background threads so that the main thread can continue processing. What’s the use of providing an asynchronous method if you have to wait for a high latency task like DNS resolution to complete before the asynchronous request is issued? Why can’t the DNS resolution be done on the thread pool thread, just like the actual Web request is?

There is a way around the problem: queue a background thread to issue the asynchronous request. Yes, I know it sounds crazy, but it works. And it’s incredibly easy to do with anonymous delegates:

ThreadPool.QueueUserWorkItem(delegate(object state)
    {
        cli.DownloadStringAsync((Uri)state);
    }, GetNextUrlFromQueue());

That spawns a thread, which then issues the asynchronous Web request. The time waiting for DNS lookup is spent in the background thread rather than on the main processing thread. It looks pretty goofy, and unless it’s commented well somebody six months from now will wonder what I was smoking when I wrote it.

June 5th, 2008

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk’s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

June 4th, 2008

.NET regular expressions are slow?

One of the benefits–or curses, depending on my mood and how urgently I need a solution–of programming computers is that I often start working on one thing and end up getting sidetracked by a piece of the problem.

Take today’s distraction, for example. I’m writing a program to experiment with some text classification using the downloaded Wikipedia database. A major part of the program is extracting terms from the individual articles. Since I don’t need anything too fancy (at least not quite yet), I figured that this would be a perfect application for regular expressions. So I coded up my term extractor and let it loose on the Wikipedia data, figuring it’d take an hour or two to process the 13 gigabytes of text.

It’s a good thing I’ve gotten into the habit of instrumenting my code and periodically outputting timing information. It was taking an average of 10 milliseconds to process each Wikipedia article. The file has about 5.8 million articles. A quick back of the envelope calculation says that’ll take 58,000 seconds, or 16 hours. I’ve been wrong before, but not often by an order of magnitude.

After removing some unnecessary code and postponing some processing that can be done against the aggregate, I cut the required time per page to about 4 milliseconds. Better, but still too much. Through the process of elimination, I finally narrowed it down to the loop that extracts terms from the document text using a regular expression. Stripping everything but the critical loop, it looks like this:

static Regex reTerm = new Regex("\\p{L}+", RegexOptions.Compiled);
static Stopwatch reElapsed = new Stopwatch();

static void DoReParse(string pageText)
{
    reElapsed.Start();

    Match m = reTerm.Match(pageText);
    while (m.Success)
    {
        m = m.NextMatch();
    }

    reElapsed.Stop();
}

The regular expression, \p{L}+, says, “match a string of one or more contiguous Unicode Letter characters.” reElapsed is a Stopwatch object that accumulates the time taken between the Start and Stop calls. When my program is done, I divide the accumulated time by the number of documents processed to get the average time per page. For the first 100,000 documents in the Wikipedia download, it averages about 1.5 milliseconds per page, which is pretty close to what I thought all of the processing would take–parsing included.

That seemed unreasonable, so I wrote my own parser that reads each character of the document text and pulls out the substrings of contiguous Unicode letter characters. It’s more code than the regular expression version, but it’s still pretty simple:

static void DoJmParse(string pageText)
{
    // Time extraction with direct parsing
    jmElapsed.Start();

    int i = 0;
    int len = pageText.Length;
    bool inWord = false;
    int start = 0;
    for (i = 0; i < len; i++)
    {
        if (!inWord)
        {
            if (char.IsLetter(pageText[i]))
            {
                start = i;
                inWord = true;
            }
        }
        else if (!char.IsLetter(pageText[i]))
        {
            string term = pageText.Substring(start, i - start);
            inWord = false;
        }
    }
    if (inWord)
    {
        string term = pageText.Substring(start);
    }

    jmElapsed.Stop();
}

Both of the methods shown above are stripped-down versions of the code. The complete code adds the extracted terms to a list so that I can compare what’s extracted by each method. In all cases (the first 100,000 pages in the Wikipedia download), both methods extract the same terms. But the hand-coded parser takes an average of 0.25 milliseconds per page. The regular expression parser is six times slower.

I could understand my hand-coded routine being somewhat faster because it avoids the overhead of constructing Match objects and some of the other regular expression overhead. But six times? Something smells wrong. I have to think that I’m missing something.

I tried the usual things: fiddling with the regular expression options, calling Matches to get all of the matches in a single call, etc. All to no avail. Everything I tried had either no effect or increased the running time.

Then I thought that the difference had to do with Unicode combining characters or surrogate pairs. That is, characters that take up two code units rather than just one. My parser treats each code unit as a character, whereas the regular expression parser might be taking those multi-code unit characters into account. But a simple test doesn’t bear that out.

Consider this character string defined in C#:

const string testo = "\u0061\u0300xyz";

The first two characters, “\0061\0300″, define the character à, a lower-case “a” with a grave accent. From my understanding of Unicode, this should be treated as a single character. But when I run that string through the regular expression term extractor, I get two strings: “a”, and “xyz”. If the regular expression engine supports combining characters, I should get one string: “àxyz”. The documentation is suspiciously silent on the matter, but a close reading of Jeffrey Friedl’s Mastering Regular Expressions indicates that I shouldn’t expect the .NET regular expression engine to give me just the one string.

So I’m at a loss. I have no idea why the regular expression version of the term extractor is so much slower than my hand-coded version. For now, I’m going to abandon the regular expression, but I’d sure like to hear from anybody who can shed some light on this for me.

May 21st, 2008

More on .NET Collection Sizes

Last month in HashSet Limitations, I noted what I thought was an absurd limitation on the maximum number of items that you can store in a .NET HashSet or Dictionary collection. I did more research on all the major collection types and wrote a series of articles on the topic for my .NET Reference Guide column. If you’re interested in how many items of a particular type can fit into one of the .NET collection types, you’ll want to read those four or five articles.

I mentioned last month that the HashSet and Dictionary collections appeared to have a limit of 47,995,854 items. Whereas that really is the limit if you use the code that I showed, the real limit is quite a bit larger: about 89.5 million with the 64-bit runtime (61.7 million on the 32-bit runtime) if you have String keys and your values are references. The difference has to do with the way that the data structure grows as you add items to it.

Like all of the .NET collection types, HashSet and Dictionary grow dynamically as you add items. They start out with a very small capacity–fewer than 16 items. As you add items and overflow the capacity, the collection is resized by doubling. But the capacity is not exactly doubled. It appears that the new capacity is set to the first prime number that is larger than twice the current capacity. (Some hashing algorithms perform better when the number of buckets is a prime number.) I don’t know the exact resize sequence, but I do know that there are noticeable pauses at around 5 million, 11 million, and 23 million, and then the program crashes at 47 million. The reason for the crash is that the collection is trying to resize its internal array to the next prime number that’s somewhere around 96 million. And that’s larger than the empirically determined maximum size of about 89.5 million.

So how does one get a collection to hold more than that 48 million items? By pre-allocating it. If you know you’re going to need 50 million items, you can specify that as the initial capacity when you create the object:

Dictionary<string, object> myItems = new Dictionary<string, object>(50000000);

You can then add up to 50 million items to the collection. But if you try to add more, you’ll get that OutOfMemory exception again.

That works well with Dictionary, but HashSet doesn’t have a constructor that will let you specify the initial capacity. When I was writing my .NET column on this topic, I discovered that you can create a larger HashSet indirectly, by building a List and then passing that to the HashSet constructor that takes an IEnumerable parameter. Here’s how:

static void Main(string[] args)
{
  int maxSize = 89478457;
  Console.WriteLine("Max size = {0:N0}", maxSize);

  // Initialize a List<long> to hold maxSize items
  var l = new List<long>(maxSize);

  // now add items to the HashSet
  for (long i = 0; i < maxSize; i++)
  {
    if ((i % 1000000) == 0)
    {
      Console.Write("\r{0:N0}", i);
    }
    l.Add(i);
  }
  Console.WriteLine();

  // Construct a HashSet from that list
  var h = new HashSet<long>(l);

  Console.WriteLine("{0:N0} items in the HashSet", h.Count);

  Console.Write("Press Enter:");
  Console.ReadLine();
 }

That works fine if you know in advance what items you want to put into your HashSet, but it doesn’t do you much good if you want to add things one at a time. At the time I wrote the article, I didn’t have a solution to the problem.

I later discovered that after creating the collection, you can remove all the items (either by calling Remove 89 million times, or by calling Clear) and you’re left with an empty HashSet that has a capacity of 89 million items. It’s a roundabout way to get some functionality that I think should have been included with the HashSet class, but if you need it, that’s how you do it. Understand, though, that it’s undocumented behavior that might change with the next release of the .NET Framework.

In last month’s note, I also grumbled a bit about the perceived 5x memory requirement of Dictionary and HashSet. It turns out that I was off base there. Internally, these collections store a key/value pair for each item. Since typically both the key and the value are references, that adds up to 16 bytes per item, giving a best case of about 134 million* items fitting into the .NET maximum 2 gigabyte allocation. The real limit of 89,478,457 indicates that the per-item overhead is 24 bytes (2 GB / 24 = 89,478,485), which is actually pretty reasonable.

*The number is 128 * 1024 * 1024. It seems to be understood that when we programmers say “128 K items,” we really mean 131,072 items (128 * 1,024). But there doesn’t seem to be a generally accepted terminology for millions or billions of items. I don’t hear “128 M items” or “128 G items” in conversation, nor do I recall seeing them in print. I’ve heard (and used) “128 mega items” or “128 giga items,” but those sound clunky to me. We don’t say “128 kilo items.” Is there generally accepted nomenclature for these larger magnitudes? If not, shouldn’t there be? Any suggestions?

May 12th, 2008

The New ReaderWriterLockSlim Class

Last year, in Improper Use of Exceptions, I mentioned that the ReaderWriterLock.AcquireReaderLock and ReaderWriterLock.AcquireWriterLock methods were improperly written because they throw exceptions when the lock is not available. I mentioned further that whoever designed the ReaderWriterLock should have studied the Monitor class for a more rational API.

Apparently I wasn’t the only one to think that, as .NET 3.5 introduced the ReaderWriterLockSlim class, which has a much more Monitor-like interface. ReaderWriterLockSlim reportedly has much better performance than ReaderWriterLock, as well as much simpler rules for lock recursion and upgrading/downgrading locks. The documentation says that ReaderWriterLockSlim avoids many cases of potential deadlock. All in all, it’s recommended over ReaderWriterLock for all new development.

At the time I wrote my note last year, I was especially disappointed that there was no way to wait indefinitely to acquire a reader or writer lock. It turns out that I was wrong: you can wait indefinitely if you pass Timeout.Infinite as the timeout value to AcquireReaderLock or AcquireWriterLock. It’s documented, but not very well. Rather than stating the valid values in the description of the timeout parameter, the documentation for ReaderWriterLock has a link at the end of the Remarks section says, “For valid time-out values, see ReaderWriterLock.”

I guess I need to follow the advice I read so long ago: “Periodically re-read the documentation for the functions you’re calling.”

April 24th, 2008

Calling static methods as instance methods

I was half asleep yesterday when I fired off an email to my co-workers about the String.IsNullOrEmpty method in the .NET runtime. One thing I said in the mail was:

I find it curious that IsNullOrEmpty is a static method rather than an instance method. Why can’t I say:
if (!someString.IsNullOrEmpty())

The obvious answer, as one coworker pointed out, is “because it can be null.” Calling a method on a null reference will end up throwing an exception. Embarrassing, I know. I’ll blame lack of sleep.

But I got to thinking about it later and I end up asking the same question. Let me explain.

C# 3.5 introduced extension methods, which allow you to add functionality to existing classes without having to create a derived type, recompile, or otherwise modify the original types. The idea is that you create a static class to contain the extension method, and inside that class create a static method that has a special syntax. For example, the following would create a WordCount extension method for the String class:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        public static int WordCount(this String str)
        {
            return str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries).Length;
        }
    }
}

If you then reference the namespace (with a using statement) in the code that uses this method, you can get the word count for a string (assuming that ’s’ is a string) by writing:

int wc = s.WordCount();

You can also call it like this:

int wc = MyExtensions.WordCount(s);

Extension methods are certainly cool, and LINQ depends on them to do what it does. Although extension methods can be very confusing and even dangerous if used indiscriminately, they’re very useful.

Think about what happens inside the compiler in order to make this possible. When the compiler sees that special “this” syntax in the extension method’s definition, it has to make an entry in its symbol table that says, in effect, “ExtensionMethods.MyExtensions.WordCount is an instance method of type System.String.” Then, when it sees that call to s.WordCount, the compiler emits code equivalent to ExtensionMethods.MyExtensions.WordCount(s).

Now let’s go back to the static String.IsNullOrEmpty method. It is defined in the String class as:

static Boolean IsNullOrEmpty(String)

The only thing missing is the “this” syntax on the parameter. If that syntax existed, we’d be able to call IsNullOrEmpty on a null string reference, because the compiler would have converted it into a static method call. But it wouldn’t take any changes to the .NET runtime in order to make this possible. The compiler should be able to handle it.

That is, the compiler should be able to see that you’re calling a static method using instance method syntax, and fix it up accordingly. The compiler has all the information it needs in order to do that. And with the already-existing extension method machinery built into the compiler, I can’t see that this would be especially difficult to implement. Why wasn’t it?

April 9th, 2008

HashSet Limitations

Version 3.5 of the .NET runtime class library introduced the HashSet generic collection type. HashSet represents a set of values that you can quickly query to determine if a value exists in the set, or enumerate to list all of the items in the set. You can also perform standard set operations: union, intersection, determine subset or superset, etc. HashSet is a very handy thing to have. Simulating the same functionality in prior versions of .NET was very difficult.

I’ve made heavy use of HashSet in my code since it was introduced, and I’ve been very happy with its performance. Until today. Today I ran into a limitation that makes HashSet (and the generic Dictionary collection type, as well) useless for moderately large data sets. It’s a memory limitation, and how many items you can store in the HashSet depends on how large your key is.

I’ve mentioned before that the .NET runtime has a 2 gigabyte limit on the size of a single object. Even in the 64-bit version, you can’t make a single allocation that’s larger than 2 gigabytes. I’ve bumped into that limitation a few times in the past, but have been able to work around them by restructuring some things. I thought I was safe with the HashSet, though. Even with an 8-byte key, I figured I should be able to store on the order of 250 million items. I found out today that the number is quite a bit lower: a little less than 50 million. 47,995,853 to be exact. After I figured out what was causing my problem, I verified it with this program:

static void Main(string[] args)
{
    HashSet<long> bighash = new HashSet<long>();
    for (long i = 0; i < 50000000; ++i)
    {
        if ((i % 100000) == 0)
        {
            Console.Write("r{0:N0}", i);
        }
        bighash.Add(i);
    }
    Console.WriteLine();
    Console.Write("Press Enter");
    Console.ReadLine();
}

The program throws OutOfMemoryException when it tries to add the 47,995,853rd (or perhaps the 47,995,854th) item, because it’s increasing the capacity of an internal data structure and that data structure exceeds 2 gigabytes.

If I reduce the size of the key to 4 bytes (a .NET long is 8 bytes), then I can add just a little less than 100 million items before hitting the limit. Let’s think about that a little bit.

50 million keys of 8 bytes each should take up about 400 megabytes. 100 million keys of 4 bytes each should take up about 400 megabytes. I realize that there’s some overhead in a hash table to deal with collisions, but five times is excessive! I can’t imagine a hash table implementation that has an overhead of five times the total key size. And yet, that’s what we have in .NET.

It’s bad enough in today’s world, where a machine with 16 gigabytes of RAM can be had for under $2,000, that we have to deal with the 2-gigabyte-per-object limitation in .NET. But to have the runtime library’s implementation of a critical data structure squander memory in this way is too much.

Any workaround is very painful. We’ll have to write our own hash table implementation that allocates unmanaged memory and mucks around with pointers in unsafe code. We’re old C programmers, so that’s not beyond our capabilities. But it sure makes me wonder why I selected .NET for this project. In the process, we’re going to lose a lot of the functionality of Dictionary and of HashSet.

I can’t be the only one running up against these kinds of limitations. 10 years ago, a data set of 100 million items may have been considered large. Today 100 million is, at best, moderately large. There are plenty of applications that work with billions of items and today’s computers have the capacity to store them all in RAM. We damned well should be able to index them in RAM using modern tools.

I hope the .NET team is working on a solution to the 2-gigabyte limit, and I’d strongly suggest that they take a very close look at their hash table implementation.

February 5th, 2008

LINQ: I’m a believer

Language-integrated query (LINQ) is a new technology introduced with .NET 3.5. The C# 3.0 language has been extended to support it. LINQ is, in essence, a query language for in-memory data. Think of SQL for your data structures and you get the idea.

Sometimes you have to see a thing in action before you understand just how powerful it is. The other day I had the opportunity to learn that about LINQ.

Suppose you have a Dictionary of inventory items. For example:

class InventoryItem
{
    public string StockNumber {get; private set; }
    public int Count { get; set; }
    public int ReorderThreshold { get; set; }
    // other fields here
}

Dictionary<string,InventoryItem> inventory =
    new Dictionary<string, InventoryItem>();

Now, imagine you want to want to go through the list and select all the items whose Count is less than the ReorderThreshold, and output the list sorted in StockNumber order. The traditional way to do this is to go through the Dictionary one item at a time (using foreach) and create a List of items. Then, sort the list and output. Like this:

// Select items to reorder
List<InventoryItem> reorderList = new List<InventoryItem>();
foreach (var kvp in inventory)
{
    if (kvp.Value.Count <= kvp.Value.ReorderThreshold)
      reorderList.Add(kvp.Value);
}

// sort the list by stock number
reorderList.Sort(delegate(InventoryItem i1, InventoryItem i2)
{
    return i1.StockNumber.CompareTo(i2.StockNumber);
});

// Now output all the items
foreach (var item in reorderList)
{
    OutputItem(item);
}

All that goes away with LINQ. The new way to do things is build a query and let the runtime deal with selection and sorting:

var reorderList =
    from item in inventory
    where item.Count <= item.ReorderThreshold
    orderby item.StockNumber
    select item;

foreach (var item in reorderList)
{
    OutputItem(item);
}

Sure, I understand that I’m not saving any execution time. It’s even possible that the runtime generated code for selecting and sorting is less optimal than what I could write myself. But most of the time I don’t care. In a majority of cases, I won’t even notice a few milliseconds or even seconds in something like this. And it makes the code much easier to write.

I’m definitely a convert. LINQ is cool.

January 17th, 2008

Unintended Consequences

Changing any non-trivial program will almost invariably have unintended consequences. Nowhere is this more readily apparent than when you’re modifying a program that has multiple threads. A change in the behavior of one part of the program will change the way that it communicates with other threads–either directly in thread-to-thread messages or indirectly in the thread’s data access pattern. Assumptions about usage patterns creep in to the code, usually inadvertently, and when those assumptions are no longer true, all manner of odd things happen.

These things are usually obvious after the fact. For example, I have a program that loads a list of items from disk into a queue, and then begins processing those items. The program has many different threads that do various things, and it can process hundreds of items concurrently. Recently I got the brilliant idea of loading the queue asynchronously so that the program can begin processing items immediately. No reason, I thought, to wait the few minutes to load 10 million items before I start processing.

It was a fairly easy change to spawn a loader thread, and I had everything working again in a few minutes. I started the program, the loader thread began loading, and immediately my worker threads started processing items. All was well with the world and I congratulated myself on a job well done.

But there’s a twist. You see, in processing items, the worker threads put even more items in the queue. There’s a process that prunes the queue from time to time so that it doesn’t grow without bounds, but the key here is that when it comes time to shut down the program I need to save the queue state so that I can pick up where I left off next time the program starts. Still no problem, right?

Unless I try to shut down the program before it’s finished loading the queue. In that case, bad things happen. If I just assume that the loader is done, then the code that writes the queue to disk will fail when it tries to open the file for writing because the loader has it locked already. If I code the loader thread to see the shutdown message and stop loading, then the current contents of the queue in memory will overwrite the queue on disk–causing me to lose anything that wasn’t already loaded.

The only solution is to allow the loader to finish before trying to save the queue to disk.

In hindsight, of course, this is obvious. But it’s something that a lot of programmers would miss in the initial coding.

December 11th, 2007

Source control and formatting standards

Things get so much more complicated when your development team grows larger than one person. Not only coordinating the efforts of two or more people, but also agreeing on data structures and naming conventions, designing interfaces so that you insulate developers as much as possible from others’ work, and a whole host of other problems. A source code control system relieves a large part of the worry, although you should be using source code control as soon as your development team grows larger than zero people. I won’t belabor that point here, as it’s the subject for a rather long post in itself.

One of the many things programmers like to argue about is code formatting. This topic is right up there with the discussion of the best language and best text editor. And as is the case with those topics in which each programmer has his own opinion that is unquestionably right, each programmer has his own code formatting style that is The One True Way. Anybody who does things differently is, at best, suspect.

But most development shops impose code formatting standards along with naming conventions. Why? In order to reduce confusion when somebody other than the code’s original author works on it. Without a code formatting standard, confusion arises in several ways.

The format of the code tells you things about it. For example, indentation implies a compound statement. In many languages (the C-derived languages), an opening brace begins a compound statement, which is normally indented. As creatures of habit, when we see indentation we expect to see an opening brace. And vice-versa. If the brace is not where we expect it, or the indentation is different than we’re accustomed to, then our brains have to adjust. This adjustment takes longer than you might expect. For example, consider these three blocks of code, each of which does the same thing, but are formatted slightly differently:

if (someValue == 0) {
    DoSomething();
    DoSomethingElse();
}

if (someValue == 0)
{
    DoSomething();
    DoSomethingElse();
}

if (someValue == 0)
   {
   DoSomething();
   DoSomethingElse();
   }

Each of those styles has its adherents and detractors, and objectively each is as valid as the other. But two of them make my brain hurt.

Most programmers these days use editors that perform some level of automatic formatting when you enter new code. If my editor is configured to create things in the One True Way, the result of editing a file that was created with one of the heretical formatting techniques is a complete mess.

Absent imposed coding standards, there are four possible solutions to the problem, none of which is satisfactory:

  1. Just ignore the problem. Believe me when I say that you don’t want to do this. Few things are as confusing as working on a single file that has inconsistent formatting.
  2. Disable the automatic formatting features of the editor. This is possible, but makes you much less productive. The automatic formatting features save a lot of time, especially when you’re refactoring code, as it prevents you from having to manually indent or unindent code to make sure that the braces match up.
  3. Reconfigure your editor to match the coding style of whichever file you’re currently working on. This is problematic because programmers often (probably most often) are working on multiple files at the same time. No editor that I know of will allow you to specify the configuration on a per-file basis. (On a file type basis, yes. But not per file.)
  4. Instruct the editor to reformat code to your style whenever you open a file. Although this sounds like a good idea to begin with, it’s disastrous when source control is involved which, as I pointed out above, should be for every project. Source control typically saves the deltas–changes between two versions of a file–rather than saving the entire file every time it’s modified. This is a huge space savings, and also allows you to view the history to see the specific changes between two versions. If you reformat the entire file, then the source control system will assume that (almost) every line changed. The result is that your source control database will be orders of magnitude larger than it has to be, and the significant changes between versions will be lost in the reformatting noise.

There is one other option that is not currently available, but should be. If the source control system (or a filter in front of the source control system) could normalize a file–convert it to a standard form–then the fourth option above would be possible. You could get a file from source control in whatever the normalized form is, open it in your editor and reformat to your heart’s content. When you saved the file and checked it in, the source control system would reformat it to the common form and store only the deltas. This allows each programmer to view and edit code in whatever format he is most comfortable with, but also allows the source control system to be efficient and effective.

I realize that I’m waving my hand over some implementation detail, particularly the many filters that would be required to format different kinds of files. But nothing here sounds terribly difficult with today’s tools. Has this been done? If not, why?

December 6th, 2007

Can I really ignore that warning?

When you build 64-bit .NET applications with Visual Studio, the Assembly Linker issues warning messages of the form:

Assembly mscorlib.dll targets a different processor.

It will issue that warning for each of the .NET runtime assemblies that your project references. The warning occurs because the linker checks the 32-bit runtime assemblies for type information, and since you’re building a 64-bit assembly, there’s a mismatch. The documentation says that it’s safe to ignore this warning because all of the .NET assemblies are guaranteed to have the same external interface, regardless of the CPU for which they are compiled. Specifically, the documentation says:

All x86-specific common language runtime (CLR) assemblies have 64-bit counterparts (every CLR assembly will exist on all platforms). Therefore, you can safely ignore CS1607 for CLR assemblies.

And that does appear to be the case. There is no conflict when you run the application.

I contend, however, that this is a bug in the tool. Why? Because it is a spurious warning that I can’t safely eliminate. And why can’t I safely eliminate it? I’m so glad you asked.

I can eliminate the warning. All I have to do is go into the project properties and add ‘1607′ to the list of warnings I want disabled. Goodbye, warning. But doing that eliminates all occurrences of the warning in the project. If I just happen to have a conflict within my own solution (i.e. trying to link my 64-bit project with a 32-bit assembly that I created), I won’t see the warning. And I’ll get a rude shock when I deploy and try to run the resulting application.

In addition, Visual Studio 2008 will issue the same warning number (CS1607) if the AssemblyVersionInfo attribute in your AssemblyInfo.cs file is not in the recommended format. That in itself is a spurious warning in my opinion, and one that I’d love to disable. Except that if I disable it, I also disable the processor mismatch warning.

You might wonder why I even worry about it, if the warnings are ignorable. There are two reasons. First, when a solution I’m building in Visual Studio issues warnings, I have to examine each warning to make sure that I can safely ignore it. Some of my solutions have more than 20 different projects–tools and dependent assemblies. If each one issues a warning about every referenced CLR assembly targeting a different processor, I have almost 50 warnings to wade through for each build. The risk of missing a real warning in that mess of useless messages is very high.

More importantly, we’re trying to put together an automated build system. In automated builds, warnings should be treated as errors. Otherwise you have to write some kind of post-processing filter that can examine the tool output to determine whether the warnings are meaningful. Doing so introduces yet another uncertainty into the process: the correctness of the post-processing tool and any configuration file that controls it.

So I’m left with no good solution. Whether I enable or disable the the spurious warnings, there’s a very real possibility of missing an important warning. I could understand this bug existing in Visual Studio 2005, but the development team has had more than two years to work out a solution. There is no excuse for the bug remaining in Visual Studio 2008 and the latest version of the .NET tools. Rather than fixing it, they made the problem worse by overloading the warning number. That’s unforgivable.

December 5th, 2007

Some useful utilities

I’ve run across a few utilities lately that I thought others might also find useful.

ISO Recorder is a very handy way to create CDs or DVDs from ISO files. No frills: just a Windows shell extension that works. Right-click on a .ISO file, select the drive you want to burn it to, and go. I would complain about Windows lacking this feature natively, but I think I’d rather have the simple right-click-go interface of ISO Recorder than whatever overly complicated interface the Windows design team would come up with.

I’ve mentioned Info-ZIP before. They have the best command line Zip and Unzip tools that I know of. I went to the site the other day, looking to install the tools on my new Vista system, and found that they now have 64-bit versions. Now if only we could get past the 4 gigabyte limitation.

Speaking of compression, I’ve been seeing a lot of bzipped stuff for some reason lately. Unfortunately, bzip and bunzip aren’t included in the Subsystem for UNIX-based Applications tools and utilities. You can download Bzip2 and other utilities for Windows from the GnuWin32 project.

We’re using Subversion for version control here. That could be the subject for a post by itself. If you’re using Subversion on a Windows client, don’t even bother with installing the command line tools. Rather, download and install TortoiseSVN. It’s a Windows shell extension that lets you work with your version control system visually rather than futzing with the command line. It’s nicely polished, and the new version works well with Windows Vista.

Most open source or free software sites will give the MD5 hash of the files that are available for download. The md5 utility is a standard component of most Linux distributions, but Windows doesn’t include such thing out of the box. MD5sums is a handy thing to have. You can operate it from the command line, or drag files from Explorer onto the .exe. Nicely done.

December 1st, 2007

Vista Subsystem for UNIX-based Applications

The Enterprise and Ultimate editions of Windows Vista include a component called Subsystem for UNIX-based Applications, or SUA. This subsystem is also available in Windows Server 2003 R2, and will be available in Windows Server 2008 (Longhorn). SUA by itself is just a Windows component that provides platform services for UNIX_based applications. You get UNIX tools and an SDK from a separate download.

SUA is the new version of Windows Services for UNIX (SFU), which is available as a separate download for Windows 2000, Windows Server 2003, and Windows XP (except Home edition). Also see the SFU blog. According to Wikipedia, SFU has a long history.

To install SUA in Windows Vista, go to to Control Panel | Programs and Features, and click on the “Turn Windows features on or off” link. Be patient. It takes a minute or two for Windows to populate a list box of all the available features. Once you see the list, scroll down to “Subsystem for UNIX-based Applications,” and check that box. Press OK after you’ve selected any other features you want to turn on or of