Jim’s Random Notes

February 19th, 2010

It’s harder than it looks

Imagine that you have a web site that, among other things, allows your users to search for media (audio and video) using a simple query language.  So, if you want to find Britney Spears videos, you’d just type britney spears in the search box and click the Search button.  Simple, right?

Except it turns out that britney and spears are pretty common spam terms in metadata (the keywords and description fields of YouTube videos, for example).  People will upload all manner of stuff to YouTube and put bogus terms in the description in an attempt to get people to watch the video.  To reduce the number of irrelevant or inappropriate results returned (it’s probably impossible to eliminate irrelevant content), you decide to index the metadata by field and allow the user to say which fields are searched.  So, if they want just those videos that have “Britney” and “Spears” in the title field, they would type britney spears IN Title.  That doesn’t eliminate all of the spam, but it reduces it quite a bit.

It turns out that you have to make the IN case sensitive.  Otherwise you’d never be able to search for the word “in” in any metadata.  The same is true for any word that you use in your query language.  For example, if wanted all the videos that contain “Britney” or “Spears”, we’d write britney OR spears IN Title.

Still, not too hard, right?  But what if you want to search the Title field and the Description field?  At first you’d think you could write:  britney spears IN Title OR Description.  You could make that work until you take into account the possibility of more complex query expressions.  For example, let’s say you wanted a list of all videos that claim to be a Led Zeppelin song, or some version of Stairway to Heaven.  One possible query would be:

led zeppelin IN Artist OR Description OR stairway heaven IN Title

Whereas that query might look reasonable to a non-programmer, writing a computer program to properly handle the general case of queries like that is non-trivial.  The query can be parsed in several different ways.  Three of which are:

(led zeppelin IN Artist OR Description) OR (stairway heaven IN Title)
(led zeppelin IN Artist) OR )(Description OR stairway) heaven IN Title)
(led zeppelin IN Artist) OR (description OR (stairway heaven) IN Title)

All three of those interpretations are perfectly valid.  Applying rules of operator precedence can disambiguate some of the cases, but if you go through the exercise you’ll find out that IN has to have lower precedence than OR, and if you do that, then you end up with:

(led zeppelin IN Artist OR (Description OR stairway heaven)) IN Title

You end up having to either decorate the field names (i.e. “@Artist”) or group them with brackets or parentheses (i.e IN [Artist or Description]).

All of this is doable, and not especially heavy lifting as far as parsing is concerned.  But then you have to explain it to a non-technical user and make it easy for the non-technical user to use.  Otherwise, only programmers will want to (or even be able to) use it.

I’ve heard many a programmer (myself included, come to think of it) complain about a search facility that doesn’t allow complex queries.  We look at it from a programmer’s perspective and think it’d be trivial to implement a comprehensive query facility.  And in most cases they’re probably right.  You could develop a query system that anybody with a couple years’ of programming experience could use without trouble and get exact results.  And when you flipped the switch to turn it on, you’d hear crickets.  Most users don’t understand Boolean algebra or the difference in precedence between AND and OR.  Trust me, people will go somewhere else to get their information rather than have to think of how to ask for it.

What users really want is a DWIM mode:  Do What I Mean.  They want to type word soup into the search and get back exactly what they were looking for, with no false hits (i.e. asking for beatles the music group and getting back something about dung beetles because somebody misspelled “beetle”).

But DWIM doesn’t exist.  Not today, and not for a long time (perhaps ever) in the future.  As a result, we have to restrict what the user can type and very carefully specify how things will be interpreted.  We have to make it easy for the most common cases, but able to do moderately complex and powerful things.  That balance is difficult to achieve, and no matter what you come up with, somebody will complain.  You can only hope that the number of users you delight will vastly outweigh those whom you annoy.

October 21st, 2009

Sniffing network traffic

My latest crawler modifications require me to scrape Web pages that host videos so that I can obtain metadata (title, description, date posted, etc.) that we place in our index.  Unfortunately, there’s no standard way for sites to present such information.  ESPN and Vimeo have HTML <meta> tags that provide some info, but I have to go parsing through the body of the document to find the date.  (And yes, I’m aware that Vimeo has an API that will make this a moot point.  I’ll be investigating that soon.)

Other sites are much worse in that they provide no metadata in the HTML.  For example, one site’s video page is very code-heavy.  Requiring that the page be reloaded every time you request a new video would require a lot of network traffic.  Their design instead uses JavaScript to request a particular video’s metadata from a server.  Loading a new video involves downloading just a few kilobytes of data.

I spent some time this afternoon searching through the a video page HTML and the associated JavaScript, looking for the magic incantation that would get me the data I’m looking for.  The amount of code involved is staggering, and I quickly went crosseyed trying to decipher it before I hit on the idea of hooking up a sniffer to see if I could identify the HTTP request that gets the data.

It took me all of five minutes to download and install Free Http Sniffer, request a video from the site in question, and locate the magic line in the 230 or so requests that the page makes when it loads.  Problem solved.  Now all I have to do is write code that’ll transform a video page url into a request for the metadata, and I’m set.

I have no idea why I didn’t think of the sniffer earlier.  I’d used one before for a similar purpose.  I suspect I’ll be making heavy use of it in the near future as I expand the number of sites that we crawl for media.

September 19th, 2009

A small change?

I’ve been programming computers for a long time.  Getting paid to write computer programs, even, which I thought was pretty darned funny when I first started.  People were paying me to do something that I loved.  But I digress.

After 30 years, you’d think that I would have learned that there’s no such thing as a small change that you can push into production code without having to test.  You might get away with it from time to time, but eventually that arrogance is going to cost you.

But, hey, it’s a simple change!  What could go wrong?

When you hear yourself say that, think about what you’re saying.  And then spend the few minutes it will take to test your assumption.  If nothing else, you’ll save yourself the embarrassment of explaining to your business partner that you made the kind of mistake that you’d reprimand an employee for.

Fortunately, all it cost me was a little embassassment, a few hours’ lost sleep, and an additional hour of down time for the crawler.  I got off easy.

August 27th, 2009

Looking for a Ghost Replacement

When describing the problems I was having configuring our new servers, I mentioned that I was going to try using Clonezilla to speed the process.  The idea was to get Windows installed and all the other software configured on one machine, and then just clone the drive.  Seemed like a good thing to do.

So I fired up Clonezilla, fought through the user interface to tell it what I wanted backed up and where, and then pressed the any key (really!  There was a prompt that said, “Press the any key”) to start the copy.  Clonezilla promptly told me that my network card wasn’t supported.  It would have been nice if it would have checked that when I first started the program.

Slightly discouraged but not yet willing to give up, I decided to try PING.  Another cryptic user interface, but I won’t complain too much considering the price.  This time my network card was supported and after a couple of house it had created a copy of my partition.  So I fired up the next machine, ran PING, told it to copy the partition image to the disk.  That went well, too.  Except that after I was done, the machine wouldn’t boot.  The BIOS doesn’t see a bootable image on the disk.

At that point I gave up.  I’d already spent almost a full day futzing with the things.  In that time I could have installed and configured all of the machines.  (Or so I thought.)  In any case, my experiments with free drive cloning software left me disappointed.

There’s a good overview of Ghost alternatives over at pack rat studios, but I haven’t had the opportunity to try any of the others mentioned.  Clonezilla didn’t support my hardware, and PING failed for reasons unknown.  Anybody know of a package that actually works?

By the way, telling a potential user, “if your network card isn’t supported, download it and compile it into the Clonezilla package” is not likely to be met with smiles and thanks.  More likely, users—even technically competent users like me who are capable of downloading and building—are more likely to say, “no thanks,” and move on to something else.

July 21st, 2009

Another .NET Framework bug?

When faced with inexplicable program behavior, inexperienced programmers often blame the operating system, the runtime library, the compiler, or some other external force for the error.  Even when they discover that the bug is in their code, these programmers often will try to blame it on something else.  A sure sign of maturity in a programmer is how he approaches bugs.  If, when faced with a bug, he concentrates his initial efforts on looking for the problem in his own code, you know that he’s learned.

I learned long ago.  In a little more than 30 years of programming computers (about 25 years doing it full time), I can think of only a handful of cases in which a bug in one of my programs was caused by anything other than my own error.  I know that there are bugs in operating systems and runtime libraries, but it’s not often that they cause problems for me.

So imagine my surprise when, within the space of one week, I’ve identified two previously unknown (or unreported, as far as I can tell) genuine bugs in the Microsoft .NET Framework runtime libraries.  I wrote about the first one the other day.  Microsoft has acknowledged that it is a bug and is currently reviewing it.

I originally thought that the new bug was in the Uri.TryCreate method, because it throws an exception in some circumstances even though the documentation says that it won’t throw an exception but rather will return False if it’s unable to create a valid Uri from the input parameters. And although throwing an exception in this case is (maybe) a bug, the cause of the bug is something else: the Uri constructor allows you to construct invalid Uri instances.

In my particular case, my Web crawler crashed because Uri.TryCreate threw an exception. That was very unexpected, and whatever parameters caused it are lost to me. But it’s pretty rare. I pass somewhere upwards of 250 million urls through that function every day. Unable to reconstruct the exact parameters that caused the problem, I used what I learned from poking around in the disassembled code to come up with a string that illustrates the problem.

The Uri constructor creates a Uri instance from a passed string. Uri is pretty cool in that it supports many types of resources, not just HTTP.  One such type is a mailto: URI of the form mailto:jim@mischel.com.

But the Uri constructor will succeed when it should fail. For example:

Uri mailUri = new Uri("mailto:jim@mischel.comtest@mischel.com");
// trying to access mailUri.AbsoluteUri at this point will throw UriFormatException
// with the message "The host name cannot be parsed"

Passing that invalid Uri as the baseUri parameter to Uri.TryCreate throws the exception that I encountered in the crawler.

What I find most curious about all this is that the Uri class appears to have two different methods for parsing strings.  There appears to be one parser for constructing Uri instances from strings (as in the constructor), and a separate parser used internally for various things.   I know from experience that parsing URIs is a horrendously difficult problem and hard to do correctly.   Why anybody would want to write, test, and maintain two different versions of such difficult code is beyond me.

Bug reported.  Awaiting response.

July 18th, 2009

“Highly Unlikely” is not the same as “Impossible”

One of my programs crashed the other day in a very unexpected place:  inside the runtime library.  The exception stack trace is pretty clear on where the error occurred:

System.OverflowException: Negating the minimum value of a twos complement number is invalid.
   at System.Math.AbsHelper(Int32 value)
   at System.Random..ctor(Int32 Seed)
   at System.Threading.Collections.ConcurrentQueue`1.TryDequeueCore(T& result)
   at System.Threading.Collections.ConcurrentQueue`1.TryDequeue(T& result)
   at MyProgram.ThreadProc() in c:\MyProgram\Main.cs:line 118
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

The documentation for the Random constructor says that this exception will be thrown if the Seed parameter is equal to Int32.MinValue.  Curious, I thought I’d disassemble the ConcurrentQueue class to see what’s going on.  No problem there, as it’s calling the default (parameterless) Random constructor.  So I took a look at that.

.method public hidebysig specialname rtspecialname
        instance void  .ctor() cil managed
{
  // Code size       12 (0xc)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  call       int32 System.Environment::get_TickCount()
  IL_0006:  call       instance void System.Random::.ctor(int32)
  IL_000b:  ret
} // end of method Random::.ctor

That code gets the current tick count and then passes that value as a seed to the constructor.  So, what’s wrong with this?  Take a look at the documentation for System.Environment.TickCount:

The value of this property is derived from the system timer and is stored as a 32-bit signed integer. Consequently, if the system runs continuously, TickCount will increment from zero to Int32.MaxValue for approximately 24.9 days, then jump to Int32.MinValue, which is a negative number, then increment back to zero during the next 24.9 days.

What all this means is that if your program calls the Random constructor during that one millisecond (after the system has been up for Int.MaxValue milliseconds), the value returned by System.Environment.TickCount is going to be equal to Int.MinValue, and passing that value as the seed will result in an exception.

I’ll be the first to admit that encountering this bug is highly unlikely.  Your computer has to be up and running for almost 25 days, and there’s a one-millisecond window when it’s vulnerable.  But the fact that my program crashed is proof that it is possible for this error to cause a problem.

This is a huge oversight on the part of Microsoft’s runtime library team.  I’ve reported the bug and hope they manage to get it fixed before they release .NET 4.0.

Update 2009/07/21:  Microsoft’s Base Class Library team has said that the issue has been resolved for the next major release.

March 25th, 2009

Stack Overflow

For most of the ’90s, I was a part of TeamB—a group of volunteers who helped answer questions on Borland’s Compuserve forums.  I met a bunch of really great people doing that, got some free Compuserve time, a few trips to the Bay Area, and lots of Borland products.  But mostly, I learned a heck of a lot by helping to answer users’ questions.

When Borland, Microsoft, and other development tool companies moved their online technical support to the Internet, their support was mostly done through newsgroups, and I found the signal-to-noise ratio there almost unbearable.  Except for the moderated newsgroups, which were few and far between, asking a question was like talking to a wall.  Worse, even, because a wall won’t give you wrong answers or call you stupid for doing something different.  Even with the advent of forums rather than newsgroups, online technical help was virtually non-existent for a number of years and I just stopped trying.

Enter Stack Overflow, a free programming Q&A site where you can ask questions, share your expertise, or just browse for nuggets of programming wisdom.  Stack Overflow works.  In many ways it works better than the old Borland Compuserve forums that I enjoyed so much.

Why it works is simple: they’ve found a way to reward people for supplying good answers and, to a lesser extent, asking good questions.  It all has to do with reputation:  ego.  You gain reputation points for supplying good answers, and asking good questions.  “Good” is determined by a simple up- or down-votes by site users.  As you gain reputation points, you gain the ability to help moderate the site: re-tag questions, vote to close, edit questions, etc.  And your current reputation is prominently displayed beside your name.  There are also awards (”Badges”) given for a number of different things.

If you don’t care about reputation, that’s fine.  You can use the site anonymously and still ask, answer, and comment on questions.  But Stack Overflow works because a whole lot of people there do care about their reputations.  Giving more experienced users the ability to help moderate keeps the flaming and other invective to a minimum, and the constant peer review ensures that (in general) the higher-rated answers really are the best.

My only real complaint with Stack Overflow (and it’s not huge) is that the format doesn’t encourage an ongoing threaded discussion as was available on the Compuserve forums.  That’s not a problem in most cases, but there are times when arriving at a satisfactory answer requires much back-and-forth, and it’d be nice to see questions and answers displayed in threaded newsgroup fashion.  The ability to see answers ordered by date helps a lot, as does the comments feature, and I suspect that adding a threaded view would be of only limited additional help.

Despite a few nitpicks, I’m seriously impressed with Stack Overflow.  If you have a programming question on any topic, you should search for the answer there.  And if you don’t find it, ask.  You’ll probably be surprised at the speed and the quality of the answers you get.

February 27th, 2009

Source Code is Formal Communication

When developing a new program—especially when trying out many different things—it’s common to pepper your code with various messages that are displayed when the program reaches a particular point or when it encounters a condition that you thought was impossible.  I suspect every programmer has run into a message or assertion that says, “this can’t happen.”  The joke is that it did happen.

It’s also common to be pretty liberal with terse comments that are little more than notes to oneself about the current technique being tried.

Both are understandable.  Every programmer I’ve known has done similar things.  However, it’s also quite common to leave those comments and messages in the code after the problem is solved and you’ve moved on to other things.  Particularly in a fast-paced environment, you often have little time to go through and “clean up” your code once you have it working.  And therein lies the problem.

All too often those comments and debug error messages are…well, let’s just say “colorful.”   In addition, programmers are known to editorialize in their comments—often to the point of ranting.

You may think that when you’re working by yourself or in a small team, such things shouldn’t matter.  After all, you know that your fellow team members won’t be offended by your use of colorful language.  Thinking that way is a mistake because eventually somebody else is going to see your code.  Perhaps you’ll bring on a new team member who is offended.  Or maybe you’ll want to present the code as a sample in an interview for your next job.

Either of those two would be embarrassing.  But suppose one of your colorful messages pops up on the screen when your boss is demonstrating the program to a potential client.  That could be disastrous.  I’ve seen it happen.  Fortunately it wasn’t one of my colorful error messages.

You’re mistaken if you think that source code comments pose less of a problem.  I was involved with a company that was acquired by another, and part of the “due diligence” involved having the acquiring company’s lead programmers look over the source code.  They were mostly looking for stuff that shouldn’t be there (stolen code, for example), and verifying that everyting that should be there, was.  Early on, one reviewer stumbled upon a long and profane ranting comment in the source code and as a result was highly critical of all the code he reviewed.  Fair or not, that one comment colored his perception of the entire project.

Even if you’re working by yourself, the likelihood of your code being seen by others is very high.  The likelihood of somebody else seeing the output from your programs is even higher.  People will judge you by the contents of your comments and error messages, just as they will judge you by the quality of your code and how well your programs work.

Unless you’re absolutely certain that your program will not be seen by anybody else, you should treat it like any other type of formal communication.  You wouldn’t include profanity or ranting editorializing in a design document you’re presenting to a client, so such things shouldn’t be in your programs.

February 19th, 2009

Shakespeare vs. Markov

I’m developing some code to work with Markov models, and as an early test I created a model of Shakespeare’s sonnets.  Then I had the program generate some verse from the model.  For example:

If they in thou, thy love, that old,
Thought, that which yet do the heat did excess
My love concord never saw my woeful taste
At ransoms you for that so unprovident:
For thereby beauty show’st I am fled,
Althoughts to the dead, that care
With should thus shall by their fair,
Where too much my jade,
Her loves, my heart.

The Markov model generates text one character at a time.  The next character is selected based on the distribution of characters in the original source text, and also on the “order” of the model.  In an order 0 model, character selection is based entirely on the frequency of characters in the source text.  If the letter ‘e’ appears 12% of the time in the source text, then it will appear approximately 12% of the time in the output.  Output from an order 0 model is just a weighted random sampling of the source text.  For example:

vl
cbr
  oaa,stpr iyt eimhr  rttntnettohe  Tt
chShag  flsoTa
btecl)btatef
ebshenuAievien lffetNthfean
nnnfIhwste  aek oppe

An order 1 model bases character selection on the frequency of characters that follow the current character.  That is, given the current character, the next character selected is guaranteed to be one that followed that character in the source text.  The model cannot generate a bigram (sequence of two characters) that did not exist in the source.  So, for example, the sequence ‘qz’ can’t appear in the generated text unless it was also in the source.   Verse generated from an order 1 model is slightly better than what you’ll get from an order 0 model:

Myouivey inn f th peroeethe,
Anete pe!
Touss s o by thy!
Ar temye corste d w.
Be wh s t, n rverose d th ms t,
I s s Ifiththin ay past lll s farenoulofan r, t thonard areall ngain,
Butod,
Theas,
Foutea cthakedinore,
Yot te,
Lorowor d thoodoref pathe,

Things begin to improve quickly as you move up in order of the model.  Order 2 bases character selection on the two preceeding characters, and you start to see recognizable English words although overall it looks more like some freaky foreign language:

Thich Must I winter,
The I do loodye ming to lin
So me lie stat so hou faut forse trom fore.
Whe be.
That, all shart and’s mor thee my swed thy head.
Whoun
And thawarit, a strand sabudge,
To ale whal wart,

Order 3 verse contains primarily recognizable English words.  At order 5 (the first sample above), almost all words are valid and you begin to see word pairs that existed in the original source text.  The resulting verse is reminiscent of a drunk and confused dramatics major trying to impress a group of girls with his knowledge of Shakespeare.

Order 10 verse is like Shakespeare in a blender:

Thou blind fool Love, what dost thou abuse,
The bounteous largess given thee to give?
Profitless usurer why dost thou that I do fawn upon,
Nay if thou live remembered
My deepest sense, how hard true sorrow hits,
And sealed false bonds of love doth well denote,
Love’s fire heats water, water cools not love
Which alters when it alteration finds,
Or bends with thee, wherever I abide,
The guilty goddess of my passion,
A woman’s face with nature’s truth,
And delves the parallels in beauty dyed?
Both truth and beauty should look so.

Here you sometimes see complete sonnet lines.  Other times you’ll see the first half of one line morph into the second half of another, or just pure randomness. 

Things get less interesting beyond order 10, and by order 15 the thing starts spitting out complete sonnets exactly as they were fed in.

I probably spent entirely too much time “testing” my program:  generating verse and laughing at the results.  But it was a welcome bit of humor in an otherwise somewhat dull day.

December 3rd, 2008

No IEnumerable.ForEach?

Overall, I like working with C# and the .NET Framework.  But sometimes I just can’t imagine what the designers were thinking when they put some things together.  High on the list of things I don’t understand is the lack of a ForEach extension method for the generic IEnumerable interface.

IEnumerable extension methods were added in .NET 3.5 to support LINQ. These methods enable a more functional style of programming that you just couldn’t do with earlier versions of C#. For example, in earlier versions of C# (before 3.0), if you wanted to obtain the sum of items in an array of ints, you would write:

int[] a = new int[100];
// some code here to initialize the array
// now compute the sum
int sum = 0;
foreach (int x in a)
    sum += x;

In .NET 3.0, the Array class implements the generic IEnumerable interface, and that interface has a Sum method to which you pass a Lambda expression. Computing the sum of items becomes a one-liner:

int sum = a.Sum(x => x);

That syntax looks pretty strange, but it takes very little getting used to and you begin to see how very powerful this functional style of programming is.

IEnumerable extension methods let you do lots of different things as one-liners. For example, if you have a list of employees and you want to select just those who are female, you can write a one-liner:

var FemaleEmployees = Employees.Where(e => e.Gender == 'F');

There are extension methods for many specific things: determining if all or any items in the collection meet a condition; applying an accumulator function over the list; computing average, minimum, maximum, sum; skipping items; taking a specific number of items, etc. But there is no extension method that will allow you to apply a function to all items in the list. That is, there is no IEnumerable.ForEach method.

I do a lot of list processing and my code is littered with calls to the IEnumerable extension methods. It’s very functional-looking code. But when I want to do something that’s not defined by one of the extension methods, I have to drop back into procedural mode. For example, I want to apply a function to all the items in the list:

foreach (var item in MyEnumerable)
{
    DoSomething(item);
}

That’s crazy, when I should be able to write:

MyEnumerable.ForEach(item => DoSomething(item));

The lack of ForEach is terribly annoying. What’s worse is that ForEach does exist for arrays and for the generic List class. It’s kind of strange, though. With arrays you have to call the static Array.ForEach method:

Array.ForEach(MyArray, item => DoSomething(item));

For generic List collections, it’s an instance method:

MyList.ForEach(item => DoSomething(item));

And other collection types simply don’t have a ForEach method.

It’s simple enough to add my own ForEach extension method:

public static class Extensions
{
    public static void ForEach<T>(this IEnumerable<T> source, Action<T> action)
    {
        foreach (var item in source)
        {
            action(item);
        }
    }
}

That’s what I did, and it works just fine.  But it seems such an obvious addition (i.e. everybody would want it) that I can’t for the life of me understand why it wasn’t included in the first place.

I know the guys at Microsoft who designed this stuff aren’t stupid, so I just have to think that they had a good reason (at least in their mind) for not including a ForEach extension method for the generic IEnumerable interface. But try as I might, I can’t imagine what that reason is. If you have any idea, I’d sure like to hear it.

November 23rd, 2008

An Assumption of Competence

My second programming job was with a small commercial bank in Fresno, CA, where I helped maintain the COBOL account processing software.  I was still pretty inexperienced, having only been working in the industry for about 18 months.  My previous job involved maintaining software, also in COBOL, for small banks in western Colorado.

One of the first things my new boss asked me to look at was a program that computed loan totals by category:  a two-digit code that was assigned to each loan.  Federal regulations required that we report the total number and dollar amount of all loans, by category, as well as the number and dollar amount that were 30, 60, and 90 days or more past due.  The problem was that the report program was taking way too long to run.  The bank had recently acquired a bunch of new loans, and the program’s run time had increased sharply—several times more than what one would expect from the increase in the number of loans.

Understand, this was a very simple program.  All it had to do was go through the loans sequentially and compute totals in four columns (total, 30 days past, 60 days past, 90+ days past) for each of the 100 categories.  The data structures are very simple.  I don’t remember enough COBOL to write it intelligently, so I’m showing them translated to C#:

struct CountAndTotal
{
    public int Category;
    public int Count;
    public double Total;
}

// Arrays for Total, 30, 60, and 90+ days past due
CountAndTotal[] TotalAllLoans = new CountAndTotal[100];
CountAndTotal[] Total30Past = new CountAndTotal[100];
CountAndTotal[] Total60Past = new CountAndTotal[100];
CountAndTotal[] Total90Past = new CountAndTotal[100];

I’ll admit that I was a little mystified by the Category field in the CountAndTotal structure, but figured it was an artifact from early debugging.

Those definitions and the description of the problem above lead to a simple loop:  for each loan, determine its category and payment status, and add the totals to the proper items in the arrays. The program almost writes itself:

while (!LoanFile.Eof)
{
    LoanRec loan = LoanFile.ReadNext();
    AddToTotals(loan, TotalAllLoans);
    if (loan.PastDue(90))
        AddToTotals(loan, Total90Past);
    else if (loan.PastDue(60))
        AddToTotals(loan, Total60Past);
    else if (loan.PastDue(30))
        AddToTotals(loan, Total30Past);
}

What surprised me when I looked at the code was the implementation of the AddToTotals function. You would expect it to be a simple index into the array from the loan’s category code. After all, the category was guaranteed to be in the range 0-99. That just begs for this implementation:

void AddToTotals(LoanRec loan, CountAndTotal[] Totals)
{
    ++CountAndTotal[loan.Category].Count;
    CountAndTotal[loan.Category].Total += loan.Balance;
}

What I found was quite surprising. Rather than directly index into the array of categories, the program would do a sequential search of the array to see if that category was already there. If it was, the total was added. Otherwise the program made a new entry at the next empty spot in the array. That explained the mysterious Category field and the absurdly long run time. The code is much more complicated:

void AddtoTotals(LoanRec loan, CountAndTotal[] Totals)
{
    int i = 0;
    while (i < 100)
    {
        if (Totals[i].Category == loan.Category)
        {
            ++Totals[i].Count;
            Totals[i].Total += loan.Balance;
            break;
        }
        else if (Totals[i].Category == -1)
        {
            // Unused position.  The category wasn't found in the array.
            Totals[i].Category = loan.Category.
            Totals[i].Count = 1;
            Totals[i].Total = loan.Balance;
            break;
        }
        ++i;
    }
}

The difference in run times is enormous! The first implementation accesses the array directly from the loan.Category field. The second has to search the array sequentially—an operation that involves, on average, looking at 50 different items every time.  The second version of the program is 50 times slower than the first.  In addition, the second required a subsequent sort to put things in the proper order before printing the results.

Being new at the job, I went to my boss, explained what I’d found, and said, “What am I missing?”  His response:  “Why do you think you’re missing something?”

He went on to explain that my analysis was correct, and that the industry (at least back then) was full of programmers who had no business sitting at a terminal.  It was something of a revelation to me, because I had assumed that the people who wrote this stuff really knew what they were doing.  It also taught me to question everything when faced with a problem.  It’s always a good idea to assume competence when you start debugging somebody else’s (or your own) code, but when things stop making sense, it’s time to re-evaluate that assumption.

November 4th, 2008

Interface Annoyances

We ran into a rather difficult class design problem recently that reveals a shortcoming in C# and, apparently, the .NET runtime (specifically, the Common Language Infrastructure, or CLI). It’s a pretty common problem, and I’m a little bit surprised it hasn’t been addressed.

As you know, C# doesn’t allow multiple inheritance. It does, however, allow classes to implement interfaces, which is kind of like inheriting behaviors. The big difference is that with interfaces you have to supply the implementation in your class. With inheritance, you get a default implementation along with the interface.

And before you flame me, please understand that I’m well aware of the many differences between multiple inheritance and implementing interfaces, including the many assumptions that come along with multiple inheritance. The fact remains, though, that the general behavior of a class that implements an interface will be very similar to the behavior of a class that inherits the behaviors defined by that interface. The implementations are quite different, of course, and inheritance carries with it some often unacceptable baggage, but from the outside looking in, things look very much the same.

Interfaces are very handy things, but they can get unwieldy. Consider, for example, an interface called ITextOutput that contains 4 methods:

interface ITextOutput
{
    void Write(string s);
    void Write(string fmt, params object[] parms);
    void WriteLine(string s);
    void WriteLine(string fmt, params object[] parms);
}

The idea here is that you want to give classes the ability to output text strings. If a class implements ITextOutput, then clients can call the Write or WriteLine methods, and the supplied string will go to the object’s output device. So, a class called Foo might implement the interface:

class Foo : ITextOutput
{
    public void Write(string s)
    {
        Console.WriteLine(s);
    }

    public void Write(string fmt, params object[] parms)
    {
        Write(string.Format(fmt, parms));
    }

    public void WriteLine(string s)
    {
        Write(s + Environment.NewLine);
    }

    public void WriteLine(string fmt, params object[] parms)
    {
        WriteLine(string.Format(fmt, parms));
    }
}

Easy enough, right? Except that every class that implements ITextOutput has to implement all four methods. If you look closely, you’ll see that the last three methods all end up calling the first after formatting their output. In most cases, the only method that will change across different classes that implement this interface will be the first Write method. You might want to change the output device, for example, or include a date and time stamp on the output line.

C# does not provide a good solution to this problem. As I see it, you have the following options:

  1. Implement the methods as shown in each class that implements ITextOutput. This is going to be tedious and fraught with error. Somebody is going to make a mistake in all that boilerplate code and the resulting bug will appear at the worst possible time—quite possibly after the product has shipped.
  2. Structure your class hierarchy so that every object inherits from a common TextOutputter class that implements the interface. This is a very good solution if you can do it. For those classes that inherit from some other base class, you can implement the interface as shown above. I have difficulty with this solution because it’s saying, in effect, “Foo IS-A TextOutputter that also does other stuff.” In reality what we really want to say is, “Foo IS-A object (or some other base class) that implements the ITextOutput functionality. It might sound like a fine distinction, but it matters. A lot. Especially when refactoring code.
  3. Forget about inheritance and interfaces and make Foo contain a member that implements the interface. Something like:
    class Foo
    {
        public ITextOutput Outputter = new TextOutputter();
    }

    This will work, but it’s annoying to clients. Rather than calling Foo.Write, for example, they have to call Foo.Outputter.Write. And class designers are free to change the name of the Outputter member to anything. The result is that clients can’t tell by looking at the class declaration if it implements the ITextOutput interface. Instead, they have to go looking for a member variable (or property) that implements it.

Any way you look at it, it’s messy. As a client, I’d expect class designers to bite the bullet and go with the first option, and test thoroughly. In truth, I think that’s the only reasonable option, painful as it is. As a designer, I’d grumble about the need for all that extra typing, but I’d do it. I’d be embarrassed to release a class that implemented either of the other solutions because as a user I’d be annoyed by either of the other implementations. But I sure wish there were another way to do it.

Delphi solved this problem by using what’s called implementation by delegation. The technique involves creating a member that implements the interface (similar to the third option shown above), and delegating calls to interface methods to that object. In C#, if such a feature existed, the syntax might look something like this:

class Foo: ITextOutput
{
    public ITextOutput Outputter =
        new TextOutputter() implements ITextOutput;
}

Clients could then call the Write method on an object of type Foo, just as they would with the first option. The runtime (or the compiler, maybe) would then delegate such interface calls to the Outputter member. We have the best of both worlds: real interfaces, and we don’t have to repeatedly type all that boilerplate code.

I’m not the first one to run into this problem or to suggest the solution. Steve Teixeira mentioned it in his blog over three years ago, and linked to this blog entry from 2003. Steve, by the way, is the one who came up with the idea for Delphi. He says that it’s not currently possible to do such a thing in .NET because it can’t be made verifiably type-safe. I don’t understand why not, but I’ll defer to his judgement here.

This type of thing is trivial to implement in languages that support multiple inheritance. But I don’t think I’m willing to accept the problems with multiple inheritance in order to get this one benefit. It’s a moot point anyway, as it’s quite unlikely that .NET will support multiple inheritance any time soon.

I’d sure be interested to find out how others handle this situation in C# or other .NET languages. Drop me a line and let me know.

October 16th, 2008

Optimizing the wrong thing

Today I was building a custom hash table implementation and needed a function that, given a number X, would find the next prime number that is equal to or greater than X.  Since X could be pretty large—on order of one trillion—I knew that done in a naive way, it could take a long time (in computer terms) to compute the next prime number.  So I started looking for a smarter way to do it.

The basic algorithm is pretty easy.  Given a number, X, the following will find the next prime number:

V = X
if (V % 2) == 0
    V += 1
while (!IsPrime(V))
{
    V += 2
}
// at this point, V is the next prime number >= X

The simplest way to determine if a number is prime is to see if it’s evenly divisible by any number from 2 up to and including the square root of the number in question.  That is, to determine if 101 is prime, you try dividing by 2, 3, 4, 5, 6, 7, 8, 9, and 10.  If any division results in a remainder of 0, then you know that the number is not prime and you can stop testing.  Only if all the tests fail do you know that the number is prime.  But the square root of one trillion is a million, and I didn’t relish the idea of doing many of millions of division operations to find the next prime number.

A much more efficient method to determine whether a number is prime is to divide it by all the prime numbers up to the square root.  It’s a simple optimization, right?  If the number is not evenly divisible by 2, then it certainly won’t be divisible by 4, 8, 296, or any other even number.  And if it’s not divisible by 5, it won’t be divisible by 10, 15, 20, or 2,465.

That insight can greatly decrease the amount of work you have to do in determining whether a number is prime.  After all, there are only 78,500 prime numbers between 0 and 1,000,000.  So the worst case goes from 500,000 division operations to 78,500.  The only catch is that you need to store all those prime numbers.  That costs some memory.  Naive encoding (four bytes per number) would take about 300K of RAM.  But you can use a table of bits and do it in about 125K.

I was halfway to building the table of primes when I decided to see just how long it takes to find the next prime using the naive method.  I quickly coded up the simplest version and got my answer.  On my machine using the naive method, it takes on average 30 milliseconds (3/100 second) to find the next prime number if the starting number is between 1 trillion and 2 trillion.  Granted, that’s a long time in computer terms.  But in context?

The reason I need to compute the next prime number is that in my hash table implementation the hash table size needs to be prime.  So if somebody asks for a table of 1 million items, I’ll give them 1,000,003.  And when I extend the table, I need to ensure that the new size is prime.  Resizing the table requires that I make a new array and then re-hash all the existing items.  That takes significant time when the table is large.  Fortunately, resizing is a relatively rare operation.

The point is that the function is called so infrequently, and the code that calls it takes so much time, that whatever cycles I spend computing the next prime won’t be noticed.  So the “slow,” simple, naive method wins.

I used to rant a lot about programmers spending their time optimizing the wrong thing, persuing local efficiency even when it won’t affect the overall program running time.  It’s embarrassing when I find myself falling into that trap.

August 10th, 2008

Paranoia versus productivity

We had an interesting discussion at the office about how much validation a collection type should do in its constructor. The key question, I think, came down to this:

If the constructor can determine that using the instantiated object will throw an exception, should the constructor fail rather than returning the instantiated object?

In other words, if I know that the instantiated object won’t work, shouldn’t I just throw the exception now, rather than let you be surprised later?

There are two extremes here: 1) the constructor should go to heroic efforts, and; 2) let the buyer beware. I tend to lean towards putting the onus on the caller, figuring that whoever is instantiating the object knows what he’s doing.  Let me provide an example.

Consider the .NET SortedList generic collection type. To do its job (that is, keep a collection of items sorted), it requires a comparison function. If you don’t specify a comparison function when you call the constructor, the collection uses the default comparison function for whatever type you specify as the key. This sounds simple enough, right? A list of employees that’s sorted by employee number, for example, would be defined like this:

SortedList<int, Employee> Employees =
    new SortedList<int, Employee>();

Since the System.Int32 type (which the C# int type resolves to) implements IComparable, everything works.

But imagine you have an EmployeeNumber type:

class EmployeeNumber
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public EmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }
}

Now, if you create a SortedList that’s keyed on that type, you’ll have:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

Allow me to show the entire program here, so we don’t get confused.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace genericsTest
{
    class EmployeeNumber
    {
        public string Division { get; private set; }
        public int EmpNo { get; private set; }
        public EmployeeNumber(string d, int no)
        {
            Division = d;
            EmpNo = no;
        }
    }

    class Employee
    {
        public string Name { get; set; }
        public Employee(string nm)
        {
            Name = nm;
        }
    }

    class Program
    {
        static SortedList<EmployeeNumber, Employee> Employees =
            new SortedList<EmployeeNumber, Employee>();

        static void Main(string[] args)
        {
            Employees.Add(new EmployeeNumber("Accounting", 1),
                new Employee("Sue"));
            Employees.Add(new EmployeeNumber("Dev", 2),
                new Employee("Jim"));
        }
    }
}

If you compile and run that program, you’ll see that it throws an exception when it tries to add the second employee to the list. The program fails because it can’t compare the items. Neither implements IComparable.

Those who lean towards the first extreme above will argue that the SortedList constructor should determine that the key type doesn’t implement IComparable, and should prevent you from instantiating the collection. It should throw an exception because it knows that trying to add items to the collection will fail.

The constructor could do this. It’s possible for the constructor to get the default comparer and call it. If the comparison function returns a value, then all is well. If it fails, then the constructor throws an exception saying, “Sorry, but you didn’t supply a comparison function.”

The only problem with that scenario is that it’s wrong. Not wrong philosophically, but wrong in a very concrete sense. Extending my example will illustrate why.

Suppose you have two different types of employee numbers. Maybe an OldEmployeeNumber that looks like the one I defined above, and a NewEmployeeNumber that has different fields. Because you want to keep both employee number types in the same list, you define a base class, EmployeeNumberBase from which they both can inherit. The definitions would look like this:

abstract class EmployeeNumber : IComparable
{
    // Some common employee number functionality goes here.

    public int CompareTo(object obj)
    {
        throw new NotImplementedException();
    }
}

class OldEmployeeNumber : EmployeeNumber, IComparable
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public OldEmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is OldEmployeeNumber)
        {
            var o2 = obj as OldEmployeeNumber;
            rslt = Division.CompareTo(o2.Division);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is NewEmployeeNumber)
        {
            // OldEmployeeNumber sorts before NewEmployeeNumber
            rslt = -1;
        }
        return rslt;
    }
}

class NewEmployeeNumber : EmployeeNumber, IComparable
{
    public string Country { get; private set; }
    public decimal EmpNo { get; private set; }
    public NewEmployeeNumber(string c, decimal no)
    {
        Country = c;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is NewEmployeeNumber)
        {
            var o2 = obj as NewEmployeeNumber;
            rslt = Country.CompareTo(o2.Country);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is OldEmployeeNumber)
        {
            // NewEmployeeNumber sorts after OldEmployeeNumber
            rslt = 1;
        }
        return rslt;
    }
}

Yeah, I know. That’s quite a mouthful.

The EmployeeNumberBase class implements the IComparable interface, but its implementation just throws NotImplementedException. Furthermore, the class is marked as abstract to prevent it from being instantiated. Only derived classes can be instantiated.

The derived classes each explicitly implement the IComparable interface. The company-defined sorting rules are that old employee numbers always sort in the list before new employee numbers. Within the same type, the numbers are sorted using their own rules. [Note here that my CompareTo implementations aren't terribly robust. They'll return zero (equal) if the object passed is not of a known type, and they'll fail if the passed object is null. But those details aren't terribly relevant to the example.]

Now, the Employees list is created in exactly the same way:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

We can then add items to the list:

Employees.Add(new NewEmployeeNumber("USA", 2.002m),
    new Employee("Jim"));
Employees.Add(new OldEmployeeNumber("Accounting", 1),
    new Employee("Sue"));
Employees.Add(new OldEmployeeNumber("HR", 3),
    new Employee("Dana"));

If you make those changes and run the program, you’ll see that it does indeed run, and work as expected, and I didn’t change the comparison function that the constructor sees.

If SortedList had attempted to protect me from myself–that is, call the default comparison function and throw an exception because the comparison function had failed–then this final code would not work. By trying to protect me from myself, it would have prevented me from doing what I wanted to do.

Understand, the above is something of a contrived example. I certainly can’t imagine implementing the employee list that way, even if I did have different employee number types. But somebody else might think it’s a perfectly reasonable thing to do. The point is that there could be very good reasons for instantiating a keyed collection with a key type that does not have a valid comparison function. The constructor cannot know if comparisons will fail.

Which brings us back to the original question: how hard should a collection class (or any library object) try to prevent you from instantiating an object that will fail? In my opinion, the constructor should instantiate the object if the immediate parameters look reasonable. My reasoning is that it’s extremely difficult, if not impossible, to know how the caller will be using the class. As you saw above, making broad assumptions about types in a polymorphic environment can be fatal.

This reasoning extends far beyond the question of how a collection class’s constructor should behave. As programmers, we have to strike a balance between paranoia and productivity. We have to decide daily how much trust to put in the code that calls our methods, and how much we can depend on the code we call. Do we write classes that hold the programmer’s hand to help him across the street, or do we provide a “walk” signal and a warning that says, in effect, “If you cross on red, all bets are off”?

August 4th, 2008

Multicore Crisis?

There’s been some talk recently of the next “programming crisis”: multicore computing. I’ll agree that we should be concerned, but I don’t think we’re anywhere near the crisis point. Before I address that specifically, I think it’s instructive to review the background: why multicore processors exist, how they affect existing software, and the issues involved in writing code to make use of multiple cores.

Moore’s law has been quoted and misquoted so often that it’s almost a cliché. His original statement was simply an observation on the rate at which transistor counts were increasing on integrated circuits, and that he expected the trend to continue for at least 10 years. That was 1965. The trend has continued, and there’s no indication that it will slow.

Some people think Moore’s Law has become something of a self-fulfilling prophecy: because we believe that it’s possible, somehow we strive to make it so. One wonders what would have happened if Moore had said that he expected the rate of growth to increase. Would transistor densities have increased at an exponential rate?

Self-fulfilling prophecy or not, it’s almost certain that the trend in increasing transistor densities will continue (it has through 2007) and that as a result we’ll get ever more powerful CPUs as well as faster, higher-capacity RAM. Absolute processor speed as measured by clock rate will continue to increase, but not at the astounding rates that we saw up to 2005 or so. Quantum effects and current leakage have put a little damper on the rate of growth there. Better materials will solve the problem–are solving the problem–but absent a fundamental breakthrough by the chemists working on the problem, clock speeds won’t be doubling every 18 months like they had been in the recent past. The Clock Speed Timeline graph makes this quite evident.

Today’s trend is towards multiple cores on a single processor, running at a somewhat slower clock rate. The machine I’m writing this on, for example, has a quad-core Intel Xeon processor running at 2 GHz. The clock speed is somewhat slower than you can get in a high-end Pentium, but the multiple cores provide more total computing power. Quad core processors today are quite common. Intel demonstrated an 80-core chip in February of 2007, and promised to deliver it within five years. I fully expect to have a 256-core processor in my desktop computer ten years from now.

The trend towards multiple cores and very slowly increasing clock rates has some interesting ramifications for software developers. In the past, we have depended on more RAM and faster processors to give us some very nice performance boosts. All indications are that the amount of available RAM and the size of on-chip caches will continue to grow, but we can’t count on the biannual doubling of processor speed. Unless we learn to write programs that use multiple cores, we will soon reach a very real performance ceiling.

Not all applications can benefit from multiple cores, but you’d be surprised at how many can. And even in those cases when a single program can’t make use of multiple cores, users still benefit from having a multicore processor because the machine is better at multi-tasking. Imagine running four virtual machines on one computer, for example. If the computer has a single processor core, all four virtual machines and all of the operating system services share that one core. On a quad-core processor, the work load is spread out over all four cores. The result is more processor cycles per virtual machine, meaning that all four virtual machines should run faster.

Software systems that consist of multiple mostly-independent processes can make good use of multicore processors without any modification. Consider a system consisting of two services that are constantly running. On a single-core computer, only one can actually be working at a time. You could almost double performance simply by upgrading to a dual-core processor. Such software systems are quite common, and they require no code changes in order to benefit immediately from the new multicore processor designs.

Contrary to popular belief, writing code that is explicitly multi-threaded–designed to take advantage of multiple cores–isn’t necessarily a huge step up in complexity. Such code can be much more complex than single-threaded code, but it doesn’t have to be. Some programs are more multi-threaded than others. I’ve found it useful to think of programs in terms of the following four levels of complexity:

  1. No explicit multi-threading.
  2. Infrequent, mostly independent asynchronous tasks.
  3. Loosely coupled cooperating tasks.
  4. Tightly coupled cooperating tasks.

Obviously, it’s impossible to draw exact boundaries between the levels, and many programs will use features found in two or more of the levels. In general, I would classify a program by the highest level of multi-threading features that it uses.

Level 1 requires little in the way of explanation. This is the most common type of application in use today. In a batch mode program, execution proceeds sequentially from start to finish. In a GUI program, user interface events and processing execute on the same thread. This type of application has served us well over the years.

Most Windows programmers have some experience with the next level of complexity. A GUI application that performs background processing and periodically updates the display is an example of this type of program. Typically, the program starts the background process, which from time to time raises events which the GUI thread handles and updates the display. Data and process synchronization between tasks is limited to the event handlers that respond to asynchronous events. Modern development environments make it very easy to create such programs. These programs can benefit from multiple processor cores because the background thread can operate independently of the the GUI thread, making the GUI thread much more responsive.

I have found the third level of complexity–loosely coupled cooperating tasks–to be a very useful and relatively simple way to make use of multiple cores. The idea is to construct a program that operates in an assembly line fashion. For example, consider a program that gathers input, does some complex processing of the input data, and then generates some output. Many such programs are processor bound. If you structure the program such that it maintains an input queue, a pool of independent worker threads, and an output queue, then there is little danger of running into the problems that often plague more complex programs. You have to supply synchronization (mutual exclusion locks, or similar) on the input and output queues, but the worker threads operate independently. Using this technique on a quad-core processor, it’s possible to get an almost 4x increase in throughput over a single-core processor, with very little danger of running into resource contention issues.

Written correctly, programs that have multiple tightly-coupled cooperating tasks make the best possible use of processor resources. However, explictly coding thread synchronization is perhaps the most difficult type of programming imaginable. Forgetting to lock a resource before accessing it can lead to unexplained crashes or data corruption. Holding a lock for too long can create a performance bottleneck. Locks that are too granular increase complexity and also the chance for deadlock situations. Locks that are not granular enough will stall worker threads. Race conditions are endemic. Assuming you get such a program working, even a small change will often cause new, unanticipated problems. Writing this kind of code is hard. You’re much better off re-thinking your approach to the problem and casting it as a Level 3 problem. Whatever price you pay in performance will be returned many fold in increased reliability and reduced development time.

If you’re writing a Level 3 or Level 4 program, you should very seriously consider using a existing multi-tasking library if at all possible. Doing so will require that you think about your problem differently, but you leverage a lot of known-working code that is almost certainly more robust in all ways than what you’re likely to write yourself in the time allotted. Two good examples of such libraries are the Parallel Extensions to .NET 3.5 and the Java Parallel Processing Framework. Such libraries exist for many other programming environments. Although still in their infancy, these libraries promise to greatly simplify the move to multicore. If you’re contemplating development of a program that makes good use of multiple cores, you definitely should learn about any parallel computing libraries that support your platform.

So, back to the crisis. Bob Warfield over at SmoothSpan Blog has had and continues to have quite a lot to say about it, and many others share his sentiments. I, on the other hand, don’t think we’re anywhere near the crisis point. Nor do I think we’re likely to get there. Whereas it’s true that most current software isn’t multicore ready, software developers have understood for several years now that they need to begin writing applications that take advantage of multiple processor cores. It’s likely that some shops have taken an ad hoc approach to the problem, and they’re probably suffering with the issues I pointed out above. It’s also likely that many (and I would hope, most) development shops have done the prudent thing and adopted a parallel computing library that takes care of the difficult areas, leaving the programmers to worry about their specific applications. Doing so is no different than adopting an operating system, development environment, GUI library, report generator, or any other third party component–something that development shops have long experience with.

In short, the multicore “crisis” that the doomsayers are warning us about is almost a non-issue. It’s going to require a small amount of programmer retraining and there will undoubtedly be a temporary plateau in the rate at which our processing of data increases, but in a very short time we’ll again have mainstream applications that push all this fancy hardware to its limits.

July 21st, 2008

C# and .NET: What’s Next?

About 10 days ago, MSDN’s Channel 9 site released an hour-long video entitled Meet the Design Team, that talks in very vague terms about uncoming features in C# 4.0.  You’ll learn that the language will include more dynamic constructs and built-in support for multiple cores.  Honestly, that’s about all you’ll learn from watching the video.  Granted, either one of those broad features implies many changes to the language and to the underlying runtime.

Improvements to the language are all well and good, but given the choice I’d rather have them address some fundamental runtime issues:  the two-gigabyte limit, and garbage collection.  Both of these issues have caused me no end of grief over the past year.

All things considered, the .NET garbage collector is a definite win.  It handles the majority of memory management tasks much better than most programmers.  It’s not impossible to create a memory leak in a .NET program, but you really have to try.  Unfortunately, garbage collection is not free.  You’ll find that out pretty quickly if you write a long-running program that does a lot of string manipulation.  For example, take a look at this clip, which shows bandwidth usage from a Web crawler written in .NET:

Those times of zero bandwidth usage you see coincide with the garbage collector pausing all the threads to clean things up.  We lose somewhere around 10% of our potential bandwidth usage due to garbage collection.  This particular graph is from a dual-core machine.  The graph looks the same on a quad-core processor.

Obviously, they’ll have to do something about the garbage collector if they’re going to support multiple cores.  No amount of multi-core support in the language or in the runtime will do me a bit of good if every core stops whenever the garbage collector kicks in.

I’ve mentioned the .NET two-gigabyte limit before.  The 64-bit runtime has access to as much memory as you can put in a machine, but no single object can be larger than two gigabytes.  When you’re working with data sets that contain hundreds of millions of items, that’s just not acceptable.  When $2,000 will buy you a machine with 16 gigabytes of memory, it’s time that the .NET runtime give me the ability to allocate an object that makes use of that capacity.

I’m happy to see the team continue improving the C# language.  I’ll undoubtedly find many of their improvements useful.  But no amount of language improvement will increase my productivity if I’m hamstrung by the absurd limit on individual object size and the garbage collector continues to eat my processor cycles.

Unfortunately, we’ll have to wait a bit longer before we know what all will be included in the next versions of C# and .NET.  Microsoft is keeping pretty quiet, apparently in an attempt to make a big splash at the Professional Developer’s Conference in October.

Anybody care to pay my way to the conference?

July 16th, 2008

Exceeding the Limits

We generate a lot of data here, some of which we want to keep around. Yesterday I noticed that I was running out of space on one of my 750 GB archive drives and figured it was time to start compressing some of the data. The data in question is reasonably compressible. A quick test with Windows’ .zip file creator indicated that I’d get a 30% or better reduction in size.

The data is generated on a continuous basis by a program that is always running.  The program rotates its log once per hour, and the hourly log files can be anywhere from 75 to 200 megabytes in size.  Figuring I’d reduce the number of files while also compressing the data, I wrote a script that uses INFO-ZIP’s Zip utility to create one .zip file for each day’s data.

And then I hit a wall.  It seems that the largest archive that Zip can create is 2 gigabytes.  As their FAQ entry about Limits says:

While the only theoretical limit on the size of an archive is given by (65,536 files x 4 GB each), realistically UnZip’s random-access operation and (partial) dependence on the stored compressed-size values limits the total size to something in the neighborhood of 2 to 4 GB. This restriction may be relaxed in a future release.

With 24 files ranging in size from 75 to 200 megabytes, it’s inevitable that some days will generate more than 3 gigabytes of data.  At about 30% compression, that’s not going to fit into the 2 GB file.

My immediate solution will be to compress the files individually.  It’s less than ideal, but at least it’ll give me some breathing room while I look for a new archive utility.

I’m surprised that in today’s world of cheap terabyte-sized hard drives, the most popular compression tools have the same limitations they had 20 years ago.  Every modern operating system has supported files larger than 4 gigabytes for at least 10 years.  It’s time our tools let us use that functionality.

I’m in the market for a good command-line compression/archiver utility that has true 64-bit file support.  Any suggestions?

July 14th, 2008

Going Too Far Back

The other day I intended to close a Remote Desktop window and instead hit the Close button (the X on the right of the window’s caption bar) on the console window running our data broker. Nothing like an abnormal exit to bring the whole house of cards tumbling down.

So I went looking for a way to prevent that particular problem from occurring again. Disabling the Close button is pretty easy. In fact, there are at least two ways to do it. Neither is ideal.

The Close button is on the window’s system menu. You can get a handle to the system menu by calling the GetSystemMenu Windows API function. In addition to the buttons on the window’s caption bar, this menu also contains the menu items you see if you click on the box at the left of the window:

Given a handle to the system menu, you have (at least) two choices:

  1. Call EnableMenuItem to disable the caption bar’s Close button.
  2. Call DeleteMenu to remove the Close item from the menu. Doing so will also disable the Close button on the caption bar.

The second option looks like the best, because it prevents me from hitting the Close button, and also prevents me from inadvertently clicking the Close menu item when I’m going for Edit. The C# code for the second option looks like this:

[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr GetConsoleWindow();

[DllImport("user32")]
private static extern IntPtr GetSystemMenu(IntPtr hWnd, bool bRevert);

[DllImport("user32")]
private static extern bool DeleteMenu(IntPtr hMenu, uint uPosition, uint uFlags);

private const int MF_BYPOSITION = 0x0400;

static void Main(string[] args)
{
    // Get the console window handle
    IntPtr winHandle = GetConsoleWindow();

    // Get the system menu
    IntPtr hmenu = GetSystemMenu(winHandle, false);

    // Delete the Close item from the menu
    DeleteMenu(hmenu, 6, MF_BYPOSITION);

    // rest of program follows
}

That works well, as you can see from this screen shot:

But there’s a problem. To restore the menu when your program is done, you’re supposed to call GetSystemMenu and pass true for the second parameter, telling it to restore the menu, like this:

GetSystemMenu(winHandle, true);

The result is probably not what you expect:

The system didn’t revert to the previous menu, but rather to the default system menu–the one created for every window. The Edit, Defaults, and Properties items that cmd.exe adds to the menu are gone.

Since I can’t reliably restore the menu after deleting an item, I figured I’d call EnableMenuItem to disable the Close item. Unfortunately, that doesn’t appear to be possible. At least, I haven’t been able to make it work. Since I often need the Edit menu item even after the program exits, I’m going with the first option and hoping that I don’t hit the Close menu item by mistake when going for the Edit menu while the program is running.

An aside: we have the term “fat finger” to describe hitting the wrong key on the keyboard. Is there a similar expression for making a mistake with the mouse? I suppose “mis-mouse” would do, but it doesn’t have quite the same ring to it as “fat finger.”

June 11th, 2008

Can’t Configure Windows DNS Resolver Cache

In experimenting with the program I described yesterday, I got to fiddling with the DNS resolver cache, called dnscache. Briefly, dnscache saves the results from recent DNS queries so that it doesn’t have to keep querying the DNS server. Considering that a DNS query can take 100 milliseconds or more to resolve, this can save considerable time. For example, for your browser to load this Web page, it has to make many different requests to my server: one for the base page, one for the stylesheet, one for each image, etc. It wouldn’t be uncommon to require a dozen separate requests to get all the resources that make up the page. If each resource required a separate DNS request, it would take more than a second just for DNS!

I got to wondering just how large the DNS cache is. A little bit of searching brings up any number of pages claiming that you can “speed up your connection” by tweaking the DNS resolver cache parameters. Specifically, they talk about changing registry keys for the cache hash table size, maximum time to live, etc. There’s even a Microsoft TechNet article describing these parameters for Windows Server 2003 (and, by extension, Windows XP). It’s interesting to note that the information on most of the pages claiming to speed things up conflicts rather badly with the information in the TechNet article.

After reading the tweaks and the TechNet article, I figured I’d give it a shot. I fired up the Registry Editor, made the changes, and … is it working? How can I tell? I tried browsing a few Web sites, but I couldn’t see any difference.

A little more searching and I found the command ipconfig /displaydns. This writes the contents of the DNS resolver cache to the console. A little work with the FIND utility, and I was able to count the number of entries in the cache. 34 on my Windows XP box. Interesting, considering that I set the CacheHashTableSize registry entry to over 7,000. I fiddled and tweaked, restarted the DNS Client service, flushed the cache, rebooted my computer, faced Redmond and cursed, and generally tried everything I could think of. No matter what settings I used, I always ended up with between 30 and 40 entries in my DNS cache.

On my Windows Server 2008 machine at the office, I always got between 270 and 300 entries, no matter what I tried.

So that leaves me with the following possibilities:

  1. It’s not possible to change the size of the DNS resolver cache in Windows XP or Windows Server 2008.
  2. It is possible, but the documentation is wrong.
  3. The documentation is correct as far as it goes, but it’s incomplete.
  4. The documentation is correct and complete, but I’m too dumb to make sense of it.
  5. The documented registry entries actually changed the size of the cache, but ipconfig isn’t showing me all the entries that are in the cache.

At this point, all possibilities seem almost equally likely. I could do some indirect testing based on the amount of time it takes to resolve a series of DNS requests, but even that would be inconclusive. There are no documented API calls that allow me to examine the DNS cache or its size. (And the undocumented ones aren’t described well enough to be worth checking out.) My only means of seeing what’s in the cache is the ipconfig tool.

So I ask: does anybody know how to change the size of the Windows DNS resolver cache and prove that those changes actually work? Do I have to restart the DNS Client service? Reboot the machine? Set some super magic registry entry?

Any information greatly appreciated.

June 10th, 2008

Is this really asynchronous?

I’ve been working on a relatively simple program whose purpose is to see just how fast I can issue Web requests. The idea is to get one machine hooked directly to an Internet connection and see how many concurrent connections it can maintain and how much bandwidth it can consume. A straight bandwidth test is easy: just start three or four Linux distribution downloads from different sites. That’ll usually max out a cable modem connection.

But determining the sustained concurrent connection rate is a bit more difficult. It requires that you issue a lot of requests, very quickly, for an extended period of time. By slowly increasing the number of concurrent connections and monitoring the bandwidth used, I should be able to find an optimum range of request rates: one that makes maximum use of bandwidth, but doesn’t cause requests to timeout.

My Web crawler does something similar, but it also does a whole lot of other things that make it impractical for use as a diagnostic tool.

I got the program up and limping today, and was somewhat surprised to find that it couldn’t maintain more than 15 concurrent connections for any length of time. Considering that my crawler can maintain 200 or more connections without a problem, I found that quite curious. It had to be something about the different way I was issuing requests.

Because this is a simple tool, I figured I’d use the .NET Framework’s WebClient component to issue the requests. In order to avoid the overhead of constructing a new WebClient for every request, I initialized 100 WebClient instances to be served from a queue, and then issued the requests in a loop, kind of like this:

while (!shutdown)
{
    if (currentConnections < MaxConnections)
    {
        WebClient cli = GetClientFromQueue();
        ++currentConnections;
        cli.DownloadStringAsync(GetNextUrlFromQueue());
    }
}

The actual code is a bit more involved, of course, but that’s the gist of it. The currentConnections counter gets decremented in the download completed event handler.

The important thing to note here is that I’m issuing asynchronous requests. The call to DownloadStringAsync executes on a thread pool thread. This code should issue requests at a blindingly fast rate, and keep the number of concurrent connections right near the maximum. Even with MaxConnections set to 50, the best I could do was 20 concurrent–and that for only a very short time. Most often I had somewhere between 10 and 15 concurrent connections.

After eliminating everything else, I finally got around to timing just how long it takes to issue that asynchronous request. The result was pretty surprising: in my brief tests, it took anywhere from 0 to 300 milliseconds to issue those requests. The average seemed to be around 100 or 150 ms. That would explain why I could only keep 10 or 15 connections open. If it takes 100 ms to issue a request, then I can only make 10 requests per second. Since it takes about 2 seconds (on average) to complete a request, the absolute best I’ll be able to do is 20 concurrent requests.

So I got to thinking, why would it take 100 milliseconds or more to issue an asynchronous Web request? And the only reasonable answer I could come up with was DNS: resolving the domain name. And it turns out I was right. I flushed the DNS cache and ran my test by requesting a small number of URLs from different domains. Sure enough, it averaged about 150 ms per request. I then ran the program again and it took almost no time at all to issue the requests. Why? Because the DNS cache already had those domain names resolved. Just to make sure, I flushed the DNS cache again and re-ran the test.

By the way, the HttpWebRequest.BeginGetResponse method (the low-level counterpart to WebClient.DownloadStringAsync) exhibits the same behavior. That’s not surprising, considering that WebClient uses HttpWebRequest to do its thing.

This is a fatal flaw in the design of the .NET Framework’s support for asynchronous Web requests. The whole idea of supplying asynchronous methods for I/O requests is to push the waiting off on to background threads so that the main thread can continue processing. What’s the use of providing an asynchronous method if you have to wait for a high latency task like DNS resolution to complete before the asynchronous request is issued? Why can’t the DNS resolution be done on the thread pool thread, just like the actual Web request is?

There is a way around the problem: queue a background thread to issue the asynchronous request. Yes, I know it sounds crazy, but it works. And it’s incredibly easy to do with anonymous delegates:

ThreadPool.QueueUserWorkItem(delegate(object state)
    {
        cli.DownloadStringAsync((Uri)state);
    }, GetNextUrlFromQueue());

That spawns a thread, which then issues the asynchronous Web request. The time waiting for DNS lookup is spent in the background thread rather than on the main processing thread. It looks pretty goofy, and unless it’s commented well somebody six months from now will wonder what I was smoking when I wrote it.