How did this happen?

Last time I showed two different implementations of the naive method for generating unique coupon codes. The traditional method does this:

    do
    {
        id = pick random number
    } until id not in used numbers
    code = generate code from id

The other is slightly different:

    do
    {
        id = pick random number
        code = generate code from id
    } until code not in used codes

Before we start, understand that I strongly discourage you from actually implementing such code. The naive selection algorithm performs very poorly as the number of items you choose increases. But the seemingly small difference in these two implementations allows me to illustrate why I think that the second version is broken in a fundamental way.

Think about what we’re doing here. We’re obtaining a number by which we are going to identify something. Then we’re generating a string to express that number. It just so happens that in the code I’ve been discussing these past few entries, the expression is a six-digit, base-31 value.

So why are we saving the display version of the number? If we were going to display the number in decimal, there’d be no question: we’d save the binary value. After all, the user interface already knows how to display a binary value in decimal. Even if we were going to display the number in hexadecimal, octal, or binary, we’d store the binary value. We certainly wouldn’t store the string “3735928559”, “11011110101011011011111011101111”, or “DEADBEEF”. So, again, why store the string “26BB62” instead of the the 32-bit value 33,554,432?

I can’t give you a good reason to do that. I can, however, give you reasons not to.

  1. Storing a six-character string in the database takes more space than storing a 32-bit integer.
  2. Everything you do with with a six-character code string takes longer than if you were working with an integer.
  3. Display logic should be part of the user interface rather than part of the database schema.
  4. Changing your coding scheme becomes much more difficult.

The only argument I can come up with for storing codes rather than integer identifiers is that with the former, there’s no conversion necessary once the code is generated. Whereas that’s true, it doesn’t hold up under scrutiny. Again, nobody would store a binary string just because the user interface wants to display the number in binary. It’s a simple number base conversion, for crying out loud!

If you’re generating unique integer values for database objects, then let the database treat them as integers. Let everybody treat them as integers. Except the users. “26BB62” is a whole lot easier to read and to type than is “33554432”.

I’ll be charitable and say that whoever came up with the idea of storing the display representation was inexperienced. I’m much less forgiving of the person who approved the design. The combination of the naive selection algorithm, storing the generated coupon code rather than the corresponding integer, and the broken base conversion function smacks of an inexperienced programmer working without adult supervision. Or perhaps an inexperienced programmer working with incompetent adult supervision, which amounts to the same thing.

The mere existence of the naive selection algorithm in this code is a sure sign that whoever wrote the code either never studied computer science, or slept through the part of his algorithms class where they studied random selection and the birthday paradox (also, birthday problem). The comment, “There is a tiny chance that a generated code may collide with an existing code,” tells us all we need to know of how much whoever implemented the code understood about the asymptotic behavior of this algorithm. Unless your definition of “tiny” includes a 10% chance after only one percent of the codes have been generated.

The decision to store the generated code string surprises me. You’d think that a DBA would want to conserve space, processor cycles, and data transfer size. Certainly every DBA I’ve ever worked with would prefer to store and index a 32-bit integer rather than a six-character string.

The way in which the base conversion function is broken is hard evidence that whoever modified it was definitely not an experienced programmer. I suspect strongly that the person who wrote the original function knew what he was doing, but whoever came along later and modified it was totally out of his depth. That’s the only way I can explain how the function ended up the way it did.

Finally, that all this was implemented on the database looks to me like a case of seeing the world through database-colored glasses. Sure, one can implement an obfuscated unique key generator in T-SQL. Whether one should is another matter entirely. Whoever did it in this case shouldn’t have tried, because he wasn’t up to the task.

However this design came to exist, it should not have been approved. If there was any oversight at all, it should have been rejected. That it was implemented in the first place is surprising. That it was approved and put into production borders on the criminal. Either there was no oversight, or the person reviewing the implementation was totally incompetent.

With the code buried three layers deep in a database stored procedure, it was pretty safe from prying eyes. So the bug remained hidden for a couple of years. Until a crisis occurred: a duplicate code was generated. Yes, I learned that the original implementation didn’t even take into account the chance of a duplicate. Apparently somebody figured that with 887 million possible codes, the chance of getting a duplicate in the first few million was so remote as to be ignored. Little did they know that the chance of getting a duplicate within the first 35,000 codes is 50%.

I also happen to know that at this point there was oversight by a senior programmer who did understand the asymptotic behavior of the naive algorithm, and yet approved “fixing” the problem by implementing the duplicate check. He selected that quick fix rather than replacing the naive algorithm with the more robust algorithm that was proposed: one that does not suffer from the same pathological behavior. The combination of arrogance and stupidity involved in that decision boggles the mind.

The history of this bug is rooted in company culture, where the database is considered sacrosanct: the domain of DBAs, no programmers allowed, thank you very much. And although programmers have access to all of the source code, they aren’t exactly encouraged to look at parts of the system outside of their own areas. Worse, they’re actively discouraged from making suggestions for improvement.

In such an environment, it’s little surprise that the horribly broken unique key generation algorithm survives.

So much for that. Next time we’ll start looking at good ways to generate unique, “random-looking” keys.

A broken unique key generator

Hanlon’s Razor: “Never attribute to malice that which is adequately explained by stupidity.”

Recap: in my previous post I showed the naive method of generating unique “random-looking” six-character keys, and explained why it wasn’t a good solution for generating hundreds of millions of keys. I also mentioned that I used to think that the naive method was the worst possible way somebody would try to use for generating keys, and promised that I would show a worse way in today’s post.

If you recall, the naive method does this:

    do
    {
        generate a random number
    } until you find one that hasn't been used
    mark the number as used
    create the coupon code for that number

The worse way that I mentioned attempts to do this:

    do
    {
        generate a random number
        create the coupon code for that number
    } until you find a coupon code that hasn't been used
    mark the code as used

The only difference in the overall algorithm is that the second method uses the coupon code rather than the random number to decide whether a code has been used. That might seem like a minor difference, but it’s really a major change in thinking. It’s also more expensive and yet another place for bugs to creep in.

Let’s defer the discussion of why the second method is worse than the first. In preparation, I want to provide a real-life example of such an implementation.

I ran across this code in a production database that’s used to generate millions of six-character coupon codes every week. When I first saw it I was surprised that somebody would use the naive method to generate unique codes. But then I studied it a little closer and was astonished at how incompetently it was written. If I were the suspicious type, I’d be tempted to think that the person who wrote (or, possibly, modified) it actually broke it on purpose. But then I remembered Hanlon’s Razor.

Here is part of a database stored procedure that implements the second algorithm I showed above.

    WHILE (@codesGenerated < @NumberToGenerate AND @Attempts < @MaxAttempts)
    BEGIN
        BEGIN TRY
            INSERT INTO [dbo].[CouponCodes]
                    ([CouponNamespaceId], [CouponId], [Code], [Used])
                VALUES
                    (@CouponNamespaceId,@CouponId,LTRIM(RTRIM([dbo].[GenerateRandomBase32](ABS(CHECKSUM(NEWID()))))), 0)

            -- If a collision happened in the Insert above then this @codesGenerated will (correctly) not happen
            SET @codesGenerated = @codesGenerated + 1  

        END TRY
        BEGIN CATCH 
            -- There is a tiny chance that a generated code may collide with an existing code.
            -- which will violate the unique constraint.
            -- Error handling code is here.
        END CATCH

        --Always increment @Attempts to prevent an infinite loop if there are too many collisions
        SET @Attempts = @Attempts + 1
    END

This is a T-SQL version of the modified naive method: pick a number, generate the coupon code, see if the coupon code has been used. In the interest of brevity, I’ve removed some irrelevant code and the error handling. It’s a reasonable implementation of that modified naive method, although I question the use of a GUID checksum as the source of random numbers. I’m also not convinced that the person who wrote the error handling understands that the ABS function will throw an exception if the parameter is -2147483648. It’s highly unlikely that CHECKSUM will generate that value, but I’ve been burned before by code that blindly calls ABS without either first checking the input parameter, or handling the exception.

Whereas I believe that those issues should be addressed, they’re pretty minor in light of what follows.

I’m also a bit amused at the comment: “There is a tiny chance that a generated code may collide with an existing code.” Again, it’s especially amusing in light of what follows.

Let’s also ignore for now the question of whether this code even belongs in the database layer. I think it’s a good discussion to have, but we’ll save it for another time.

The problem lies in GenerateRandomBase32, a SQL Server scalar function:

    ALTER FUNCTION [dbo].[GenerateRandomBase32]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 1040187391 + 33554432;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        WHILE @seed > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
            END
        RETURN @answer;
    END

One problem is that the function is poorly named. The name “GenerateRandomBase32” implies that it’s going to generate a random value. At first I thought it was supposed to generate a random 32-bit value, but for reasons I’ll get to in a bit I’m now convinced that “32” was supposed to be the number base. The function doesn’t generate anything, there’s no randomness involved, and the number base is 31. A more appropriate name would be “ConvertToBase31,” or perhaps “ConvertToSixDigitBase31”. I realize that code changes over the years and this function might have done something quite different when it was originally written, but that’s no excuse for failing to change the function name to something meaningful.

On the face of it, the function looks like a T-SQL implementation of the base conversion method that I showed last time. That is, it takes a binary number and creates base-31 representation of it. That’s what it does, in general, but there are a few oddities that will surprise you. The first is this expression at the beginning:

    SET @seed = @seed % 1040187391 + 33554432;

With that expression, the effective range of keys that the function can create base-31 representations of is 33,554,432 to 1,073,741,822: a total of 1,040,187,390. Remember, though, that there are only 887,503,681 possible six-character coupon codes. Some of those values will get duplicates.

You might get the idea that this isn’t going to end well.

The peculiar thing about the expression is that, if we were generating a base-32 code, it would have the effect of preventing all codes that start with the first digit (presumably 0) from being generated. That is, if you sum the two magic numbers you end up with 1,073,741,823. It just so happens that 1,073,741,824 is the number of six-digit base-32 codes you can generate. I suspect that this code was originally written to generate base-32 codes, but for some reason later modified to eliminate ‘0’ from the valid digits, making it a base-31 code generator. But changing the number base without changing that expression has resulted in a horribly broken function.

I can’t prove that the function used to generate base-32 numbers, and was subsequently modified. I can only say that the evidence points strongly to that. I also can’t say for sure why one would want to prevent generation of codes that start with “0”. The only two reasons I can think of are:

  1. Somebody decided that numbers with leading 0’s were somehow difficult for users to handle.
  2. The programmer couldn’t figure out how to include leading 0’s when doing the base conversion.

At this point, I consider those two possibilities equally likely.

After mangling the input parameter, the code performs the standard base conversion logic, peeling out digits from lowest to highest, and creating a string. Note that each digit is prepended to the string. The code “7HNH51” would be constructed in the sequence “1”, “61”, “H61”, “NH61”, “HNH61”, “7HNH61”.

However, the range of values for which the function can generate codes exceeds the number of possible coupon codes. The function can generate codes for numbers greater than 887,503,681, which means that some of the codes generated will be seven characters long. They have to be truncated to six characters. How that truncation is done ends up making this code much worse than it appears at first.

If you pass the value 1040187360 to this function, it will generate seven characters. The first six characters generated are the code shown above, “7HNH61”. But then it generates a seventh character, “2”, which is prepended. Except the answer variable is declared as holding only six characters. So when this line of code is executed:

    SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;

The character “2” becomes the first character in the new string, and then the first five (from left to right) of the previous characters are copied. The result is “27HNH6”.

Now here’s the kicker. Guess what happens when you pass the value 1040187361. The function generates “27HNH62”, which is truncated to “27HNH6”. Duplicate! In fact, all 31 values from 1040187360 through 1040187390 generate that same code. That’s a huge oversight that seriously affects the duplicate rate.

There are 186,238,141 numbers for which this function will generate a seven-character code that’s truncated. If you pass a number in the range 853,949,249 to 1,040,187,390 to this function, it will generate one of 6,007,682 codes. Those 186 million values are 31 times more likely than average to be generated.

One other thing that affects the duplicate rate, albeit not very large, is that the function won’t ever generate a code for any number less than 33,554,432.

Combined, those two errors more than double the duplicate rate experienced by programs that call this function as part of the naive code generation algorithm.

Fixing that function really isn’t difficult, as I show below. The only real changes are to replace the seed mangling expression, and modify the loop so that it always generates exactly six characters.

    ALTER FUNCTION [dbo].[CreateSixDigitBase31]
    (
        @seed AS BIGINT
    )
    RETURNS CHAR(6)
    AS
    BEGIN 
        /*
         * This function generates a six-digit, base-31 representation of the passed seed parameter.
         * It is capable of generating 887,503,681 different codes (permutations of 31 items taken six
         * at a time, with repetition).
         * If the passed seed is larger than 887,593,680, the seed is taken modulo 887,593,680.
         * This function will not produce a useful value if seed is negative.
         */
        DECLARE @base BIGINT = 31
        SET @seed = @seed % 887503680;
        DECLARE @answer AS CHAR(6) = '';
        DECLARE @allvalid AS VARCHAR(31);
        SET @allvalid = '123456789ABCDFGHJKLMNPQRSTVWXYZ';
        DECLARE @counter as INT = 6;
        WHILE @counter > 0
            BEGIN 
                SET @answer = SUBSTRING(@allvalid, @seed % @base + 1, 1) + @answer;
                SET @seed = @seed / @base;
                SET @counter = @counter - 1;
            END
        RETURN @answer;
    END

I don’t particularly like that function because I suspect that the string manipulation is highly inefficient. We are in effect doing six string concatenation instructions with every code generated. I know that doing things this way in a C# program would bring the program to a crawl, which is why the implementation in my previous post used a character array. I don’t know enough about how T-SQL manipulates strings to say how efficient this is, and I don’t know if there’s a better alternative.

With the fixed base conversion function, this naive unique code generation algorithm is about as good as it can be. Which isn’t very good. Not only does it suffer from the duplicates problem that I pointed out yesterday, it has two other, related, strikes against it: it’s implemented at the database level, and it stores the generated code rather than the seed number.

Next time I’m going to talk a bit about why I think implementing the code generation in the database is a bad idea. I’ll also touch on how it’s possible for such a horribly broken implementation to go unnoticed for so long.

Inconsistent error reporting

The systems I maintain call several different partner APIs to get player account information. We noticed recently that one of the APIs returns seemingly inconsistent error messages when presented with invalid player account numbers. For example, if we ask for information on account number 123456789, the system returns a 404, meaning that the account was not found in the system. But if ask for account number 3420859494, the server returns a 500: internal server error. Very strange.

At first I thought it was that the account number was too long. Typically, the account numbers are eight digits, and the ones giving us errors were 10 digits in length. But when I requested account number 1420859494 (10 digits), I got a 404.

On a hunch, I tried the magic number 2147483647. Not surprisingly, that returned a 404. But 2147483648 returned a 500 error. What the heck?

The magic of 2147483647 is that it’s the largest signed integer that can be expressed in 32 bits. It appears that the API tries to convert the input to a 32 bit integer. If that conversion fails, the server throws an exception and returns a 500 internal server error.

This is the kind of error that unit testing is designed to ferret out. Had they tested the edge cases, they would have seen this inconsistent behavior and, one would hope, modified the code to handle it more reasonably.

I don’t know what to do about this. Their documentation doesn’t say specifically that the input must be a positive 32-bit integer, only that the value must be numeric. Based on the empirical evidence, I could put some validation on my end to ensure that users don’t enter account numbers larger than 2147483647, but there’s nothing preventing the API from changing underneath me and allowing larger numbers in the future. It’s unlikely that those in charge of the partner API would notify me of such a change.

Debugging a water heater

About a month ago, Debra and I started noticing that hot water pressure was lower than normal.  At first I thought it was my imagination, but it steadily got worse.  My first hypothesis was sediment in the tank, which fit with what others online will say.  So I hooked up a hose, drained the tank, flushed it a bit, and then refilled it.  Still low pressure.

My next thought was to verify that the problem was with the water heater and not somewhere else in the pipes.  So I connected the cold water input directly to the hot water output, removing the water heater completely from the system.  The resulting high pressure from the hot water side confirmed that the problem was indeed with the water heater.

It took a little looking around, but I finally found the problem:  the 3″ brass nipple that connects the water heater with the hot water pipes in the house was clogged with sediment.  It was so clogged that I’m surprised any water was coming out.  I’m a little embarrassed that it took me so long to check that out.  But since I had so much time in the project and it looked like I wouldn’t have to replace the water heater, I decided to refurbish it a bit.

Electric heating elements for my water heater are about $10 each.  Those and two replacement nipples, plus gaskets for the input and output, and the special tool for removing the elements set me back a total of about $25.  It was a pretty big time investment, though.  Draining the water heater takes a long time.  A few things to keep in mind (some of which I learned from experience):

  • Before you drain your electric water heater or do any other work on on it, turn off the circuit breaker.
  • If you don’t open a hot water faucet somewhere in the house, or open the T&P valve on the water heater, it’ll take almost forever to drain.  (I knew this one before, but thought I’d throw it in because a friend of mine ran into this problem.)
  • Do not try to remove the elements with pliers or a pipe wrench.  Pay the $8 for the element removal tool.
  • If the screwdriver you’re using as a handle for the element removal tool starts to bend, stop.  It’s likely you’ll break the screwdriver before the element comes out.  Find a longer and stiffer piece of metal to use as a handle.  Or dispense with the handle and wrap a pipe wrench around the element removal tool.  Works wonders.  (I was smart enough to stop when I saw the screwdriver shaft flexing.)
  • Be absolutely sure the water is below the level of the element before you try to remove it.  You will not believe how fast water can come pouring out of that hole, and you will not be able to screw the element back in with the water pouring out.  And the water is hot. (Yes, I’m guilty of this one.)
  • You can clean corrosion from heating elements by soaking them in vinegar.  If you decide to re-use your elements after cleaning and inspecting them, be sure to replace the rubber gasket.  Otherwise they will almost surely leak and you’ll have to drain the water heater again to remove and replace them.  The way I figure it, if I’m going to the trouble of draining the water heater, I’ll just replace the elements.

One other thing.  Electric water heaters contain a sacrificial anode rod that helps prevent corrosion of the tank.  The idea is that the anode, being a more active metal than what the tank is made of, will corrode first.  As long as there’s a more active metal than the tank’s metal, the tank won’t corrode (or will do so much more slowly).  Water heater warrantees are typically based on how long the manufacturer thinks the anode rod will last.  You can replace the anode rod.  I haven’t tried it yet.

Most manufacturers recommend that you drain a few quarts from your water heater every three months (some say every month).  That will prevent sediment buildup in your tank.  They recommend draining the tank and inspecting the elements annually.  They also recommend an annual inspection of the anode rod.

By the way, elements that are covered in corrosion don’t work very well at all.  They require a lot more electricity to generate the same amount of heat as new elements.  Especially if you have very hard water, you’re probably money ahead if you replace the elements annually.  The money you save in electricity will more than offset the cost of the new elements, and your water will heat much faster.

Most people (myself included, usually) never think about their water heater until they have no hot water or they notice a leak.  That’s too bad, because with a little periodic maintenance a water heater can last 15 or 20 years rather than the five or so years that they typically last these days.  Considering the cost of a replacement water heater and installation (sometimes over $1,000) and the aggravation of a leaking heater or no hot water, you’re much better off with the periodic maintenance.

If you’re having a problem with your water heater, a good place to look for a solution is Waterheaterrescue.com.  Whereas it’s true that they’re trying to sell you things, they have very good information about common problems and simple solutions.  Oh, and in case you’re interested in how this stuff works: How Water Heaters Work.

Development tool trouble

n 25 years of working with computers, I’ve rarely been as frustrated as I am today.  At work I’m making some modifications to an old ASP (not .NET) project for a client, and am trying to debug. Visual InterDev debugging is kind of clunky at the best of times, but getting it set up is a nightmare!  I’m running Windows 2003 Server on my laptop here, and have been completely unsuccessful in getting Visual InterDev to stop at a breakpoint.  My biggest success so far has been totally screwing up my project so that I have to re-create it.

I had to fire up Visual Studio .NET and create a new Web project just to remember what it was like to have a reasonable development and debugging environment.  If you’ve never worked with Visual InterDev, be thankful. If you’re lucky enough to have moved from Visual InterDev to Visual Studio .NET, be thankful.  If you’re stuck working with Visual InterDev…well, I don’t actually suggest that you shoot yourself in the head, but doing so might be less painful.

Perhaps the most frustrating part of this experience is the complete lack of information in the MSDN Knowledge Base.  I can’t be the only developer who’s had trouble getting Visual InterDev to debug on Windows 2003.  The KB articles that do mention VI Debugging are specific to Windows 2000 or Windows XP, and IIS 5.  Guess what? Many settings either don’t exist in IIS 6, or they’ve been moved.  Shouldn’t Microsoft update their Knowledge Base?

Trace debugging reborn!

Back in the bad old days of computer programming, before the advent of integrated development environments (IDEs) and source-level debuggers, the most common debugging method was code tracing—placing output statements in the code that would write the state of the program or the name of the current procedure to the console or to a log file of some sort.  Placed correctly, those messages could give you good insight into your program’s operation.  The tough part was knowing where to put the trace statements so that you had enough information in the output log, but not too much.  It was a painful process, but it’s all we had at the time.  Well, there were symbolic debuggers, but most COBOL programmers didn’t understand enough assembly language to use them effectively.

With the advent of source level debuggers the fine art of program tracing was mostly lost.  Why dig through a log file to find out what your program is doing when you can just fire up the debugger and step through the code?  Source debuggers are useful tools, but a log file gives you something that a debugger can’t—a permanent record.  A log file can also be used (if you write your program to support it) to “replay” program execution.  If you write your program so that it records all inputs in the log file, then it should be a simple matter to write code that will read input from the log file to duplicate a session.  Combined with a source level debugger, the log file gives you a reliable way to duplicate most errors.

The lost art of program tracing is finding its way back to mainstream programming.  Operating systems have always had event log files where major events were stored, but until recently there was little attention paid to trace information.  The primary reason for renewed interest in tracing?  Web services and other programs that have no real user interface and are intended to run unattended for months or years at a time.  A log file is the only practical way to monitor such applications and diagnose problems.

The point?  I find it interesting how many times an older technology is supplanted by something “better,” and then is re-discovered a generation later and touted as the hottest new thing.  So very often replacing old technologies results in our losing something essential, but not realizing the loss until long after the old technology is mostly forgotten.

Is that a bug? Or a stupid programmer?

Slashdot ran a piece today about pet bugs—a programmer asking about “obscure, weird, and especially annoying bugs in languages/compilers/etc” that have frustrated developers.  In his query, the author writes:

Recently I found that Microsoft’s VB/VBScript(ASP) round function has problems (for example, ’round(82.845)’ returns ‘82.84’ instead of ‘82.85’). It took me an annoyingly long time to realize the problem wasn’t mine.

Never mind that this is the documented behavior of the Visual Basic round function, and in full accordance with the IEEE 754 standard.  Bankers, scientists, mathematicians, and anybody else who works with numbers have been rounding this way for over a century.

I find it difficult to believe that the person who posted the piece on Slashdot is actually paid to write computer programs.  You’d think that, if he suspected there was a problem with the function, he’d search the Microsoft knowledge base or, horror of horrors, read the documentation.  I suspect that this person is deeply embarrassed right about now.

This ill-informed note reminded me of the compiler bug that has frustrated me the most in my career—programmers who automatically assume that their code is right and the compiler is wrong.  I can’t count the number of times a programmer has asked my assistance in isolating a compiler or runtime library bug when he hasn’t yet fully checked his own code.  Sure, compiler and runtime library bugs exist, but they’re very rare.  If there’s a bug in your program, it’s most likely of your own making.

A simple lesson in debugging

Today’s lesson, class, is on the art of finding and fixing bugs.  It’ll be a very short lesson—just one point for me to make and for you to ponder.  When you’re trying to find a bug, make as few changes as possible to the code.  Doing otherwise risks introducing new bugs or, worse, masking rather than fixing the problem.  And debugging is not the time when you optimize that routine you’ve been thinking about, or rewrite some code that you think is ugly.  Working is more important than fast, and until you get it working you can’t know where the real bottlenecks are, so any optimization you make is at best wishful thinking.

It’s such a simple lesson, and painfully obvious to anybody who takes a few minutes to think it over.  Why do so many programmers ignore it?

Lessons learned from debugging

I and a co-worker just spent two very long and tedious days tracking down and fixing a bug in the new version of our server software.  The problem would occur intermittently when multiple people were making requests.  Manifestation was always a blank page returned when we were expecting something else.  The bug was exceedingly difficult to reproduce by hand, so I wrote a little program that would make requests automatically.  That made the bug reproducible, and we learned a few lessons while tracking it down and fixing it.

Always suspect your own code first.  I knew on Wednesday evening that the failure was due to either IIS or our software dropping the request.  I’ve been in this business long enough to suspect my own code first, and I thought it exceedingly odd that IIS would be dropping a request and returning a blank page.  At first I resisted testing against another web server, but decided it couldn’t hurt.  I’m glad we had another web server to test against because it verified that the problem was with our code.

When you’re writing a log file, be sure that the information written is usable.  We were writing a tremendous amount of stuff to the log, most of which was useless for debugging purposes when multiple simultaneous requests were being processed.  In a multi-threaded application, the log information must include the thread ID and some information that ties that thread ID to the request being processed.  We spent almost a full day running tests and modifying the log output so that we could get enough information to track the bug.  My only hope is that the modified logging code remains in the shipping product.

Murphy’s Law applies.  In spades.  If there’s any chance for resource contention, you can bet that it will happen.  The root of the problem was a piece of code that reads a configuration file.  This code opened the file in exclusive mode, thus preventing any other thread from opening the file during the few milliseconds it took to read the information.  If there’s any chance of resource contention, you either code for the possibility by using a mutual exclusion lock, or you write code to eliminate the possibility.

Squashing exceptions is a very bad idea.  Although the root cause of the problem was the resource contention, we were unable to determine the real problem because the exception being raised by the thread’s failure to open the file was being squashed by the exception handler.  We were correctly throwing an exception, but handling it improperly and not even writing a log entry.  If you’re writing code that uses exceptions, then use them!

I’m happy to say that I’m not personally responsible for any of the affected code.  However, the experience has left me with a renewed appreciation for the difficulties of writing multi-threaded code, and you can bet that some of my other applications will be getting a good looking over in the near future.

ASP problems

Remember I said I was working with ASP?  I spent a good 8 hours working with it yesterday and today, implementing an example in VBScript for our product’s extensibility SDK.  I remember now why I dislike BASIC (or Basic).  The language is pretty irregular.  I wonder if a formal grammar even exists.

For example, to call a subroutine, you can use either one of these methods:

subroutine param1, param2, param3
call subroutine (param1, param2, param3)

But don’t try putting parentheses around the parameters if you don’t use the call keyword, or you’ll get an error.  There has to be a reason for this oddity in the language, but I’ve not seen it in any of the books.  Understand, this is just one of the many things that I find strange in the language.

The other problem I have is that the development environment is primitive.  The debugging environment, especially, is terrible.  The best thing I’ve found so far is what we called “the printf method” when I was writing C code.  Except in VBScript with ASP, it’s “Response.Write.”  That works fine if you’re accessing the ASP page directly (i.e. through your browser), but if you’re accessing a CGI that in turn accesses the ASP script that you’re trying to debug (it’s a distributed computing environment), then you’ve got problems.  Remember application log files?  Yeah, boy. 

I understand that if I install IIS and the debugging package, that I could do visual debugging with Visual Interdev.  We tried this and actually got it to work, sort of.  But we kept having to reboot the computer because it’d freeze at odd places.  After the fifth or sixth reboot, we went back to the old way.  I’d sure like to get the visual debugger up and running, though.  Perhaps I’ll have some time to play with it tomorrow.