I and a co-worker just spent two very long and tedious days tracking down and fixing a bug in the new version of our server software. The problem would occur intermittently when multiple people were making requests. Manifestation was always a blank page returned when we were expecting something else. The bug was exceedingly difficult to reproduce by hand, so I wrote a little program that would make requests automatically. That made the bug reproducible, and we learned a few lessons while tracking it down and fixing it.
Always suspect your own code first. I knew on Wednesday evening that the failure was due to either IIS or our software dropping the request. I’ve been in this business long enough to suspect my own code first, and I thought it exceedingly odd that IIS would be dropping a request and returning a blank page. At first I resisted testing against another web server, but decided it couldn’t hurt. I’m glad we had another web server to test against because it verified that the problem was with our code.
When you’re writing a log file, be sure that the information written is usable. We were writing a tremendous amount of stuff to the log, most of which was useless for debugging purposes when multiple simultaneous requests were being processed. In a multi-threaded application, the log information must include the thread ID and some information that ties that thread ID to the request being processed. We spent almost a full day running tests and modifying the log output so that we could get enough information to track the bug. My only hope is that the modified logging code remains in the shipping product.
Murphy’s Law applies. In spades. If there’s any chance for resource contention, you can bet that it will happen. The root of the problem was a piece of code that reads a configuration file. This code opened the file in exclusive mode, thus preventing any other thread from opening the file during the few milliseconds it took to read the information. If there’s any chance of resource contention, you either code for the possibility by using a mutual exclusion lock, or you write code to eliminate the possibility.
Squashing exceptions is a very bad idea. Although the root cause of the problem was the resource contention, we were unable to determine the real problem because the exception being raised by the thread’s failure to open the file was being squashed by the exception handler. We were correctly throwing an exception, but handling it improperly and not even writing a log entry. If you’re writing code that uses exceptions, then use them!
I’m happy to say that I’m not personally responsible for any of the affected code. However, the experience has left me with a renewed appreciation for the difficulties of writing multi-threaded code, and you can bet that some of my other applications will be getting a good looking over in the near future.