Fun with platform invoke

About a year ago I wrote a series of three articles about using .CAB files from .NET programs.  The last article, Creating an Object-Oriented Interface to CAB Files (that article contains links to the first two), presented a .NET class library that programmers could use to read and write CAB files from within their .NET programs.  The article has been reasonably popular, I hear, and the code for the article gets about 100 downloads a month from my site.

I’m almost embarrassed to say that the code, as presented in the article, is unstable and probably unusable in a production environment.  I submitted it to what I thought was rigorous testing here by decompressing and re-compressing a rather large cabinet file from the Windows distribution media.  That all worked fine, but I apparently didn’t test enough.  In the past year I’ve had many messages from people who have had problems with the code.

I fixed the major problem several months ago after a helpful reader sent me some information.  The fixed code is available for download at http://www.mischel.com/pubs/cabdotnet4.zip.  That fixes the error that most people were asking about (a System.NullPointerException during FdiCopy or FdiAddFile).  I neglected to take into account that the garbage collector would collect the delegates that I’d passed to FdiCreate or FciCreate.  The solution was to create private member variables to hold the delegates so that there was always a managed reference.  Problem solved.  I thought.

Recently, another reader contacted me about a different issue that occurred comparitively infrequently.  It was another System.NullPointerException, but we couldn’t figure out what was causing it.  Running the program with ClrSpy looking at it didn’t reveal anything, and no amount of setting breakpoints or other debugger tricks would identify the problem.  I had to resort to that tried and true (cough, cough) method of just poking at it until it worked.  That is, in the process of changing things so that maybe I could reveal the problem, the problem went away.  I “fixed” it, and I don’t know how.

think the problem was that the garbage collector was moving a managed structure around in memory while the unmanaged code had a reference to it.  When the managed code then tried to access the structure at its previous location, the system threw an exception.  The trouble is, I can’t prove that was the error, and there is some evidence that that couldn’t have been.  Furthermore, I can’t prove that whatever I did solved the problem.  It’s possible that I just made the problem’s occurrence less common.

All of which leads me to my point:  you can’t fix a problem if you don’t know what’s causing it.  You can maybe cure a symptom, but unless you can identify the cause you have absolutely no way of knowing for sure that you fixed it.  All the testing in the world can only reveal that the problem hasn’t occurred in the situations that you’ve tested.

So now I’m left with the somewhat curious task of backing out the changes I made to the code in order to see if I can make the problem reappear.  That is, rather than poking at it until it works, I’m going to poke at it until it breaks.  If I can keep my incremental changes small enough, that should at least give me some idea of where the problem lies.  But there are serious problems with this approach.

The bug I’m trying to locate just happens sometimes.  Pretty infrequently, in fact.  I have a test suite that can make it happen, but it involves trying to decompress a large number of files and takes quite some time to run.  The bug would appear, on average, about once every other run, but I sat through three or four successful runs in a row before it appeared.  So let’s assume that I make a change and run the test.  How many times do I run the test before I decide that the change I made had no effect on the bug?  Once?  Three times?  Ten?

Assuming that I come up with an answer to the first question, the next question is possibly worse:  how do I know if the change I made is the actual cause, or just the thing that makes the real cause show a symptom.  That is, it’s possible that the cause of the bug isn’t just one thing, but rather a combination of two or more seemingly unrelated things.  There’s absolutely no way of knowing.

This is a fundamental problem with executing in a managed environment.  I can see pretty much everything that’s going on inside the managed code and I can even single-step into the unmanaged (i.e. native) code with the debugger.  I can even step through the edges–the marshaling of data across the managed/unmanaged line.  What I can’t do (or haven’t yet figured out how to do) is peek at the garbage collector to see if it’s pulling the rug out from under me.