Scanning Slashdot today during lunch, I came across this posting about two Stanford researchers who have written a paper (it’s a PDF) showing that seemingly harmless redundant code frequently points to not-so-harmless errors. They used a tool that does static analysis of the source code to trace execution paths and such. The technology behind their tool is fascinating and something I’d like to study, given the time. But that’s beside the point.
On the surface, the paper’s primary conclusion—that redundancies flag higher-level correctness mistakes—seems obvious. After all, it’s something that we programmers have suspected, even known, for quite some time. But our “knowledge” was only of the problems that arose specifically from particular redundancies like “cut and paste” coding errors—repetitions. The paper identifies other types of redundancies (unused assignments, dead code, superfluous conditionals). Some of these redundancies actually are errors, but many are not. The paper’s primary contribution (in my opinion) is to show that, whether or not the redundancies themselves are errors, their existence in a source file is an indicator that there are hard errors—real bugs—lurking in the vicinity. How strong of an indicator? In their test of the 1.6 million lines of Linux source code (2,055 files), they show that a source file that contained these redundancies was from 45% to 100% more likely to contain hard errors than a source file picked at random. In other words, where there’s smoke (confused code, which most likely means a confused programmer), it’s likely that fire is nearby. These methods don’t necessarily point out errors, but rather point at likely places to find errors.
A production tool based on the methods presented in this research would be an invaluable auditing tool. Rather than picking a random sample of source files for review, auditors could use this tool to identify modules that have a higher likelihood of containing errors. Very cool stuff, and well worth the read.