A Better Metric than Code Coverage
My Chase of Code Coverage
Perhaps it’s because fall is upon us and this is the first year in a while that I haven’t been enrolled in a Master’s of CS program (I graduated in May), I’m feeling a little academic. As I mentioned in my last post, I’ve been plowing through following TDD by the letter, and if nothing else, I’m pleased that my code coverage is more effortlessly at 100%. I try to keep my code coverage around 100% whether or not I do TDD, so the main difference I’ve noticed is that TDD versus retrofitted tests seems to hit my use cases a lot harder, instead of just going through the code at least once.
Now, it’s important to me to get close to or hit that 100% mark, because I know that I’m at least touching everything going into production, meaning that I don’t have anything that would blow up if the stack pointer ever got to it, and I’m saved only by another bug preventing it from executing. But, there is a difference between covering code and exercising it.
More than 100% Code Coverage?
As I was contemplating this last night, I realized that some lines of my TDD code, especially control flow statements, were really getting pounded. There are lines in there that are covered by dozens of tests. So, the first flicker of an idea popped into my head — what if there were two factors at play when contemplating coverage: LOC Covered/Total LOC (i.e. our current code coverage metric) and Covering tests/LOC (I’ll call this coverage density).
High coverage is a breadth-oriented thing, while high density is depth — casting a wide net versus a narrow one deeply. And so, the ultimate solution would be to cast a wide net, deeply (assuming unlimited development time and lack of design constraints).
Are We Just Shifting the Goalposts?
So, Code Density sounded like sort of a heady concept, and I thought I might be onto something until I realized that this suffered the same potential for false positive feedback as code coverage. Specifically, I could achieve an extremely high density by making 50 copies of all of my unit tests. All of my LOC would get hit a lot more but my test suite would be no better (in fact, it’d be worse since it’s now clearly less efficient). So code coverage is weaker as a metric when you cheat by having weak asserts, and density is weaker when you cheat by hitting the same code with identical (or near identical) asserts.
Is there a way to use these two metrics in combination without the potential for cheating? It’s an interesting question and it’s easy enough to see that “higher is better” for both is generally, but not always true, and can be perverted by developers working under some kind of management edict demanding X coverage or, now, Y density.
Stepping Back a Bit
Well, it seems that Density is really no better than Code Coverage, and it’s arguably more obtuse, or at least it has the potential to be more obtuse, so maybe that’s not the route to go. After all, what we’re really after here is how many times a line of code is hit in a different scenario. For instance, hitting the line double result = x/y is only interesting when y is zero. If I hit it 45,000 times and achieve high density, I might as well just hit it once unless I try y at zero.
Now, we have something interesting. This isn’t a control flow statement, so code coverage doesn’t tell the whole story. You can cover that line easily without generating the problematic condition. Density is a slightly (but not much) better metric. We’re really driving after program correctness here, but since that’s a bit of a difficult problem, what we’ll generally settle for is notable, or interesting scenarios.
A Look at Pex
Microsoft Research made a utility called Pex (which I’ve blogged about here). Pex is an automated test generation utility that “finds interesting input-output values of your methods”. What this means, in practice, is that Pex pokes through your code looking for edge cases and anything that might be considered ‘interesting’. Often, this means conditions that causes control flow branching, but it also means things like finding our “y” div by zero exception from earlier.
What Pex does when it finds these interesting paths is it auto-generates unit tests that you can add to your suite. Since it finds hard-to-find edge cases and specializes in branching through your code, it boasts a high degree of coverage. But, what I’d really be interested in seeing is the stats on how many interesting paths your test suite cover versus how many there are or may be (we’d likely need a good approximation as this problem quickly becomes computationally unfeasible to know for certain).
I’m thinking that this has the makings of an excellent metric. Forget code coverage or my erstwhile “Density” metric. At this point, you’re no longer hoping that your metric reflects something good — you’re relatively confident that it must. While this isn’t as good as some kind of formal method that proves your code, you can at least be confident that critical things are being exercised by your test suite – manual, automated or both. And, while you can achieve this to some degree by regularly using Pex, I don’t know that you can quantify it other than to say, “well, I ran Pex a whole bunch of times and it stopped finding new issues, so I think we’re good.” I’d like a real, numerical metric.
Anyway, perhaps that’s in the offing at some point. It’d certainly be nice to see, and I think it would be an advancement in the field of static analysis.