Friday, May 27, 2022

Why test coverage targets backfire

 Most organizations adopt a "Percent-Test-Coverage" metric as part of their Definition of Done. Very often, it's set by managers, and becomes a target which developers have to meet. This approach is often harmful. And here's why:

The good, bad and ugly of Coverage Targets

Setting coverage targets is often intended as both a means of improving the product, as well as improving the process - and getting developers to think about testing.

The good

It's good that developers begin to pay closer attention to the questions, "How do I test this?" as well as, "Where are tests missing?" Unfortunately, that's about it.

The bad

The first problem is that large legacy code bases starting with zero test coverage will put developers into a pickle: If I work on a component with 0% test coverage, then any amount of tests I write will still keep that number close to Zero. Hence, the necessary compromise becomes not setting a baseline number, just ask for the number to increase. The usually envisioned 80+% targets are visions for a distant future rather than something useful today.

Looking into the practice - as long as the organization is set up to reward minimizing the amount of time invested into unit testing, the outcome will be that developers try to meet the test coverage targets with minimum effort. 

Also, when developers have no experience in writing good tests, their tests may not do what they're supposed to do.

The ugly

There are many ways in which we can meet coverage targets that fulfill Goodhart's Law:

Any metric that becomes a target stops being useful.

Often, developers will feel unhappy producing tests that they create only to meet the target, considering the activity a wasteful addition to the process, which provides no benefit.

At worst, developers will spend a lot of time creating tests that provide a false sense of confidence, which is even worse than knowing that you have no tests to rely on.

But how can this happen? Let's see ...

The first antipattern is, of course, to write tests that check nothing.

Assertion-free testing

Let's take a look at this example:

Production code

int divide(int x, int y) { return x/y; }


print divide(6,2);

When looking at our code coverage metrics, we will see a 100% coverage. But: the test will always pass - even when we break the production code. Only if we were to inspect the actual test output (which is manual effort that we probably won't do) - we will what the test does: There is no automated failure detection, and the tests aren't even written in a way that we would detect a failure: Who would detect the problem if suddenly, we got a "2" instead of a "3" - we don't even know which result would have been correct!

Testing in the wrong place

Look at this example:
Class Number {
int x;
void setX(int val) {x=val;}
void getX() {return x;}
void compute() { x=x?x/x^x:x-x*(x+x); }
n = new Number();
assert (N.x == 5);
assert (N.get() == 5);

In this case, we have a code coverage of 75% - we're testing x, we're testing the setter, and we're testing the getter.

The "only" thing we're not testing is the compute function, which is actually the one place where we would expect problems, where other developers might have questions as to "What does that do, and why are we doing it like that?" - or where various inputs could lead to undesirable outcomes.

Testing wrongly

Take a peek at this simple piece of code:
int inverse (int x) { return 1/x; }
void printInverse(int x) { print "inverse(x);" }
assert (inverse(1) == 1);
assert (inverse(5) == 0);
assert (print.toHaveBeenCalled());

The issues

There are numerous major problems here, all of which in combination make that test more dangeous than helpful:

Data Quality

We may get a null pointer exception if the method is called without initializing x first, but we neither catch nor test this case.

Unpexpected results

If we feed 0 into the function, we get a Divide by Zero. We're not testing for this, and it will lead to undesired outcomes.

Missing the intent

The inverse function returns 0 for every number other than 1 and -1. It probably doesn't do what it's expected to do. How do we know? Is it poorly named, or poorly implemented?

Testing the wrong things

The print function's output is most likely not what we expect, but our tests still pass.


If we rely on coverage metrics, we might assume that we have 100% test coverage, but in practice, we may have very unreliable software that doesn't even work as intended.

In short: this way of testing tests that the code does what it does, not that it does what it should.

Then what?

The situation can be remedied, but not with numerical quotas. Instead, developers need education on what to test, how to test, and how to test well.

While this is a long topic, this article already shows some extremely common pitfalls that developers can - and need to - steer clear from. Coverage metrics leave management none the wiser: the real issue is hidden behind a smoke screen of existing tests.

1 comment:

  1. So you need Mutation Testing, to verify that the alleged coverage actually means anything.