Fail Fast, Move On: Test Pyramid Explained - Part 2: Measurement Systems

Understanding the "Why" of the Test Pyramid is important in making the right decisions. This article examines the underlying foundation of testing: making statements about quality.
Why do we need to consider the test pyramid when creating our test suite?

How can we know if the software works? Whether it does what it's supposed to do? Whether it does that right?
If not, whether it's broken? What doesn't work? Why it doesn't work? What caused it to malfunction?

These are all different questions - and so the approach to answering the questions also differs. Which approach should we then take?

Let's take a look at our test pyramid.

In an attempt to answer the questions above, we need to explore:

Measurement Systems

According to Wikipedia, a measurement system includes a number of factors, including - but not limited to - these:

Miss one of the factors, and you might end up with an entirely messed up test process!

Before we can answer how these factors contribute to your testing process, we need to examine why they are relevant - and to answer the "Why" question, we need to answer the even more fundamental question:

Why test?

There are varying reasons for testing, all of which require different approaches:

Ensuring you did things right.
Ensuring you are doing things right.
Ensuring you will do things right.
Ensuring you understand things right.
Ensuring you did the right things.
...

As you might guess, a test approach to ensure you did things right will look vastly different from a test approach to ensure that you will be doing the right things.
Some approaches are more reactive in nature, while others are more proactive. Some are more concerned with the process of creating software - others are more concerned with the created software.

When no tests have formerly been in place (such as in a Legacy System), you're well advised to start at the easiest level: ensuring that you did things right, i.e. ensuring that the software works as intended.
This is our classic Waterfall testing approach, where testers get confronted with allegedly "finished" software which just needs to be quality-checked.

When you have the luxury of starting with a Green Field, you're well advised to take the more challenging, yet more rewarding route: ensuring that you will be doing the right thing right - before even starting off.
This approach requires "building quality in" right from the outset, using practices such as Behaviour Driven Development, Test Driven Development and Specification by Example.

The advantage of "testing early" is that misunderstandings are caught even before they can lead to faulty software, the advantage of "testing often" is that problems get solved before they proliferate or exacerbate.

The desirable state

A perfect testing approach would minimize:

the risk of introducing fault into the system
the time required to detect potential fault in the system
the effort required to correct fault in the system

When taking a good look at out testing pyramid from the last article, we can realize the following:

Test Type	Prevent risk	Execution time	Correction Effort
Process Chain	Hardly helps: Often doesn't even get fixed before launch.	Might come too late in the process	Lots of pre-analysis required, potentially already proliferated.
System	Very low: Only prevents known launch failure.	Very slow, often gets skipped.	Slow,
Integration	Low: Only catches defects from proliferating in the system.	Slow, difficult to set up.	Interrupts the flow of work.
Feature&Contract	BDD: Know risk ahead.	Would run all the time while working on a feature.	Should only affect 1 method.
Unit	TDD: Know risk ahead.	Neglegible. Can always run.	Minimal. Should only affect 1 line of code.

This matrix gives the impression that any test other than Feature&Contract or Unit test don't even make sense from an economic perspective - yet these types of test are most often neglected, and attention is paid to the upper parts of the Test Pyramid. Why does this happen?

Precision and Accuracy

Choose your poison

Let's suppose I turn on Google Maps and want to know how long my daily commute will take.
Imagine that I get to choose between two answers:
Answer #1: "Between 1 minute and 10 hours". Wow, that's helpful - not! It's an accurate answer with low precision.
Answer #2: "45 minutes, 21 seconds and 112 milliseconds". I like that. But ... when I hit the highway, there's traffic all over the place. I end up taking three hours. This answer was very precise - just also very inaccurate.

Do you prefer high accuracy and low precision - or high precision and low accuracy?
It seems like only a dunce would answer "high precision and low accuracy", because that's like having a non-winning lottery ticket.

Approximating meaning

When starting with nothing to begin with, it's a good idea to turn a huge fog of war into something more tangible, more solid - so we start with a test which brings us accuracy at the cost of precision. We approximate.
In the absence of a better strategy, a vague answer is better than no answer or a wrong answer. And that is how Process Chain tests are created.

Knowing nothing about the system, I can still easily answer a simple question, such as: "If I buy lettuce, bananas and napkins - will I have these exact three things shipped to my home?"
This is a typical process chain test. as it masks the complexity of the underlying process. The test requires little understanding of the system, yet allows the tester to make a definite yes/no statement about whether the system works as intended.

Unravelling complexity

When a tester's answer to a process chain test is "It doesn't work", the entire lack of accuracy in the quality statement is thrown directly at the developers, who then need to discover why it doesn't work. Testers then get trained to make a the best possible statement of quality, such as, "I got parsley instead of lettuce" and "The order confirmation showed lettuce" - the tester may never know is where the problem got introduced into the system. In a complex service landscape (potentially covering B2B suppliers, partners and service providers), the analysis process is often "Happy Hunting".

The false dichotomy

Choosing either accuracy or precision is a false dichotomy - why opt for one when you can have both? What is required is a measurement system of finer granularity.
Even in the above example, we hinted that the tester is definitely able to make a more accurate statement than "It didn't work" - and they can be more precise than that, as well. Good testers would always approximate the maximum possible accuracy and precision.
Their accuracy is only limited by logic hidden from their understanding - and their precision is only limited by the means through which they can interact with the process.
Giving testers deeper insight into the logic of a system allows them to increase their accuracy.
Giving them better means of interacting with the system allows them to increase their precision.

Under perfect conditions, a test will answer with perfect accuracy and perfect precision. And that's our Unit Test. The downside? To test for all potential issues - we need a LOT of them: Any single missing unit test means that we're punching holes into our precision statements.

Repeatability & Reproducibility

What's the most common joke among testers? "Works on my machine." - while testers consider this a developer's measly excuse for not fixing a defect, developers consider this statement as sufficient proof that the test was executed sloppy. The issue? Reproducibility.
It gets worse when the tester calls in the developer to show them the problem - and: magic - it works! The issue? Repeatability.

Reproducibility

In science, reproducibility is key - a hypothesis which can't rely on reproducible evidence is subject to severe doubts, and for good reason. In order to make a reliable statement of quality, therefore, is to ensure that test results are reproducible.
This means that given the same setup, we would expect to get the same outcome.
Let's look closely at the factors affecting the reproducibility of a test:
Preconditions, the environment, the code segment in question, the method of test execution - all affect reproducibility.
As most applications are stateful (i.e. the outcome depends on the current state of the system), reproducibility requires a perfect reconstruction of the test conditions. The bigger the scope affected by the test is - the more test conditions need to be met. In the worst case scenario, the entire world could affect the test case, and our only chance of reproducing the same outcome would be to snapshot and reset the world - which, of course, we can't do.

Our goal therefore should be to minimize the essential test conditions, as every additional condition reduces reproducibility.

Repeatability

Another key to hypothesis testing is being able to do the same thing over and over in order to get the same outcome. The Scientific Method requires repeatability for good reason: which conclusion do we draw when doing the same thing twice leads to different outcomes?
When we create an automated system which possibly fires the same code segment millions (or even billions) of times per day, then even a 1% fault ratio is unacceptable, so we can't rely on tests that may or may not be correct - we want the software itself to always respond in the same way, and we want our software tests to do the same.
The more often we run our tests, the more repeatability we need for our tests. When executing a test once a week, having 1% problems in our repeatability means that once in two years, we may need to repeat a test to get the correct result. It's an entirely differnt story when the test is executed a few hundred times per day - even a 1% repeatability issue would mean that we're doing nothing except figuring out why the tests have failed!

Flakiness

Every developer who uses a Continuous Integration (or: Deployment) pipeline has some horror stories to tell about flaky tests. Flakiness, in short, is the result of both reproducibility and repeatability issues.
Tests become flaky when either the process isn't 100% repeatable or there are some preconditions which haven't been caught in preparing the tests.
As test complexity increases, the amount of factors potentially causing flakiness increase - as well as the amount of test steps potentially resulting in flaky results.

Let's re-examine our pyramid:

Test Type	Repeatability	Reproducibility	Causes of Flakiness
Process Chain	Difficult: Any change can change the outcome.	Extremely low: A consistent state across many systems is almost impossible to maintain.	Unknown changes, Unknown configuration effects, Undefined interactions, Unreliable systems, Unreliable infrastructure
System	Extremely low: Desired feature change can change overall system.	Challenging: Any system change can cause any test to fail.	Unknown configuration effects, Undefined interactions, Unreliable infrastructure
Integration	Low: Every release has new features, so tests need updates.	Low: Every feature change will change test outcomes.	Unknown configuration effects, Unreliable infrastructure
Feature&Contract	High: Feature tests are changed only when features change.	High: Feature definitions are comprehensive.	Uncoordinated changes in API definitions
Unit	High: The test outcome should only change when the code has changed.	Extremely high. A unit test always does the same one thing.	None.

We again observe that testing high up in the pyramid leads to high flakiness and poor test outcomes - whereas testing far down in the pyramid creates a higher level of quality control.

A flakiness level of 10% means that from 10 tests, an average of 1 test fails - so if we include a test suite of 30 flaky Tests into a build pipeline, we're hardly ever going to get a Green Master - we just don't know if there's a software problem or something else is going on.
And 10% flakiness in Process Chains is not a bad value - I've seen numbers ranging as high as 50%, given stuff like network timeouts, uncommunicated downtimes, unreliable data in the test database etc.

When we want to rely on our tests, we must guarantee 100% repeatability and reproducibility to prevent flakiness - and the only way to get there is to move tests as low in the pyramid as possible.

Conclusion

In this section, we have covered some of the critical factors contributing to a reliable testing system.
Long story short: we need a testing strategy that moves tests to the lowest levels in the pyramid, otherwise our tests will be a quality issue all by themselves!

Fail Fast, Move On

Pages

Sunday, June 24, 2018

Test Pyramid Explained - Part 2: Measurement Systems