Sunday, June 24, 2018

Test Pyramid Explained - Part 2: Measurement Systems

Understanding the "Why" of the Test Pyramid is important in making the right decisions. This article examines the underlying foundation of testing: making statements about quality.
Why do we need to consider the test pyramid when creating our test suite?



How can we know if the software works? Whether it does what it's supposed to do? Whether it does that right?
If not, whether it's broken? What doesn't work? Why it doesn't work? What caused it to malfunction?
These are all different questions - and so the approach to answering the questions also differs. Which approach should we then take? 

Let's take a look at our test pyramid.



In an attempt to answer the questions above, we need to explore:

Measurement Systems

According to Wikipedia, a measurement system includes a number of factors, including - but not limited to - these:

Miss one of the factors, and you might end up with an entirely messed up test process!

Before we can answer how these factors contribute to your testing process, we need to examine why they are relevant - and to answer the "Why" question, we need to answer the even more fundamental question:

Why test?

There are varying reasons for testing, all of which require different approaches:

  1. Ensuring you did things right.
  2. Ensuring you are doing things right.
  3. Ensuring you will do things right.
  4. Ensuring you understand things right.
  5. Ensuring you did the right things.
  6. ...

As you might guess, a test approach to ensure you did things right will look vastly different from a test approach to ensure that you will be doing the right things.
Some approaches are more reactive in nature, while others are more proactive. Some are more concerned with the process of creating software - others are more concerned with the created software.

When no tests have formerly been in place (such as in a Legacy System), you're well advised to start at the easiest level: ensuring that you did things right, i.e. ensuring that the software works as intended.
This is our classic Waterfall testing approach, where testers get confronted with allegedly "finished" software which just needs to be quality-checked.

When you have the luxury of starting with a Green Field, you're well advised to take the more challenging, yet more rewarding route: ensuring that you will be doing the right thing right - before even starting off.
This approach requires "building quality in" right from the outset, using practices such as Behaviour Driven Development, Test Driven Development and Specification by Example.

The advantage of "testing early" is that misunderstandings are caught even before they can lead to faulty software, the advantage of "testing often" is that problems get solved before they proliferate or exacerbate.

The desirable state

A perfect testing approach would minimize:

  • the risk of introducing fault into the system
  • the time required to detect potential fault in the system
  • the effort required to correct fault in the system

When taking a good look at out testing pyramid from the last article, we can realize the following:

Test TypePrevent riskExecution timeCorrection Effort
Process ChainHardly helps:
Often doesn't
even get fixed
before launch.
Might come too late
in the process
Lots of pre-analysis
required, potentially
already proliferated.
SystemVery low:
Only prevents
known launch failure.
Very slow,
often gets skipped.
Slow, 
IntegrationLow:
Only catches defects
from proliferating in
the system.
Slow, difficult to set up.Interrupts the flow of work.
Feature&ContractBDD:
Know risk ahead.
Would run all the time
while working on a feature.
Should only affect
1 method.
UnitTDD:
Know risk ahead.
Neglegible.
Can always run.
Minimal.
Should only affect
1 line of code.


This matrix gives the impression that any test other than Feature&Contract or Unit test don't even make sense from an economic perspective - yet these types of test are most often neglected, and attention is paid to the upper parts of the Test Pyramid. Why does this happen?


Precision and Accuracy

Choose your poison

Let's suppose I turn on Google Maps and want to know how long my daily commute will take.
Imagine that I get to choose between two answers:
Answer #1: "Between 1 minute and 10 hours". Wow, that's helpful - not! It's an accurate answer with low precision.
Answer #2: "45 minutes, 21 seconds and 112 milliseconds". I like that. But ... when I hit the highway, there's traffic all over the place. I end up taking three hours. This answer was very precise - just also very inaccurate.

Do you prefer high accuracy and low precision - or high precision and low accuracy?
It seems like only a dunce would answer "high precision and low accuracy", because that's like having a non-winning lottery ticket.

Approximating meaning

When starting with nothing to begin with, it's a good idea to turn a huge fog of war into something more tangible, more solid - so we start with a test which brings us accuracy at the cost of precision. We approximate.
In the absence of a better strategy, a vague answer is better than no answer or a wrong answer. And that is how Process Chain tests are created.

Knowing nothing about the system, I can still easily answer a simple question, such as: "If I buy lettuce, bananas and napkins - will I have these exact three things shipped to my home?"
This is a typical process chain test. as it masks the complexity of the underlying process. The test requires little understanding of the system, yet allows the tester to make a definite yes/no statement about whether the system works as intended.

Unravelling complexity

When a tester's answer to a process chain test is "It doesn't work", the entire lack of accuracy in the quality statement is thrown directly at the developers, who then need to discover why it doesn't work. Testers then get trained to make a the best possible statement of quality, such as, "I got parsley instead of lettuce" and "The order confirmation showed lettuce" - the tester may never know is where the problem got introduced into the system. In a complex service landscape (potentially covering B2B suppliers, partners and service providers), the analysis process is often "Happy Hunting".

The false dichotomy

Choosing either accuracy or precision is a false dichotomy - why opt for one when you can have both? What is required is a measurement system of finer granularity.
Even in the above example, we hinted that the tester is definitely able to make a more accurate statement than "It didn't work" - and they can be more precise than that, as well. Good testers would always approximate the maximum possible accuracy and precision.
Their accuracy is only limited by logic hidden from their understanding - and their precision is only limited by the means through which they can interact with the process.
Giving testers deeper insight into the logic of a system allows them to increase their accuracy.
Giving them better means of interacting with the system allows them to increase their precision.

Under perfect conditions, a test will answer with perfect accuracy and perfect precision. And that's our Unit Test. The downside? To test for all potential issues - we need a LOT of them: Any single missing unit test means that we're punching holes into our precision statements.


Repeatability & Reproducibility

What's the most common joke among testers? "Works on my machine." - while testers consider this a developer's measly excuse for not fixing a defect, developers consider this statement as sufficient proof that the test was executed sloppy. The issue? Reproducibility.
It gets worse when the tester calls in the developer to show them the problem - and: magic - it works! The issue? Repeatability.

Reproducibility

In science, reproducibility is key - a hypothesis which can't rely on reproducible evidence is subject to severe doubts, and for good reason. In order to make a reliable statement of quality, therefore, is to ensure that test results are reproducible.
This means that given the same setup, we would expect to get the same outcome.
Let's look closely at the factors affecting the reproducibility of a test:
Preconditions, the environment, the code segment in question, the method of test execution - all affect reproducibility.
As most applications are stateful (i.e. the outcome depends on the current state of the system), reproducibility requires a perfect reconstruction of the test conditions. The bigger the scope affected by the test is - the more test conditions need to be met. In the worst case scenario, the entire world could affect the test case, and our only chance of reproducing the same outcome would be to snapshot and reset the world - which, of course, we can't do.

Our goal therefore should be to minimize the essential test conditions, as every additional condition reduces reproducibility.

Repeatability

Another key to hypothesis testing is being able to do the same thing over and over in order to get the same outcome. The Scientific Method requires repeatability for good reason: which conclusion do we draw when doing the same thing twice leads to different outcomes?
When we create an automated system which possibly fires the same code segment millions (or even billions) of times per day, then even a 1% fault ratio is unacceptable, so we can't rely on tests that may or may not be correct - we want the software itself to always respond in the same way, and we want our software tests to do the same.
The more often we run our tests, the more repeatability we need for our tests. When executing a test once a week, having 1% problems in our repeatability means that once in two years, we may need to repeat a test to get the correct result. It's an entirely differnt story when the test is executed a few hundred times per day - even a 1% repeatability issue would mean that we're doing nothing except figuring out why the tests have failed!


Flakiness

Every developer who uses a Continuous Integration (or: Deployment) pipeline has some horror stories to tell about flaky tests. Flakiness, in short, is the result of both reproducibility and repeatability issues.
Tests become flaky when either the process isn't 100% repeatable or there are some preconditions which haven't been caught in preparing the tests.
As test complexity increases, the amount of factors potentially causing flakiness increase - as well as the amount of test steps potentially resulting in flaky results.

Let's re-examine our pyramid:

Test TypeRepeatabilityReproducibilityCauses of Flakiness
Process ChainDifficult:
Any change can
change the outcome.
Extremely low:
A consistent state across
many systems is almost
impossible to maintain.
Unknown changes,
Unknown configuration effects,
Undefined interactions,
Unreliable systems,
Unreliable infrastructure
SystemExtremely low:
Desired feature
change can change
overall system.

Challenging:
Any system change can
cause any test to fail.
Unknown configuration effects,
Undefined interactions,
Unreliable infrastructure
IntegrationLow:
Every release has
new features, so tests
need updates.
Low:
Every feature change
will change test outcomes.
Unknown configuration effects,
Unreliable infrastructure
Feature&ContractHigh:
Feature tests are
changed only when
features change.
High:
Feature definitions are
comprehensive.
Uncoordinated changes in API
definitions
UnitHigh:
The test outcome
should only change
when the code
has changed.
Extremely high.
A unit test always does
the same one thing.
None.


We again observe that testing high up in the pyramid leads to high flakiness and poor test outcomes - whereas testing far down in the pyramid creates a higher level of quality control.

A flakiness level of 10% means that from 10 tests, an average of 1 test fails - so if we include a test suite of 30 flaky Tests into a build pipeline, we're hardly ever going to get a Green Master - we just don't know if there's a software problem or something else is going on.
And 10% flakiness in Process Chains is not a bad value - I've seen numbers ranging as high as 50%, given stuff like network timeouts, uncommunicated downtimes, unreliable data in the test database etc.


When we want to rely on our tests, we must guarantee 100% repeatability and reproducibility to prevent flakiness - and the only way to get there is to move tests as low in the pyramid as possible.


Conclusion

In this section, we have covered some of the critical factors contributing to a reliable testing system.
Long story short: we need a testing strategy that moves tests to the lowest levels in the pyramid, otherwise our tests will be a quality issue all by themselves!




No comments:

Post a Comment