Sunday, June 24, 2018

Test Pyramid Explained - Part 2: Measurement Systems

Understanding the "Why" of the Test Pyramid is important in making the right decisions. This article examines the underlying foundation of testing: making statements about quality.
Why do we need to consider the test pyramid when creating our test suite?

How can we know if the software works? Whether it does what it's supposed to do? Whether it does that right?
If not, whether it's broken? What doesn't work? Why it doesn't work? What caused it to malfunction?
These are all different questions - and so the approach to answering the questions also differs. Which approach should we then take? 

Let's take a look at our test pyramid.

In an attempt to answer the questions above, we need to explore:

Measurement Systems

According to Wikipedia, a measurement system includes a number of factors, including - but not limited to - these:

Miss one of the factors, and you might end up with an entirely messed up test process!

Before we can answer how these factors contribute to your testing process, we need to examine why they are relevant - and to answer the "Why" question, we need to answer the even more fundamental question:

Why test?

There are varying reasons for testing, all of which require different approaches:

  1. Ensuring you did things right.
  2. Ensuring you are doing things right.
  3. Ensuring you will do things right.
  4. Ensuring you understand things right.
  5. Ensuring you did the right things.
  6. ...

As you might guess, a test approach to ensure you did things right will look vastly different from a test approach to ensure that you will be doing the right things.
Some approaches are more reactive in nature, while others are more proactive. Some are more concerned with the process of creating software - others are more concerned with the created software.

When no tests have formerly been in place (such as in a Legacy System), you're well advised to start at the easiest level: ensuring that you did things right, i.e. ensuring that the software works as intended.
This is our classic Waterfall testing approach, where testers get confronted with allegedly "finished" software which just needs to be quality-checked.

When you have the luxury of starting with a Green Field, you're well advised to take the more challenging, yet more rewarding route: ensuring that you will be doing the right thing right - before even starting off.
This approach requires "building quality in" right from the outset, using practices such as Behaviour Driven Development, Test Driven Development and Specification by Example.

The advantage of "testing early" is that misunderstandings are caught even before they can lead to faulty software, the advantage of "testing often" is that problems get solved before they proliferate or exacerbate.

The desirable state

A perfect testing approach would minimize:

  • the risk of introducing fault into the system
  • the time required to detect potential fault in the system
  • the effort required to correct fault in the system

When taking a good look at out testing pyramid from the last article, we can realize the following:

Test TypePrevent riskExecution timeCorrection Effort
Process ChainHardly helps:
Often doesn't
even get fixed
before launch.
Might come too late
in the process
Lots of pre-analysis
required, potentially
already proliferated.
SystemVery low:
Only prevents
known launch failure.
Very slow,
often gets skipped.
Only catches defects
from proliferating in
the system.
Slow, difficult to set up.Interrupts the flow of work.
Know risk ahead.
Would run all the time
while working on a feature.
Should only affect
1 method.
Know risk ahead.
Can always run.
Should only affect
1 line of code.

This matrix gives the impression that any test other than Feature&Contract or Unit test don't even make sense from an economic perspective - yet these types of test are most often neglected, and attention is paid to the upper parts of the Test Pyramid. Why does this happen?

Precision and Accuracy

Choose your poison

Let's suppose I turn on Google Maps and want to know how long my daily commute will take.
Imagine that I get to choose between two answers:
Answer #1: "Between 1 minute and 10 hours". Wow, that's helpful - not! It's an accurate answer with low precision.
Answer #2: "45 minutes, 21 seconds and 112 milliseconds". I like that. But ... when I hit the highway, there's traffic all over the place. I end up taking three hours. This answer was very precise - just also very inaccurate.

Do you prefer high accuracy and low precision - or high precision and low accuracy?
It seems like only a dunce would answer "high precision and low accuracy", because that's like having a non-winning lottery ticket.

Approximating meaning

When starting with nothing to begin with, it's a good idea to turn a huge fog of war into something more tangible, more solid - so we start with a test which brings us accuracy at the cost of precision. We approximate.
In the absence of a better strategy, a vague answer is better than no answer or a wrong answer. And that is how Process Chain tests are created.

Knowing nothing about the system, I can still easily answer a simple question, such as: "If I buy lettuce, bananas and napkins - will I have these exact three things shipped to my home?"
This is a typical process chain test. as it masks the complexity of the underlying process. The test requires little understanding of the system, yet allows the tester to make a definite yes/no statement about whether the system works as intended.

Unravelling complexity

When a tester's answer to a process chain test is "It doesn't work", the entire lack of accuracy in the quality statement is thrown directly at the developers, who then need to discover why it doesn't work. Testers then get trained to make a the best possible statement of quality, such as, "I got parsley instead of lettuce" and "The order confirmation showed lettuce" - the tester may never know is where the problem got introduced into the system. In a complex service landscape (potentially covering B2B suppliers, partners and service providers), the analysis process is often "Happy Hunting".

The false dichotomy

Choosing either accuracy or precision is a false dichotomy - why opt for one when you can have both? What is required is a measurement system of finer granularity.
Even in the above example, we hinted that the tester is definitely able to make a more accurate statement than "It didn't work" - and they can be more precise than that, as well. Good testers would always approximate the maximum possible accuracy and precision.
Their accuracy is only limited by logic hidden from their understanding - and their precision is only limited by the means through which they can interact with the process.
Giving testers deeper insight into the logic of a system allows them to increase their accuracy.
Giving them better means of interacting with the system allows them to increase their precision.

Under perfect conditions, a test will answer with perfect accuracy and perfect precision. And that's our Unit Test. The downside? To test for all potential issues - we need a LOT of them: Any single missing unit test means that we're punching holes into our precision statements.

Repeatability & Reproducibility

What's the most common joke among testers? "Works on my machine." - while testers consider this a developer's measly excuse for not fixing a defect, developers consider this statement as sufficient proof that the test was executed sloppy. The issue? Reproducibility.
It gets worse when the tester calls in the developer to show them the problem - and: magic - it works! The issue? Repeatability.


In science, reproducibility is key - a hypothesis which can't rely on reproducible evidence is subject to severe doubts, and for good reason. In order to make a reliable statement of quality, therefore, is to ensure that test results are reproducible.
This means that given the same setup, we would expect to get the same outcome.
Let's look closely at the factors affecting the reproducibility of a test:
Preconditions, the environment, the code segment in question, the method of test execution - all affect reproducibility.
As most applications are stateful (i.e. the outcome depends on the current state of the system), reproducibility requires a perfect reconstruction of the test conditions. The bigger the scope affected by the test is - the more test conditions need to be met. In the worst case scenario, the entire world could affect the test case, and our only chance of reproducing the same outcome would be to snapshot and reset the world - which, of course, we can't do.

Our goal therefore should be to minimize the essential test conditions, as every additional condition reduces reproducibility.


Another key to hypothesis testing is being able to do the same thing over and over in order to get the same outcome. The Scientific Method requires repeatability for good reason: which conclusion do we draw when doing the same thing twice leads to different outcomes?
When we create an automated system which possibly fires the same code segment millions (or even billions) of times per day, then even a 1% fault ratio is unacceptable, so we can't rely on tests that may or may not be correct - we want the software itself to always respond in the same way, and we want our software tests to do the same.
The more often we run our tests, the more repeatability we need for our tests. When executing a test once a week, having 1% problems in our repeatability means that once in two years, we may need to repeat a test to get the correct result. It's an entirely differnt story when the test is executed a few hundred times per day - even a 1% repeatability issue would mean that we're doing nothing except figuring out why the tests have failed!


Every developer who uses a Continuous Integration (or: Deployment) pipeline has some horror stories to tell about flaky tests. Flakiness, in short, is the result of both reproducibility and repeatability issues.
Tests become flaky when either the process isn't 100% repeatable or there are some preconditions which haven't been caught in preparing the tests.
As test complexity increases, the amount of factors potentially causing flakiness increase - as well as the amount of test steps potentially resulting in flaky results.

Let's re-examine our pyramid:

Test TypeRepeatabilityReproducibilityCauses of Flakiness
Process ChainDifficult:
Any change can
change the outcome.
Extremely low:
A consistent state across
many systems is almost
impossible to maintain.
Unknown changes,
Unknown configuration effects,
Undefined interactions,
Unreliable systems,
Unreliable infrastructure
SystemExtremely low:
Desired feature
change can change
overall system.

Any system change can
cause any test to fail.
Unknown configuration effects,
Undefined interactions,
Unreliable infrastructure
Every release has
new features, so tests
need updates.
Every feature change
will change test outcomes.
Unknown configuration effects,
Unreliable infrastructure
Feature tests are
changed only when
features change.
Feature definitions are
Uncoordinated changes in API
The test outcome
should only change
when the code
has changed.
Extremely high.
A unit test always does
the same one thing.

We again observe that testing high up in the pyramid leads to high flakiness and poor test outcomes - whereas testing far down in the pyramid creates a higher level of quality control.

A flakiness level of 10% means that from 10 tests, an average of 1 test fails - so if we include a test suite of 30 flaky Tests into a build pipeline, we're hardly ever going to get a Green Master - we just don't know if there's a software problem or something else is going on.
And 10% flakiness in Process Chains is not a bad value - I've seen numbers ranging as high as 50%, given stuff like network timeouts, uncommunicated downtimes, unreliable data in the test database etc.

When we want to rely on our tests, we must guarantee 100% repeatability and reproducibility to prevent flakiness - and the only way to get there is to move tests as low in the pyramid as possible.


In this section, we have covered some of the critical factors contributing to a reliable testing system.
Long story short: we need a testing strategy that moves tests to the lowest levels in the pyramid, otherwise our tests will be a quality issue all by themselves!

Sunday, June 17, 2018

Test Pyramid Explained - part 1

Let's take a deeper look at what the Test Pyramid is, and how it can help us achieve sustainable, high quality. In this section, we will take a look at the left part of the picture only, as understanding this portion is essential to making sense of the right side.

Yet another model of the "Test Pyramid" - there's more to it than meets the eye!

The five levels

Before we get into the "How", we will examine the "What" - the five different levels, starting from top to bottom. Why top-down? Because this is how the business looks at software.

Process Chain Tests

A process chain is a description of a user-centric feature (oftentimes, a user story), irrespective of where it is implemented. From a high level, a customer might be something like "I want my order shipped home."
Such a process chain may consist of a larger number of technical features realized across a bigger number of subsystems, some of them potentially not even software. In our example, the process chain might look like this:

  1.  User presses "Purchase" (online shop)
  2.  User makes payment (payment provider)
  3.  Order gets sent to warehouse for picking (warehouse system)
  4.  Order is picked (picker's device + warehouse system)
  5.  Package is sent for shipment (logistics + logistics payment system)
  6.  Package is shipped (logistics + logistics tracking system)
  7.  Package arrives (logistics tracking system)
  8.  Order is closed (online shop)

As we can see from this example, it's incredibly complex to test a process chain, as each system and activity has a chance to fail. The potential amount of failure scenarios are nearly infinite - regardless of how many we cover, there might still be another.

The good news is that if a process chain works, it's a guarantee that all subsystems and steps worked.
At the same time the bad news is that - if the process chain doesn't work, we may need to do a lot of trackbacking to discover where the failure was introduced into the system.

Regardless of how much we test elsewhere - it might just be a good idea to do at least one supervised process chain test before "going live" with a complex system. That is, if we can afford it. Many organizations might simply resort to monitoring a live system's process chain in a "friendly user pilot phase".

A process chain test might take anywhere from a few minutes to many weeks to complete. As a rule of thumb, lacking any further information, an hour to a day might be a solid guess for the execution time of such a test. This explains why we don't want hundreds of them.

System Tests

Slightly simper than process chain tests are the oftenplace common system tests: The system is considered an inseperable unit - oftentimes, a "black box".

A system test would be concerned with the activities and data transfers from the time data enters into one system until the sub-process within the system is closed. Resorting to our above example, a system test of the Online Shop might look like this:

  1.  User presses "Purchase" (Webshop)
  2.  User's order data is persisted as "Payment Pending" (Database)
  3.  User is redirected to payment section (External Payment service)
  4.  Payment is authorized (External Payment service)
  5.  Payment authorization ID is persisted (Database)
  6.  Order Status is set to "Payment Complete" (Database)
  7.  User is redirected to "Thank you" page (Webshop)
  8.  Order is forwareded to Warehouse system
  9.  Warehouse System sends Order Acknowledged message
  10.  Order Status is set to "In Process" (Database)
Here we see that system tests, despite having a much smaller scope than a process chain, are still nearly as difficult to test and stabilize. 

Oddly enough, many so-called "test factories" test on this level, creating complex automation scripts - oftentimes based on tools such as SeleniumIDE - which is seen as a feasible way to automate tests with little effort.
The downside of automating system tests is that a minor change in the test constellation will invalidate the test - in our example, if the "Thank You" is replaced with a modal stating "Your order has been completed.", we might have to scrap the entire test (depending on how poorly it has been written).

I have seen entire teams spending major portions of their time both figuring out why system tests failed - as well as keeping up with all those feature changes invalidating the tests.

System tests shouldn't take all too long, but 5-15 minutes for a single automated test case isn't unheard of. Fast system tests might finish in as little as ten seconds.

Integration Tests

Integration tests are inteded to check the I/O of a system's components, usually ignoring both the larger scope process chain and the lower level technical details. 

An integration test assumes that the preceding steps in the source system worked - the focus is on the system's entry and exit points, considering the internal logic as a black box.

In our webshop payment example, we might consider the following autonomous integration tests:

  1. When a user presses "Purchase", all items from the basket are stored in the database (UI -> Backend)
  2. When a user is forwarded to the Payment Website, the total purchase price is correctly transferred to the payment service (Backend -> payment system)
  3. When a payment is successfully completed, the payment data is correctly stored (payment system -> Backend)
  4. When an order is correctly paid, it is forwarded to the warehouse system (Backend -> warehouse system)
  5. The Warehouse system's order acknowledge is correctly processed (warehouse system -> Backend)

Integration tests are much smaller than system tests, and the root cause of failure is much easier to isolate.
The biggest downside of integration tests is that they rely on the availability and response of the partner system. If a partner system happens to be unavailable for any reason, integration tests can not be run.
I've seen this break the back of one webshop's test suite who relied on a global payment provider's sandbox that failed to answer during business hours, because it was constantly bombarded by thousands of clients.

Integration tests don't do all that much, their downside is the response time of the two systems. Good integration tests shouldn't take more than maybe 50ms, while poor integration tests might take a few seconds.

A good way to speed up integration tests is by mocking slow or unreliable partner systems, which can also speed them up massively, but adds complexity to the component's test suite.

Feature & Contract Tests

This group simultaneously contains two types of testing, as these go hand in hand: Feature tests are the internal logic how a system processes data. Contract tests validate how the data is being passed into / exits from the system.

Here's an example of a feature test:

class BasketValidationResponseSpec extends Specification {
   def "information given to customer" (BasketPojo Basket, String message, Boolean status ) {

   Basket.statusMessage() === message
   Basket.checkState() === status

   Basket | message | status
   [ Bread[1], Butter[1], Book[1] ] | "Valid" | true
   [] | "Empty basket" | false
   [Bread[199] ] | "Too much Bread" | false
   [Bread[1], Butter[199] ] | "Too much Butter" | false

(please forgive my indentation, HTML indentation is a pain)

Feature tests don't rely on any external interfaces being available, making them both reliable and fast to execute. Unlike unit tests (below), they don't test each method on their own, but might test a the interaction of multiple methods and/or classes.

Contract tests are the flip side of the coin here, as a feature test assumes that the data is both provided in the right way and is returned in a way that the interfaced component can correctly process. In an ever-changing software world, these assumption are often untrue - contracts help create some reliability here. I don't want to go into that topic too deeply, as contracts are an entire field on their own.

The good news is that good feature and contract tests execute in as little as 20ms, making them both incredibly fast and reliable.

Unit tests

The bread and butter of software development are unit tests. They test single methods within a class, and good engineering practice dictates that any line of code beyond getters and setters should have an associated unit test.
The purpose of unit tests isn't as much to create feature-level or user comprehensible test feedback, it's to ensure that the code is workable - even when refactored.

Unit tests will ensure your code is loosely coupled, that each method doesn't do too many things (ideal amount: one purpose per method), that involuntary design errors are quickly caught and many other things which help developers.

While well-designed Feature tests answer the question "Why" a piece of code does what it does, a unit test defines "How" the code does it. Separating these two things often neither makes sense - the boundary may be fluid. The main difference is that a unit test never relies on anything other than the method of test, whereas a feature test might rely on full object instatiation.
Their main "downside" is that their lifetime is coupled to the method they test - whenever the method gets adjusted, the unit test either has to stay valid or needs to be modified. If the method gets deleted, the test goes as well.

Unit tests are extremely fast. There are even tools executing the unit tests of modified code in the background while the developer is still typing. The limiting factors here are pretty much CPU speed and RAM: executing an entire project's unit test suite shouldn't take more than a minute (excluding ramp-up time of the IDE), otherwise you're probably doing something wrong.

Given these definitions, let's do a brief...


Test TypeDurationAccuracyDurability
Process Chain1h +Very LowVery Low
System1-15minVery LowVery Low
Unit< 10msHighN/A

If you ask me if there's any sane reason to test on the higher levels of the pyramid - I'd answer: "It's too slow, too expensive and too unreliable." At the same time, there are reasons to test high in the pyramid, including: Coarse granularity business feasibility testing, lack of lower level automation and/or lack of developer skills.

In the next article of the series, I will explain the right side of the image - the testing metrics in more detail.

Sunday, June 10, 2018

Not Scrum - not a problem

We have been warned in our CSM training: "Scrum’s roles, events, artifacts, and rules are immutable and although implementing only parts of Scrum is possible, the result is not Scrum." - any deviation from Scrum leads to a dangerous "Scrum-But" - or worse ... so, you should stick to Scrum as per Guide!

Is that even a problem? Forget it!

Why would we even care if "the result is not Scrum"?

Here are a few examples of "results that aren't Scrum" ...

Unless you are in the business of producing and selling Scrum - why would it even be a problem if "the result is not Scrum"!

Scrum is but one of many means of achieving better business outcomes. It is neither a desirable outcome, nor the focus of your attention - again, unless you're making your money from Scrum.

As agnostic agile practitioners, we aren't forced to sell Scrum. We're trying to help our clients achieve better relevant business outcomes - more sales, more revenue, new markets, happier customers. If Scrum helps us get there, we're happy with Scrum as far as it helps. When Scrum becomes a distraction or an impediment - we'll gladly throw Scrum as per Guide overboard and do something else that works.

 "If you deviate, the result is not Scrum!" is a kind of fearmongering that only works on those who don't know that there are other, equally valid approaches. There's plenty of them around.

Saturday, June 2, 2018

Things that never meant what we understood

We throw around a lot of terminology - yet we may not even know what we're saying. Here are three terms that you may have understood differently from how the original author's intention:

1. Technical debt

Technical debt has been used by many to denote willfully taken shortcuts on quality.
Many developers use the term to imply that code has been developed with poor craftsmanship - for instance, lack of tests or overly complicated structure.

Ward Cunningham, the inventor of the term, originally saw Technical debt as a means of learning from the Real World - software built upon today's understanding incorporating everything we know at the moment put into use. He took the stance that it's better to ship today, learn tomorrow and then return to the code with tomorrow's knowledge - than to wait until tomorrow before even creating any code!

In his eyes, code should always look like it was consistently built "just today", never even hinting that it had looked different years ago. Technical debt was intended to be nothing more than the existence of things we can't know yet.

Technical debt always implied high quality clean code - because that is the only way to incorporate tomorrow's learning in a sustainable way without slowing down.

2. Kaizen ("Continuous Improvement")

Kaizen is often understood as an approach of getting better at doing things.
While it's laudable to improve - many improvement initiatives are rather aimless. Especially Scrum teams easily fall victim of such aimless changes when each Retrospective covers a different topic.

Taiichi Ohno, known as the father of the Toyota System which inspired Lean, Six Sigma - and Scrum, stated, "Where there is no standard, there can be no Kaizen".

Another thing that many of us Westerners seem to be unaware of: there's a difference between Kaizen and Kairyo - with Kaizen being an inward-focus exercise of becoming the best we can be - which in turn enables us to improve the system - and Kairyu being the exercise of improving the system itself. This, of course, means that Kaizen can never be delegated!

Kaizen requires a long-term direction towards which people desire to improve themselves. Such a direction is often absent in an agile environment - short-term thinking prevails, and people are happy having done something which improved the process a little.

What this "something" is, and how important it is in comparison to the strategic direction may elude everyone. And there's a huge chance that when we consider what we actually want to achieve, our "improvements" might even be a step in the wrong direction.
Have you ever bothered talking about where you yourself are actually heading - and why?

3. Agile

"Agile" is extremely difficult to pinpoint.  It means something different to everyone.
Some think of it as a project management methodology, while others claim "There are no agile projects".
Some think of a specific set of principles and practices, while others state these are all optional.
Some confuse a framework with agile - some go even as far as thinking that "Agile" can be packaged.
Some are even selling a certain piece of software which allegedly is "Agile".

Yet everyone seems to forget that the bunch of 17 people meeting at Snowpeak were out to define a new standard for how to better develop software - and couldn't agree on much more than 6 sentences.
Especially in the light of Kaizen above - What do 'better ways' even mean, when no direction and no standard has been defined?
A lot of confusion in the agile community is caused by people standing at different points, heading into different directions (or: not even having a direction) and aiming for different things - and then telling each other what "better" is supposed to mean.

The Agile Manifesto is nothing more than a handful of things that seemed to be consistent across the different perspectives: It answers neither What, How nor Why.
To actually make meaning from that, you need to find your own direction and start moving.