Fail Fast, Move On: The defect funnel - systematically working towards high quality

Take a look at this diagram: Which of these images best describes your quality strategy?

The four stages - from left to right - are:

Automated testing - issues detected by automated test execution.
Manual testing - issues detected by manual testing efforts.
System Monitoring - issues detected by monitoring capability.
User Reports - issues encountered by users on the running system.

The bars indicate:

Red: Too many issues to deal with
Yellow: A bearable, greater amount that needs to be prioritized rather than dealt with.
Big green: A greater amount of issues that gets completely handled.
Small green: A negligible amount of issues that are being dealt with as soon as they pop up.
No bar: "we don't do this or it doesn't do much."

The defect funnel

Although the images don't really resemble much of a funnel, this "defect funnel" is similar to a sales funnel. In the ideal world, you'd find the highest amount and the most critical defects early, and as a delivery progresses through the process, both amount and criticality decrease. Let's take a look at the ideal world (which never happens in reality) -

Automated testing should cover all the issues that we know can happen - and when we have a good understanding and high control of our system, that should be the bulk of all issues. If we rigorously apply Test Driven Design, we should always have automated tests run red when we create new features, so having red tests is the ideal scenario.

Manual testing - in theory - should not find "known" problems. Instead, it should focus on gray and unexplored areas: manual testing should only find problems where we don't know ahead of time what happens. That's normal in complex systems. Still, this should be significantly less than what we already know.

Monitoring is typically built to maintain technical stability - and in more progressive organizations also to generate business insights. If we find unexpected things in monitoring, it basically means that we don't know how our product works. And the amount of known problems we have should be low, because everything else is just a sign of shoddy craftsmanship.

User reports are quirks we learn from our users. Since we're the designers, creators and maintainers of our product, no user should know more about it than we do. Still, it can occasionally happen that either we choose to expose our user to a trial, or that a scenario is too far out of the norm to predict before it happened. The better our control of our system is, the lower the amount of stuff we don't see before our users.

In the real world, the funnel usually doesn't even remotely resemble a funnel at all. This should be a clear-cut sign that your process may neither be working as intended nor as designed.

No systematic quality approach

If you don't have a coherent approach to quality at all, this is most likely how things look like: If you encounter a problem, it's either by chance during testing, or because users complain.
You can't really discriminate whether the issue was caused by the latest deployment, or has been around for a while and simply never shown up before.
If there's any test automation, it's most likely just regression tests, focusing on the critical scenarios that existed for a long time. Since these tend to be stable, test automation hardly finds any issues.
System monitoring will only detect the most glaring issues - like "server down" or "tablespace full".

In such a situation, developers are fighting a losing battle: Not only do they not really know what caused the problem, or how many problems there actually are - every deployment invites problems. You never know how much effort anything takes, because of the constant interrupts to solve production issues. Reliability is low, quality is low, predictability is low - the only things that tend to be high are effort and frustration.

Hence, most larger organizations adopt systematic quality gated processes:

Waterfall: Quality Gates

Adding systematic quality control processes, with formal test cases, systematic test execution and rigorous bug tracking allows IT to discover most of the critical issues before a deployment hits the user. If this is your only quality measure, though, you're not reducing defect rates at all.
Delays in deliveries to the test environment cut down test time, so test cases get prioritized to meet time constraints.

New components are tested manually ("no time for automation") and everyone sighs with relief when the package leaves development - there's neither time, money nor mental capacity to mind Operations.
The time available to fix found issues is never enough, so defects merely get prioritized - the most critical ones fixed, and the rest are simply released along with the new features: the long-term quality of the system degrades.

In such an environment, testers continually stumble upon legacy problems and simply learn to no longer report known issues. Quality is a mess, and every new user stumbles upon the same things.
The fortunate thing for developers is that they're no longer the only ones who get blamed and interrupted - they have the QA team to shift blame to as well.

Introduction of Agile Testing

The most notable thing about agile testing is that developers and testers are now in the same boat. By having a Definition of Done that declares no feature "Done" before tests were executed, developers no longer benefit from pushing efforts onto the next desk, and test automation - especially of new components - becomes mandatory to keep cycle times low.

What's scary is that the increased focus on quality and the introduction of agile testing techniques seem to reduce quality - the amount of issues suddenly discovered becomes immense! The truth is that the discovered issues were always there and are inherent both to the product and the process. They were just invisible.

Many teams stop at this point, because they don't get enough time to fix all know problems and stakeholders lose patience with the seeming drop in performance. Everyone knows testing is the bottleneck, and instead of pushing forward and resolving the issue once for all, they become content with "just enough" testing.
Hence, they never reach the wonderful point where the amount of issues discovered by users start to decline to a bearable amount. But that's where the true victory of using higher degrees of test automation, user centric testing and closer collaboration with development manifest.

Shift-Left Testing

It's not enough to "do Agile Testing", we have to change the quality approach. By having every team member - and users - agree on quality and acceptance criteria prior to deployment, by moving to test driven design, by formulating quality in terms of true/false verifiable scenarios prior to implementation - and finally, by automating these scenarios prior to development, we break the problem of finding issues after the fact, that is, when the code is already wrong.

When we first move to Shift-Left Test, we will typically encounter a lot of situations where we discover that the system never did what it was supposed to do, and the newly designed test scenarios fail due to legacy issues. At this point, effort may have another explosion, because a lot of discussions will be required to make the system consistent. The reduction in speed and the increase in problems is a sign that you're moving in the right direction.

In the context of shift-left testing, teams often add extra capabilities to the system which mainly serve for testing purposes, but which are also great hookpoints to enlarge system monitoring to catch certain business scenarios, such as processing or procedural failures.
All of the problems thus caught earlier will not hit the user any more, and this becomes the first point where users start to notice what's going on - and begin to increase confidence in the team's efforts.

Moving to DevOps

Once you've got the quality of the creation of new features under control, it's time to enhance your sphere of control and ensure users also have a good experience of your system. You can't do that without Ops on board, and you need to start solving the issues Ops encounter with a higher priority.

Investing into monitoring for new components becomes an integral part of your quality strategy, for two reasons: First, you will need ways to test your value hypotheses against real world data, and second, since you're designing for quality, you need to ensure this design doesn't break.

You'll still be hitting legacy issues left and right - because you still never had the time to clean them up. But you start to become more aware of them as they arise, and by systematically adding monitoring hookpoints to know issues, you learn to quantify them, so that you can systematically work them off.

The "Accelerate" Stage

In their book, "Accelerate", Gene Kim, Nicole Forgsen and Jez Humble, describe four key metrics of high performing organizations:

Lead time
Deployment frequency
Mean time to recover
Change Fail Percentage

Being world-class on these metrics is only possible with stringent quality control in every aspect of your process, and it's only possible if your system has high quality to begin with.

What may come as a surprise: we're not even aiming to eliminate all known issues in design: That would be too expensive, and too slow. Instead, we're making informed optimization decisions: Does it cost more to automate a test, or to establish a monitoring ruleset that will ensure we're not running into problems? Do we try to get it right the first time, or are we willing to let our users determine whether our choice was good?

An Accelerated organization, oddly enough, will typically feature a lower degree of test automation and less manual testing than a Shift-lefted organization, because they do not gain value from these activities as much any more. For example, shooting a record of data through the system landscape and validating the critical monitoring hookpoints tends to be significantly lower effort than to design, automate, execute and maintain a complex test scenario. Plus, it speeds up the process.

Fail Fast, Move On

Pages

Saturday, April 25, 2020

The defect funnel - systematically working towards high quality