Sunday, April 4, 2021

A few things you have to understand about systems

The difference between a system and a compound is that while a compound is defined by the sum of its component, a system is defined by the product of the interactions of its components.

This very simple statement has profound consequences, regardless of whether we are talking about chemical, physical, social or entire economic systems.


Decomposition and Reassembly

Classic science has it that if you de-compose a complex problem into smaller units, the complexity can be handled in individual bites. While this works great when interactions are not as prevalent, it entirely fails when the behaviour of a system is predominantly defined by component interactions.

A de-composed system missing even one of its interactions will not display the same properties as the complete system.

Modifying a de-composed system may create an entirely different system when re-assembled.


Synchronization

Interaction generates friction. The mechanism of minimizing friction is synchronization.

As friction reduces the motion energy of the affected components, the amount of friction gradually reduces until the interacting components will have minimal friction.  As such, every interacting component of a system will enter into a synchronized state over time.

The momentum of a system in a synchronized state will be the cumulative momentum of all components. The same holds true for inertia.

Synchronization does not equate stability. Indeed, the process of synchronization could destabilize, and potentially destroy, the entire system.


Subsystems

On a higher level of abstraction, a subsystem behaves like a component, assuming its internal and external interactions are separate and distinct.

Interacting subsystems will generate friction until they are synchronized.

Subsystem synchronization could oscillate between different states and have different driving forces until an equilibrium is achieved.

Independent subsystems behave like components: they may be in sync within themselves, yet out of sync with each other.


Component Effectiveness

Since the components of a system are as effective as their interactions, the effectiveness of any individual component is both enabled and constrained by its interaction. 

Effectiveness is enabled by synchronized interactions.
Effectiveness is constrained by frictional interactions.

When a component's interactions are predominantly frictional, the component is rendered ineffective unless it's intended to be an abrasive component.


Why is any of that important?

Think about what the above means for piloting changes in parts of your system.
You may not achieve what you intend.

Thursday, April 1, 2021

Improving Code Reviews

A "code review" activity is part of many organizations' development process - and oftentimes, it sucks. It frustrates people, wastes their time and the value in improving quality is questionable at best. If you feel that's the case, here are some things you can try.




What's a Code Review?

"Code review" is a feedback process, fostering growth and learning. It should not be confused or conflated with a QA sign-off process. While finding problems in the code may be part of doing the review, that's not the key point.

So-called one-way "gate reviews" without feedback on defect-free code are a waste. A major portion of their value is missed. The best reviews won't merely help people learn where they messed up - they help people find new, better ways of doing things!

Now, let us explore five common antipatterns and what we could do about them.

Five Code Review Antipatterns and how to deal with them

Review Hierarchy

In many organizations, the Code Review process "puts people in their place": A more senior person reviews the code of more junior persons, and annotates everything that these did wrong. Yes - this sounds exactly like a teacher grading a student's term paper, and the psychological implications are very similar.

While this does indeed foster some form of learning, it creates an anhedonian mindset: the key objective of the coding developer is to avoid the pain of criticism and rework. There is little joy in a job well done. Deming's 12th point comes to mind.

Suggestion 1: Reverse the review process. Let the junior review the senior's code, and see what happens.

Suggestion 2: Do review round robins. Everyone gets to review everyone else's code.

Suggestion 3: Have an open conversation, "How do Code Reviews affect our view of each other's professionalism?"


Huge Chunk Reviews

I'll admit that I've been both on the giving and receiving end here: Committing huge chunks of code at once and sending thousands of lines of code for review in one big bulk, without any comments. And the review outcome was, "This is garbage, don't work like that!" Rightly so. Nobody in their right mind has time to ponder such a huge amount of changes in detail. The review feedback will take a long time and probably not consider all the important points - simply because there are too many.

Code Reviews shouldn't create a huge burden, and they should have a clear focus.

Suggestion 1: State the review objective: What would you like feedback on?

Suggestion 2: Send increments into Code Review that can be thoroghly reviewed in no more than 15 minutes.

Suggestion 3: Reduce feedback intervals. For example: no more than 2 days should pass between writing a line of code and getting it reviewed.


"LGTM" or whimsical feedback

Poor reviews start with the premise that "the only purpose of a review is to find problems." On the positive side of the spectrum, this leads to a lot of a standard "lgtm" (Looks Good To Me) annotations as code is simply waved forward. On the opposite side of the spectrum, some individuals feel an almost sadistic need to let others know that there are always problems, today stating "this is good", and tomorrow stating "this is bad."

Behind this antipattern is the "controller mindset" that someone in the organization believes that a review is intended to tell others, "you did this wrong, you did that wrong.

You can improve this by moving away from checking the code towards positive reinforcement, creating virtuous learning cycles

Suggestion 1: Change the guiding question from, "What is wrong with this code?" towards, "What could I learn from this code?

Suggestion 2: Create Working Agreements how you want to deal with extreme ends of the review spectrum.

Suggestion 3: Collect the learnings from Code Reviews and look at them in the Retrospective.


Ping-pong or Ghosting

Small story: One of my teams had just fixed a production problem that was causing a revenue loss of roughly €15k per day. Someone from a different team did the code review, demanded some semantic fixes, these were made - next round: lather, rinse, repeat. After 2 weeks, the reviewer went on vacation without notice. The fix got stuck in the pipeline for 5 days without response. This funny little event cost the company over €250k - roughly three years' worth of developer salary!

Things like that happen because the expectations and priorities in the team aren't aligned with business objectives and also because of a phenomenon I call "ticket talk."

Suggestion 1: Use TameFlow Kanban to make the Wait Time and Flowback caused by Code Reviews visible.

Suggestion 2: Create a Working Agreement to talk face-to-face as soon as there's flowback.

Suggestion 3: Replace Code Reviews with Pair Programming.


Preferences, emotions and opinions

Let's return to the "whimsical feedback" antipattern briefly. Many times, I see feedback over "use CamelCase instead of Snake Case", "use Tabs indentation instead of spaces" or whether a "brace should open behind the method name rather than in a new line". 

None of these make the product any better or worse. The debate over such matters can get quite heated, and potentially even escalate into a full-blown religious war. These are entirely up to personal preference, and as such, not worth a review annotation: They are red herrings. 

Suggestion 1: Formalize coding conventions and put them into your Lint / SCA config.

Suggestion 2: If you're really bothered, use a pre-commit hook to prevent checking in code that violates static code rules.

Suggestion 3: If you think a rule is missing or unproductive, bring it up in the Retrospective.


Alternative perspective

Code Reviews are just one way to improve coding within a team and/or organization. Mandatory code reviews - by default - create interrupts in the flow, reducing overall performance by a significant amount. Better options include:

  • Code Review upon request
    (e.g., "I want to talk with you about this code")
  • Code Dojos, where the entire team assembles to learn from one another.
    (SAFe's IP Iterations are great for dojos.)
  • Pair programming - making the discussion part of the process.
    (Reviews should be obsolete if you do Pairing right)

Still, if your organization mandates code reviews, try to make the best from them.


Summary (tl;dr)

Code Review is more about fast feedback learning than about "catching errors".
A positive "what can I learn" attitude makes reviews much more enjoyable and beneficial than a negative "what did they do wrong" attitude.

When reviews expose pressing problems, don't just annotate them. Engage the discussion about "how can we work differently?"


Saturday, March 27, 2021

Tests without code knowledge are waste

I start with an outrageous claim: "Software tests made without knowledge of the code are waste."

Now, let me back that claim up with an example - a calculator applet which you can actually use:

Calculator

Have fun, play around with this little applet ... it works if your JS is enabled!

Now let us create a test strategy for this applet:

Black Box test strategy

Let's discuss how you would test this, by going into the list of features this applet offers:
  • Basic maths: Addition, Subtraction, Multiplication, Division
  • Parentheses and nested parentheses
  • Comparisons: Less, Equal, Greater ...
  • Advanced maths: Exponents, Square Root, Logarithm, Maximum, Minimum, Rounding (prefixed by "Math.")
  • Trigonometry: Sinus, Cosinus, Tangens, Arc-Functions (also prefixed)
  • Variables: defining, referencing, modifying
  • Nested functions: combining any set of aforementioned functionality
  • And a few more ...
Wow - I'm sure you can already see where our test case catalogue is going with minimal coverage, and we haven't even considered edge cases or negative tests yet!


How many things could go wrong in this thing? Twenty, thirty? Fifty? A thousand?
How many tests would provide an appropriate test coverage? Five, ten, a hundred?
How much effort is required for all this testing? 

Would it shock you if I said ...

All those tests are a waste of time!

Frankly speaking, I would test this entire thing with a single test case, because everything beyond that is just waste. 
I am confident that I can do that because I know the code:


  <div class="row">
	<div class="card"
             style="margin-top:1rem;
                        margin-left:2rem;
                        margin-right:2rem; width: 18rem;
                        box-shadow: 4px 4px 4px 1px rgba(0, 0, 0, 0.2);">
	  <b class="card-header text-center bg-primary text-light">
            Calculator</b>
	  <div class="card-body">
	  	<textarea class="form-control"
                          id="calc_input"
                          rows="1" cols="20"
                          style="text-align: right;"
		          oninput="output.innerHTML = eval(this.value)"
                          ></textarea>
	  </div>
	  <div class="card-footer text-right">
		  <b id="output">0</b>
	  </div>
	</div>
  </div>
  
Yes, you see this right. The entire code is just look-and-feel. There is only a single executable statement: "eval(this.value)", so we don't have a truckload of branch, path, line, statement coverage and whatnotever that we need to cover:
All relevant failure opportunities are already covered by JavaScript's own tests for its own eval() function, so why would we want to test it again? 

The actual failure scenarios

Seriously, this code has only the following opportunities for failure:
  • Javascript not working (in this case, no test case will run anyways)
  • Accidentally renaming the "output" field (in this case, all tests will fail anyways)
  • User Input error (not processed)
  • Users not knowing how to use certain functions
    (which is a UX issue ... but how relevant?)
Without knowing the code, I need to test an entire catalogue of test cases.
Knowing and understanding the code, I would  reduce the entire test to a single Gherkin spec:

The only relevant test

Given: I see the calculator
When: I type "1+0".
Then: I see "1" as computation result.

So why wouldn't we want to test anything else?
Because: the way the code is written, if this test passes, then all other tests we could think of would also pass.

But why not write further tests?

A classical BDD/TDD approach would mandate us to incrementally define all application behaviours, create a failing test for each, then add passing code, then refactor the application. 
If we would do this poorly, we would really end up creating hundreds of tests and writing explicit code to pass each of them - and that's actually a problem with incremental design unaware of the big picture!

The point is that we wrote the minimum required code to meet the specified functionality right from the start: code that has only a single failure opportunity (not being executed) - and after having this code in place, there's no way we can write another failing test that meets the feature specifications!


And that's why a close discussion between testers and developers is essential in figuring out which tests are worthwhile and which aren't.

Wednesday, March 24, 2021

The only constant is change

A classic response towards change failures in traditional organizations, most notably, Software Release failures, is "Then make changes less often." This gives a good feeling that for a prolonged period of time, things will be predictable and stable. Unfortunately, it's a local optimization issue that actually makes things worse in the long term. Why do we think like that, and why is that thought a problem?



The gut reaction

Let's say that a change just failed. This is an uncomfortable event. Naturally, we don't like discomfort. The easiest way to deal with discomfort is to postpone the re-occurrence of the uncomfortable event into the future. In software development, we do this by postponing the next release as far as possible.

This provides an immediate short-term effect on our psyche: We have just achieved certainty that until the next release happens, there will be no further discomfort, and thus we have reduced immediate stress by postponing it.

Unfortunately, this has done nothing to reduce the probability of that next change failing.

Now, let's see what happens as a consequence of this decision:


The immediate after-effect

Because there is now a prolonged period of time until our next release happens, and the amount of development work is not decreasing, the batch size for the next release increases by exactly the amount of delay in the release. So, if, for example, we used to release once per month and reduce this to once per three months, the batch size just went up 200% with a single decision.

Of course, this means that the scope of the next change is also going to increase, it will become more complex to deliver the next release.

The knock-on effect

A bigger, more complex change is more difficult to conduct and has a higher potential for failure. In consequence, we're going to be failing stronger and harder. If we have 3 small changes, and 1 of them fails, that's a success rate of 33%. If we now combine all 3 changes into 1 big change, that means we end up with a 100% failure rate!

You see - reducing change frequency without reducing change scope automatically increases likelihood and impact of failure.

If now we decide to follow our gut instinct again, postponing the next failure event, we end up in a vicious circle where change becomes a rare, unwanted, highly painful event: We have set the foundation for a static organization that is no longer able to adapt and meet customer demands.

The outcome

The long-term consequence of reducing change frequency is that we can poorly correlate effort and outcome, it becomes indistinguishable what works and what doesn't - and thereby, the quality of our work, our product, our processes and our metrics deteriorate. We lose our reason to exist on the market: providing high quality and value to our customers on demand.




"If it hurts, do it more often."

Let's just follow the previous computation:
If 1 big change fails 100% of the time, maybe you can slice it into 3 smaller changes, of which 2 will succeed, reducing your failure rate by 66%?

So, instead of deciding to reduce change frequency, you decide to increase it?

The immediate after-effect

Because there is now a shorter period of time until the next release, there will be a reduced time between when something is developed and until you see the consequences. We close the feedback loop faster, we learn quicker what works and what doesn't. And since we tend to be wired to not do things that become painful, we do more of the things that work, and less of the things that don't.

The knock-on effect

Managing small changes is easier than managing complex change. Thereby, it becomes less risky, less work and less painful to make changes such small changes. Likewise, since we get faster (and more frequent) feedback on what worked, we can optimize faster for doing more things that provide actual value.

The outcome

By making rapid, small changes, we can quickly correlate whether we improved or worsened something, and we can respond much more flexibly towards changing circumstances. This allows us to deliver better quality and feel more confident about what we do.


Summary

The same vicious circle created by the attitude, "If we change less often, we will have fewer (but significantly more) painful events" can become a virtuous cycle if we change our attitude towards, "If change do it more often, it'll become easier and less painful each time."

Your call.

Monday, March 15, 2021

Why WSJF is Nonsense

There's a common backlog prioritization technique, suggested as standard practice in SAFe, but also used elsewhere, "WSJF", "Weighted Shortest Job First." - also called "HCDF", "Highest Cost of Delay First" by Don Reinertsen.

Now, let me explain this one in (slightly oversimplified) terms:

The idea behind WSJF

It's better to gain $5000 in 2 days than to gain $10000 for a year's work. 
You can still go for those 10 Grand once have 5 Grand in your pocket, but if you do the 10 Grand job first, you'll have to see how you can survive a year penniless.

Always do the thing first that delivers the highest value and blocks your development pipeline for the shortest time. This allows you to deliver value as fast and high as possible. 


How to do WSJF?

WSJF is a simple four-step process:

To find out what the optimal backlog position for a given item is, you estimate the impact of doing the item ("value") and divide that by the investment into said item ("size") and then put the items in relation towards each other.


It's often suggested for estimated to use the "Agile Fibonacci" scale, so "1, 2, 3, 5, 8, 13, 20, 40, 80..."
The idea is that every subsequent number is "a little more, but not quite twice as much" as the previous one, so a "13" is "a little more than 8, but not quite 20". 
Since there are no in-between numbers, when you think you're not sure whether an item is 8 or 13, you can choose either, because these two numbers are adjacant and their difference is considered miniscule.

Step 1: Calculate "Value" Score for your backlog items.

Value (in SAFe) is actually three variables: User and/or Business Value, Time Criticality, Enablement and/or risk reduction. But let's not turn it into a science. It's presumed value.

Regardless of how you calculate "Value", either as one score or a sum or difference of multiple scores, you end up with a number. It becomes the numerator in your equation.

Step 2: Calculate "Size" Score for your backlog items.

"Size" is typically measured in the rubber-unit called Story Points, and regardless of what a Story Point means in your organization or how it's produced, you'll get another number - the denominator in your equation.

Step 3: Calculate "WSJF" Score for your backlog items.

"WSJF" score, in SAFe, is computed by dividing Value by Size.

For example, a Value of 20 divided by a size of 5 would give you a WSJF score of 4.

Step 4: Sort the backlog by "WSJF" Score.

As you add items, you just put them into the position where the WSJF sort order suggests, with the highest value on top, and the bottom value on the bottom of the backlog.
For example, if you get a WSJF of 3 and your topmost backlog item has a WSJF score of 2.5, the new item would go on top - it's assumed to be the most valuable item to deliver!

And now ... let me dismantle the entire concept of WSJF.

Disclaimer: After reading the subsequent portion, you may feel like a dunce if you've been using WSJF in the real world.


WSJF vs. Maths

WSJF assumes estimates to be accurate. They aren't. They're guesswork, based on incomplete and biased information: Neither do we know how much money we will make in the future (if you do, why are you working in Development, and not on the stock market?) nor do we actually know how much work something takes until we did it. Our estimates are inaccurate.

Two terms with error

Let's keep the math simple, and just state that every estimate has an error term associated. We can ignore an estimator's bias, assuming that it will affect all items equally, although that, too, is often untrue. Anyway.

The actual numbers for an item can be written as:
Value = A(V) + E(V)  [Actual Value + Error on the Value]
Sizes = A(S) + E(S)  [Actual Size + Error on the Size]

Why is this important?
Because we divide two numbers, which both contain an error term. The error term propagates.

For the following section, it's important to know that we're on a Fibonacci scale, where two adjacent items are always at least 60% apart.

Slight estimation Error

If we over-estimate value, an item will have at least 60% higher value than estimated, even if the difference between fact and assumption is miniscule. Likewise, if we under-estimate value, an item will have at least 30% lower value than estimated.

To take a specific example:
When an item is estimated at 8 (based from whatever benchmark), but turns out to actually be 5, we overestimated it by 60%. Likewise, if it turns out to actually be 13, we underestimated it by 38.5%.
If we're not 100% precise on our estimates, we could be off by a factor of 2.5!

The same holds true for Size. I don't want to repeat the calculation.

Larger estimation error

Remember - we're on a Fibonacci scale, and we only permitted a deviation by a single notch. If now, we permit our estimates to be off by two notches, we get significantly worse numbers: All of a sudden, we could be off by a factor of more than 6!

Now, the real problem happens when we divide those two.

Square error terms

Imagine that we divide a number 6 times larger than it should be, by a number 6 times smaller than it should be, we get a square error term.

Let's talk in a specific example again:
Item A was estimated as 5 value, but it was actually a 2 value. It was estimated as 5 size, but it was actually a 13 size. As such, it had an error of 3 in value, and an error of 13 in size.
Estimated WSJF = (2 + 3) / (13 - 8) = 1
However, the Actual WSJF = 2 / 13 = 0.15


Now, I hear you arguing, "The actual numbers don't matter... it's their relationship towards one another!"


Errors aren't equal

There's a problem with estimation errors: we don't know where we make errors, otherwise we wouldn't make them, and we also make different errors, otherwise, they wouldn't affect the scale at all. Errors are errors, and they are random.

So, let me draw a small table of estimates produced for your backlog:

Item Est. WSJF Est. Value Est. Size Act. Value Act. Size Act. WSJF
A 1.6 8 5 5 5 1
B 1 8 8 3 20 0.15
C 0.6 3 5 8 2 4
D 0.4 5 13 13 2 6.5

Feel free to sort by "Act. WSJF" to see how you should have ordered your backlog, had you had a better crystal ball.

And that's the problem with WSJF

We turn haphazard guesswork into a science, and think we're making sound business decisions because we "have done the numbers", when in reality, we are the victim of an error that is explicitly built into our process. We make entirely pointless prioritization decisions, thinking them to be economically sound.


WSJF is merely a process to start a conversation about what we think should be priority, when our main problem is indecision.
It is a terrible process for making reliable business decisions, because it doesn't rely on facts. It relies on error-prone assumptions, and it exacerbates any error we make in the process.

Don't rely on WSJF to make sound decisions for you. 
It's a red herring.

The discussion about where and what the value is provides much more benefit than anything you can read from a WSJF table. Do the discussion. Forget the numbers.

 

Monday, February 15, 2021

The problem with maturity assessments

There are a lot of "Agile Maturity Assessments" out there, and these tend to be good ideas when dealing with teams that need a kind of vision where to improve. And yet, it's good to be cautious with such assessments, as they could really send you off on the wrong track, especially when you're a coach evaluating a team.

It's not a linear scale

To some degree, product development is like a form of art: there are standards of professionalism known to the vast majority of people in the field, and those who do not apply these standards stick out. Usually, that's a bad sign. We would probably call these people amateurs. But sometimes, the results are excellent. Is that just "Beginner's luck?"

Let me give you a practical example of how a maturity assessment could go wrong:

Suppose you meet a team of developers who don't talk much, using mostly scripts, don't even branch out their code, there are very few tests in their repository and people simply put every change straight into production.

A disaster, right? They need a lot of coaching to become a professional development team, won't they?
Maybe. But maybe not. Maybe they're a high-performing team, and there's nothing you can teach them, because they're far ahead of you!

The assessment

Let's take a look at this simple "Maturity assessment":
It was on purpose designed with the same content for "Amateur" and "Mastery".
Most maturity assessments we know would stop in the fourth column - at the highest standards of professionalism. 

And that is exactly the reason why assessors might give terrible advice and why ignorant coaches could deal a great amount of damage. And as an aside, it's the reason why I refrain from having an opinion on a team until I see how they actually perform over a prolonged period of time.

True Mastery

To save some time, let's just look at the "communication" aspect in more detail here.
Communication is always good, isn't it? No - that this is a false belief!

Have you ever seen that elderly couples don't talk so much, yet don't feel there's a problem? It is because they wordlessly understand each other. She doesn't need to ask him what food he likes - she knows. He doesn't need to ask where the plates are - he knows. There is no reason why they should have a meeting to align on the meal preparations.

Communication has no value in and of itself. It has to serve a purpose. We need to communicate enough to meet the objectives of our communication. Every bit of communication beyond that is waste.
  1. Amateurish teams don't know what they need to talk about, hence they under-communicate, generating communication debt.
  2. In the first stage of professionalism, teams will establish cadenced planning and alignment events where they will communicate about key topics.
  3. In the next stage of professionalism, teams will pair on development to speed up feedback and prevent common errors that would cause rework.
  4. In the highest stage of professionalism, mob events replace discussion meetings, eliminating the delay between plan creation and execution. Cadenced planning becomes redundant and alignment implicit.
  5. And yet, true masters of communication don't need any of that after having spent sufficient time together. They already understand and can predict each other's actions, so why should they align on it?
And how about simple scripts instead of high-end enterprise-level object-oriented programming? In a context. I'll leave it to you to figure that one out, although it could be quite tricky unless you've seen it.

It's quite similar with all the other aspects in our "maturity assessment." There are pretty good reasons for people who absolutely know what they are doing and who know all the tricks in the book to avoid using any of them.
It's a great exercise to reflect on the reasons why or how true mastery could make the professional patterns superfluous.

Lack of awareness

So once we realize that Amateurish behaviour and mastery might be indistinguishable from the outside, how about we ask the team? Probably not a good idea, because they can most likely explain to you why their way of working is the cherry on the cream, even when it's not. With regards from Dunning-Kruger. And even if they are well aware of the professional practices, most reasons for neglecting them are bad reasons.

Thus, you need a way to realize whether an organization has had a healthy evolution beyond the need for professional practices.

Proof in the Pudding

Again, taking a look at our communication example - the proof is in the pudding. How would you know a team is meeting its communication objectives?

When there are misunderstandings or unresolved controversies, individuals go off on tangents or people deliver fragmented piecemeal work, then clearly, we have a team who has not mastered communication, irrespective of whether they are still very amateurish or have learned professional patterns.

When a team harmoniously delivers great results, where everyone understands what to do why, how and when - what "communication problem" would you want to solve by adding structure?

Likewise, it's for all aspects of our "maturity assessment." 
When a team manages to sustainably deliver high value, high quality outcomes, they may have no need of adding anything. Most likely, the things you'd want to add would make them worse, not better. 

Once a team can deliver excellent results, our key question is what we can remove and retain our excellence. And that's the road to mastery.

Sunday, January 24, 2021

The importance of testablility

In the article on Black Box Testing, we took a look at the testing nightmare caused by a product that was not designed for testing. This led to a followup point: "Testability", which is also a quality attribute of the ISO:25010. Let us examine a little closer what the value of testability for product development actually is.






What is even a "test"?

In science, every hypothesis needs to be falsifiable, i.e. it must be logically and practically possible to find counter-examples. 

Why are counter-examples so important? Let's use a real-world scenario, and start with a joke.

A sociologist, a physisicst and a mathematician ride on a train across Europe. As they cross the border into Switzerland, they see a black sheep. The sociologist exclaims, "Interesting. Swiss sheep are black!". The physicist corrects, "Almost. In Switzerland, there are black sheep." The mathematician shakes her head, "We only know that in Switzerland, there exists one sheep that is black on at least one side."

  • The first statement is very easy to disprove: they just need to encounter a sheep that isn't black.
  • The second statement is very hard to disprove: because even if the mathematician were right and another angle would reveal the sheep to not be fully black, the statement weren't automatically untrue, because there could be other sheep somewhere in Switzerland that are black.
  • Finally, the third statement holds true, because the reverse claim ("There is no sheep that is black on one side in Switzerland") has already been disproven by evidence.

This leads to a few follow up questions we need to ask about test design.


Test Precision

Imagine that our test setup to verify the above statements looked like this:

  1. Go to the travel agency.
  2. Book a flight to Zurich.
  3. Fly to Zurich.
  4. Take a taxi to the countryside.
  5. Get out at a meadow.
  6. Walk to the nearest sheep.
  7. Inspect the sheep's fur.
  8. If the sheep's fur color is black, then Swiss sheep are black.

Aside from the fact that after running this test, you might be stranded in the Winter Alps at midnight wearing nothing but your pyjamas and you're a couple hundred Euros poorer, this test could go wrong in so many ways.

For example, your travel agency might be closed. Or, you could have insufficient funds to book a flight. You could have forgotten your passport and aren't allowed to exit the airport, you could go to a meadow that has cows and not sheep, and the sheep's fur inspection might yield fleas. Which of these has anything to do with whether Swiss sheep are black?

We see this very often in "black box tests" as well:

  • it's very unclear what we're actually trying to test
  • just because the test failed, that doesn't mean that the thing we wanted to know is untrue
  • there's a huge overhead cost associated with validating our hypothesis
  • we don't return to a "clean slate" after the test
  • Success doesn't provide sufficient evidence to verify the test hypothesis.

Occam's Razor

In the 13th century, a monk by the name of Occam came up with what's known today as Occam's Razor, i.e., "entities should not be multiplied without necessity". Taking a look at our above test, that would mean that every test step that has nothing to do with sheep or fur color should be removed from the design.

The probability to run this test successfully increases by isolating the test object (sheep's fur) as far as possible, and eliminating all variability from the test that isn't directly associated to the hypothesis itself.


Verifyability

We verify a hypothesis by finding a counter-example to what's called the "alternate hypothesis" and assuming that if this one is untrue, then its logical opposite, called the "null hypothesis" is true. Unfortunately, this means we have to think in reverse.

In our example: To prove that all sheep are black, we have to sample all sheep. That's difficult. It's much easier to sample one non-black sheep, and thereby falsify that all sheep are black. If we fail to produce even one, then all sheep must be black.


Repeatability and Reproducibility

A proper test is a systematic, preferrably repeatable and reproducible, way of verifying our hypothesis. That means, it should be highly predictable in its outcome, and we should be able to test as often as we want.

Again, going back to our example, if we design our test like this:

  1. Take a look at a Swiss sheep.
  2. If it is white, then sheep are not black.
  3. If sheep are not - not black, then sheep are black.

This is terrible test design, because of some obvious flaws in each step: 

  1. The setup is an uncontrolled random sample. Since sheep are white or black, running this test on an unknown setup, means we haven't ruled out anything if we picked a black sheep.
  2. The alternate hypothesis is incomplete: Should the sheep be brown, then it is also not white.
  3. Assuming that step 2 didn't trigger, we would conclude that brown = black.

Since the "take a look at a Swiss sheep" is already part of the test design, each time we repeat this test, we get a different outcome, and we can't reproduce anything either, because if I run this test, my outcome will be different from yours.


Reproducibility

A repeatability problem occurs when the same setup can generate different results. In our example, "take a look at" could be fixed, assuming we take the mathematician's advice, by re-phrasing step 1 to: "Look at a Swiss sheep from all angles." This would lead everyone to reach the same conclusion.

We might also have to define what we call "white" and "black", or whether we would classify "brown" as a kind of white.

We increase reproducibility by being precise on how we examine the object under test, and which statements we want to make about our object under test.


Repeatability

Depending on what the purpose of our test is, we are doing well by removing the variation in our object under test. So, if our test objective is to prove or falsify that all sheep are black, we can set up a highly repeatable, highly reproducible test like this:

  1. Get a white Swiss sheep.
  2. Identify the color of the sheep.
  3. If it is not black, then the statement that Swiss sheep are black is false.

This experiment setup is going to produce the same outcome for everyone conducting the test anywhere across the globe, at any point in time.

While there is a risk that we fail in step 1 (if we can't get hold of a Swiss sheep), we could substitute the test object with a picture of a Swiss sheep without affecting the validity of the test itself.


What is testability?

A good test setup has:
  • a verifiable test hypothesis
  • a well-defined test object
  • a precise set of test instructions
  • absolutely minimized test complexity
  • high repeatability and reproducibility
When all these are given, we can say that we have good testability. The more compromises we need to make in any direction, the worse our testability gets. 

A product that has high testability allows us to formulate and verify any relevant test hypothesis with minimal effort.

A product with poor testability has a high difficulty associated with formulating or verifying a test hypothesis. This difficulty might translate into an increase of any or all of the following:
  • complexity
  • effort
  • cost
  • duration
  • validity
  • uncertainty


In conclusion

The more often you want to test a hypothesis, the more valuable high testability becomes.
With increasing change frequency, the need to re-verify a formerly true hypothesis also increases. 

Design your product from day 1 to be highly testable.
By the time you discover that a product's testability is unsustainably low, it's often extremely expensive to notch it up to the level where you need it.

Tuesday, January 19, 2021

The problem of Black Box Tests

One of the most fundamental enablers of agile ways of working is the ability to swiftly and reliably detect problems in your product - that is, to test it efficiently and effectively. Unfortunately, the "traditional" approach of black-box testing a running application is hardly useful for this purpose. 

I have created an executable use case to illustrate the problem. 


Let's take a look at this little service, and assume that it would be your responsibility to test it in a way that you can reliably tell whether it works correctly, or which problems it has:

You can call this service in the mini applet included on this page if you have JavaScript enabled..

Alea iacta est

You need Javascript to run this demo.

Yes, just try it out - roll the dice!
That's about as simple as an application can get.
Can you imagine how much effort it takes to properly test this simple application?

It's not enough to roll the dice once and get a number between 1 and 6 - how do you know that there isn't a possibility that the application might generate results outside that range?

And how would you know that you have fair dice? Call the service a thousand times and assume that you would get an approximately even distribution of values? What would be your thresholds for assuming that the dice are "fair"? What if 5% or fewer, or 25% or more results go to one number, which is statistically still possible with a decent probability?
You see the difficulty already.

But let's make this more difficult:

Hello

You need Javascript to run this demo.

What if I told you that this is a call to the same service?
Yes, exactly - you didn't know everything the service does when you created your test concept before.
There's a different feature hidden in the service: if you pass a user name to the request, it will greet you!

This adds a whole new dimension to the test complexity: you have to test with - and without - a user name. And would you want to try different user names?

  But that's not everything:

You lose!

You need Javascript to run this demo.

Did you even catch that this one behaves different?
What if I told you that this is another call to the same service?
Yes, exactly - you still didn't know everything the service does when you created your test concept.
There's another different feature hidden in the service: you can load the dice and cheat!

If you tell the service to cheat you, you will get unfair dice.

So, now you need to run your entire test set from above twice again - with and without cheating.

And we haven't even looked into whether there are multiple ways of cheating, or whether the cheating function always triggers correctly when the variable is set (hint: it doesn't). Good luck without knowing the application where the malfunction is.

But we're not done yet:

I win  

You need Javascript to run this demo.

Did you catch the difference here?
What if I told you that this is yet another call to yet again same service?

There's yet another different feature hidden in the service: if I use my name to cheat, I will get loaded dice in my advantage!

By now, you're probably doubting whether you understood the application at all when you started testing it.

The code

Now - let me blow your mind and tell you how little source code was required to totally blow your test complexity and effort out of proportion:


That's it. This little snippet of code is entirely sufficient to keep a Black Box tester busy for hours, potentially days, and still remain unable to make a reliable statement on whether they missed anything, and which problems the product may or may not have.

Depending on how your application is designed, a few minutes of development effort can generate a humongous mountain of effort in testing.
 
And that's why you can't possibly hope to ever achieve a decent test coverage on an application without knowing the code.

Testability

There's another problem: this code wasn't written with testing in mind (or, much rather: purposely written with poor testability in mind -- hee hee) so you have no way of ever coming up with an easier way to test this service, until it's rewritten.

And that's why you can't maintain sustainable high quality unless developers and testers actively collaborate to build highly testable software that is easy to work with, both for software changes and testing. 

Think of minimizing sustainable lead time - consider the total effort from request to release, and consider it both for initial creation and future modification. There's no point in optimizing for development speed if you slow down testing more than that, and likewise, there's no point in delivering minimal code if the consequence is totally bloated test scope.

Otherwise, you'll not be very agile.

Friday, January 1, 2021

Low-Code, No-Code, Full-Code - The Testing Challenge

In the Enterprise world, there's a huge market for so-called "Low Code" and "No Code" Solutions, and they do have a certain appeal - you need to do less coding, and as such, need less developers, to achieve your business objectives, because they bring a lot of "out-of-the-box" functionality. 

So why is it even something to talk about - and how does that relate to "Agile" ways of working?


Let's explore this one from a quality perspective.


The Paradigms

No-Code: Configure and Go

"No Code" solutions are especially appealing to organizations that have no IT department and are looking for something that someone without IT knowledge can configure in a way that's immediately useful.

An early implementation of no-code platforms will typically not even include a staging environment where people try out things. Many times, changes are immediately live, on a productive system. That's great for small organizations who know exactly what they're doing because it absolutely minimizes effort and maximizes speed.
It turns into a nightmare when someone, somehow, by pure accident, managed to delete the "Order" object and now you're happy-hunting for a couple thousand unprocessed orders that your angry customers are complaining about - with no way to remedy the system.

And it turns into an even worse nightmare when the system doesn't do what it's supposed to do, and you've got a chance smaller than hell freezing over of figuring out why the black box does what it actually does instead of what it's supposed to do.

When introducing Quality Assurance on a No-Code platform, organizations are often stuck using third-party testing software that uses slow, flaky, difficult-to-maintain, expensive UI-based tests which will eventually get in the way of high speed adaptability. Clean Code practices applied to testing are usually a rare find in such an environment.


Low-Code: Configure and Customize

"Low Code" solutions are especially appealing to managers who are out to deliver standardized software to their organization fast. Many of these systems bring huge chunks of standard capability out-of-the box and "only need customization where your organization doesn't do what everyone else does."

That sounds appealing and is a common route in many organization, who often find out only years after the initial introduction that "you can't sustain your market position by doing what everyone else does" - your business does require a lot of customization to stand out in the market, and the platform often doesn't accommodate for that. 

Most vendor solutions don't provide a suite of functional tests for your organization to validate the standard behaviour, which means you often end up creating duplicate or highly similar code in your customization efforts - or use standard functions that don't do what you think they would. Worse yet, many use proprietary languages that make it very difficult to test close to the code. In combination, that makes it extremely hard to test the customization you're building, and even harder to sustainably keep the platform flexible.



Full-Code: Design, Build, Optimize

"Full Code" solutions sound like the most effort and the slowest way of achieving things. But looks can be deceptive, especially to a non-expert, because a modern stack of standard frameworks like Spring, Vue and Bootstrap, can literally make it a matter of minutes for a developer to produce the same kind of results that a low-code or no-code platform configuration would provide, without any of the quality drawbacks of Low-Code or No-Code.

Your organization has full control over the quality and sustainability of a full-code solution. It depends entirely upon what kind of engineering practices you apply, which technologies you use and which standards for quality you set for yourself.


Quality Control

To sustainably high quality at a rapid pace, you need full quality control:
  • You must be able to quickly validate that a component does what it's supposed to do.
  • You must be able to quickly figure out when something breaks, what it was, why and how.
  • When something breaks, you must be able to control blast radius.
  • You need a systematic way of isolating causes, effects and impact.
The most common approach to maintain these is a CI/CD pipeline that runs a robust test automation in the delivery process. To make it feasible that this control is exercised upon every single change that anyone makes at any point in time, it should not take longer than a few minutes, lest people are tempted to skip it when in a hurry.

The problem with both No-Code and Low-Code solutions is: In many cases, such platforms aren't even built for testability, and that becomes a nightmare for agile development. Instead of running a test where and how it is most efficient to run, you invest a lot of brainpower and time into figuring out how to run the test in a way that fits your technology: You have subjected quality to the technology, instead of the other way around!

In a low-code environment, this can become even more problematic, when custom components start to interfere with standard components in a way that is unknown and uncontrollable in a huge black box.


Non-functional Quality

Although I would not outright suggest to opt for a full-code solution (which potentially is not in the best interests of your organization, and it's entirely implausible without skilled developers), I would like to share a list of non-functional quality attributes that may not be considered when selecting a new system, platform or service.

In order to remain agile - that is, to be able to quickly, effectively and easily implement changes in a sustainable manner - your platform should also accommodate for the following non-functional quality requirements:

Factor Decisions
Testability
How much effort is it to test your business logic?
This must go far beyond having a human check briefly whether something works as intended. It needs to include ongoing execution, maintenance and control of any important tests whenever any change is made. And remember: any function you can't test may cause problems - even when you're not intentionally using it!
Traceability
How closely are cause and effect related?
You don't want a change to X also to affect Y and Z if that wasn't your intent! Are you able to isolate changes you're making - and are you able to isolate the impact of these changes?
This should go for the initial setup as well as for the entire lifecylce of the product.
Extensibility
How much effort does it take to add, change or remove business logic?
Adding a form field to a user interface is a start, not the end. Most of your data has a business purpose, and it may need to be sent to business partners, reported in finance, analyzed in marketing etc. How much effort does it take to verify everything turns out as intended?
Flexibility
How often will you be making changes?
If you're expecting a change a year, you can permit higher test efforts per change, but when you're looking at new stuff in a weekly manner, you could be overwhelmed by high test or change efforts, and cutting corners will become almost inevitable.
Security
Can you really trust your system?
Although every system could have vulnerabilities, and standard softwares tend to have fewer, but how can you test for Zero-Days unless you can fully test the intricate inner workings?
Also, some legislation like GDPR forces you to expose certain data processings, and you may need to provide evidence what your system does in order to do that. This is extremely difficult when some behavioural description of certain aspects are a black box.
Mutability
How much effort would it take to migrate to a new platform and decommission the current platform?
When you introduce a system without having an understanding of how much time, effort and risk is involved in a migration or decommissioning initiative, it might be easier to kill your current company and start another business than to get rid of the current technology. That means you could find yourself in a hostage situation when the day comes that your platform is no longer the best choice for your business, and you have no choice except continuously throwing good money after the bad.

As a general rule of thumb, low-code and no-code platforms tend not to emphasize these, so the value your organization places on these non-functional requirements correlates with the plausibility of selecting this approach.

Conclusion

With a lot of these to be said, if you're in the comfortable situation of introducing a new technology, ensure that you check the non-functional requirements and don't get blinded by the cute bucket of functionality a low-code or no-code solution may offer. If your platform does poorly especially on traceability, testability or mutability, you're going to trade off your agility for some extremely painful workaround that could increase the Cost of Ownership of your solution beyond feasible limits.

It wouldn't be the first time that I'd advise a client to "trash everything and start with a blank folder. Within a few years, you'll be faster, have saved money and made better business."

Culture Conversion

Many times, I hear that "SAFe doesn't work" both from Agile Coaches and companies who've tried it, and the reasons behind the complaint tend to boil down to a single pattern that is missing in the SAFe implementation - culture conversion. Let's explore why this pattern is so important, what it is, and how to establish it.



The Culture Clash

Many enterprises are often built upon classical management principles: Workers are seen as lazy, selfish and disposable "resources". Decisions are made at the top, execution is delegated. We have a constant tug-of-war between "The Business" and "Development". All problems are caused by "Them" (irrespective of whom you ask) - and the key objective is always to pass the next milestone lest heads roll.  There is little space for face-level exchange of ideas, mutual problem solving, growth and learning.

If you try to use an agile approach, which is built upon an entirely different set of principles, practices and beliefs, you'll get a clash. Either workers care, or they don't. Either people are valuable, or they aren't. Either they can think, or they can't. You get the idea. Behind that is a thing called "Theory X/Y." 

Self-fulfilling prophesy

When you treat people like trash, they'll stop caring about their work. When you don't listen to your developers, they fall silent. When you punish mistakes, workers become passive. And so on. This lose-lose proposition turns into a death spiral and becomes a self-fulfilling prophesy.

Likewise, when you create an environment built upon mutuality, trust and respect, people will behave differently. Except - you can't just declare it to be so, and continue sending signals that the new values are "just theoretical buzzwords that don't match our reality." Because, if you do that, this will again be a self-fulfilling prophesy.


Breaking the vicious circle

You can't change everything overnight, especially not an entire organization. Some people "get it" immediately, others take longer. Some may never get it. Even when you desire and announce a new culture, it can't be taken for granted. You have to work towards it, which can be a lot of effort when dealing with people who have built their entire careers on the ideas of the old culture.  

Resilience over robustness

A lot of this doesn't happen in the realm of processes, org charts and facts - what's truly going on happens mostly in the realm of beliefs, hopes, fears. As such, problems are often difficult to identify or pinpoint until a dangerous symptom becomes manifest. Hence, you can't simply re-design an organization to "implement" this new culture. The best you can do is institute checks and balances, early warning mechanisms, buffer zones and intentional breaking points.

Buffer Zone

Often, you may need time to collect striking evidence that would convince others to let go of certain un-helpful practices. These might include, for example, HR policies, project management or accounting practices. When you can't quite yet eliminate these things, it's quite important for the culture conversion to also include a conversion of such activities, so that they don't affect the teams. At the same time, you need a strategy laid out with clear targets for abolishing these things, lest they become "the new normal" and culture converters start believing them to be right or even essential.


The Culture Conversion Pattern

When you operate in an environment where cultural elements that conflict with the intended future culture exist and will likely interfere with the sustainability of the change, you need mechanisms that let you:

  • Establish the desirable culture
  • Minimize undesirable culture infringement
  • Mitigate damage from culture infringement
  • Breaking points when undesirable culture gets too strong
  • Identify culture clash

Specific people must take on this responsibility, it's not sufficient to say "We should do this." Someone must be in control of these activities and the entire organization must rigorously apply the above mechanisms, inspecting and adapting relentlessly upon failure.

Failure on any of these will provide a backdoor for the existing, undesirable culture to quickly usurp the new culture, and the culture change will fail.

The SAFe Zone

A healthy SAFe organization would institute the "Program Level" to provide exactly this resilience for culture conversion. The Product Management function would protect the agile organization against low value work and overburden, the RTE function would safeguard against Command and Control, and the architect would be the bulwark against unsustainable engineering. Product Owners and Scrum Masters would provide an additional safety cushion to protect the teams.

These roles must unite to drive the need for transparent, un-political value optimization, mutual collaboration and quality-focused development practice both towards the teams and the non-agile surrounding organization.


Failing Culture Conversion

Let's say your Program Level is being pressured to introduce cultural dysfunctions from the previously existing surrounding organization into the Agile Release Train, and they can't push back. In their function as a culture converter, they are now converting the new culture back into the old culture, and as such, working against the Agile Transformation. If you do not identify and deal with this issue swiftly and strongly, you're setting the fox to keep the geese: The fledgling new culture will be steamrolled by the existing culture in no time.




Summary

When you are using SAFe, ensure that the ART Roles are both willing and able to act as culture converters, and give them the support they need to function properly as such, mostly by relieving them of any and all responsibilities that relate to the "old" culture you want to abolish.

By overriding, short-ciruiting or ignoring the culture conversion function, you're dooming the culture transformation, and since the new ways of working rely on the new culture, you're going to train wreck. 

SAFe sucks when you mess up the culture conversion.



Thursday, December 10, 2020

Test Cocooning

 "How do you deal with Legacy code that lacks test coverage?" - even miniscule small changes are hazardous, and often, a necessary rewrite is postponed forever because it's such a nightmare to work with. Even if you have invested time into your test coverage after taking over the system, chances are there are still parts of the system you need to deal with that aren't covered at all. So this is what I propose in this situation:


Test Cocooning is a reversed TDD cycle, and it should be common sense.


The Cocooning process

Test Cocooning is a pretty straightforward exercise: 
  1. Based on what you think the code does, you create a cocooning test.
    1. If the test fails, you didn't understand the code correctly and you have to improve your test.
    2. If the test passes, you have covered a section of the code with a test that ensures you don't accidentally break the tested aspect of the code.
  2. Based on what you think the code does, you make a breaking change.
    1. If the test fails in the way you thought it would, you have a correct understanding of that piece of code.
    2. If the test passes, you didn't understand the code correctly and you have to improve your test (back to step 1)
    3. Intermediate activity: Of course, you revert the change to restore the behaviour that you have covered with test.
  3. Within the scope of your passing test, you begin to improve:
    1. Create lower-levelled tests that deal with more specifics of the tested code (e.g. unit tests.)
    2. Refactor based on the continuous and recurrent execution of all the relevant tests.
    3. Refactor your tests as well.
  4. Re-run the original cocooning test to ensure you didn't mess up anywhere!

Once a cocooning cycle is completed, you should have reworked a small section of your Legacy code to be Clean(er) Code that is more workable for change.


Iterating

You may need to complete multiple cocooning cycles until you have a sufficient amount of certainty that you have workable code.


Backtracking

The important secret of successful Test Cocooning is that you need to backtrack both on the code and your tests - after completing all relevant cocooning cycles, you'll need to re-run:

  • your cocooning tests against the original legacy code. 
  • your unrefactored original cocooning tests against the new code.
  • your unrefactored original cocooning tests against the original legacy code.
Yes, that's painful and a lot of overhead, but it's your best bet in the face of dangerous, unworkable code, and believe me - it's a lot less painful than what you'll experience when some nasty bugs slip through because you skipped any of these.


Working Cocooned code

Once you have your test cocoon, you can work the cocooned code - only within the scope of your cocoon - to fix bugs and to build new features.

Bugfixes

Fixing bugs relies on making a controlled breach to your cocoon.
Metaphorically speaking, you need to be like a spider that caught a bug and sucks it dry before discarding the woven husk.
  1. Create a test cocoon for the current behaviour which passes under the current faulty(!) conditions of the code segment that exactly reproduces the bug as though it were desired behaviour.
  2. Create a test which fails due to the bug, i.e. add a second test that exactly reverses the cocooned behaviour.
  3. Write the code that meets the requirement of the failing test.
    1. As a consequence, the cocooned passing test for the bug should now fail.
    2. Ensure that no other tests have failed.
    3. If another test has failed, ensure that this is intentional.
  4. Eliminate the broken cocoon test that reproduces the bug's behaviour.
    1. If there were other tests that failed, now is the time to modify these tests one by one.
  5. Backtrack like described above to ensure that nothing slipped. 

Modifying features

Modifying existing behaviour should be treated exactly like a bugfix.

New functionality

If you plan to add new functionality to Legacy Code, your best bet is to develop this code in isolation  from the cocooned legacy and only communicate via interfaces, ensuring that the cocoon doesn't break. 
When you really need to invoke new code from the Legacy, treat the modification like a bugfix.

Rewrite

A rewrite should keep the cocoon intact. Don't cling to any of the Legacy code and consider your cocooning efforts "sunken cost" - otherwise, you risk reproducing the same mess with new statements. 



Closing remarks

  1. I believe that test cocooning requires both strong test and development expertise, so if you have different specialists on your team, I would highly recommend to build the cocoon in pairing.
  2. Cocoon tests are often inefficient and have poor performance. You do not need to add these tests to your CI/CD pipeline. What you must add to your pipeline is the lower-level tests that replicate the unit behaviour of the cocoon. It's totally sufficient to rerun the cocoon tests when you work on the cocooned Legacy segment.
  3. Cocooning is a workaround for low-quality code. When time permits, rewrite it with Clean Code and you can discard the cocoon along with the deleted code.
  4. Do not work on Legacy Code without a solid Cocoon. The risk outweighs the effort.

Friday, December 4, 2020

Test Coverage Matrix

Whether you're transitioning towards agile ways of working on a Legacy platform or intend to step up your testing game for a larger system developed in an agile fashion, at some point, it pays to set up a Coverage Matrix to see where it pays to invest effort - and where it doesn't.



Before we start

First things first: the purpose of an agile Coverage Matrix isn't the same as a traditional project-style coverage matrix that's mostly concerned with getting the next release shipped. I don't intend to introduce a mechanism that adds significant overhead with little value, but to give you a means of starting the right discussions at the right time and to help you think in a specific direction. Caveat emptor: It's up to you to figure out how far you want to go down each of the rabbit holes. "Start really simple and incrementally improve" is good advice here!

What I'm proposing in this article will sound familiar to the apt Six Sigma practitioner as a simplified modification of the method, "Quality Function Deployment." And that's no coincidence.


Coverage Characteristics

Based on the ISO/IEC 9126, quality characteristics can be grouped into Functionality, Reliability, Usability, Efficiency, Maintainability and Portability. These are definitely good guidance. 

To simplify matters, I like to start the initial discussion by labelling the columns of the matrix:

  • Functionality ("Happy Cases")
  • Reliability ("Unhappy Cases")
  • Integration
  • Performance
  • Compliance
  • UX
Of course, we can clarify a lot more on what each of these areas means, but let's provide some leeway for the first round of discussion here. The most important thing is that everyone in the room has an aligned understanding on what these are supposed mean. 
If you are in the mood for some over-engineering, add subcategories for each coverage characteristic, such as splitting Performance into "efficiency", "speed", "scalability", "stress resilience" etc. That will bloat up the matrix and may make it more appropriate to flip rows and columns on the matrix.

Test Areas

Defining test areas falls into multiple categories, which correlate to the "Automation Test Pyramid". 

  • User journeys
  • Data flow
  • Architectural structure
  • Code behaviour
There are other kinds of test areas, such as validation of learning hypotheses around value and human behaviour, but let's ignore these here. Let's make a strong assumption that we know what "the right thing" is, and we just want to test that "we have things right." Otherwise, we'd open a can of worms here. You're free to also cover these, adding the respective complexity.


Functional areas

In each test area, you will find different functional areas, which strongly depend on what your product looks like.

User journeys

There are different user journeys with different touchpoints how your user interacts with your product. 

For example, a simple video player app might have one user flow for free-to-play, another for registration, another for premium top-up, and another for GDPR compliant deregistration as well as various flows such as "continue to watch my last video" or "download for offline viewing". These flows don't care what's happen technically.


Data flow

Take a look at how the data flows through your system as certain processes get executed. Every technical flow should be consistent end-to-end.

For example, when you buy a product online, the user just presses "Purchase", and a few milliseconds later, they get a message like "Thank you for your order." The magic that happens inbetween is make or break for your product, but entirely irrelevant for the user. In our example, that might mean that the system needs to make a purchase reservation, validate the user's identity and their payment information, must conduct a payment transaction, turn the reservation into an order, ensure the order gets fulfilled etc. If a single step in this flow breaks, the outcome could be an economic disaster. Such tests can become a nightmare in microservice environments where they were never mapped out.


Architectural structure

Similar to technical flow, there are multiple ways in which a transaction can occur: it can happen inside one component (e.g. frontend rendering), it can span a group of components (e.g. frontend / backend / database) or even a cluster (e.g. billing service, payment service, fulfilment service) and in the worst case, multiple ecosystems consisting of multiple services spanning multiple enterprises (e.g. Google Account, Amazon Fulfilment, Salesforce CRM, Tableau Analytics).

In architectural flow, you could list the components and their key partner interfaces. For example:

  • User Management
    • CRM API
    • DWH API
  • Payment 
    • Order API
    • Billing API

Architectural flow is important in the sense that you need to ensure that all relevant product components and their interactions are covered.

You can simplify this by first listing the relevant architectural components, and only drilling down further if you have identified a relevant hotspot.


Code behaviour

At the lowest level is always the unit test, and different components tend to have different levels of coverage - are you testing class coverage, line coverage, statement coverage, branch coverage - and what else? Clean Code? Suit yourself.

Since you can't list every single behaviour of the code that you'd want to test for without turning a Coverage Matrix into a copy of your source code, you'll want to focus on stuff that really matters: do we think there's a need to do something?


Bringing the areas together

There are dependencies between the areas - you can't have a user flow without technical flow, you won't have technical flow without architectural flow, and you won't have architectural flow without code behaviour. Preferably, you don't need to test for certain user flows at all, because the technical and architectural flows already cover everything. 

If you can relate the different areas with each other, you may learn that you're duplicating or missing on key factors.


Section Weight

For each row, for each column, assign a value on how important this topic is. 

For example, you have the user journey "Register new account." How important do you think it's to have the happy path automated? Do you think the negative case is also important? Does this have impact on other components, i.e. would the user get SSO capability across multiple products? Can you deal with 1000 simultaneous registrations? Is the process secure and GDPR compliant? Are users happy with their experience?

You will quickly discover that certain rows and columns are "mission critical", so mark them in red. Others will turn out to be "entirely out-of-scope", such as testing UX on a backend service, so mark them gray. Others will be "basically relevant" (green) or "Important" (yellow).

As a result, you end up with a color-coded matrix.

The key discussion that should happen here is whether the colors are appropriate. An entirely red matrix is as unfeasible as an entirely gray matrix.


A sample row: Mission critical, important, relevant and irrelevant



Reliability Level

As the fourth activity, focus on the red and yellow cells and take a look at a sliding scale on how well  you're doing in each area and assign a number from 0 to 10 with this rough guidance:

  • 0 - We're doing nothing, but know we must.
  • 3 - We know that we should to more here.
  • 5 - We've got this covered, but with gaps.
  • 7 - We're doing okay here.
  • 10 - We're using an optimized, aligned, standardized, sustainable approach here.

As a consequence, the red and yellow cells should look like this:

A sample matrix with four weighted fields.

As you would probably guess by now, the next step for discussion would be to look at the big picture and ask, "What do we do with that now?"


The Matrix

Row Aggregate

For each row, figure out what the majority of colors in that row is, and use that as the color of the row. Next, add up all the numbers. This will give you a total number for the row. 

This will give you an indicator which row is most important to address - the ones in red, with the lowest number.

The Row Aggregate


Column Aggregate

You can use the same approach for the columns, and you will discover which test type is covered best.  I would be amazed if Unhappy Path or Compliance turn out to have poor coverage when you first do this exercise, but the real question is again: Which of the red columns has the lowest number?


The Column aggregate



After conducting all the above activities, you should end up with a matrix that looks similar to this one:


A coverage matrix

Working with the Matrix

There is no "The right approach" to whether to work on improving coverage for test objects or test types - the intent is to start a discussion about "the next sensible thing to do," which totally depends on your specific context.  

As per our example, the question of "Should we discuss the badly covered topic Performance which isn't the most important thing, or should we cover the topic of architectural flow?" has no correct answer - you could end up with different groups of people working hand in hand to improve both of these, or you could focus on either one.



How-To Use

You can facilitate discussions with this matrix by inviting different groups of interest - business people, product people, architects, developers, testers - and start a discussion on "Are we testing the right things right, and where or how could we improve most effectively?"

Modifications 

You can modify this matrix in whatever way you think: Different categories for rows or columns, drill-in, drill-across - all are valid.

For example, you could have a look at only functional tests on user journeys and start listing the different journeys, or you could explicitly look at different types of approaching happy path tests (e.g., focusing on covering various suppliers, inputs, processing, outputs or consumers)

KISS 

This method looks super complicated if you list out all potential scenarios and all potential test types - you'd take months to set up the board, without even having a coversation. Don't. First, identify the 3-4 most critical rows and columns, and take the conversation from there. Drill in only when necessary and only where it makes sense.