Saturday, March 27, 2021

Tests without code knowledge are waste

I start with an outrageous claim: "Software tests made without knowledge of the code are waste."

Now, let me back that claim up with an example - a calculator applet which you can actually use:

Calculator

Have fun, play around with this little applet ... it works if your JS is enabled!

Now let us create a test strategy for this applet:

Black Box test strategy

Let's discuss how you would test this, by going into the list of features this applet offers:
  • Basic maths: Addition, Subtraction, Multiplication, Division
  • Parentheses and nested parentheses
  • Comparisons: Less, Equal, Greater ...
  • Advanced maths: Exponents, Square Root, Logarithm, Maximum, Minimum, Rounding (prefixed by "Math.")
  • Trigonometry: Sinus, Cosinus, Tangens, Arc-Functions (also prefixed)
  • Variables: defining, referencing, modifying
  • Nested functions: combining any set of aforementioned functionality
  • And a few more ...
Wow - I'm sure you can already see where our test case catalogue is going with minimal coverage, and we haven't even considered edge cases or negative tests yet!


How many things could go wrong in this thing? Twenty, thirty? Fifty? A thousand?
How many tests would provide an appropriate test coverage? Five, ten, a hundred?
How much effort is required for all this testing? 

Would it shock you if I said ...

All those tests are a waste of time!

Frankly speaking, I would test this entire thing with a single test case, because everything beyond that is just waste. 
I am confident that I can do that because I know the code:


  <div class="row">
	<div class="card"
             style="margin-top:1rem;
                        margin-left:2rem;
                        margin-right:2rem; width: 18rem;
                        box-shadow: 4px 4px 4px 1px rgba(0, 0, 0, 0.2);">
	  <b class="card-header text-center bg-primary text-light">
            Calculator</b>
	  <div class="card-body">
	  	<textarea class="form-control"
                          id="calc_input"
                          rows="1" cols="20"
                          style="text-align: right;"
		          oninput="output.innerHTML = eval(this.value)"
                          ></textarea>
	  </div>
	  <div class="card-footer text-right">
		  <b id="output">0</b>
	  </div>
	</div>
  </div>
  
Yes, you see this right. The entire code is just look-and-feel. There is only a single executable statement: "eval(this.value)", so we don't have a truckload of branch, path, line, statement coverage and whatnotever that we need to cover:
All relevant failure opportunities are already covered by JavaScript's own tests for its own eval() function, so why would we want to test it again? 

The actual failure scenarios

Seriously, this code has only the following opportunities for failure:
  • Javascript not working (in this case, no test case will run anyways)
  • Accidentally renaming the "output" field (in this case, all tests will fail anyways)
  • User Input error (not processed)
  • Users not knowing how to use certain functions
    (which is a UX issue ... but how relevant?)
Without knowing the code, I need to test an entire catalogue of test cases.
Knowing and understanding the code, I would  reduce the entire test to a single Gherkin spec:

The only relevant test

Given: I see the calculator
When: I type "1+0".
Then: I see "1" as computation result.

So why wouldn't we want to test anything else?
Because: the way the code is written, if this test passes, then all other tests we could think of would also pass.

But why not write further tests?

A classical BDD/TDD approach would mandate us to incrementally define all application behaviours, create a failing test for each, then add passing code, then refactor the application. 
If we would do this poorly, we would really end up creating hundreds of tests and writing explicit code to pass each of them - and that's actually a problem with incremental design unaware of the big picture!

The point is that we wrote the minimum required code to meet the specified functionality right from the start: code that has only a single failure opportunity (not being executed) - and after having this code in place, there's no way we can write another failing test that meets the feature specifications!


And that's why a close discussion between testers and developers is essential in figuring out which tests are worthwhile and which aren't.

Wednesday, March 24, 2021

The only constant is change

A classic response towards change failures in traditional organizations, most notably, Software Release failures, is "Then make changes less often." This gives a good feeling that for a prolonged period of time, things will be predictable and stable. Unfortunately, it's a local optimization issue that actually makes things worse in the long term. Why do we think like that, and why is that thought a problem?



The gut reaction

Let's say that a change just failed. This is an uncomfortable event. Naturally, we don't like discomfort. The easiest way to deal with discomfort is to postpone the re-occurrence of the uncomfortable event into the future. In software development, we do this by postponing the next release as far as possible.

This provides an immediate short-term effect on our psyche: We have just achieved certainty that until the next release happens, there will be no further discomfort, and thus we have reduced immediate stress by postponing it.

Unfortunately, this has done nothing to reduce the probability of that next change failing.

Now, let's see what happens as a consequence of this decision:


The immediate after-effect

Because there is now a prolonged period of time until our next release happens, and the amount of development work is not decreasing, the batch size for the next release increases by exactly the amount of delay in the release. So, if, for example, we used to release once per month and reduce this to once per three months, the batch size just went up 200% with a single decision.

Of course, this means that the scope of the next change is also going to increase, it will become more complex to deliver the next release.

The knock-on effect

A bigger, more complex change is more difficult to conduct and has a higher potential for failure. In consequence, we're going to be failing stronger and harder. If we have 3 small changes, and 1 of them fails, that's a failure rate of 33%. If we now combine all 3 changes into 1 big change, that means we end up with a 100% failure rate!

You see - reducing change frequency without reducing change scope automatically increases likelihood and impact of failure.

If now we decide to follow our gut instinct again, postponing the next failure event, we end up in a vicious circle where change becomes a rare, unwanted, highly painful event: We have set the foundation for a static organization that is no longer able to adapt and meet customer demands.

The outcome

The long-term consequence of reducing change frequency is that we can poorly correlate effort and outcome, it becomes indistinguishable what works and what doesn't - and thereby, the quality of our work, our product, our processes and our metrics deteriorate. We lose our reason to exist on the market: providing high quality and value to our customers on demand.


"If it hurts, do it more often."

Let's just follow the previous computation:
If 1 big change fails 100% of the time, maybe you can slice it into 3 smaller changes, of which 2 will succeed, reducing your failure rate by 66%?

So, instead of deciding to reduce change frequency, you decide to increase it?

The immediate after-effect

Because there is now a shorter period of time until the next release, there will be a reduced time between when something is developed and until you see the consequences. We close the feedback loop faster, we learn quicker what works and what doesn't. And since we tend to be wired to not do things that become painful, we do more of the things that work, and less of the things that don't.

The knock-on effect

Managing small changes is easier than managing complex change. Thereby, it becomes less risky, less work and less painful to make multiple small changes. Likewise, since we get faster (and more frequent) feedback on what worked, we can optimize faster for doing more things that provide actual value.

The outcome

By making rapid, small changes, we can quickly correlate whether we improved or worsened something, and we can respond much more flexibly towards changing circumstances. This allows us to deliver better quality and feel more confident about what we do.


Summary

The same vicious circle created by the attitude, "If we change less often, we will have fewer (but significantly more) painful events" can become a virtuous cycle if we change our attitude towards, "If we do it more often, it'll become easier and less painful each time."

Your call.

Monday, March 15, 2021

Why WSJF is Nonsense

There's a common backlog prioritization technique, suggested as standard practice in SAFe, but also used elsewhere, "WSJF", "Weighted Shortest Job First." - also called "HCDF", "Highest Cost of Delay First" by Don Reinertsen.

Now, let me explain this one in (slightly oversimplified) terms:

The idea behind WSJF

It's better to gain $5000 in 2 days than to gain $10000 for a year's work. 
You can still go for those 10 Grand once have 5 Grand in your pocket, but if you do the 10 Grand job first, you'll have to see how you can survive a year penniless.

Always do the thing first that delivers the highest value and blocks your development pipeline for the shortest time. This allows you to deliver value as fast and high as possible. 


How to do WSJF?

WSJF is a simple four-step process:

To find out what the optimal backlog position for a given item is, you estimate the impact of doing the item ("value") and divide that by the investment into said item ("size") and then put the items in relation towards each other.


It's often suggested for estimated to use the "Agile Fibonacci" scale, so "1, 2, 3, 5, 8, 13, 20, 40, 80..."
The idea is that every subsequent number is "a little more, but not quite twice as much" as the previous one, so a "13" is "a little more than 8, but not quite 20". 
Since there are no in-between numbers, when you think you're not sure whether an item is 8 or 13, you can choose either, because these two numbers are adjacant and their difference is considered miniscule.

Step 1: Calculate "Value" Score for your backlog items.

Value (in SAFe) is actually three variables: User and/or Business Value, Time Criticality, Enablement and/or risk reduction. But let's not turn it into a science. It's presumed value.

Regardless of how you calculate "Value", either as one score or a sum or difference of multiple scores, you end up with a number. It becomes the numerator in your equation.

Step 2: Calculate "Size" Score for your backlog items.

"Size" is typically measured in the rubber-unit called Story Points, and regardless of what a Story Point means in your organization or how it's produced, you'll get another number - the denominator in your equation.

Step 3: Calculate "WSJF" Score for your backlog items.

"WSJF" score, in SAFe, is computed by dividing Value by Size.

For example, a Value of 20 divided by a size of 5 would give you a WSJF score of 4.

Step 4: Sort the backlog by "WSJF" Score.

As you add items, you just put them into the position where the WSJF sort order suggests, with the highest value on top, and the bottom value on the bottom of the backlog.
For example, if you get a WSJF of 3 and your topmost backlog item has a WSJF score of 2.5, the new item would go on top - it's assumed to be the most valuable item to deliver!

And now ... let me dismantle the entire concept of WSJF.

Disclaimer: After reading the subsequent portion, you may feel like a dunce if you've been using WSJF in the real world.


WSJF vs. Maths

WSJF assumes estimates to be accurate. They aren't. They're guesswork, based on incomplete and biased information: Neither do we know how much money we will make in the future (if you do, why are you working in Development, and not on the stock market?) nor do we actually know how much work something takes until we did it. Our estimates are inaccurate.

Two terms with error

Let's keep the math simple, and just state that every estimate has an error term associated. We can ignore an estimator's bias, assuming that it will affect all items equally, although that, too, is often untrue. Anyway.

The actual numbers for an item can be written as:
Value = A(V) + E(V)  [Actual Value + Error on the Value]
Sizes = A(S) + E(S)  [Actual Size + Error on the Size]

Why is this important?
Because we divide two numbers, which both contain an error term. The error term propagates.

For the following section, it's important to know that we're on a Fibonacci scale, where two adjacent items are always at least 60% apart.

Slight estimation Error

If we over-estimate value, an item will have at least 60% higher value than estimated, even if the difference between fact and assumption is miniscule. Likewise, if we under-estimate value, an item will have at least 30% lower value than estimated.

To take a specific example:
When an item is estimated at 8 (based from whatever benchmark), but turns out to actually be 5, we overestimated it by 60%. Likewise, if it turns out to actually be 13, we underestimated it by 38.5%.
If we're not 100% precise on our estimates, we could be off by a factor of 2.5!

The same holds true for Size. I don't want to repeat the calculation.

Larger estimation error

Remember - we're on a Fibonacci scale, and we only permitted a deviation by a single notch. If now, we permit our estimates to be off by two notches, we get significantly worse numbers: All of a sudden, we could be off by a factor of more than 6!

Now, the real problem happens when we divide those two.

Square error terms

Imagine that we divide a number 6 times larger than it should be, by a number 6 times smaller than it should be, we get a square error term.

Let's talk in a specific example again:
Item A was estimated as 5 value, but it was actually a 2 value. It was estimated as 5 size, but it was actually a 13 size. As such, it had an error of 3 in value, and an error of 13 in size.
Estimated WSJF = (2 + 3) / (13 - 8) = 1
However, the Actual WSJF = 2 / 13 = 0.15


Now, I hear you arguing, "The actual numbers don't matter... it's their relationship towards one another!"


Errors aren't equal

There's a problem with estimation errors: we don't know where we make errors, otherwise we wouldn't make them, and we also make different errors, otherwise, they wouldn't affect the scale at all. Errors are errors, and they are random.

So, let me draw a small table of estimates produced for your backlog:

Item Est. WSJF Est. Value Est. Size Act. Value Act. Size Act. WSJF
A 1.6 8 5 5 5 1
B 1 8 8 3 20 0.15
C 0.6 3 5 8 2 4
D 0.4 5 13 13 2 6.5

Feel free to sort by "Act. WSJF" to see how you should have ordered your backlog, had you had a better crystal ball.

And that's the problem with WSJF

We turn haphazard guesswork into a science, and think we're making sound business decisions because we "have done the numbers", when in reality, we are the victim of an error that is explicitly built into our process. We make entirely pointless prioritization decisions, thinking them to be economically sound.


WSJF is merely a process to start a conversation about what we think should be priority, when our main problem is indecision.
It is a terrible process for making reliable business decisions, because it doesn't rely on facts. It relies on error-prone assumptions, and it exacerbates any error we make in the process.

Don't rely on WSJF to make sound decisions for you. 
It's a red herring.

The discussion about where and what the value is provides much more benefit than anything you can read from a WSJF table. Do the discussion. Forget the numbers.