Fail Fast, Move On: Six Sigma

Showing posts with label Six Sigma. Show all posts

Thursday, June 9, 2022

Using metrics properly

Getting metrics right is pretty difficult - many try, and usually mess up. The problem?
Metrics require a context, and they also create a context. Without a proper definition of context, metrics are useless - or worse: guide you in the wrong direction.

A Metrics system

Let's say you have a hunch, or a need, that something could - or should - be improved. To make sure that you know that you're actually improving, create a metrics system covering the topic. To build this system, it should cover the organizational system in an adequate - that is, both simple and sufficient, model consisting of:

Primary metrics (things we want to budge)
Secondary metrics (things we expect to be related to our primary metric)
Indirect metrics (things we expect NOT budge)

An example

We start with a problem statement, "Our TTM sucks." Hence, our metrics system would start with the primary metric "time-to-market" as a primary metric. A common sense assumption might be that an improvement to TTM will make people do overtime, or that people become sloppy. Thus, we add the secondary metric "quality" - we would like to observe how a change to TTM affects quality, and we set an indirect metric "overtime" - we set a constraint that people shall not do extra hours.

Systematic improvement

In order to work with your metrics system adequately, there's a common five-step process which is at the core of Six Sigma:

Define

Define our problem statement: what problem do we currently face?
Define our primary metric.
Become clear on our Secondary and Indirect metrics.

Measure

Get data to determine where these metrics currently are.
Set an improvement target on our primary metric.
Predict the effects on secondary metrics
Set boundaries on indirect metrics.

Analyze

Understand what's currently going on.
Understand why we currently see the unwanted state in the primary metric.
Determine what we'd like to do to budge the primary metric.

Improve

Make a change.
Observe changes to all the metrics.

Control

If our Primary metric budged significantly and all other metrics are where we'd expect them to be, our change was successful.
If that wasn't the case - we messed up. Backtrack.
Determine which metrics we'd like to retain in the future to make sure we're not lapsing back.

Metrics are thus always bounded to a specific problem you would like to address.

Pitfalls to avoid

Getting metrics systems completely right is challenging, and many organisations struggle with getting metrics right.

Incomplete metric systems

The most common problem is that we often only define primary metrics, which paves the way for building Cobra Farms, that is: we improve one thing at the expense of another thing, which might create an even bigger problem that we just didn't realize.

Red Herring metrics

Another issue is confusion between outcomes and indicators. This is also often associated with a Cobra Farm, but from another angle - we fail to address the actual problem and pursue the problem of the metric.

For example, if management wants to reduce the amount of reported defects, the easiest change is to deactivate the defect reporting tool. That reduces the amount of defect reports, but doesn't improve quality.

This is also called "Goodhart's Law:" A metric that becomes a target stops being useful.

Vanity metrics

It's a human tendency to want to feel good about something, and metrics can serve that basic need. For example, we might track the amount of hours worked per week. That metric constantly goes up, and it always hits the target. But it's not valuable: It tells us nothing about the quality or value of the work done.

Uncontrolled metrics ("waste")

We often collect data, just in case. And we don't connect any action trigger with it. Let's take a look at, for example, deployment duration: It's a standard metric provided by CI/CD tools, but in many teams, nothing happens when the numbers rocket skyward. There are no boundaries, no controls, and no actions related to the metric. If we don't use the data available to act upon it, the data might as well not exist.

Bad data

Sometimes, we have the right metric, but we're collecting the wrong data, or we collect it in the wrong way. That could range anywhere from having the wrong scale (e.g. measuring transaction duration in minutes, when we should measure in milliseconds - or, vice versa) to having the wrong unit (e.g. measuring customer satisfaction in amount of likes instead of NPS) to having the wrong measurement point (e.g.measuring lead time from "start work" instead of from "request incoming.)

This data will then lead us to draw wrong conclusions - and any of our metrics could suffer from this.

Category errors

Metrics serve a purpose, and they are defined in a context. To use the same metrics in a different context leads to absurd conclusions. For example, if team A is doing maintenance work and team B is doing new product development: team A will find it much easier to predict scope and workload, but to say that B should learn from A would be a category error.

Outdated metrics

Since we're talking about metric systems rather than individual metrics, when the organizational system on which we measure has changed, our metrics may no longer make sense. Frequently revisiting our measurement system and discarding control metrics which no longer make sense and either discarding or adjusting them is essential to keep our metric system relevant.

Friday, December 4, 2020

Test Coverage Matrix

Whether you're transitioning towards agile ways of working on a Legacy platform or intend to step up your testing game for a larger system developed in an agile fashion, at some point, it pays to set up a Coverage Matrix to see where it pays to invest effort - and where it doesn't.

Before we start

First things first: the purpose of an agile Coverage Matrix isn't the same as a traditional project-style coverage matrix that's mostly concerned with getting the next release shipped. I don't intend to introduce a mechanism that adds significant overhead with little value, but to give you a means of starting the right discussions at the right time and to help you think in a specific direction. Caveat emptor: It's up to you to figure out how far you want to go down each of the rabbit holes. "Start really simple and incrementally improve" is good advice here!

What I'm proposing in this article will sound familiar to the apt Six Sigma practitioner as a simplified modification of the method, "Quality Function Deployment." And that's no coincidence.

Coverage Characteristics

Based on the ISO/IEC 9126, quality characteristics can be grouped into Functionality, Reliability, Usability, Efficiency, Maintainability and Portability. These are definitely good guidance.

To simplify matters, I like to start the initial discussion by labelling the columns of the matrix:

Functionality ("Happy Cases")
Reliability ("Unhappy Cases")
Integration
Performance
Compliance
UX

Of course, we can clarify a lot more on what each of these areas means, but let's provide some leeway for the first round of discussion here. The most important thing is that everyone in the room has an aligned understanding on what these are supposed mean.

If you are in the mood for some over-engineering, add subcategories for each coverage characteristic, such as splitting Performance into "efficiency", "speed", "scalability", "stress resilience" etc. That will bloat up the matrix and may make it more appropriate to flip rows and columns on the matrix.

Test Areas

Defining test areas falls into multiple categories, which correlate to the "Automation Test Pyramid".

User journeys
Data flow
Architectural structure
Code behaviour

There are other kinds of test areas, such as validation of learning hypotheses around value and human behaviour, but let's ignore these here. Let's make a strong assumption that we know what "the right thing" is, and we just want to test that "we have things right." Otherwise, we'd open a can of worms here. You're free to also cover these, adding the respective complexity.

Functional areas

In each test area, you will find different functional areas, which strongly depend on what your product looks like.

User journeys

There are different user journeys with different touchpoints how your user interacts with your product.

For example, a simple video player app might have one user flow for free-to-play, another for registration, another for premium top-up, and another for GDPR compliant deregistration as well as various flows such as "continue to watch my last video" or "download for offline viewing". These flows don't care what's happen technically.

Data flow

Take a look at how the data flows through your system as certain processes get executed. Every technical flow should be consistent end-to-end.

For example, when you buy a product online, the user just presses "Purchase", and a few milliseconds later, they get a message like "Thank you for your order." The magic that happens inbetween is make or break for your product, but entirely irrelevant for the user. In our example, that might mean that the system needs to make a purchase reservation, validate the user's identity and their payment information, must conduct a payment transaction, turn the reservation into an order, ensure the order gets fulfilled etc. If a single step in this flow breaks, the outcome could be an economic disaster. Such tests can become a nightmare in microservice environments where they were never mapped out.

Architectural structure

Similar to technical flow, there are multiple ways in which a transaction can occur: it can happen inside one component (e.g. frontend rendering), it can span a group of components (e.g. frontend / backend / database) or even a cluster (e.g. billing service, payment service, fulfilment service) and in the worst case, multiple ecosystems consisting of multiple services spanning multiple enterprises (e.g. Google Account, Amazon Fulfilment, Salesforce CRM, Tableau Analytics).

In architectural flow, you could list the components and their key partner interfaces. For example:

User Management

CRM API
DWH API

Payment

Order API
Billing API

Architectural flow is important in the sense that you need to ensure that all relevant product components and their interactions are covered.

You can simplify this by first listing the relevant architectural components, and only drilling down further if you have identified a relevant hotspot.

Code behaviour

At the lowest level is always the unit test, and different components tend to have different levels of coverage - are you testing class coverage, line coverage, statement coverage, branch coverage - and what else? Clean Code? Suit yourself.

Since you can't list every single behaviour of the code that you'd want to test for without turning a Coverage Matrix into a copy of your source code, you'll want to focus on stuff that really matters: do we think there's a need to do something?

Bringing the areas together

There are dependencies between the areas - you can't have a user flow without technical flow, you won't have technical flow without architectural flow, and you won't have architectural flow without code behaviour. Preferably, you don't need to test for certain user flows at all, because the technical and architectural flows already cover everything.

If you can relate the different areas with each other, you may learn that you're duplicating or missing on key factors.

Section Weight

For each row, for each column, assign a value on how important this topic is.

For example, you have the user journey "Register new account." How important do you think it's to have the happy path automated? Do you think the negative case is also important? Does this have impact on other components, i.e. would the user get SSO capability across multiple products? Can you deal with 1000 simultaneous registrations? Is the process secure and GDPR compliant? Are users happy with their experience?

You will quickly discover that certain rows and columns are "mission critical", so mark them in red. Others will turn out to be "entirely out-of-scope", such as testing UX on a backend service, so mark them gray. Others will be "basically relevant" (green) or "Important" (yellow).

As a result, you end up with a color-coded matrix.

The key discussion that should happen here is whether the colors are appropriate. An entirely red matrix is as unfeasible as an entirely gray matrix.

A sample row: Mission critical, important, relevant and irrelevant

Reliability Level

As the fourth activity, focus on the red and yellow cells and take a look at a sliding scale on how well you're doing in each area and assign a number from 0 to 10 with this rough guidance:

0 - We're doing nothing, but know we must.
3 - We know that we should to more here.
5 - We've got this covered, but with gaps.
7 - We're doing okay here.
10 - We're using an optimized, aligned, standardized, sustainable approach here.

As a consequence, the red and yellow cells should look like this:

A sample matrix with four weighted fields.

As you would probably guess by now, the next step for discussion would be to look at the big picture and ask, "What do we do with that now?"

The Matrix

Row Aggregate

For each row, figure out what the majority of colors in that row is, and use that as the color of the row. Next, add up all the numbers. This will give you a total number for the row.

This will give you an indicator which row is most important to address - the ones in red, with the lowest number.

The Row Aggregate

Column Aggregate

You can use the same approach for the columns, and you will discover which test type is covered best. I would be amazed if Unhappy Path or Compliance turn out to have poor coverage when you first do this exercise, but the real question is again: Which of the red columns has the lowest number?

The Column aggregate

After conducting all the above activities, you should end up with a matrix that looks similar to this one:

A coverage matrix

Working with the Matrix

There is no "The right approach" to whether to work on improving coverage for test objects or test types - the intent is to start a discussion about "the next sensible thing to do," which totally depends on your specific context.

As per our example, the question of "Should we discuss the badly covered topic Performance which isn't the most important thing, or should we cover the topic of architectural flow?" has no correct answer - you could end up with different groups of people working hand in hand to improve both of these, or you could focus on either one.

How-To Use

You can facilitate discussions with this matrix by inviting different groups of interest - business people, product people, architects, developers, testers - and start a discussion on "Are we testing the right things right, and where or how could we improve most effectively?"

Modifications

You can modify this matrix in whatever way you think: Different categories for rows or columns, drill-in, drill-across - all are valid.

For example, you could have a look at only functional tests on user journeys and start listing the different journeys, or you could explicitly look at different types of approaching happy path tests (e.g., focusing on covering various suppliers, inputs, processing, outputs or consumers)

KISS

This method looks super complicated if you list out all potential scenarios and all potential test types - you'd take months to set up the board, without even having a coversation. Don't. First, identify the 3-4 most critical rows and columns, and take the conversation from there. Drill in only when necessary and only where it makes sense.

Tuesday, November 17, 2020

Is all development work innovation? No.

In the Enteprise world, a huge portion of development work isn't all that innovative. A lot of it is merely putting existing knowledge into code. So what does that mean for our approach?

In my Six Sigma days, we used a method called "ICRA" to design high quality solutions.

Technically, this process was a funnel, reducing degrees of freedom as time progressed. We can formidably argue about whether such a funnel is (always) appropriate in software development or whether it's a better mental model to consider that all of them run in parallel at varying degrees, (but that's a red herring.) I would like to salvage the acronym to discriminate between four different types of development activity:

Activity	Content	Example
Innovate	Fundamental changes or the creation of new knowledge to determine which problem to solve in what way, potentially generating a range of feasible possibilities.	Creating a new capability, such as "automated user profiling" to learn about target audiences.
Configure	Choosing solutions to a well-defined problems from a range of known options. Could include cherry-picking and combining known solutions.	Using a cookie cutter template to design the new company website.
Realize	Both problem and solution are known, the rest is "just work", potentially lots of it.	Including a 3rd party payment API into an online shop.
Attenuate	Minor tweaks and adjustments to optimize a known solution or process. Key paradigm is "Reduce and simplify".	Adding a validation rule or removing redundant information.

Why this is important

Think about how you're developing: depending on each of the four activities, the probability of failure, hence, the predictable amount of scrap and rework, decreases. And as such, the way that you manage the activity is different. A predictable, strict, regulated, failsafe procedure would be problematic during innovation, and highly useful on attenuation - you don't want everything to explode when you add a single line of code into an otherwise stable system, which might actually be a desirable outcome of innovation: destabilizing status quo to create a new, better future.

I am not writing this to tell you "This is how you must work in this or that activity." Instead, I would invite you to ponder which patterns are helpful and efficient - and which are misleading or wasteful in context.

By reflecting on which of the four activities and the most appropriate patterns for each of them, you may find significant change potential both for your team and for your organization, to "discover better ways of working by doing it and helping others do it."

Wednesday, January 21, 2015

Agile process models, Six Sigma and Cynefin

Back in my days as Six Sigma Master Black Belt, I learned and taught that every process can be modelled as a function. We write, on an abstract level, "y=f(x)". We use tools like SIPOC and Ishikawa Diagrams ("Fishbone") to model the relationship between inputs and outputs.

Since most processes have some variation, we accept that in practice, the function really implies "y=f(x+e)", where e is an error term within certain tolerance limits, typically impacting the accuracy of the model by no more than 5%. Because e is so small and we can't control it anyways, we ignore it in the model.

But what do you do when y=f(x) is simply an invalid assumption?

That's where agility comes into play.

Simple

The first difference between product manufacturing and software development is that product manufacturing is reproducible and repeatable. You know the exact process to apply to certain inputs to get the predetermined output. This is the "simple" world based on the Cynefin Framework.

But nobody in their right mind would build the same software the same way twice. Rather, if you have it, you'd use it. Or, you make some changes. Regardless. The output is not identical.

Complicated

What typically happens is that once a customer (user) sees a piece of software, they get ideas of what else they need. In this case, it's more like "y=f(x,y)": Given a result, the customer defines the new result. The process isn't reproducible anymore, because the output of each stage is input again. As a Six Sigma practitioner, I already had some issues with outputs that were process inputs: SIPOC becomes confusing, but it was workable.

At this point, it makes sense to work incrementally, because you can't know the final output until you have seen intermediate output. This is why agile frameworks typically apply backlogs rather than finalized requirement specification documents. We accept that the user may need to gain some understanding of the system to decide what has value and what doesn't. We also accept that the user may change their opinion on anything that has only been a vague idea yet.

Failure to implement checks and balances will move the team into the Complex domain within a couple weeks. Validation is not a onetime initial activity but must actually become stricter throughout the project lest errors accumulate and disorder leads straight into a Chaotic environment.
There is no way to reach the Simple domain, lest you acquire a good crystal ball that lets you accurately predict the future. (If you got that, you probably shouldn't develop software but play the stock market.)

Complex

You get into a mess when you only assume that the customer likes the result, but can't know for certain whether they do. The definition of quality may be unavailable until after the result is available to users. This scenario is fairly typical in web development, so as developers we have to guess what customers will accept. When users don't like something, we made an error. And there's no way of knowing that it was an error until we deployed the feature. We have to live with the fact that "y=f(x,y,e)" We can't eliminate the error term from our model, much rather, we have to accomodate towards the ever-present risk. We can only try to minimize risk by getting frequent feedback. The more time between two deployments and the more content within a single deployment, the more likely we made some critical error in the user experience.

Processes like A-B testing and continuous delivery become critical success factors.

While you cannot completely eliminate the randomness, creating a high speed feedback mechanism, such as actively involving users in every minor detail produced, minimizes the effect of errors and effectively permits working similar to a Complicated environment.
The absence of control processes which deal with randomness may cause disorder to quickly shift the team into the Chaotic domain.

Chaotic

The worst thing that can happen is that yesterday's truth may be today's error. Customers may change their preferences on a whim. Imagine you work in a project where yesterday, you were requested to build a web portal for e-commerce, today the customer wants a document management system instead. Any plan is invalidated when you can't rely on your requirements, user stories or feature requests - or whatever you call them. Your model becomes a form of "y=f(e)", where neither x, you input, nor y, your previous output, are relevant indicators of success or failure.

This is where Waterfall projects with long planning phases may drift: By the time the plan is realized, demand shifted due to factors outside the project's sphere of control. An example would be a Waterfall team building a well-defined, perfectly fine php platform over the past 2 years meeting all business requirements, only to find out the newly hired CTO has just announced that all newly launched systems must be implemented in pure Java.

The only good news about the Chaotic domain is: You don't have to be afraid that it gets worse. Everything the team does may be waste.

The best way to deal with the Chaotic domain is moving away from it by delivering iteratively to minimize the effect of uncontrollable random on results. Frequent releases, no more than a couple years, provide the opportunity to move into the Complex domain.

Conclusion

The nicest statistical process model doesn't help when there is no feasible way to keep error terms under control or any model you come up with has relevant variables which can not be controlled. Six Sigma techniques for statistically modelling processes is useful when the process is repeatable and reproducible (good AR+R), but it won't get you far in a domain where R+R is neither given nor desired.