Showing posts with label Devops. Show all posts
Showing posts with label Devops. Show all posts

Monday, October 27, 2025

AI Estimation - helping you understand Architectural Quality

Artificial Intelligence can be a blessing and a curse. While developing VXS, I have come upon a personal reflection exercise:
At the end of every working day, I tell my IDE AI to "please explain all the changes we have made today."
Then, I feed the output into chatGPT and ask it, "Please provide for me a conservative estimate for this backlog."
The result is usually something like this:

The Evening Ritual

Every evening I close the laptop with a simple ritual. I ask my IDE's assistant to explain, in plain language, everything that changed today: which modules moved, which interfaces shifted, which tests were added or deleted, where the seams tightened or tore. It replies with a neutral, factual log. No heroics, no storytelling, just the anatomy of the pure code work.

I take that log and hand it to an external estimator (I use chatGPT) with a fixed prompt: "Give me a conservative effort estimate for the following backlog:" The prompt never changes. The intent never changes. Only the code does. And I'm not asking for a plan or a deadline. I'm asking for a mirror.

The estimator returns a number and, more importantly, a decomposition of why the number is what it is. I archive the result alongside the day's log. Tomorrow I'll do it again. Over weeks, a pattern emerges: sometimes the estimated effort grows; sometimes it shrinks; sometimes it holds steady. The numbers are not my speedometer. They're my oil pressure gauge.

How I do it

Consistency makes this work: I use a consistent source description (the IDE's explanatory log) and an identical estimator (same prompt, same assistant). The descriptive prompt helps keep things consistent. The only variation I tolerate is that I separate either by user-facing behaviour (for anything that has user impact) or technical behaviour (things that affect structure or technical arrangements rather than users) - the point is categorical comparability, not theatrics.

My git commit logs are the source of truth. I don't talk about work that is not visible in the codebase. So it excludes anything that does not lead to a technical change: and that's on purpose. We will get to that soon.

I never doubt the estimate. I never ask for a second-guess unless chatGPT made obviously wrong assumptions. I do not think I'm a hero if the number is high. Nobody knows the real numbers - except me and my dev diary. I am interested in a single question: "What did today's work say about the architecture's ability to accept change?"

Why I do it

Most development "estimation" is theater. I don't have time for theater. I am not in a situation where it matters how long something takes - what matters is that it's done, and it works. I don't care how fast other people work - I care what I accomplish. I don't care what is easy or difficult - I care for the results I produce. So the estimates themselves are not the value.

And I don't care for quantity of output - what I am concerned with is malleability - the rate at which the system can absorb change. This one property tells me whether the system is getting better or deteriorating over time. The result might look like this:

How AI Estimates Help You Understand Architectural Quality Thirty-point line showing strong variance with an overall rising trend; rising estimates interpreted as improving architectural leverage. Time AI Estimate (of 1 day of development work) High Low How AI Estimates Help You Understand Architectural Quality AI Estimated Effort (person-days) Rising trend indicates stronger architectural leverage

Interpreting AI estimates as "architectural telemetry" flips the estimation narrative on its head. When the estimates rise over time, it means the system is able to absorb larger changes than before: The boundaries are clear, the tests reliably catch issues, the code is adaptive, and the coupling of components is honest. However, a decline in estimates is caused by poor changes: Poor modularity, Clean Code Violations, WET code, "unidentifiable bugs," unfixable tests, non-working features. These are all things that steal my time and result in fewer changes integrated over the course of the day.

So what I end up with is not a developer performance metric - it's an architectural quality metric, and an extremely powerful indicator of Technical Debt. This measure is much more connected to business value than anything that others measure before: If I can do 3 days of work in 1 day, it means that my engineering efforts are highly valuable. If it were 1/2 day, I'd better throw away the entire codebase and start from scratch.

Additionally, this routine also helps me check cognitive drift. Immersion blurs judgment. The AI, as an external estimator, given a fixed description, stays objective enough to challenge my gut. It stays neutral and unbiased. It has no incentive to make me feel good or bad about the work I do. It is pure consequence. This removes the emotional justification from the process.

How you can use this

You do not need a specific tool to adopt this. All you need is a rhythm: a factual daily log, a consistent external estimator, and a small note about your architectural intent. Done steadily, the series becomes a quiet diagnostic of design health. You will discover whether your system starts resisting meaningful change, and you can see the payoff of investments into code quality long before defect curves become visible.

There's a temptation to optimize the number, and managers may be tempted to look for high numbers. But: Goodhart's Law is inevitable if you try. The entire approach only works because it's not about making numbers go up. The number is just a mirror. And it helps you reflect on key design questions: "Why was this so hard, when AI said it should be easy?" - "What can we do to make it easier without getting ourselves into a pickle tomorrow?" - "What prevents or causes deterioration of change integration?" The answers will relentlessly point out the weak spots in both your architecture and your development approach.

Over time, you will notice the codebase becoming calmer. Working in this way and asking these questions frequently leads to Relentless Improvement. No tests? You did poor. Tight coupling? You blew the estimate. Hardcoding? You can't keep it. And that's the value of the exercise: a leading indicator of architectural quality that you can gather with nothing more than honesty and repetition.

Tips & Tricks

When the estimates of today's work increase after a deliberate cleanup, that's the right track. Keep going. You've unlocked leverage: the same unit of intent now reaches farther into the system.

When estimates decrease without special causes (e.g., you had a lot of meetings, were sick etc.) - stop and look for debt: hidden coupling, missing tests, undocumented contracts.

When estimates stay constant while scope expands, confirm that your boundaries are carrying their weight; if they are, you've found equilibrium worth defending.

The point of the AI Estimation metric is not to go faster. The point is to stay flexible. True agility in software development is "the ability to achieve more tomorrow than you could today, with less sweat or fear." When used in this way, AI estimates will tell you whether you're creating or losing that freedom.

Tuesday, August 16, 2022

TOP Structure - the Technology Domain

 Too often, organizations reduce the technical aspect of software development to coding and delivering features.

This, however, betrays a company which hasn't understood digital work yet, and it begs the question who, if anyone, is taking care of:

Technology is the pillar of software development that might be hidden in plain sight

Engineering

Are you engineering your software properly, or just churning out code? Are you looking only at the bit of work to be done, or how it fits into the bigger picture? Do you apply scientific principles for the discovery of smart, effective and efficient solutions? How do you ensure that your solutions aren't just makeshift, but will withstand the test of time?

Automation

What do you turn into code? Only the requirement, or also things that will help you do your work easier, with higher quality, and lower chance of failure? Do you invest into improving the automation of your quality assurance, build processes, your deployment pipeline, your configuration management - even your IDE? How many things that a machine could do is your company still doing by hand, and how much does that cost you over the year - including all of those "oops" moments?

Monitoring

Once you delivered something - how do you know that it works, it works correctly, is being used, is being used correctly, has no side effects, and is as valuable as you think it was? Do you make telemetry a standard of your applications, or do you have reasons for remaining ignorant about how your software performs in the real world?


All of the items above cost time and require skills. Are you planning sufficient capacity to do these things in your work, or are you accumulating technical debt at a meta level?

Think for a minute: How well does your team balance technological needs and opportunities with product and organizational requirements?

Friday, July 22, 2022

U-Curve Optimization doesn't apply to deployments!

Maybe you have seen this model as a suggestion how we should determine optimum batch size for deployments in software development? It's being propagated, among other places, on the official SAFe website - unfortunately, it sets people off on the wrong foot and suggests them to do the wrong thing. Hence, I'd like to correct this model -


In essence, it states is that "if you have high transaction costs for your deployments, you shouldn't deploy too often - wait for the point where the cost of delay is higher than the cost of a deployment." That makes sense, doesn't it?

The cause of Big Batches

Well - what's wrong with the model is the curve. Let's take a look at what it really looks like:


The difference

It's true that holding costs increase over time, but so do transaction costs. And they increase non-linearly. Anyone who has ever worked in IT will confirm that making a huge, massive change isn't faster, easier or cheaper than making a small change

The amount of effort in making a deployment is usually unrelated to the amount of new features part of the deployment - the effort is determined by the amount of quality control, governance and operational activity required to put a package into production. Again, experience tells us that bigger batches don't cause less effort for QC, documentation or operations. If anything, this effort is required less often, but bigger batches typically require more tests, more documentation and more operational activity each time - and the probability of Incidents rises astronomically, which we can't exclude from the cost of change if we're halfway honest.

Metaphorically, the U-Curve graph could be interpreted as, "If exercise is tiresome, exercise less often - then you won't get tired so often. The optimum amount of exercise is going to door to receive the pizza order, but rather order half a dozen pizzas at once if the trip to the door is too exhausting, and then just eat cold pizza for a few days.

Turning back from metaphors to the world of software deployment: It's true that for some organizations, the cost of transaction exceeds the cost of holding. This means that the value produced but unavailable to users is lower than the cost of making that value available. And that means that the company is losing money while IT sits on undeployed, "finished" software. The solution, of course, can't be to wait even longer with not deploying, and losing even more money - even if that's what many IT departments do.

As shown in the model, the optimum batch size isn't achieved when the company is stuck between a rock and a hard place - finding the point where the amount of money lost by not deploying is so big that it's worth to spend a ton of money on making a deployment.


The mess

Let's look at some real world numbers from clients I have worked with. 

As I hinted, some companies have complex, cumbersome deployment processes that require dozens of people weeks of work, easily costing $50000+ for a single new version. It's obvious that due to the sheer amount of time and money involved, this process happens as rarely as possible. Usually, these companies celebrate it as a success when they're able to go from quarterly releases to semiannual releases. But what happens to the value of the software in the meantime? 

Just assuming that the software produced is worth the cost of production (because if it wasn't, why build it to begin with) - if the monthly cost of development is $100k, then a quarterly frequency means that the holding cost is already at $300k, and it goes up to over half a million for semiannual releases. 

Given that calculation, we should assume that the optimal deployment frequency is when the holding cost reaches $50k, which would be two deployments per month. That doesn't make sense, however: when 2 deployments costs $50k each per month, then 100% of the budget would flow into deployment - of nothing

Thus, the downward spiral begins: fewer deployments, more value lost, declining business case, pressure to deliver more, more defects, higher cost of failure, more governance, higher cost of deployments, fewer deployments ... race to the bottom!


The solution

So, how do we break free from this death spiral?

Simple: when you're playing a losing game, change the rules.

The mental model that deployments are costly and we should optimize our batch size to only deploy when the cost of deployment outweighs the holding cost is flawed. We are in that situation because we have the wrong processes to begin with. We can't keep these processes. We need to find processes that significantly reduce our deployment costs:


The cost of Continuous Deployment

Again, using real world data from a different client of mine: 

This development organization had a KPI on deployment costs, and they were constantly working on making deployments more reliable, easier and faster. 

Can you guess what their figures were? Given that I have anchored you at $50k before, you might think that they have optimized the process maybe to $5000 or $3000.
No! If you think so, you're off by so many orders of magnitudes that it's already funny. 

I attended one of their feedback events, where they reported that they had brought down the average deployment cost from $0.09 to $0.073. Yes - less than a nickel!

This company made over 1000 deployments per day, so they were spending $73 a day, or $1460 a month, on deployments. If we calculated the accumulated cost of deployments for the whole period, they were still spending over $5000 for three months' worth of software development. But the transaction cost for each single deployment is ridiculously low.

Tell me of anything in software where the holding cost is lower than 7 Cents - and then tell me why we are building that thing? Literally: 7 Cents is mere seconds of developer time!

With a Continuous Deployment process like this, anything that's worth enough for a developer to reach for their keyboard is worth deploying without delay!

And that's the key message why the U-Curve optimization model is flawed:

Anything worth developing is worth deploying immediately.

When the cost of a single deployment is so high that anything developed isn't worth deploying immediately, you need to improve your CI/CD processes, not figure out how big you should make that batch.

If your processes, architecture, infrastructure or practices don't permit for Continuous Deployment, the correct solution is to figure out which changes you need to make so that you can continuously deploy.







Friday, March 11, 2022

Test Automation Specialists - what a waste!

"Test Automation Engineers require someone who will write the test cases for them, and they can start automating right after the software is deployed onto the test environment."

It's 2022, and I just had this conversation yesterday. Seriously?


Overspecialization leads to Waste

If we work like that, we needlessly create bottlenecks, we test too late, our tests will be of low quality, and we will spend a lot of time on identifying, tracking and reworking defective deliveries. We produce a lot of work, but not quality.

In a well-functioning software organization, there can be no such role as a "test automation engineer" who is fed work by "test designers." 

Although it sounds "resource efficient," that approach generates massive amounts of waste in all categories of Lean Muda

Defects

When automated tests aren't written by developers, and only available once a software package is released, we end up with three kinds of defects:

  1. Product defects, i.e. things the product should (or shouldn't) do, but that weren't correctly implemented, which require another software update.
  2. Automation defects, i.e. things the test automation should (or shouldn't) do, but that weren't correctly implemented. Until clarified, these lead to significant communication overhead.
  3. False negatives, i.e. inadequate automation not catching existing defects. This is significantly more likely when automation isn't written close to the code, and might omit dangerous cliffs in the product implementation.

Handover

Whenever actors in a process change, there's a handover. At the point of handover, we induce delay, lose information and risk the introduction of defects. This makes our process slow and error-prone. When test itself becomes a source of error, we should stop and think.


Motion

Handovers also mean that we're moving artifacts: test descriptions, build versions and defects - without need. We also need to coordinate this motion, which means we induce additional meetings into our process, which also slows us down and adds non-value added overhead. Motion also leads to inventory waste.


Inventory

As previously mentioned, un-automated tests will pile up as inventory, as will un-tested deliveries. When tests fail, the defect notifications are made available at a point in time when developers are working on other things. Hence, we are adding three queues to our process. Each queue contains work waiting to be done, and the combined length of all these queues is value delay for our customers.


Waiting

As mentioned in my article on CI/CD industry benchmarks, the time required between when a developer makes a change, and when a developer receives feedback on whether everything is correct shouldn't exceed a few minutes, preferably just a few seconds. If automation begins upon delivery, this period of time is impossible to warrant. The delivered product has to wait a significant amount of time until it can produce customer value, and developers will turn their attention elsewhere.


Overprocessing

Having different people automate tests and write productive code leads to "overprocessing," often called "gold-plating:" As mentioned in my article about the test pyramid, we should try to automate as close to the code as possible. When we're doing high-level tests which could instead be low-level tests, we make our test suite more complex, slower and less reliable than possible.Unit tests are extremely quick both to write and execute. Test automation engineers who automate black box/gray box tests will spend a significantly higher amount of work to write tests with poorer outcomes.


Overproduction

Since the amount of labor to create a product increment in server code and test automation are extremely hard to balance, the strategy of automating after the fact usually leads to developed product features piling up, waiting to be automated. As a consequence, the decision is often to release these without proper test automation coverage to flush the waiting queue. This builds up "technical debt" which deteriorates product sustainability.


Human potential

When developing software with multiple people, we act as a team. We can learn from each other, and we can align: Someone who can automate tests, can also write features. Someone who can write features, can also automate. As a team, we can align during planning "who does what," and fill our information gaps so that those doing the work have full information and don't need anyone else to fill in for them. Techniques like pair programming or code reviews ensure that even if one person makes a mistake, someone else will catch it before a defect even enters the system. It's a limiting belief to think that somehow people can't learn - we all can, if the conditions are favorable. Maybe not everyone will be the greatest developer, but we don't need to: we work within our ability, and grow our capacity a little every day.

To confine a test automation engineer to automating defined test specifications, and to confine coders to write code without understanding how to do that without defects - is an insult to our human intellect, and a very expensive one, at that:  All the above seven wastes cost a lot of money that we could save if we would just improve our collaboration, communication and skillset.


Lean Software Test Automation

Please, let me clarify. I don't believe that test automation expertise is a waste - the waste is actually hidden in the process, and how we use our talent.

A much leaner approach which avoids all of the above eight wastes is to educate and empower - use QA to help developers learn how to define the right tests, how to apply processes like Test-Driven Development and Behaviour-Driven Development, and guide them on their journey to deliver Zero Defect Quality. And in reverse, having developers upskill our test automation experts to help them deliver integrated pieces of value including working software.

The purpose of test automation in such an organization isn't to find defects - it's to maintain the high frequency of delivery of high quality products, and to serve as a living documentation for all behaviours of the product which is maintained as part of the development process without overhead.

Thursday, August 5, 2021

Continuous Integration Benchmark Metrics

 While Continuous Integration should be a professional software development standard by now, many organizations struggle to set it up in a way that actually works properly.

I've created a small infographic based on data taken from the CircleCI blog - to provide an overview of the key metrics you may want to control and some figures on how the numbers should look like when benchmarked against industry performance:



The underlying data is from 2019, as I could not find data from 2021.

Key Metrics

First things first - if you're successfully validating your build on every single changed line of code and it just takes a few seconds to get feedback, tracking the individual steps would be overkill. The metrics described in this article are intended to help you locate improvement potential when you're not there yet.


Build Frequency

Build frequency is concerned with how often you integrate code from your local environment. That's important because the assumption that your local version of the code is actually correct and consistent with the work of the remaining team is just that - an assumption, which becomes less and less feasible as time passes.

By committing and creating a verified, valid build, you reset the timer on that assumption, thereby reducing the risk of future failure and rework.

A good rule of thumb is to build at least daily per team member - the elite would validate their changes every couple of minutes! If you're not doing all of the following, you may have serious issues:

  • Commit isolated changes
  • Commit small changes
  • Validate the build on every single change instead of bulking up

Build Time

The amount of time it takes for a committed change until the pipeline has successfully completed - indicating that the build is valid and ready for deployment into production.

Some organizations go insanely fast, with the top projects averaging at 2 seconds from commit all the way into production - and it seems to work for them. I have no insights whether there's much testing in the process - but hey, if their Mean Time to Restore (MTTR) on productive failures is also just a couple minutes, they have little to lose.

Well, let's talk about normal organizations - if you can go from Commit to Pass in about 3 and a half minutes, you're in the median range: half the organizations will still outperform you, half won't.

If you take longer than 28 minutes, you definitely have to improve - 95% of organizations can do better!


Build Failure Rate 

The percentage of committed changes causing a failure.

The specific root cause of the failure could be anything - from build verification, compilation errors or test automation - no matter what, I'm amazed to learn that 30% of projects seem to have their engineering practice and IDE tooling so well under control that they don't even have that problem at all, and that's great to hear. 

Well, if that's a problem for you like 1/5th of the time, you'd still pass as average, but if a third or more of your changes are causing problems, you should look to improve quickly and drastically!


Pipeline Restoration Time

How long it takes to fix a problem in the pipeline.

Okay, failure happens. Not to everyone (see above), but to most. And when it does, you have failure demand - work only required because something failed. The top 10% organizations can recover from such a failure within 10 minutes or less, so they don't sweat much when something goes awry. If you can recover within the hour, you're still on average.

From there, we quickly get into a hugely spread distribution - the median moves between 3 hours and 18 hours, and the bottom 5% take multiple days. The massive variation between 3 and 18 hours is explained easily - if you can't fix it before EOB, there's an entire night between issue and resolution.

Nightly builds, which were a pretty decent practice just a decade ago, would immediately throw you at or below median - not working between 6pm and 8am would automatically botch you above 12 hours, which puts you at the bottom already.


First-time Fix Rate

Assuming you do have problems in the pipeline - which many don't even have, you occasionally need to provide a fix to return your pipeline to Green condition.
If you do CI well, your only potential problem should be your latest commit, and if you follow the rules on build frequency properly, the worst case scenario is reverting your change, and if you're not certain that your fix will work, that's the best thing you can do in order to return to a valid build state.

Half the organizations seem to have this under control, while the bottom quartile still seems to enjoy a little bit of tinkering - with fixes being ineffective or leading to additional failures. 
If that's you, you have homework to do.


Deployment Frequency

The proof of the pudding: How often you successfully put an update into production.

Although Deployment Frequency is clearly located outside the core CI process, if you can't reliably and frequently deploy, you might have issues you maybe shouldn't have.

If you want to be great, aim for moving from valid build to installed build many times a day. If you're content with average, once a day is probably still fine. When you can't get at least one deployment a week, your deployment process is definitely ranking on the bottom of the barrel and you have definite room for improvement.

There are many root causes for lower deployment frequency, though: technical issues, organizational issues or just plain process issues. Depending on what they are, you're looking at an entirely different solution space: for example, improving technically won't help as long as your problem is an approval orgy with 17 different comittees.


Conclusion

Continuous Integration is much more than having a pipeline.

Doing it well means:

  1.  Integrating multiple times a day, preferably multiple times an hour
  2. Having such high quality that you can be pretty confident that there are no failures in the process, 
  3. And even when a failure happens, you don't break a sweat when having to fix it
And finally, your builds should always be in a deployable condition - and the deployment itself should be so safe and effortless that you can do it multiple times a day.

Thousands of companies world-wide can do that already. What's stopping you?




Friday, June 14, 2019

Don't get fooled by Fake Agile

Whichever consultancy sells this model, DO NOT let them anywhere near your IT!

So, I've come across this visual buzzword bingo in three major corporations already, and all of them believe that this is "how to do Agile". And all of them are looking for "Agile Coaches" to help them implement this garbage.

This is still the same old silo thinking that never worked.

There is so much wrong with this model that it would take an entire series of articles to explain.
Long story short: all the heavy buzzwords on this image merely relabel the same old dysfunctional Waterfall. In Bullet points:

  • "Design Thinking" isn't Design Thinking,
  • "Agile" isn't Agile,
  • "DevOps" isn't Devops. and 
  • Product Owners and Scrum Masters aren't, either.

Here is a video resource addressing this model:




I have no idea which consultancy is running around and selling this "Agile Factory Model" to corporations, but here's a piece of advice:
  • If someone is trying to sell you this thing to you as "Agile", do what HR always does: "Thank you very much for coming. We will call you, do not call us ..."
  • If someone within your organization is proposing to implement this model, send them to a Scrum - or preferably LeSS - training!
  • If someone is actually taking steps to "implement" this thing within your organization, put them on a more helpful project. Counting the holes in the ceiling, for example.

If this has just saved you a couple millions in consulting fees, you're welcome.

Wednesday, May 15, 2019

There's no such thing as a DevOps Department

"Can we set up a DevOps Department, and if so: How do we proceed in doing this?"
As the title reads, my take is: No.

Recently, I was in a conversation with a middle manager, let's call him Terry, whose organization just recently started to move towards agility. He had a serious concern, and asked for help. The following conversation ensued - (with some creative license to protect privacy of the individual) :

The conversation


Terry: "I'm in charge of our upcoming DevOps department and have a question to you, as an experienced agile transformation expert: How big should the DevOps department be?"

Me: "I really don't know how to answer this, let me ask you some questions first."
Me: "What do you even mean by, 'DevOps Department'? What are they supposed to do?"

Terry: "You know, DevOps. Setting up dev tools, creating the CD pipeline, providing environments, deployment, monitoring. Regular DevOps tasks."

Me: "Why would you form a separate department for that and not let that be part of the Development teams?"

Terry: "Because if teams take care of their own tools and infrastructure, they will be using lots of different tools."

Me: "And that would be a problem, as in: how?"

Terry: "There would be lots of redundant instances of stuff like Jira and Sonar and we might be paying license fees multiple times."

Me: "And that would be a problem, as in: how?"

Terry: "Well, we need a DevOps department because otherwise, there will be lots of local optimization, extra expenses and confusion."

Me: "Do you actually observe any of these things on a significant scale?"

Terry: "Not yet, but there will be. But back to my question: How big should the department be, what's the best practice?"

Me: "Have you even researched the topic a little bit on the Internet yet?"

Terry: "I was looking how many percent of a project's budget should be assigned for the DevOps unit."
Terry: "And every resource I looked pretty much said the same thing, 'Don't do it.'.But that's not what I am looking for."

Me: "Sorry Terry, there is so much we need to clarify here that the only advice I can give you for the moment is to reflect WHY every internet resource you could find suggested to not do it."




Why not?

Maybe you are a Terry and looking for how to set up a DevOps department, so let me get into the reasons where I think Terry took a wrong turn.


  • Terry was still thinking in functional separation, classic Tayloristic Silo structures, and instead of researching what agile organizations actually look like, he was working with the assumption that agile structures are the same thing.
  • Terry fell into another trap: not even understanding what DevOps is, thinking it was only about tools and servers. Indeed, DevOps is nothing more than removing the separation between Dev and Ops - a DevOps is both Dev and Ops, not a separate department that does neither properly.
  • Terry presumed that "The Perfect DevOps toolbox" exists, whereas every system is different, every team is different, every customer is different. While certain tools are fairly wide-spread, it's up to the team to decide what they can work with best.
  • Terry presumed that it is by default centralization is an optimization. While indeed, both centralization and decentralization have trade-offs, centralization has a massive opportunity cost that needs to be warranted by a use case first!
  • Terry wanted to solve fictional problems. "Could, might, will" without observable evidence are probably the biggest reason for wasted company resources. Oddly enough, DevOps is all about data-driven decision makings and total, automated process control. A DevOps mindset would never permit major investments into undefined opportunities.
  • Terry was thinking in terms of classical project-based resource allocation and funding. Not even this premise makes sense in an agile organization, and it's a major impediment towards customer satisfaction and sustainable quality. DevOps is primarily a quality initiative, and when your basic organizational structure doesn't support quality, DevOps is just a buzzword.

Reasons for DevOps


Some more key problems that DevOps tries to address:
  • Organizational hoops required to get the tools developers need to do their job: Creating a different organizational hoop is just putting lipstick on a pig.
  • Developers don't know what's happening on Production: Any form of delegation doesn't solve the problem.
  • Handovers slowing down the process: It doesn't matter what the name of the handover party is, it's still a handover.
  • Manual activity and faulty documentation: Defects in the delivery chain are best prevented by making the people with an idea responsible for success. 
  • Lack of evidence: An organization with low traceability from cause to action will not improve this by further decoupling cause and action.

How-To: DevOps


  • If you have Dev and Ops, drop the labels and call all of them Developers. Oh - and don't replace it with the DevOps label, either!
  • Equip developers with the necessary knowledge to bear the responsibility of Operations activity.
  • Enable developers to use the required tools to bring every part of the environment fully under automated control,
  • Bring developers closer to the users, either by direct dialog or through data.
  • Move towards data-driven decision making on all levels: Task-wise, structurally, organizationally and strategically.



And that is my take on the subject of setting up a DevOps department.




Wednesday, October 31, 2018

My discovery journey to FaaS

Moving from monolithic systems to autonomous services is more of a mindset challenge than a technical challenge, although depending on the architecture of a system, the latter can already be tremendous. In this article, I want to describe my own discovery journey which changed how I perceived software systems forever.


Introduction

I was working for a client on an Information Discovery project which gathered and evaluated data from a range of other platforms which were basically a Stovepipe Architecture, so we had brittle, incomplete data objects from a wide range of independent systems which all described the same business object. There were two goals: First, link the data into a common view - and second, highlight inconsistencies for data purification. So to say, it was a fledgling Master Data Management system.



Stage 1: The Big Monolith

Although I knew about services, I never utilized them to their full potential. In stage one, my architecture looked something like this:




The central application ran ETL processes, wrote the data to its own local database, then produced a static HTML report which was sent to users via mail.
The good news is that this monolithic application was already built on Clean Code principles, so although I had a monolith, working the code was easy and the solution flexible.

The solution was well received - indeed, too well. The HTML reports were sent to various managers, who in turn sent them to their employees, who in turn sent them to their colleagues who collaborated on the relevant cleanup tasks. This was both a data protection and consistency nightmare - I couldn't trace where the data went and who was working with which version. So I moved from an HTML report to server-side rendering within the application.


Stage 2: Large Components

My application became two - an ETL and a Rendering component, both built on the same database. I had solved the problem of inconsistent versions and access control:


But there was another issue: Not everyone needed the same data, and not everyone should see everything. There were different users with different needs. The Simplicity principle led me to client-side rendering: I would pass the data to the frontend, then let the user decide what they wanted to see.


Stage 3: The first REST API

I cut in a REST layer which would fetch the requested data from the backend based on URL parameters, then render these to the user on the browser:


This also solved another problem: I wouldn't need to re-create the report in the backend whenever I made changes to layout and arrangement. It enabled true Continuous Delivery - I could implement new representation-based user requests multiple times a day without having to do any changes to any data on the server. The cycle time was three to five relevant features - per day, and so the report grew.

Guess what happened next? People wanted to know what the actual data in the source system was. And that meant - there was more than one view.


Stage 4: Multiple Endpoints

Early attention to separable design and SOLID principles once again saved the day. I moved the Extraction Process out of the ETL application and into autonomous getter services: 



This made my local database asynchronous with the data users might see, but this was a price everyone was willing to pay - it was enough if the report was updated daily, as long as people had access to un-purified raw data in real time. Another side benefit was that I pulled in pagination - increasing both performance and separability of the ETL process as a whole.

Loose coupling allowed me not only to have separate endpoints for each Getter, it allowed me to completely decouple each getter from a central point: I could scale!

As the application grew in business impact, more data sources were added and additional use cases came in. (To keep the picture simple, I'm not going to add these additional sources - just imagine there's many more).

Stage 5: Separate Use Cases

This modification was almost free - when the requirement came, I did it on a whim without even realizing what I had done until I saw the outcome:



Different user groups had different clients for different purposes. Each of the client needed only a subset of data, and that data was a subset of the available data - I could scale clients without even modifying anything in the core architecture!

And finally, we had network security concerns, so we had to pull everything apart. This was a freebie from a software perspective, but not a freebie from an infrastructure perspective: I separated the repositories and deployment process.


Stage 6: Decentralization

In this final stage, it was impossible to still call it "one application": I ended up with a distributed system with multiple servers, each of them hosting only a small set of functions which were fully encapsulated and could be created, provisioned, updated and maintained separately. In effect, each of my functions was a service running independently of its underlying infrastructure:


And yes, this representation omits a few of the infrastructure troubles I had, most notably, availability and discovery. The good news is that once I had this pattern, adding new services became a matter of minutes - Copy+Paste the infrastructure provisioning and write the few lines of code that did what I wanted to achieve. 



Conclusion

As I like to say, "The monolith in people's heads dies last". Although I knew of microservices, I used to build them in a very monolithic way: Single source repository, single database and single deployment process. This monolithic approach kept infrastructure maintenance efforts low and reduced the complexity of maintaining system integrity.

At the same time, this monolithic approach made it difficult to separate concerns, pull in the necessary layers of abstraction to separate use cases, resulted in downtimes and a security nightmare: When you have access to the monolith, everything is fair game: Both availability and confidentiality are fairly easy to compromize. The FaaS approach I discovered for myself would allow me to maximize these two core goals of data security with no additional software complexity.

Thanks to tools like Gitlab, Puppet, Docker and Kubernetes, the infrastructure complexity of migrating from is manageable, and the benefits are worth it. And then, of course, there are clouds like AWS and Azure which make the FaaS transition almost effortless - when the code is ready.


Today, I would rather build 20 autonomous services with a handful of separate clients that I can maintain and update in seconds than a single monolith that causes me headaches on all levels, from infrastructure over testing all the way to fitness for user purpose.


This was my journey. 
Your mileage may vary.








Sunday, February 11, 2018

The Agile / DevOps ecosystem

The question, "How does an organization, especially one that is already using an agile framework, benefit from DevOps?" is very common. In this article, I provide a succinct answer to this question by examining the benefits provided by some standard frameworks.

The development lifecycle

Irrespective of which approach a company is using, it all starts with a customer need:
This need must be understood, so we need to invest some effort into understanding.
In Scrum, we call this "Refinement". It could also be called an Ideation or Prototyping process. Regardless of how you call it - it's essential to build the right product!


Once understood, our next job is to build something. Any kind of iterative approach encourages us to build just a little, then learn from that. We want to minimize our upfront planning efforts and jump right into doing delivery work. In Scrum, that's our timeboxed "Sprint".


The objective of delivery is to get something shipped right to the customer, we want to make the product available. This is often called "Minimum Viable Product", in Scrum we call it the "Product Increment".

When we got the product out of the door, we want to learn how customers respond to it. This is where most agile implementations disconnect: Scrum has a "Review" - although the Review still happens disjoint of both the problems and opportunities arising when our product hits the shelves!
What happens in the "Real World" becomes the job of Support, Customer Service or Marketing.


This final piece of the puzzle is vital to build an even better product in the future. We want to iterate a full cycle in increments as small as possible:


The business opportunities when integrating this cycle are immense: Anything that is of value can be turned into revenue, anything that wastes effort can be turned into saving. To harness this potential, it's essential that a single team has the ability to recognize and create these benefits.

Now, it's time to take a closer look at some agile frameworks:

Standard frameworks

Scrum

Scrum is the most common framework. It excels as a delivery framework. Based on the Scrum Guide, it neither cares for what happens in order to get the right items into the Product Backlog nor what happens after the product increment has been delivered.
A small fly in the ointment is that Scrum scopes the timebox: it doesn't bode well for a Scrum team to deal with unplanned work. Yet, everything that happens with customers in real time is hardly planned, much less does it make sense to plan it upfront.


Kanban

Like it's cousin Scrum, Kanban focuses strongly on optimizing the delivery process. Kanban provides less emphasis on getting a planned shipment through the door, offering higher flexibility in dealing with unplanned work. Teams that have productive maintenance responsibility often fly well with Kanban.


Side note: Scaling Frameworks

Scrum scaling solutions, such as Nexus, Scrum At Scale (S@S), LeSS focus on doing the same thing with more people - getting better at delivery. Combining these frameworks with Enterprise Kanban practices, like suggested in SAFe, is still limited to the delivery portion.
While SAFe strongly encourages integrating DevOps, the suggested use of DevOps is still restricted to getting the product through the door in the most reliable fashion.


Design Thinking

It's often adertised that Design Thinking combines well with Scrum, and indeed it does. Design Thinking addresses understanding the customer need through systematic exploration before delving into a more costly delivery process.

Lean / Six Sigma 

Regardless of whether you're thinking Lean, Six Sigma, or the standard combination "Lean Six Sigma" - these frameworks address optimizing our existing product through better insight:


Extreme Programming

Returning to the agile hemisphere, Extreme Programming and some of its derivates, such as Programmer Anarchy, add more emphasis on understanding the customer and providing the right solution for the customer's needs.

DevOps

DevOps is concerned with two things: Getting the product to the customer with minimal overhead and delay - and understanding how the product is being used.
The first, and well-known, portion of DevOps is providing an automated delivery Infrastructure - Continuous Delivery, Infrastructure as Code, Containerization, Logging, Authentication and many other things come to mind.

In addition, DevOps, done well, provide a myriad of analytical insights into the product: which features are being used, how new features are being received, where unexpected things occur, how to solve customer issues etc. This requires more than technical insight - it requires harnessing business insight as well!


The big picture

DevOps is an essential piece of the puzzle towards hyperperformant teams. The full power of an agile team requires a smart combination of discovery and delivery methods. It's your choice whether your team sails under the Scrum, Kanban, XP, none or any other banner - just make sure you cover all of the blocks:

Design Thinking shines for the "Understanding customer needs". Scrum is great to cover the "Doing work on customer needs" section. Combine these two with DevOps to optimize your "Working on the available product - both before and after delivery" approach.
For best results, you might want to add a touch of Lean, an ounce of XP and a pinch of Kanban,


Closing the loop

Let's forget frameworks for a minute. What's important is that you got everything covered. Use Design Thinking where it helps you build a better product. Apply XP where it helps you build the product better. Try Scrum elements that improve your delivery process. Utilize DevOps tools and practices to ensure you're doing the right thing when the rubber hits the road. And never forget - Lean/Six Sigma has really powerful tools that help you do the right thing better!


If you look closely, you will discover that this cycle is an implied "Build-Measure-Learn" cycle, which is the core premise of Lean Startup, yet another framework from the agile bucket.

Closing remarks

No single framework, approach or method is sufficient to deal with the complexity of creating and maintaining a successful product. All frameworks have useful and important elements to offer. Combining them for best results is a core practice of agnostic agile practitioners.

Friday, March 17, 2017

Building a CI tool from scratch in one hour

Is a CI server the right answer for your team? Do you need a CI tool? "Well, that's obvious!". Is it? You may be surprised! 
This is a lesson I learned about the Agile Principle #10: "Simplicity - maximizing the amount of work not done - is essential." 

A simple CI chain, providing essential feedback to the developer

We need a CI tool

In most software development projects, someone will say the sentence "We need a CI server" - usually right at the start of development, before the first line of code is written. In an extreme case, a company paid many thousand Euros to set up a CI chain before any developer was using it. Tools like Jenkins, Hudson, Bamboo and the likes pop into mind right away. They are very good at what they do.

When I worked with one organization that had been using Atlassian's Bamboo from day 1, I talked to a senior developer who took personal responsibility to create the best possible CI process that developers could get. He jested that setting up a build chain in Bamboo required a master's study in Atlassian tooling.

This leads me to pose the question: "Do we need the complexity that a CI server creates to achieve the goal we want to achieve?"


CI isn't rocket science

In its simplest form, CI doesn't do more than retrieve the latest commit from the version control system, create a new build, deploy it - and run the tests. Developers are informed about the consequences of their actions, and that's it. Of course, complexity can be added as needed.

Do you really need a magnificient tool to do this? Is it worth sinking hours (or days) into configuring and maintaining a platform to do this trivial job for you?

My claim: You don't need a standard out-of-the-box CI tool!


How to create a CI tool

We were in a training setting. It was a Certified Scrum Developer course provided by Andy Schliep. It was fun and we were building a training application (real, working software!) in Ruby. Andy set a constraint for the Definition of Done "Works on this machine".
There was eight of us in the course creating Ruby code - so we needed Continuous Integration. In a training setting, we didn't have a CI server, nor the time to set one up. So, we reduced it to the essentials. "What do we really need from CI?"
A developer story was born.
Since we had a cool gadget from blink1, Andy added the Acceptance Criteria here: "The light must be red on a failed build and green on a successful build." Blink1's API is very simple, so we got going.
Even if we had used Jenkins, the integration with our blink1 USB light would still be an open point.

Since I love Shell, I started writing some Shell code to access the blink1 LED:
blink()
{
  ~/tools/blink1-tool "--$1" 2>&1 >/dev/null
}
This small function snipped would allow me to set the color of our blink1 LED to anything I wanted, for example, red, blue - or green.

The next thing I needed was the failure state:
fail()
{
  blink red
  echo "CI step failure." | log
  exit 1
}
Of course, I'd also need a success state, which looked pretty much the same.
Like that, I could already "fail the build".

We then brainstormed which activities we would need in our CI. We ended up with the following sequence of actions:

  1. checkout the latest code from github
  2. install the bundle
  3. clean the database
  4. run the tests
  5. start the server

If all five activities succeeded, we would call this a successful build and "pass the build".

The main control logic of our CI looked like this:

ci_process()
{
  for task in checkout install_bundle clean_db run_tests; do
  do_task $task
  done
  ( succeed )
  do_task start_server
  }

Of course, we still needed to fill the logic for each task step, so we did that, too. Tapping into the knowledge of my teammates as to what exactly needed to be done, I wrote functions for each task as well. It looked pretty much like this:
checkout()
{
  git_feedback="`git pull`"
  echo "$git_feedback" | log
  [[ "$git_feedback" == "Already up-to-date." ]] && return 1
  return 0
}

After we had the essentials down, we started polishing the code a little, adding extra features such as turning the light blue while the build was in progress.

Everything together, we spent roughly one hour to create this "CI tool" which served it purpose. It is less than 100 lines of code and can be viewed here. Of course, I should have added unit tests - but we were content with doing every unit test manually, because this was only a training exercise and I didn't have my Shell unit testing framework at hand.

Summary

You don't need to add the complexity of "yet another platform to maintain" to your development unless there are pressing reasons. Try looking into this question: "Who needs what - and what is the minimum we need?" Talk with the team, discover the real need and find the simplest way to reach your purpose.

Only add extra tools when you have sufficient evidence to prove that a few lines of code can't meet the same purpose. Remember: Every tool adds complexity and work.



Friday, October 30, 2015

SysOps and Development: An Antipattern

Developers develop, Sysops operate. Obvious.
Well, that's classical thinking.

The consequence is that problems do not get resolved on time: Users suffer, value is lost and the reputation of the IT department is tarnished.


Obvious problems often don't get resolved in a timely manner - the loser is: everyone!

What you see above is real data taken from an organization where some Scrum teams are responsible for creating features and another team is responsible for operating the platform.

Just taking this example, the biggest current issue is an encoding problem - something that happens in the "real world", but not in a controlled development environment: Continuous Integration was successful, a feature hit the market - and boom!

Separate Responsibilities

In the example above, developers are busy churning out new features while SysOps are busy with fire-fighting.
Taken from the chart, it's been 3 days since the problem started - 3 days where SysOps are constantly stressed while developers are busy with stacking more features on top of existing problems.

XP proclaims "A Red Master is developers' Priority 1" --- well, whose priority 1 should a Red Production be? SysOps or everyone?
In a classic Kanban system, when the production line goes down, it should stop the entire machinery until the issue is resolved. This also holds true for agile teams: If anywhere in the organization, there is a problem, stop. Fix.
Don't go on producing more problems.

Perverse incentives

Developers are appreciated for delivering new features, while SysOps are appreciated for keeping the platform stable. Consequently, Devs prefer to work on new stuff rather than fixing existing stuff. Unfortunately, the organization as a whole suffers. Not rowing in the same direction can't yield any better results.

The very fact that people have different incentives here implies is a problem with autonomy and self-organization!
You must align the incentives of everyone working on the product, not only "development team".

Definition of "Done"

Regarding developers higher for delivering a new feature than making an old feature stable yields the problem above: "Just add a new story into the Product Backlog and we'll deliver it with the next sprint." - while this is a classic approach to Agile Software development "There are no bugs, missing Acceptance Criteria are new stories" and Scrum's "The Sprint is Closed towards modification".
This is a severe misunderstanding: Can you actually consider a story "Done" when it causes problems to the customer?

When you see that there's an operative problem with a new feature, it means that in your retrospective, you should put your DoD under scrutiny.


DevOps: A solution

As the agile proverb goes "You build it, you run it."
As long as developers do not feel the operational pain, they are not interested in removing it. Likewise, as long as SysOps are unable to fix the root cause of a problem in source code, they become apathic towards code-related problems.
Moving both the operative problem as well as the solution space into the team's sphere of responsibility, developers will take an interest not only in the code they write and how it meets the acceptance criteria, but also how the code actually performs in the real world.

Organizationally, this can be done as simple as moving a SysOp into the development team and then holding the team accountable for events on the productive platform.
The Developers become Dev-Ops who will come up with innovative solutions to detect, analyse and prevent problems. A DevOps can solve issues in a fashion that would be completely unthinkable for a SysOp who is confronted with a "build ready" black box software.




Side note
The above screenshot was taken from Medullar, towards which I, personally, am completely biased. I am one of the core developers of this easy to use, open source, cloud-based, universal monitoring and problem solving solution which was designed to require minimum levels of intrusion and resource capacity while offering maximum flexibility.
Medullar allows you not only to detect and analyse problems in any kind of environment. It provides an API and UI to remedy them in real time. It even comes bundled with it's own test framework, so you don't need anything else to improve your operative performance.

Thursday, October 22, 2015

The five key ingredients for successful software development

All you need is good code that meets the requirement and you're set? That is short-sighted!

You need to keep a keen eye not only on the following key processes and artefacts to be successful in the long term.

Working Software (Code)

That's obvious. If you don't have a product, you have nothing.
This not only includes what the customer sees, but also what is going on behind the scenes. The statement, "A kitchen where someone cooks is always messy" might apply to the developer's office, but it must not apply to the code.

Smelly code costs real money and binds capacity that would be better focused towards innovation and progress.

Test Suite

To understand whether your product works correctly and does the right thing, you need a proper test suite. I radically proclaim "If your test isn't at least as clean as your productive code, don't bother".
A common (anti-)pattern I observe is that organizations de-prioritize testing in order to meet certain goals. Or, they actually put an emphasis on testing, but approach testing wrongly. The consequence? Flaky or unreliable tests that cause more trouble than they solve.

The money invested into testing is misspent if your tests are not well-managed, clean and comprehensible. However, abolishing your test suite equals abolishing your investment into sustainability. Don't ever reach this point.

From day 1 of a new product, invest into a proper test suite. Treat your test suite as the Holy Grail of long-term sustainability.

If you have existing software with that, go seek that Holy Grail - today!

Working Build

To actually reach Continuous Integration - or beyond, you need a smooth, swift build process. You should be able to modify your build as easily as you can modify your product's features.

I see many teams and companies who under-emphasize their build process. A sad fact is that they all run into build problems sooner or later. Some can't manage their dependencies any more, some don't understand why their build is unstable - and the build becomes a nightmare for every developer!

Like you manage your code and the test suite, you must manage your build.This includes the build management system, the build artifacts as well as the build infrastructure.

Environment

A good product is not impeded by the environment, which includes technical infrastructure as well as corresponding tools and integrated components.

An overly complex or under-performing environment are equally detrimental to the success of your product.

For example, the wild growth of turnkey service solutions is the best way to lose control of what is actually going on. It is quite easy, for instance, to include a tracking or ad service. It is, however, incredibly hard to debug your software if such an external service is causing a malfunction in your product.

Another example, if you are using a myriad of virtual machines without having a proper hardware management will cause tremendous trouble when trying to find out what went wrong where, and why. The time lost to a poorly managed environment can quickly exceed the time you are actually spending on the product you're building.

 

Runtime

"Works on my machine" - isn't it a running gag?

How do you control that the product also works in production? Of course, you need monitoring and feedback.

But technical monitoring doesn't cut it.  Have you ever taken a detailed look into your log files? Whenever I engage with a new client, I take the liberty of scanning the productive logs for stack traces and the other obvious stuff ("Exception", "ERROR", "CRITICAL", "FATAL"). Doing that only takes a minute with grep and I usually find a myriad of stuff for the next retro.

Sometimes, the software doesn't display any obvious sign of malfunction for your (Dev-)Ops while thousands of users around the world have no clue why they can't reach their intended goal.

To understand and control your runtime properly is a tremendously important - yet oftentimes underestimated - success factor.


Conclusion

You must take care of five critical ingredients:

Code, Test, Build, Environment and Runtime.

Give sufficient attention to all of these will drastically reduce your risk of failure.
Ignore one of them and you will find yourself struggling to keep customer satisfaction and progress up.