Problem, Legacy Change Management
We all might have been in that situation that making the actual software change takes just a few hours, but getting approval to go live takes almost two weeks!?
It is the story of a painful go-live experience for an engineering team who is ready to release their changes but needs to go through lengthy organisation red-tapes to get 10 approvals by filling the same forms with so much information that may or may not be even used. As change management is not fully aware of what's in the change and because they get so many of these requests, the tendency might be to just decline some of these requests with no clear reason for the team, until they are convinced that the change is safe after a lot of back and forth, and sometimes by just seeing how committed you are to go live!
If this sounds familiar, this is the story of many companies who implemented modern DevOps Practices, without paying much attention to change management.
There was a time companies with large platforms, major releases every few months, and interconnected software foundations, needed to have a central team in charge of change management because a change in one platform would have an impact on the other, however that model doesn’t make sense anymore. These days software platforms have adopted microservices architecture with CI/CD pipelines, piloting mechanisms and feature flags, and can release software quite independently but due to those outdated change management processes and long lead times, they can’t release daily or hourly.
Solution, Automated Change Management using AIOps
If we wanted to talk about a better pipeline now perhaps we could just assume next time someone asked you the status of a bug fix, the answer would be: we've just made the change, the change request was automatically approved, and the estimated go-live time is 10 minutes. You will get a notification on your phone when it’s ready to test in prod, as you are a pilot user. This is possible using AIOps.
What's happening behind the scene is a highly mature and automated pipeline that executes the tests including penetration and performance tests, deploys the change to non-prod, if we have nominated the change for go-live, raises the change request containing test coverage and commit messages attached as well as the CIs and systems that are going to be impacted using a dependency map. It will then have an AI agent to assess the risk of change based on those factors and the attestations to ensure certain controls have been implemented with sufficient evidence available. It then looks at the company’s risk threshold and if the change could get approved or rejected automatically. We also take into account the previous deployments, and how the data around similar deployments in the past affected the systems. 90 per cent of the changes should follow this pattern in a year and for the remaining 10% due to high risk or interdependency with other changes, CIs or systems, someone from the delivery team (and not change management) could approve the change. A release window gets automatically allocated, it gets deployed to staging and prod using blue-green, canary or ramped deployment, all the automated non-destructive business verification tests will run in production and then it’s available to pilot users, and if all good, we dial-up to make it available to everyone.
If you have 30 engineering teams, and you are doing traditional change management, manual engagements for penetration, performance and regression testing, by having the above pipeline you could potentially save in excess of 5 million dollars a year for your platform for monthly releases, as you are reducing the release lead time from three weeks to minutes. You could imagine what happens if you release more frequently.
AIOps, artificial intelligence for IT operations, it’s intelligent automation and the use of ML on the big data we have generated using our DevSecOps pipelines at scale. We can use it for a range of initiatives such as root cause analysis, anomaly detection, event correlation across multiple monitoring systems, and proactive incident prevention. We train the models to learn from and make sense of the unmanageable mountain of data we have accumulated through our pipelines, software-defined networks, and millions of containers we run in production.
AIOps Use Cases
We look at your past incidents, correlate them to specific code changes and then use that insight for future changes, and stop your deployments that might repeat the same failures.
Very simple use cases are opening a ticket automatically when certain patterns are observed in production across multiple monitoring tools at different levels of application, pipeline, network, and infrastructure with higher accuracy for ruling out false alarms. More complicated use cases are assessing the risk of a change as mentioned, and also defining and assessing the quality of the change. And the next level is auto-healing of our systems by applying automated patches based on past remediations and how effective they have been.
Now back to change management, to get millions of changes approved and deployed to production safely in a year, change management needs to be automated, federated, and delegated to delivery teams for the majority of our changes, while we have a good level of audit and traceability as we can’t meet complex security, regulatory and compliance concerns at scale using the traditional models when 300 teams need to do automated delivery every hour. we should also distribute the accountability of keeping production environment safe, to the broader delivery teams in charge of their change. Currently, change management team in many companies, is being looked at as the gatekeepers and the rest of the organisation as cowboys who want to beak our software in production. This needs to change.
We also shouldn’t forget the merit behind change management, however, all we are saying is the automation and maturity of change management practice need to grow at the same speed as technology.