My name is Mark Addleman and I’m one of the founders of kaizenOps.io. I want to reflect on why we think this is the right time for AI-backed assistance for site reliability and faster incident resolution.
Back in the before time, companies like banks and insurance companies who grew by acquisition would have a mess of systems which made it difficult to have a unified view of their customers. These enterprises would attempt to solve this problem by writing Java applications in an effort to broker between their multiple systems. Of course, the Java applications had problems of their own: when they broke, what caused the problem? It could be any one of a dozen different backend “systems of record” or the problem could be the Java system itself. I remember a particularly intense week of hunting a severe performance bug that came down to the home-grown logging system that somebody wrote. Another two-week hunting exercise (during which my job was threatened multiple times by a manager whose only method of motivating people was to threaten their jobs) resulted in discovering a misbehaving email server that no one knew existed.
From these bruises (and plenty of others), I came to wish there was a better way. What would the world have to look like to make tracking down performance problems so straightforward that a computer could do it?
I think that world would look a lot like the world the SRE and DevOps communities are creating today: nearly ubiquitous observability, common technologies that are used broadly, and a culture that cares about customer experience. Each of these provides a pillar upon which we can begin to radically transform how teams handle incidents and other problems. Today, in most organizations, only a handful of people can effectively handle issues when they arise. I don’t think we’re too far from a world where many more people can be effective at recovering from incidents, thus relieving beleaguered ops teams and educating the broader organization about good practices.
I want to move us in that direction. I think that kaizenOps.io can help us get there.