I want to spend a little time reflecting on the kaizenOps.io journey over the past few months. As I mentioned at the outset of this blog, many years ago, I was touched by how much Lakshay’s life was impacted by his job. Of course, everyone’s job impacts their life, but what I’m referring to here is the degree to which his life was impacted by just one production problem. As the kaizenOps.io team interviewed many SREs over the past few months, I’m struck by how much the the job continues to impact people’s lives and how the particulars have changed.
Lakshay suffered from a lack of visibility. He didn’t have the data to reveal the problem. Today, it seems that the visibility problem has been solved; perhaps too well. Today’s SREs have all the data that they need. A sea of data. Drowning in data. This flood of data manifests in several ways, and alert fatigue is obviously one and not knowing which data is relevant is another.
SREs are frustrated over how much time they spend simply figuring out which data streams among the tens (hundreds?) of thousands of data streams are relevant for solving the problem at hand. We’ve heard stories of that task alone taking more than half of the time to resolution. It’s surprising in this day and age that SREs still have to hunt through multiple products and multiple dashboards just to get started!
What is a little less surprising but, perhaps, more problematic is simply knowing which data streams are good health indicators for some particular piece of technology. Very often, when you hook up a monitoring tool, you suddenly get hundreds or thousands of bits of data. This introduces a whole new set of questions. What should you alert on? What does the data mean? How do you know when the tech is healthy? It’s like going to the doctor and getting full body scan. How useful are the raw results? They’re pretty useless unless you know how to put it all into context. Once you figure out what healthy means, you can start to figure out what bad really means.
kaizenOps.io is on a mission to address these problems and we need your help. We’re organizing a community of people to document the best ways of monitoring various technologies. As we’ve talked to SREs, there are pockets of knowledge spread across the entire community. We just need to get the ball rolling on a central repository for this information. We’re excited to share much more on this soon.
Do you have any of these problems? The more we understand the problems you’re facing, the better we can help the community craft solutions. Please take a quick survey and earn yourself $10 in Amazon credit so you can get a pair of these to help with your next production incident!