Is your solution detecting actual business threats?
Reflecting on the alert fatigue problem, I think a lot of the problem comes down to conflating abnormal metric values with bad user experiences. Many monitoring products reinforce the confusion by making it easy (maybe too easy) to automatically set alerts on abnormalities. Of course, not all abnormalities result in bad user experience, but better safe than sorry, right? Well, not necessarily. In fact, I think this approach gets things exactly backwards and leads to alert fatigue.
The three standard deviations approach
Almost all of the monitoring products that I’m aware of use pretty typical statistical analysis to determine what is normal and abnormal. The most basic form of this analysis involves computing an average and standard deviation. As a rule, “abnormal” is defined as three standard deviations from the average. Given the way standard deviation is defined, 0.3% of the values for a metric are guaranteed to fall outside of three standard deviations*. Is that acceptable? Well, if you are sampling the metric once per minute, the metric is abnormal four minutes every day just due to the rules of statistics! Is your system really in a bad state four minutes every day? Maybe (I hope not!). Maybe not. The point is that using statistics alone cannot answer the question and can drive fatigue.
The prolonged abnormality approach
Suppose we continue with the statistical approach to setting your alerts. Most monitoring tools allow you to alert when the metric is abnormal for several minutes in a row. From a statistical standpoint, this reduces the noise coming from the three standard deviation alerting rule but it highlights a deeper question. How does the metric behave when users are having a bad experience?
At first glance, metrics like CPU usage seem pretty straightforward: CPU is abnormal when it goes to 100% and stays there. This isn’t always the case, however. Consider a data processing pipeline application. When CPU goes low (say less than 10%), the pipeline could be broken. The problem gets more challenging when you think about metrics like IO rate which tend to vary more wildly when systems are behaving badly. Without understanding how each of your metrics applies to the business, you can’t know if abnormal means bad.
The key focus
The whole point of alerts in the first place is to know when your users are in danger of experiencing poor performance – not when statistics indicate metrics are abnormal. Why do this indirectly using infrastructure metrics? Why not alert on bad user experience directly? You don’t need to be interrupted when some piece of infrastructure goes bump and certainly not woken up when it goes bump in the night. You do want to be woken up when the business is at risk; therefore, I advocate identifying the metrics that matter to the business. I’ll address common approaches to alerting on bad user experience in a future post.
Hopefully, you can see how this reinforces the need for an “investigate abnormality” backlog which is managed differently than alerts indicating that the business is at risk. This is not a novel concept. Since my last blog post, I discovered a great paper by Rob Ewaschuk titled, “My Philosophy of Alerting” which discusses these ideas in more depth. I highly recommend it.
Along those lines, I’m curious how your organization regards the importance of the speed at which it resolves customer application issues. Please share with us via our survey here. Thanks!
* Under ideal circumstances, metrics will follow the 68-95-99.7 rule. In this case, ideal circumstances means that the metric follows a normal distribution. Of course, this opens up a whole new set of questions like, how do you know the metric is normally distributed? And, what happens if the metric does not follow a normal curve? There’s good reason to think that many important metrics like CPU and response time are not normally distributed. In this case, math developed from the statistician with the coolest name, Pafnuty Chebyshev, says that as much as 90% of a metric’s data points will fall outside three standard deviations.