One of the issues that I’ve run across over the years is alert fatigue. As the linked article points out, it’s not just a problem for SREs, but we’re definitely victims of it. I can’t count the number of times the question, “Hey, what is that alert about?” is answered with, “I don’t really know, that always happens.” I think this comes from a desire to make sure nothing is missed, but like a lot of well-intentioned programs, handling too much ends up meaning that you handle very little. As it turns out, humans have limited attention.
To ensure that alerts are meaningful, we need to consider the ways we can respond to them. Broadly speaking, there are only three responses to an alert:
- A well-defined procedure
- Investigate to understand the source of a problem and determine a resolution
- Ignore it(of course, we’re trying to eliminate this one!)
Determining actionable alerts means going through each alert that you fire off to PagerDuty or Slack and asking some hard questions:
- Could the response to the alert be automated? (if the answer is yes but it’s not currently automated, add it to your backlog)
- Can you say, with high confidence, that the alert indicates a problem in the system which will affect your business?
These questions may seem pretty straightforward but every time I’ve seen a team wrestle with them, the conversation is laced with statements like, “Remember that time when the snarfle got all dymriddled? We set up this alert to make sure that doesn’t happen again!” Since then, there hasn’t been a problem with the snarfle but the alert has been throwing out false-positives because nobody really knew the right value to avoid another dymriddle.
Essentially, we need to decide if we have high confidence that an alert represents an impact to the business. Alerts in which we have high confidence are worthy of someone’s attention immediately and thus should be Slacked and/or PagerDutied. While it may seem obvious that medium/low confidence alerts should be handled differently, in my experience, mixing these categories is exactly how alert fatigue begins.
Managing med/low confidence alerts seems to be the achilles heal in most SRE organizations – even organizations which separate high confidence alerts from the rest. I’ve seen some teams be very successful by creating a separate “this-seems-odd-and-deserves-investigation” backlog. This “investigate weirdness” backlog requires someone managing it and dedicating time to tracking down underlying issues. I hope to talk more about managing this backlog in a future post.
Before we get there, I’m curious how your organization values resolving customer application issues quickly. Please share with us via our survey here. Thanks!
Till next time, stay alert! 🙂