An impatient SRE's guide to deleting alerts
This memo documents an opinionated methodology for triaging and dealing with unactionable alerts and alert fatigue, with the goal of ultimately reducing the alert volume in order to improve the on-call experience.
When a useless alert comes in, and you still have the mental capacity and energy to do so, don’t ignore it. The next time you get dragged out of bed on a Sunday for an expiring SSL cert in a non-production environment, it’s 🔨 time.
Methodology
Section titled “Methodology”- Is it a known issue?
- Action: 🤫 Silence. Point to issue tracking the fix.
- Reason: The issue is likely to page the next shift. If the issue is known and no short-term mitigation could be applied, there is no value in paging them again.
- Example: An incident with an alert that was silenced
- Does this alert highlight a slow-burn problem?
- Common examples include SSL certs expiring, disks filling up, maintenance jobs failing. These need to be dealt with, but not right away. These are also frequently examples where automation and self-healing can improve the situation.
- Action: 📎 Convert paging alert into auto-created issue.
- Reason: These alerts are usually not immediately actionable. We do not want to get paged for them at the weekend. Unless we reach a critical threshold, we can deal with them 1-2 days later.
- Example: Route SSLCertExpiresSoon alert to issue tracker
- Is this alert unactionable, not actually pointing to a user-facing problem?
- Common examples include cause based alerts that highlight some behaviour but aren’t actually impacting availability. Error rates may be including client-side errors or rate-limited requests. Or alerting may be pointing at non-production environments, or upstream services we don’t control.
- Action: 🔥 Delete.
- Reason: Alerts that don’t point to an actual problem are worse than worthless. They make on-call a bad experience, and we should not tolerate them.
- Examples:
- Is this alert too noisy?
- Some alerts are flappy or just too sensitive. Sometimes they have too low traffic, allowing a single user to impact the overall SLO.
- Action: 📊 Adjust thresholds. Exclude sensitive endpoints if needed.
- Reason: Noisy alerts drain precious energy during on-call shifts, and contribute to alert fatigue. “Oh, this again? Ack and ignore”.
- Examples:
- Is the alert legit?
- If the alert points towards an actual user- and SLO impacting problem in a production environment that needs immediate attention, then it’s probably legit.
- Action: 🚒 Actually investigate the alert, focus on mitigation first, then drive improvements via capacity planning, rate limiting, “corrective actions”, and the infradev process.