Skip to content

An impatient SRE's guide to deleting alerts

This memo documents an opinionated methodology for triaging and dealing with unactionable alerts and alert fatigue, with the goal of ultimately reducing the alert volume in order to improve the on-call experience.

When a useless alert comes in, and you still have the mental capacity and energy to do so, don’t ignore it. The next time you get dragged out of bed on a Sunday for an expiring SSL cert in a non-production environment, it’s 🔨 time.

  1. Is it a known issue?
    • Action: 🤫 Silence. Point to issue tracking the fix.
    • Reason: The issue is likely to page the next shift. If the issue is known and no short-term mitigation could be applied, there is no value in paging them again.
    • Example: An incident with an alert that was silenced
  2. Does this alert highlight a slow-burn problem?
    • Common examples include SSL certs expiring, disks filling up, maintenance jobs failing. These need to be dealt with, but not right away. These are also frequently examples where automation and self-healing can improve the situation.
    • Action: 📎 Convert paging alert into auto-created issue.
    • Reason: These alerts are usually not immediately actionable. We do not want to get paged for them at the weekend. Unless we reach a critical threshold, we can deal with them 1-2 days later.
    • Example: Route SSLCertExpiresSoon alert to issue tracker
  3. Is this alert unactionable, not actually pointing to a user-facing problem?
  4. Is this alert too noisy?
  5. Is the alert legit?
    • If the alert points towards an actual user- and SLO impacting problem in a production environment that needs immediate attention, then it’s probably legit.
    • Action: 🚒 Actually investigate the alert, focus on mitigation first, then drive improvements via capacity planning, rate limiting, “corrective actions”, and the infradev process.