Skip to content

(Title: Name of alert)

  • What does this alert mean?

  • What factors can contribute?

  • What parts of the service are effected?

  • What action is the recipient of this alert expected to take when it fires?

  • All alerts require one or more Service Overview links

  • Team that owns the service

  • Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)

  • Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?

  • Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.

  • Add screenshots of what a dashboard will look like when this alert is firing and when it recovers

  • Are there any specific visuals or messages one should look for in the screenshots?

  • Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules?

  • Expected frequency of the alert. Is it a high-volume alert or expected to be rare?

  • Show historical trends of the alert firing e.g Kibana dashboard

  • Guidance for assigning incident severity to this alert

  • Who is likely to be impacted by this cause of this alert?

    • All gitlab.com customers or a subset?

    • Internal customers only?

  • Things to check to determine severity

  • Prometheus link to query that triggered the alert

  • Additional monitoring dashboards

  • Link to log queries if applicable

  • Links to queries for recent related production change requests

  • Links to queries for recent cookbook or helm MR’s

  • How to properly roll back changes

  • Basic troubleshooting order

  • Additional dashboards to check

  • Useful scripts or commands

  • Links to past incidents where this alert helped identify an issue with clear resolutions

  • Internal and external dependencies which could potentially cause this alert

  • How and when to escalate

  • Slack channels where help is likely to be found:

  • Link to the definition of this alert for review and tuning

  • Advice or limitations on how we should or shouldn’t tune the alert

  • Link to edit this playbook

  • Update the template used to format this playbook

  • Related alerts Link to this /alert/ directory

  • Related documentation