(Title: Name of alert)
Overview
Section titled “Overview”-
What does this alert mean?
-
What factors can contribute?
-
What parts of the service are effected?
-
What action is the recipient of this alert expected to take when it fires?
Services
Section titled “Services”-
All alerts require one or more Service Overview links
-
Team that owns the service
Metrics
Section titled “Metrics”-
Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)
-
Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?
-
Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.
-
Add screenshots of what a dashboard will look like when this alert is firing and when it recovers
-
Are there any specific visuals or messages one should look for in the screenshots?
Alert Behavior
Section titled “Alert Behavior”-
Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules?
-
Expected frequency of the alert. Is it a high-volume alert or expected to be rare?
-
Show historical trends of the alert firing e.g Kibana dashboard
Severities
Section titled “Severities”-
Guidance for assigning incident severity to this alert
-
Who is likely to be impacted by this cause of this alert?
-
All gitlab.com customers or a subset?
-
Internal customers only?
-
-
Things to check to determine severity
Verification
Section titled “Verification”-
Prometheus link to query that triggered the alert
-
Additional monitoring dashboards
-
Link to log queries if applicable
Recent changes
Section titled “Recent changes”-
Links to queries for recent related production change requests
-
Links to queries for recent cookbook or helm MR’s
-
How to properly roll back changes
Troubleshooting
Section titled “Troubleshooting”-
Basic troubleshooting order
-
Additional dashboards to check
-
Useful scripts or commands
Possible Resolutions
Section titled “Possible Resolutions”-
Links to past incidents where this alert helped identify an issue with clear resolutions
Dependencies
Section titled “Dependencies”-
Internal and external dependencies which could potentially cause this alert
-
How and when to escalate
-
Slack channels where help is likely to be found:
-
Link to the definition of this alert for review and tuning
-
Advice or limitations on how we should or shouldn’t tune the alert
-
Link to edit this playbook
-
Related alerts Link to this /alert/ directory
-
Related documentation