Alerting
Overview
Section titled “Overview”videos and materials in ./runbooks/docs/monitoring/README.md
part V (chapters 18 and 19) of O’Reilly’s “Prometheus up and running” book
graph TDA[exporters] -->|scraping /metrics endpoint| B(Prometheus)subgraph PrometheusB --> C{alerting rules}B --> D{recording rules}D --> BC -->E[alertmanager configured endpoints]endsubgraph alertmanager clusterE --> H[routing rules]H --> I["pagerduty receiver (its config includes a PD service key)"]H --> J["slack receiver (Slack Channel Alerts)"]H --> S[Slackline webhook]H --> T[DeadMan's snitch webhook]H --> M[other receivers]endsubgraph PagerdutyI --> |PD Prometheus service integration| K[PD service]K --> O["PD Escalation policy (e.g. if no response in 5 mins send a text)"]O --> P[individual PD users]O --> R[PD schedules: SRE, DBRE, Support, etc]endsubgraph GCP cloud functionsS --> V[alertManagerBridge]endsubgraph SlackV --> LJ --> L[Slack channels]endsubgraph GrafanaV --> U[Grafana]endsubgraph DeadMan's SnitchT --> W[DeadMan's Snitch]endsubgraph otherM --> N[other]end
Prometheus
Section titled “Prometheus”Generally speaking alerts are triggered by Prometheus, and then grouped by label, prioritized and deduped by the Alertmanager.
Currently most alerts are being grouped by environment and alert name. These alerts are then shipped to the alert manager which applies a template to them to ship them to slack or pagerduty, depending on the pager
label.
Alert rule sample
Section titled “Alert rule sample”Alert rules, alongside other config such as alertmanager’s URL endpoint, are part of Prometheus config. This sample alert rule would trigger pretty much all the time,
groups:- name: testing.rules rules: - alert: HawaiianHugs expr: node_load1{environment="gprd",job="node",type="git"} > 1 for: 1m labels: severity: s1 annotations: title: "Emergency Alert" description: > "BALLISTIC MISSILE THREAT INBOUND TO GITLAB. SEEK IMMEDIATE SHELTER. THIS IS NOT A DRILL." node_load: '{{ $value }}'
Fields
Section titled “Fields”- alert: the alert name, this will be used as a grouping mechanism.
- expr: the expression to evaluate, has to evaluate to boolean.
- for: how long to wait until this alert starts firing.
- labels: any alert that triggers will include all the labels that come from the prometheus metric, here we can add more labels to manage the alert behavior. We could, for example add more labels like this:
team
: which team the alert should go to. Teams are defined in teams.ymlseverity
:s1
,s2
,s3
, ands4
pager
: ifpagerduty
then the alert will be also sent to pagerduty to the production on-callenvironment
: this one generally comes with the metric, we can override it to send to different channels (stg, pre, or geo)
- annotations: there are 2 annotations that will be always used: title and descriptions. These should be static (they can’t have any variable in them) more on this later on. Additionally we can define as many other annotations as we want, more on this later.
Annotations
Section titled “Annotations”As previously said, there are 2 static annotations that should be used always:
- title: this will be the first line of an alert, should be simple a direct saying what the problem is.
- description: a longer text providing data and context on what this alert is, best if they provide information on how to deal with the alert or which steps to take. Since we are using yaml now it’s quite easy to write a really long text, so we should be migrating the runbooks into the alerting.
Besides this, we can add any other annotation that we want, the way this will
work is that for each alert that is grouped the fqdn
or instance
will be
printed under the description, one per line, and then any additional annotation
will be printed next to it, this means that we can define as many as we want,
for example, in the previous alert, there will be one line for each node that
matches the expression and it will include the text node_load = X
once per
each host.
Alertmanager
Section titled “Alertmanager”Alerts raised by Prometheus, as defined by alert rules, are sent to alertmanager. Alertmanager is configured with routing rules and receivers. Incoming alerts from Prometheus are routed to receivers based on routing rules.
How to add new alerts
Section titled “How to add new alerts”Manual alerts
Section titled “Manual alerts”- Create a new file in
/mimir-rules/<tenant>
or add to an existing one. - Add the appropriate fields
- Add
team
label if you want it routed to a specific Slack channel for a product team. - Add severity (
severity: s2
/severity: s1
) &pager: pagerduty
if you want it to page the on-call.
- Add
Example: registry queue
Templating alerts
Section titled “Templating alerts”Sometimes you need to template alerts, and tie them with our metric catalog. Writing them manually might not be ideal.
- Create a new jsonnet file in
/mimir-rules-jsonnet
- Add a new alert
- Run
make generate
Example: patroni alerts
Where to find things
Section titled “Where to find things”Prometheus
Section titled “Prometheus”We have 3 of those, 1 and 2 are internal to GitLab.com, 3 is used for the public monitor.gitlab.net site. this is outdated, see monitoring overview for more info
Alertmanager
Section titled “Alertmanager”We have one alermanager cluster. It is operated via helmfiles and runs in the ops
GKE cluster.
It is accessed via Google IAP on https://alerts.gitlab.net
The alertmanager-specific configuration where routing and recievers are defined can be found in the runbooks.
The alerting templates (which are built using go’s text/template engine) can be found in the helmfiles.
Silencing
Section titled “Silencing”In some cases you need to silence an alert for an expected condition. This should be done when there is something alarming that is understood and will eventually be fixed.
Silencing alerts is helpful to reduce the broken window effect, critical alarms should always be actionable and if they aren’t we should ideally change the alert or if it something temporary silence them.
- Go to https://alerts.gitlab.net/#/silences and select “New Silence”
- Add matchers for the conditions to target the specific alert, for example:
Name Value----- ------
alertname LowDiskSpacefqdn sync.geo.gitlab.commountpoint /var/opt/gitlab
- Set a duration, an owner and a comment.
- In the comment field, link to a change request, an infradev issue, or a related investigation issue with a DRI that was created from an incident. For the latter, create these issues using the “Related Issue” section of the incident template.
- Once a silence is created, the associated incident issue(s) should be closed.
Note: During a site-wide outage it may be desired to globally silence alerts while we investigate issues.
To create one, navigate to https://alerts.gitlab.net/#/silences and add a silence for env=gprd
with a duration of <= 1hour.
Tips and tricks
Section titled “Tips and tricks”- Trigger alerts to see how they work by writing a condition that will always be true.
- Print the whole alerting data and model using go template printf, with a text such as:
text: '{{ printf "%#v" . }}'
in the alert receiver configuration on the alert manager. - Verify alertmanager config using https://prometheus.io/webtools/alerting/routing-tree-editor/ . It will also visualize the routing tree for you. When pasting config, remove any ruby templated strings and do not paste any secrets!
- Test changes to routing rules in alertmanager’s config using routing tree editor and alerts which can be easily found in Thanos