AlertmanagerNotificationsFailing

Overview

This alert means that Alertmanager is failing to send notifications to an upstream service, usually PagerDuty or Slack.
This can be due to an upstream service downtime or temporary networking issues.
This affects the ability for out engineer on call (EOC) to be notified and take actions on problems with the system.
The recipient of the alert is expected to determine the cause of the notification failures, and if possible take actions to resolve the problem.

This alert is firing when 4 non-webhook notifications or 10 webhook notifications fail over the course of 5 minutes.
In normal circumstances, there should be no failed notifications.
Dashboard Link
When looking at the dashboard, anything over 0 indicates there have been recent failures.

This alert can be silenced if we are aware of the issue and are working to resolve it.
Additionally consider silencing it if the problem is upstream and cannot be resolved by us.
This alert is low volume and is expected to be rare.

This alert is likely an S3.
If this alert happens in conjunction with full metrics downtime, it is an S1.
This is a fully internal alert, primarily affecting the EOC and alerting visibility.

Check the AlertManager logs to find out why it could not send alerts.
- In the gitlab-ops project of Google Cloud, open the Workloads section under the Kubernetes Engine section of the web console. Select the Alertmanager workload, named alertmanager-gitlab-monitoring-promethe-alertmanager. Here you can see details for the Alertmanager pods and select Container logs to review the logs.
- The AlertManager pod is very quiet except for errors so it should be quickly obvious if it could not contact a service.
Determine what integration is failing
- In Prometheus, run this query: rate(alertmanager_notifications_failed_total[10m]).
- This will give you a breakdown of which integration is failing, and from which server.
- For the slackline, you can view the alertManagerBridge cloud function, its logs, and code.
Keep in mind that, if nothing has changed, the problem is likely to be on the remote side - for example, a Slack or Pagerduty issue.

PagerDuty
Slack
GCP Networking
If the problem persists with no known upstream cause, escalate to the Scalability-Observability team.
#g_scalability-observability
Alert definition
This alert is unlikely to need tuning or modification, however in the past we have changed the wait time before having the alert fire when only a few webhooks failed and it recovered immediately.
Edit this Playbook
Update the template used to format this playbook
Related alerts
Related Documentation