AlertmanagerNotificationsFailing
Overview
Section titled “Overview”- This alert means that Alertmanager is failing to send notifications to an upstream service, usually PagerDuty or Slack.
- This can be due to an upstream service downtime or temporary networking issues.
- This affects the ability for out engineer on call (EOC) to be notified and take actions on problems with the system.
- The recipient of the alert is expected to determine the cause of the notification failures, and if possible take actions to resolve the problem.
Services
Section titled “Services”Metrics
Section titled “Metrics”- This alert is firing when 4 non-webhook notifications or 10 webhook notifications fail over the course of 5 minutes.
- In normal circumstances, there should be no failed notifications.
- Dashboard Link
- When looking at the dashboard, anything over 0 indicates there have been recent failures.
Alert Behavior
Section titled “Alert Behavior”- This alert can be silenced if we are aware of the issue and are working to resolve it.
- Additionally consider silencing it if the problem is upstream and cannot be resolved by us.
- This alert is low volume and is expected to be rare.
Severities
Section titled “Severities”- This alert is likely an S3.
- If this alert happens in conjunction with full metrics downtime, it is an S1.
- This is a fully internal alert, primarily affecting the EOC and alerting visibility.
Verification
Section titled “Verification”Recent changes
Section titled “Recent changes”- Recent Change Requests
- Recent Helm Merge Requests
- Typically changes to alertmanager will be in the
releases/30-gitlab-monitoring
directory
- Typically changes to alertmanager will be in the
Troubleshooting
Section titled “Troubleshooting”- Check the AlertManager logs to find out why it could not send alerts.
- In the
gitlab-ops
project of Google Cloud, open theWorkloads
section under theKubernetes Engine
section of the web console. Select the Alertmanager workload, namedalertmanager-gitlab-monitoring-promethe-alertmanager
. Here you can see details for the Alertmanager pods and selectContainer logs
to review the logs. - The AlertManager pod is very quiet except for errors so it should be quickly obvious if it could not contact a service.
- In the
- Determine what integration is failing
- In Prometheus, run this query:
rate(alertmanager_notifications_failed_total[10m])
. - This will give you a breakdown of which integration is failing, and from which server.
- For the slackline, you can view the
alertManagerBridge
cloud function, its logs, and code.
- In Prometheus, run this query:
- Keep in mind that, if nothing has changed, the problem is likely to be on the remote side - for example, a Slack or Pagerduty issue.
Possible Resolutions
Section titled “Possible Resolutions”- 2023-05-17: Alertmanager failed due to Slack service degredation
- 2023-02-26: Notifications Failing due to channel name change
- 2023-01-30: Notifications failing due to template parsing errors
- 2020-08-13: AlertmanagerNotificationsFailing incident
Dependencies
Section titled “Dependencies”-
PagerDuty
-
Slack
-
GCP Networking
-
If the problem persists with no known upstream cause, escalate to the Scalability-Observability team.
-
#g_scalability-observability
-
This alert is unlikely to need tuning or modification, however in the past we have changed the wait time before having the alert fire when only a few webhooks failed and it recovered immediately.