Skip to content

AlertmanagerNotificationsFailing

  • This alert means that Alertmanager is failing to send notifications to an upstream service, usually PagerDuty or Slack.
  • This can be due to an upstream service downtime or temporary networking issues.
  • This affects the ability for out engineer on call (EOC) to be notified and take actions on problems with the system.
  • The recipient of the alert is expected to determine the cause of the notification failures, and if possible take actions to resolve the problem.
  • This alert is firing when 4 non-webhook notifications or 10 webhook notifications fail over the course of 5 minutes.
  • In normal circumstances, there should be no failed notifications.
  • Dashboard Link
  • When looking at the dashboard, anything over 0 indicates there have been recent failures.
  • This alert can be silenced if we are aware of the issue and are working to resolve it.
  • Additionally consider silencing it if the problem is upstream and cannot be resolved by us.
  • This alert is low volume and is expected to be rare.
  • This alert is likely an S3.
  • If this alert happens in conjunction with full metrics downtime, it is an S1.
  • This is a fully internal alert, primarily affecting the EOC and alerting visibility.
  • Check the AlertManager logs to find out why it could not send alerts.
    • In the gitlab-ops project of Google Cloud, open the Workloads section under the Kubernetes Engine section of the web console. Select the Alertmanager workload, named alertmanager-gitlab-monitoring-promethe-alertmanager. Here you can see details for the Alertmanager pods and select Container logs to review the logs.
    • The AlertManager pod is very quiet except for errors so it should be quickly obvious if it could not contact a service.
  • Determine what integration is failing
  • Keep in mind that, if nothing has changed, the problem is likely to be on the remote side - for example, a Slack or Pagerduty issue.