Skip to content

PatroniScrapeFailures

Overview

This alert fires when a configured Prometheus scrape target is not responding.
System load, a node or the exporter processes being offline, or otherwise unresponsive may be contributing factors that result in this alert.
When this alert fires, it may indicate a severe problem with the host, or that we are blind to future problem detection due to lack of metrics.

Services

This particular alert is scoped only to nodes supporting the Patroni service.

Metrics

This uses the internal up() function provided by Prometheus, which indicates whether or not the most recent scrape attempt of a given job was successful or not.
It is expected that the return of the query will always be an empty result. Note that we filter the pgbouncer scrape job in the query, due to this being incorrectly configured via the Mimir prometheus-agent ScrapeConfig.

Alert Behavior

This alert is intended to fire regardless of a host’s power state. Because of this, a silence should be created in Alertmanager prior to powering off any instances to avoid unwanted alerts.
This alert will fire if any single scrape target on a host is failing to be scraped. You can determine the specific scrape jobs by removing the min() aggregator from the prometheus query. Example

Severities

If this fires and you aren’t intentionally powering down a VM, always assume this is a high severity alert.
When this fires we either have a node that has failed in a way that could directly impact our customers, or we become blind to future issues as future metrics collection will not be working.

Verification

Verify whether it is a single exporter, or all that are failing to be scraped on the host using this example query.
Check that the host is responsive to SSH connections.
Check the GCP console for any system logs that may indicate a problem.

Troubleshooting

Check that the host is responsive to ping, SSH connections, etc.
Check the GCP console for any system logs that may indicate a problem.
If you can get an SSH connection to the host, check for OOM kills that may have impacted running exporters.

Possible Resolutions

Attempt to restart the exporter services on the machine if the host is responsive and handling query traffic normally.
If the machine is locked up or unresponsive, a reboot may be necessary.
Slack channel: #g_database_operations
Slack group: @dbo
Alert definition
Link to edit this playbook
Update the template used to format this playbook
Related alerts