PatroniScrapeFailures
Overview
Section titled “Overview”- This alert fires when a configured Prometheus scrape target is not responding.
- System load, a node or the exporter processes being offline, or otherwise unresponsive may be contributing factors that result in this alert.
- When this alert fires, it may indicate a severe problem with the host, or that we are blind to future problem detection due to lack of metrics.
Services
Section titled “Services”- This particular alert is scoped only to nodes supporting the Patroni service.
Metrics
Section titled “Metrics”- This uses the internal
up()
function provided by Prometheus, which indicates whether or not the most recent scrape attempt of a given job was successful or not. - It is expected that the return of the query will always be an empty result. Note that we filter the pgbouncer scrape job in the query, due to this being incorrectly configured via the Mimir prometheus-agent ScrapeConfig.
Alert Behavior
Section titled “Alert Behavior”- This alert is intended to fire regardless of a host’s power state. Because of this, a silence should be created in Alertmanager prior to powering off any instances to avoid unwanted alerts.
- This alert will fire if any single scrape target on a host is failing to be scraped. You can determine the specific scrape jobs by removing the
min()
aggregator from the prometheus query. Example
Severities
Section titled “Severities”- If this fires and you aren’t intentionally powering down a VM, always assume this is a high severity alert.
- When this fires we either have a node that has failed in a way that could directly impact our customers, or we become blind to future issues as future metrics collection will not be working.
Verification
Section titled “Verification”- Verify whether it is a single exporter, or all that are failing to be scraped on the host using this example query.
- Check that the host is responsive to SSH connections.
- Check the GCP console for any system logs that may indicate a problem.
Troubleshooting
Section titled “Troubleshooting”- Check that the host is responsive to ping, SSH connections, etc.
- Check the GCP console for any system logs that may indicate a problem.
- If you can get an SSH connection to the host, check for OOM kills that may have impacted running exporters.
Possible Resolutions
Section titled “Possible Resolutions”-
Attempt to restart the exporter services on the machine if the host is responsive and handling query traffic normally.
-
If the machine is locked up or unresponsive, a reboot may be necessary.
-
Slack channel:
#g_database_operations
-
Slack group:
@dbo
-