Skip to content

BlackboxProbeFailures

  • The alert BlackboxProbeFailures is designed to notify you when the success rate of probes executed by the Blackbox exporter falls below 75% for 10 minutes. The instances in gprd are taken into consideration by the alert excluding the following

    • https://ops.gitlab.net/users/sign_in
    • https://dev.gitlab.org.*
    • https://pre.gitlab.com
    • https://registry.pre.gitlab.com
    • https://status.gitlab.com
    • https://new-sentry.gitlab.net
    • https://staging.gitlab.com.*
  • A variety of factors can cause a probe to fail: a GCP outage, Cloudflare event, expired SSL certificate, or a breaking change.

  • The service affected depends on the endpoint the probes failed for, the team owning the service can be determined in the Service Catalog by searching for the service name.

  • The recipient is supposed to check if the endpoint is reachable; if not, check for logs and try to figure out the cause of a endpoint being unreachable, and then fix it or escalate it.

  • The metric in the provided Prometheus query is based on the success rate of probes executed by a Prometheus blackbox exporter. Link to the metrics catalog

    avg_over_time(probe_success{...}[10m]) * 100 < 75: This part of the query calculates the average success rate over the past 10 minutes. The probe_success metric indicates whether the probe was successful (1 for success, 0 for failure). Multiplying by 100 converts this rate to a percentage. The condition < 75 triggers the alert if the average success rate falls below 75%.

  • Given the reliance on DNS and network connectivity, the blackbox thresholds are chosen to minimize false alerts for minor and transient problems outside our control. It’s still possible that a false alarm could result, but even if there is a non-service related cause for more than 10 minutes, we would want the engineer on call to be aware of it.

  • Dashboard with the alert expression

  • Example of the metric while the alert is firing:

    Alert Firing

  • Example of the metric under normal conditions:

    Alert Normal

  • Any presence of a the metric below 75 shows some failures of the probes.

  • We can silence this alert by going here, finding the BlackboxProbeFailures and click on silence option. Silencing might be required if the alerts is caused by an external dependency out of our control.

  • This alert is fairly common, past hits can be seen here

  • Previous incidents of this alert firing

  • The alert might trigger due to a variety of factors, such as: a GCP outage, Cloudflare event, expired SSL certificate, or a breaking change related to the endpoint.

Slack channels to look for assistance: