Skip to content

BlackboxProbeFailures

  • The alert BlackboxProbeFailures is designed to notify you when the success rate of probes executed by the Blackbox exporter falls below 75% for 10 minutes. The instances in gprd are taken into consideration by the alert excluding the following

    • https://ops.gitlab.net/users/sign_in
    • https://dev.gitlab.org.*
    • https://pre.gitlab.com
    • https://registry.pre.gitlab.com
    • https://release.gitlab.net
    • https://status.gitlab.com
    • https://new-sentry.gitlab.net
  • A variety of factors can cause a probe to fail: a GCP outage, Cloudflare event, expired SSL certificate, or a breaking change.

  • The service affected depends on the endpoint the probes failed for, the team owning the service can be determined in the Service Catalog by searching for the service name.

  • The recipient is supposed to check if the endpoint is reachable; if not, check for logs and try to figure out the cause of a endpoint being unreachable, and then fix it or escalate it.

  • The metric in the provided Prometheus query is based on the success rate of probes executed by a Prometheus blackbox exporter. Link to the metrics catalog

    avg_over_time(probe_success{...}[10m]) * 100 < 75: This part of the query calculates the average success rate over the past 10 minutes. The probe_success metric indicates whether the probe was successful (1 for success, 0 for failure). Multiplying by 100 converts this rate to a percentage. The condition < 75 triggers the alert if the average success rate falls below 75%.

  • Given the reliance on DNS and network connectivity, the blackbox thresholds are chosen to minimize false alerts for minor and transient problems outside our control. It’s still possible that a false alarm could result, but even if there is a non-service related cause for more than 10 minutes, we would want the engineer on call to be aware of it.

  • Dashboard when the alert is firing

Alert Firing

Alert Normal

  • Any presence of a the metric below 75 shows some failures of the probes.
  • We can silence this alert by going here, finding the BlackboxProbeFailures and click on silence option. Silencing might be required if the alerts is caused by an external dependency out of our control.

  • This alert is fairly common, past hits can be seen here

  • Previous incidents of this alert firing

  • The incident severity can range from Sev4 to Sev1 depending on the endpoint.
  • The impact depends on the endpoint being affected, as failures on certain endpoints will impact our customers.
  • Handbook Link to better decide the severity of the incident.
  • The blackbox exporter keeps logs from failed probes in memory and exposes them over a web interface. You can access it by using port forwarding, and then navigating to http://localhost:9115

    Terminal window
    ssh blackbox-01-inf-gprd.c.gitlab-production.internal -L 9115:localhost:9115

Please note that the exporter will only keep up to 1000 results, and drop older ones. So make sure to grab these as quickly as possible, before they expire.

  • The alert might trigger due to a variety of factors, such as: a GCP outage, Cloudflare event, expired SSL certificate, or a breaking change related to the endpoint.

Slack channels to look for assistance: