Skip to content

ContainerRegistryDBHighReplicaConnectivityQuarantineRate

This alert is triggered when replicas are being quarantined due to connectivity issues at a rate exceeding 0.05 quarantines/second for 10 minutes. Replicas are quarantined for connectivity issues when:

  • Consecutive failures: A replica fails to connect 3 times in a row;
  • Flapping behavior: A replica is added/removed from the pool 5+ times within 60 seconds.

This indicates network issues, replica instability, or infrastructure problems affecting database connectivity.

This alert is based on registry_database_lb_pool_events_total with labels event="replica_quarantined", reason="connectivity". The alert fires when the rate exceeds 0.05/second (~3 quarantines/minute) sustained for 10 minutes.

Quarantined replicas are automatically reintegrated after a 5-minute cooldown period.

This alert indicates active connectivity problems. The quarantine mechanism is protecting the load balancer from repeatedly attempting connections to unstable replicas.

  • s3: Connectivity issues are being mitigated by the quarantine mechanism, but underlying problems need investigation.
  • Metrics:

    • registry: Database Detail - Load Balancing panel
    • Check registry_database_lb_pool_events_total{event="replica_quarantined", reason="connectivity"}
    • Check registry_database_lb_pool_size for current pool size
  • Logs: Filter by json.msg: "replica quarantined" to identify affected replicas and reasons.

Recent registry deployments and configuration changes can be found here.

  1. Identify which replicas are being quarantined from logs:
  2. Check network connectivity from registry pods to those specific replicas:
    • Test connectivity using kubectl exec into a registry pod and attempting to reach the replica.
  3. Check PgBouncer status on affected replicas:
  4. Check Patroni cluster health for the affected replicas:
  5. Look for any network policy changes or firewall issues.
  • Restore network connectivity to affected replicas;
  • Investigate and fix PgBouncer issues on replica hosts;
  • Address Patroni cluster member health issues;
  • Quarantined replicas will auto-reintegrate after 5 minutes once connectivity is restored.
  • PgBouncer
  • Patroni
  • Network infrastructure

Escalate if multiple replicas are being quarantined or if pool size drops significantly:

The definition for this alert can be found at: