ContainerRegistryDBHighReplicaConnectivityQuarantineRate
Overview
Section titled “Overview”This alert is triggered when replicas are being quarantined due to connectivity issues at a rate exceeding 0.05 quarantines/second for 10 minutes. Replicas are quarantined for connectivity issues when:
- Consecutive failures: A replica fails to connect 3 times in a row;
- Flapping behavior: A replica is added/removed from the pool 5+ times within 60 seconds.
This indicates network issues, replica instability, or infrastructure problems affecting database connectivity.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert is based on registry_database_lb_pool_events_total with labels event="replica_quarantined", reason="connectivity". The alert fires when the rate exceeds 0.05/second (~3 quarantines/minute) sustained for 10 minutes.
Quarantined replicas are automatically reintegrated after a 5-minute cooldown period.
Alert Behavior
Section titled “Alert Behavior”This alert indicates active connectivity problems. The quarantine mechanism is protecting the load balancer from repeatedly attempting connections to unstable replicas.
Severities
Section titled “Severities”- s3: Connectivity issues are being mitigated by the quarantine mechanism, but underlying problems need investigation.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panel- Check
registry_database_lb_pool_events_total{event="replica_quarantined", reason="connectivity"} - Check
registry_database_lb_pool_sizefor current pool size
-
Logs: Filter by
json.msg: "replica quarantined"to identify affected replicas and reasons.
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Identify which replicas are being quarantined from logs:
- Kibana: replica quarantined (connectivity)
- Look for
json.db_host_addrfield to identify the affected replica.
- Check network connectivity from registry pods to those specific replicas:
- Test connectivity using
kubectl execinto a registry pod and attempting to reach the replica.
- Test connectivity using
- Check PgBouncer status on affected replicas:
- Check Patroni cluster health for the affected replicas:
- Grafana: patroni-registry Overview
- Look for replica lag, connection counts, and cluster membership.
- Look for any network policy changes or firewall issues.
Possible Resolutions
Section titled “Possible Resolutions”- Restore network connectivity to affected replicas;
- Investigate and fix PgBouncer issues on replica hosts;
- Address Patroni cluster member health issues;
- Quarantined replicas will auto-reintegrate after 5 minutes once connectivity is restored.
Dependencies
Section titled “Dependencies”- PgBouncer
- Patroni
- Network infrastructure
Escalation
Section titled “Escalation”Escalate if multiple replicas are being quarantined or if pool size drops significantly:
Definitions
Section titled “Definitions”The definition for this alert can be found at: