ContainerRegistryDBReplicaPoolDegraded
Overview
Section titled “Overview”This alert is triggered when the replica pool size has dropped below 50% of the 1-day average for at least 5 minutes. This indicates a significant reduction in available replicas, which may be caused by:
- Multiple replica failures;
- Widespread network connectivity issues;
- Infrastructure problems affecting multiple hosts;
- Mass quarantining due to connectivity or lag issues.
This is an early warning before the pool becomes completely empty.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert compares avg(registry_database_lb_pool_size) to avg_over_time(avg(registry_database_lb_pool_size)[1d:]). The alert fires when current pool size is less than 50% of the 1-day average.
This adaptive threshold automatically adjusts if the expected replica count changes.
Alert Behavior
Section titled “Alert Behavior”This alert provides early warning of pool degradation. It fires before the pool is completely empty, giving operators time to investigate and respond.
Severities
Section titled “Severities”- s3: Significant pool degradation increases load on remaining replicas and may impact performance, but the registry remains operational.
If this alert is followed by ContainerRegistryDBNoReplicasAvailable (s2), the situation has escalated to critical.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panel- Check current
registry_database_lb_pool_sizevs historical average - Check
registry_database_lb_pool_events_totalfor quarantine events
-
Logs: Filter by
json.msg: "replica quarantined" or "removing replica"to identify what’s reducing pool size.
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Identify how many replicas are currently available vs expected:
- Grafana Explore: pool size
- Grafana: registry Database Detail - Load Balancing panel.
- Check which replicas are missing/quarantined and why:
- Check Patroni cluster status - are replicas healthy?
- Check network connectivity to replica hosts.
- Check for any infrastructure incidents affecting multiple hosts.
Possible Resolutions
Section titled “Possible Resolutions”- Restore connectivity to unavailable replicas;
- Wait for quarantined replicas to auto-reintegrate (5-minute cooldown);
- Address underlying infrastructure or network issues;
- Scale up if replicas are permanently lost.
Dependencies
Section titled “Dependencies”- PgBouncer
- Patroni
- Network infrastructure
Escalation
Section titled “Escalation”Escalate immediately if pool continues to degrade or if remaining replicas show signs of stress:
Definitions
Section titled “Definitions”The definition for this alert can be found at:
Related Links
Section titled “Related Links”- Feature runbook
- Feature technical specification
ContainerRegistryDBNoReplicasAvailablealert (escalation)