ContainerRegistryDBNoReplicasAvailable
Overview
Section titled “Overview”This alert is triggered when the replica pool size has been zero for at least 2 minutes. This is a critical condition meaning:
- All database replicas are unavailable, unreachable, or quarantined;
- All read queries are being routed to the primary database;
- Primary database load is significantly increased.
Immediate investigation is required.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert is based on avg(registry_database_lb_pool_size) == 0. The alert fires after the pool has been empty for 2 minutes.
Alert Behavior
Section titled “Alert Behavior”This is the most severe replica pool alert. It indicates complete loss of read replica capacity. The registry will continue to function by routing all queries to the primary, but this significantly increases primary load and may lead to performance degradation.
Severities
Section titled “Severities”- s2: Critical condition. All read traffic is hitting the primary database, which may become overloaded.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panel- Check
registry_database_lb_pool_size- should be 0 - Check primary database load metrics
- Check
registry_database_lb_pool_events_totalfor recent quarantine events
-
Logs:
- Filter by
json.msg: "replica quarantined"to see why replicas were removed - Filter by
json.msg: "no replicas available"to confirm fallback to primary
- Filter by
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Immediate: Check primary database health and load - is it coping?
- Grafana: patroni-registry Overview - Primary load panels.
- Grafana Explore: primary connections
- Check Patroni cluster status - are replicas up?
- Grafana: patroni-registry Overview - Cluster membership panel.
- Check network connectivity from registry pods to all replica hosts.
- Check PgBouncer status on replica hosts:
- Review recent events - what caused all replicas to become unavailable?
- Check if replicas are quarantined (they will auto-reintegrate after 5 minutes).
Possible Resolutions
Section titled “Possible Resolutions”- Restore network connectivity to replicas;
- Fix PgBouncer issues on replica hosts;
- Address Patroni cluster problems;
- Wait for quarantined replicas to auto-reintegrate (5-minute cooldown);
- If replicas are permanently lost, scale up or failover.
Dependencies
Section titled “Dependencies”- PgBouncer
- Patroni
- Consul
- Network infrastructure
Escalation
Section titled “Escalation”Escalate immediately - this is an s2 condition:
g_container_registrys_package- Consider paging database on-call if primary shows signs of overload.
Definitions
Section titled “Definitions”The definition for this alert can be found at:
Related Links
Section titled “Related Links”- Feature runbook
- Feature technical specification
ContainerRegistryDBReplicaPoolDegradedalert (early warning)ContainerRegistryDBLoadBalancerReplicaPoolSizealert (related)