ContainerRegistryDBReplicaPoolSizeInstability
Overview
Section titled “Overview”This alert is triggered when the replica pool size has been fluctuating significantly (standard deviation > 1) over a 15-minute window for at least 5 minutes. This indicates:
- Replicas frequently joining and leaving the pool;
- Intermittent connectivity or health check failures;
- Infrastructure instability affecting multiple replicas.
A stable pool should have near-zero standard deviation.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert is based on stddev_over_time(registry_database_lb_pool_size[15m]). The alert fires when the standard deviation exceeds 1 for 5 minutes.
For context:
- A stddev of 0 means the pool size is completely stable
- A stddev of 1 means the pool size is fluctuating by roughly ±1 replica
- Higher values indicate more severe fluctuations
Alert Behavior
Section titled “Alert Behavior”This alert complements the churn rate alert by detecting pool size instability regardless of the specific events causing it. It can catch scenarios where replicas are being added and removed in ways not captured by other alerts.
Severities
Section titled “Severities”- s3: Pool instability indicates underlying infrastructure issues but the registry can continue operating.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panel- Check
registry_database_lb_pool_sizeover time to visualize fluctuations - Check
registry_database_lb_pool_events_totalfor add/remove/quarantine events
-
Logs: Filter by
json.msg: "replica"to see all replica-related events.
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Check which replicas are fluctuating in/out of the pool from logs:
- Kibana: replica quarantined
- Kibana: replica added
- Kibana: replica removed
- Look for
json.db_host_addrfield to identify which replicas are affected.
- Investigate if fluctuations correlate with specific events (deployments, network changes).
- Check for patterns - are the same replicas repeatedly affected?
- Review Patroni cluster health and membership:
- Check for DNS or service discovery issues:
- Verify Consul DNS is returning consistent results for
replica.patroni-registry.service.consul.
- Verify Consul DNS is returning consistent results for
Possible Resolutions
Section titled “Possible Resolutions”- Identify and fix the root cause of replica instability;
- Address network or connectivity issues;
- Resolve Patroni cluster health problems.
Dependencies
Section titled “Dependencies”- Patroni
- DNS/Consul
- Network infrastructure
Escalation
Section titled “Escalation”Escalate if instability persists or worsens:
Definitions
Section titled “Definitions”The definition for this alert can be found at: