ContainerRegistryDBHighReplicaPoolChurnRate
Overview
Section titled “Overview”This alert is triggered when the database load balancer is experiencing sustained replica DNS changes (additions and removals) at a rate exceeding 0.1 events/second for 15 minutes. This can indicate:
- DNS instability or misconfiguration;
- Network issues causing intermittent connectivity;
- Service discovery problems;
- Infrastructure instability affecting replica hosts.
The registry is able to operate during DNS churn, but frequent replica changes can lead to connection overhead and potential latency increases.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert is based on registry_database_lb_pool_events_total with labels event="replica_added", reason="discovered" and event="replica_removed", reason="removed_from_dns". The alert fires when the combined rate of these events exceeds 0.1/second (~6 events/minute) sustained for 15 minutes.
Note: Brief spikes during deployments are expected and the 15-minute duration helps filter these out.
Alert Behavior
Section titled “Alert Behavior”This alert should be rare under normal operations. The 15-minute duration window is designed to ignore transient spikes from expected events like deployments or scaling operations.
Severities
Section titled “Severities”- s3: This alert indicates infrastructure instability but does not pose immediate availability risk.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panel- Check the
registry_database_lb_pool_events_totalmetric for add/remove patterns
-
Logs: Filter by
json.msg: "replica is new" or "removing replica"to see replica pool changes.
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Identify replica add/remove events from logs:
- Kibana: replica added
- Kibana: replica removed
- Look for
json.db_host_addrfield to identify which replicas are churning.
- Check DNS resolution for the replica hosts - are records stable?
- Verify Consul DNS is returning consistent results for
replica.patroni-registry.service.consul.
- Verify Consul DNS is returning consistent results for
- Check network connectivity between registry pods and replica hosts.
- Look at Patroni cluster status for any failovers or membership changes:
- Review recent infrastructure changes that might affect DNS or networking.
Possible Resolutions
Section titled “Possible Resolutions”- Investigate and resolve DNS instability;
- Fix network connectivity issues;
- Address any Patroni cluster instability.
Dependencies
Section titled “Dependencies”- DNS (Consul)
- Patroni
- Network infrastructure
Escalation
Section titled “Escalation”Escalate if the churn rate continues to increase or if it starts affecting API latency:
Definitions
Section titled “Definitions”The definition for this alert can be found at: