ContainerRegistryDBHighReplicaPoolChurnRate

Overview

This alert is triggered when the database load balancer is experiencing sustained replica DNS changes (additions and removals) at a rate exceeding 0.1 events/second for 15 minutes. This can indicate:

DNS instability or misconfiguration;
Network issues causing intermittent connectivity;
Service discovery problems;
Infrastructure instability affecting replica hosts.

The registry is able to operate during DNS churn, but frequent replica changes can lead to connection overhead and potential latency increases.

Services

Metrics

This alert is based on registry_database_lb_pool_events_total with labels event="replica_added", reason="discovered" and event="replica_removed", reason="removed_from_dns". The alert fires when the combined rate of these events exceeds 0.1/second (~6 events/minute) sustained for 15 minutes.

Note: Brief spikes during deployments are expected and the 15-minute duration helps filter these out.

Alert Behavior

This alert should be rare under normal operations. The 15-minute duration window is designed to ignore transient spikes from expected events like deployments or scaling operations.

Severities

s3: This alert indicates infrastructure instability but does not pose immediate availability risk.

Verification

Metrics:
- registry: Database Detail - Load Balancing panel
- Check the registry_database_lb_pool_events_total metric for add/remove patterns
Logs: Filter by json.msg: "replica is new" or "removing replica" to see replica pool changes.

Recent changes

Recent registry deployments and configuration changes can be found here.

Troubleshooting

Identify replica add/remove events from logs:
- Kibana: replica added
- Kibana: replica removed
- Look for json.db_host_addr field to identify which replicas are churning.
Check DNS resolution for the replica hosts - are records stable?
- Verify Consul DNS is returning consistent results for replica.patroni-registry.service.consul.
Check network connectivity between registry pods and replica hosts.
Look at Patroni cluster status for any failovers or membership changes:
- Grafana: patroni-registry Overview
Review recent infrastructure changes that might affect DNS or networking.

Possible Resolutions

Investigate and resolve DNS instability;
Fix network connectivity issues;
Address any Patroni cluster instability.

Dependencies

DNS (Consul)
Patroni
Network infrastructure

Escalation

Escalate if the churn rate continues to increase or if it starts affecting API latency:

Definitions

The definition for this alert can be found at: