Skip to content

ContainerRegistryDBHighReplicaPoolChurnRate

This alert is triggered when the database load balancer is experiencing sustained replica DNS changes (additions and removals) at a rate exceeding 0.1 events/second for 15 minutes. This can indicate:

  • DNS instability or misconfiguration;
  • Network issues causing intermittent connectivity;
  • Service discovery problems;
  • Infrastructure instability affecting replica hosts.

The registry is able to operate during DNS churn, but frequent replica changes can lead to connection overhead and potential latency increases.

This alert is based on registry_database_lb_pool_events_total with labels event="replica_added", reason="discovered" and event="replica_removed", reason="removed_from_dns". The alert fires when the combined rate of these events exceeds 0.1/second (~6 events/minute) sustained for 15 minutes.

Note: Brief spikes during deployments are expected and the 15-minute duration helps filter these out.

This alert should be rare under normal operations. The 15-minute duration window is designed to ignore transient spikes from expected events like deployments or scaling operations.

  • s3: This alert indicates infrastructure instability but does not pose immediate availability risk.
  • Metrics:

    • registry: Database Detail - Load Balancing panel
    • Check the registry_database_lb_pool_events_total metric for add/remove patterns
  • Logs: Filter by json.msg: "replica is new" or "removing replica" to see replica pool changes.

Recent registry deployments and configuration changes can be found here.

  1. Identify replica add/remove events from logs:
  2. Check DNS resolution for the replica hosts - are records stable?
    • Verify Consul DNS is returning consistent results for replica.patroni-registry.service.consul.
  3. Check network connectivity between registry pods and replica hosts.
  4. Look at Patroni cluster status for any failovers or membership changes:
  5. Review recent infrastructure changes that might affect DNS or networking.
  • Investigate and resolve DNS instability;
  • Fix network connectivity issues;
  • Address any Patroni cluster instability.
  • DNS (Consul)
  • Patroni
  • Network infrastructure

Escalate if the churn rate continues to increase or if it starts affecting API latency:

The definition for this alert can be found at: