Skip to content

ContainerRegistryDBReplicaPoolSizeInstability

This alert is triggered when the replica pool size has been fluctuating significantly (standard deviation > 1) over a 15-minute window for at least 5 minutes. This indicates:

  • Replicas frequently joining and leaving the pool;
  • Intermittent connectivity or health check failures;
  • Infrastructure instability affecting multiple replicas.

A stable pool should have near-zero standard deviation.

This alert is based on stddev_over_time(registry_database_lb_pool_size[15m]). The alert fires when the standard deviation exceeds 1 for 5 minutes.

For context:

  • A stddev of 0 means the pool size is completely stable
  • A stddev of 1 means the pool size is fluctuating by roughly ±1 replica
  • Higher values indicate more severe fluctuations

This alert complements the churn rate alert by detecting pool size instability regardless of the specific events causing it. It can catch scenarios where replicas are being added and removed in ways not captured by other alerts.

  • s3: Pool instability indicates underlying infrastructure issues but the registry can continue operating.
  • Metrics:

    • registry: Database Detail - Load Balancing panel
    • Check registry_database_lb_pool_size over time to visualize fluctuations
    • Check registry_database_lb_pool_events_total for add/remove/quarantine events
  • Logs: Filter by json.msg: "replica" to see all replica-related events.

Recent registry deployments and configuration changes can be found here.

  1. Check which replicas are fluctuating in/out of the pool from logs:
  2. Investigate if fluctuations correlate with specific events (deployments, network changes).
  3. Check for patterns - are the same replicas repeatedly affected?
  4. Review Patroni cluster health and membership:
  5. Check for DNS or service discovery issues:
    • Verify Consul DNS is returning consistent results for replica.patroni-registry.service.consul.
  • Identify and fix the root cause of replica instability;
  • Address network or connectivity issues;
  • Resolve Patroni cluster health problems.
  • Patroni
  • DNS/Consul
  • Network infrastructure

Escalate if instability persists or worsens:

The definition for this alert can be found at: