ContainerRegistryDBReplicaPoolSizeInstability

Overview

This alert is triggered when the replica pool size has been fluctuating significantly (standard deviation > 1) over a 15-minute window for at least 5 minutes. This indicates:

Replicas frequently joining and leaving the pool;
Intermittent connectivity or health check failures;
Infrastructure instability affecting multiple replicas.

A stable pool should have near-zero standard deviation.

Services

Metrics

This alert is based on stddev_over_time(registry_database_lb_pool_size[15m]). The alert fires when the standard deviation exceeds 1 for 5 minutes.

For context:

A stddev of 0 means the pool size is completely stable
A stddev of 1 means the pool size is fluctuating by roughly ±1 replica
Higher values indicate more severe fluctuations

Alert Behavior

This alert complements the churn rate alert by detecting pool size instability regardless of the specific events causing it. It can catch scenarios where replicas are being added and removed in ways not captured by other alerts.

Severities

s3: Pool instability indicates underlying infrastructure issues but the registry can continue operating.

Verification

Metrics:
- registry: Database Detail - Load Balancing panel
- Check registry_database_lb_pool_size over time to visualize fluctuations
- Check registry_database_lb_pool_events_total for add/remove/quarantine events
Logs: Filter by json.msg: "replica" to see all replica-related events.

Recent changes

Recent registry deployments and configuration changes can be found here.

Troubleshooting

Check which replicas are fluctuating in/out of the pool from logs:
- Kibana: replica quarantined
- Kibana: replica added
- Kibana: replica removed
- Look for json.db_host_addr field to identify which replicas are affected.
Investigate if fluctuations correlate with specific events (deployments, network changes).
Check for patterns - are the same replicas repeatedly affected?
Review Patroni cluster health and membership:
- Grafana: patroni-registry Overview
Check for DNS or service discovery issues:
- Verify Consul DNS is returning consistent results for replica.patroni-registry.service.consul.

Possible Resolutions

Identify and fix the root cause of replica instability;
Address network or connectivity issues;
Resolve Patroni cluster health problems.

Dependencies

Patroni
DNS/Consul
Network infrastructure

Escalation

Escalate if instability persists or worsens:

Definitions

The definition for this alert can be found at: