ContainerRegistryDBReplicaPoolDegraded

Overview

This alert is triggered when the replica pool size has dropped below 50% of the 1-day average for at least 5 minutes. This indicates a significant reduction in available replicas, which may be caused by:

Multiple replica failures;
Widespread network connectivity issues;
Infrastructure problems affecting multiple hosts;
Mass quarantining due to connectivity or lag issues.

This is an early warning before the pool becomes completely empty.

Services

Metrics

This alert compares avg(registry_database_lb_pool_size) to avg_over_time(avg(registry_database_lb_pool_size)[1d:]). The alert fires when current pool size is less than 50% of the 1-day average.

This adaptive threshold automatically adjusts if the expected replica count changes.

Alert Behavior

This alert provides early warning of pool degradation. It fires before the pool is completely empty, giving operators time to investigate and respond.

Severities

s3: Significant pool degradation increases load on remaining replicas and may impact performance, but the registry remains operational.

If this alert is followed by ContainerRegistryDBNoReplicasAvailable (s2), the situation has escalated to critical.

Verification

Metrics:
- registry: Database Detail - Load Balancing panel
- Check current registry_database_lb_pool_size vs historical average
- Check registry_database_lb_pool_events_total for quarantine events
Logs: Filter by json.msg: "replica quarantined" or "removing replica" to identify what’s reducing pool size.

Recent changes

Recent registry deployments and configuration changes can be found here.

Troubleshooting

Identify how many replicas are currently available vs expected:
- Grafana Explore: pool size
- Grafana: registry Database Detail - Load Balancing panel.
Check which replicas are missing/quarantined and why:
- Kibana: replica quarantined
- Kibana: replica removed
Check Patroni cluster status - are replicas healthy?
- Grafana: patroni-registry Overview
Check network connectivity to replica hosts.
Check for any infrastructure incidents affecting multiple hosts.

Possible Resolutions

Restore connectivity to unavailable replicas;
Wait for quarantined replicas to auto-reintegrate (5-minute cooldown);
Address underlying infrastructure or network issues;
Scale up if replicas are permanently lost.

Dependencies

PgBouncer
Patroni
Network infrastructure

Escalation

Escalate immediately if pool continues to degrade or if remaining replicas show signs of stress:

Definitions

The definition for this alert can be found at:

Feature runbook
Feature technical specification
ContainerRegistryDBNoReplicasAvailable alert (escalation)