Skip to content

ContainerRegistryDBReplicaPoolDegraded

This alert is triggered when the replica pool size has dropped below 50% of the 1-day average for at least 5 minutes. This indicates a significant reduction in available replicas, which may be caused by:

  • Multiple replica failures;
  • Widespread network connectivity issues;
  • Infrastructure problems affecting multiple hosts;
  • Mass quarantining due to connectivity or lag issues.

This is an early warning before the pool becomes completely empty.

This alert compares avg(registry_database_lb_pool_size) to avg_over_time(avg(registry_database_lb_pool_size)[1d:]). The alert fires when current pool size is less than 50% of the 1-day average.

This adaptive threshold automatically adjusts if the expected replica count changes.

This alert provides early warning of pool degradation. It fires before the pool is completely empty, giving operators time to investigate and respond.

  • s3: Significant pool degradation increases load on remaining replicas and may impact performance, but the registry remains operational.

If this alert is followed by ContainerRegistryDBNoReplicasAvailable (s2), the situation has escalated to critical.

  • Metrics:

    • registry: Database Detail - Load Balancing panel
    • Check current registry_database_lb_pool_size vs historical average
    • Check registry_database_lb_pool_events_total for quarantine events
  • Logs: Filter by json.msg: "replica quarantined" or "removing replica" to identify what’s reducing pool size.

Recent registry deployments and configuration changes can be found here.

  1. Identify how many replicas are currently available vs expected:
  2. Check which replicas are missing/quarantined and why:
  3. Check Patroni cluster status - are replicas healthy?
  4. Check network connectivity to replica hosts.
  5. Check for any infrastructure incidents affecting multiple hosts.
  • Restore connectivity to unavailable replicas;
  • Wait for quarantined replicas to auto-reintegrate (5-minute cooldown);
  • Address underlying infrastructure or network issues;
  • Scale up if replicas are permanently lost.
  • PgBouncer
  • Patroni
  • Network infrastructure

Escalate immediately if pool continues to degrade or if remaining replicas show signs of stress:

The definition for this alert can be found at: