Skip to content

ContainerRegistryDBNoReplicasAvailable

This alert is triggered when the replica pool size has been zero for at least 2 minutes. This is a critical condition meaning:

  • All database replicas are unavailable, unreachable, or quarantined;
  • All read queries are being routed to the primary database;
  • Primary database load is significantly increased.

Immediate investigation is required.

This alert is based on avg(registry_database_lb_pool_size) == 0. The alert fires after the pool has been empty for 2 minutes.

This is the most severe replica pool alert. It indicates complete loss of read replica capacity. The registry will continue to function by routing all queries to the primary, but this significantly increases primary load and may lead to performance degradation.

  • s2: Critical condition. All read traffic is hitting the primary database, which may become overloaded.
  • Metrics:

    • registry: Database Detail - Load Balancing panel
    • Check registry_database_lb_pool_size - should be 0
    • Check primary database load metrics
    • Check registry_database_lb_pool_events_total for recent quarantine events
  • Logs:

    • Filter by json.msg: "replica quarantined" to see why replicas were removed
    • Filter by json.msg: "no replicas available" to confirm fallback to primary

Recent registry deployments and configuration changes can be found here.

  1. Immediate: Check primary database health and load - is it coping?
  2. Check Patroni cluster status - are replicas up?
  3. Check network connectivity from registry pods to all replica hosts.
  4. Check PgBouncer status on replica hosts:
  5. Review recent events - what caused all replicas to become unavailable?
  6. Check if replicas are quarantined (they will auto-reintegrate after 5 minutes).
  • Restore network connectivity to replicas;
  • Fix PgBouncer issues on replica hosts;
  • Address Patroni cluster problems;
  • Wait for quarantined replicas to auto-reintegrate (5-minute cooldown);
  • If replicas are permanently lost, scale up or failover.
  • PgBouncer
  • Patroni
  • Consul
  • Network infrastructure

Escalate immediately - this is an s2 condition:

The definition for this alert can be found at: