Skip to content

ContainerRegistryDBHighReplicaLagQuarantineRate

This alert is triggered when replicas are being quarantined due to replication lag exceeding thresholds at a rate of more than 0.05 quarantines/second for 5 minutes. This indicates:

  • High write load on the primary database;
  • Replica performance issues (slow disk, CPU saturation);
  • Replication bottlenecks;
  • Network issues affecting WAL streaming.

Replicas with excessive lag are quarantined to prevent stale reads. They are automatically reintegrated once they catch up.

This alert is based on registry_database_lb_pool_events_total with labels event="replica_quarantined", reason="replication_lag". The alert fires when the rate exceeds 0.05/second (~3 quarantines/minute) sustained for 5 minutes.

Related metrics:

  • registry_database_lb_lag_bytes - Replication lag in bytes per replica
  • registry_database_lb_lag_seconds - Replication lag in seconds per replica

This alert has a shorter duration (5 minutes) and higher threshold (0.05/sec) compared to connectivity quarantine alerts because lag-based quarantines can happen more frequently under load and indicate a more immediate performance concern.

  • s3: Replication lag is being mitigated by the quarantine mechanism, but may indicate primary database overload.
  • Metrics:

  • Logs: Filter by json.msg: "replica quarantined" AND json.reason: "replication_lag" to identify affected replicas.

Recent registry deployments and configuration changes can be found here.

  1. Check primary database WAL generation rate - is it unusually high?
  2. Check replica disk I/O and CPU utilization:
  3. Check network throughput between primary and replicas.
  4. Look for any long-running transactions or maintenance operations:
  5. Identify which replicas are being quarantined from logs:
  • Reduce write load on primary if possible;
  • Investigate and resolve replica performance bottlenecks;
  • Scale up replica resources if needed;
  • Quarantined replicas will auto-reintegrate once they catch up on lag.
  • Patroni (replication)
  • Network infrastructure (WAL streaming)

Escalate if lag-based quarantines persist or if primary shows signs of overload:

The definition for this alert can be found at: