ContainerRegistryDBHighReplicaLagQuarantineRate

Overview

This alert is triggered when replicas are being quarantined due to replication lag exceeding thresholds at a rate of more than 0.05 quarantines/second for 5 minutes. This indicates:

High write load on the primary database;
Replica performance issues (slow disk, CPU saturation);
Replication bottlenecks;
Network issues affecting WAL streaming.

Replicas with excessive lag are quarantined to prevent stale reads. They are automatically reintegrated once they catch up.

Services

Metrics

This alert is based on registry_database_lb_pool_events_total with labels event="replica_quarantined", reason="replication_lag". The alert fires when the rate exceeds 0.05/second (~3 quarantines/minute) sustained for 5 minutes.

Related metrics:

registry_database_lb_lag_bytes - Replication lag in bytes per replica
registry_database_lb_lag_seconds - Replication lag in seconds per replica

Alert Behavior

This alert has a shorter duration (5 minutes) and higher threshold (0.05/sec) compared to connectivity quarantine alerts because lag-based quarantines can happen more frequently under load and indicate a more immediate performance concern.

Severities

s3: Replication lag is being mitigated by the quarantine mechanism, but may indicate primary database overload.

Verification

Metrics:
- registry: Database Detail - Load Balancing panel
- patroni-registry: Overview - Replication lag graphs
- Check WAL generation rate on primary
Logs: Filter by json.msg: "replica quarantined" AND json.reason: "replication_lag" to identify affected replicas.

Recent changes

Recent registry deployments and configuration changes can be found here.

Troubleshooting

Check primary database WAL generation rate - is it unusually high?
- Grafana Explore: WAL generation rate
- Grafana: patroni-registry Overview - WAL panels.
Check replica disk I/O and CPU utilization:
- Grafana: patroni-registry Overview - Host metrics panels.
Check network throughput between primary and replicas.
Look for any long-running transactions or maintenance operations:
- Grafana Explore: active transactions
Identify which replicas are being quarantined from logs:
- Kibana: replica quarantined (replication_lag)
- Look for json.db_host_addr and json.lag_bytes fields.

Possible Resolutions

Reduce write load on primary if possible;
Investigate and resolve replica performance bottlenecks;
Scale up replica resources if needed;
Quarantined replicas will auto-reintegrate once they catch up on lag.

Dependencies

Patroni (replication)
Network infrastructure (WAL streaming)

Escalation

Escalate if lag-based quarantines persist or if primary shows signs of overload:

Definitions

The definition for this alert can be found at: