ContainerRegistryDBHighReplicaLagQuarantineRate
Overview
Section titled “Overview”This alert is triggered when replicas are being quarantined due to replication lag exceeding thresholds at a rate of more than 0.05 quarantines/second for 5 minutes. This indicates:
- High write load on the primary database;
- Replica performance issues (slow disk, CPU saturation);
- Replication bottlenecks;
- Network issues affecting WAL streaming.
Replicas with excessive lag are quarantined to prevent stale reads. They are automatically reintegrated once they catch up.
Services
Section titled “Services”Metrics
Section titled “Metrics”This alert is based on registry_database_lb_pool_events_total with labels event="replica_quarantined", reason="replication_lag". The alert fires when the rate exceeds 0.05/second (~3 quarantines/minute) sustained for 5 minutes.
Related metrics:
registry_database_lb_lag_bytes- Replication lag in bytes per replicaregistry_database_lb_lag_seconds- Replication lag in seconds per replica
Alert Behavior
Section titled “Alert Behavior”This alert has a shorter duration (5 minutes) and higher threshold (0.05/sec) compared to connectivity quarantine alerts because lag-based quarantines can happen more frequently under load and indicate a more immediate performance concern.
Severities
Section titled “Severities”- s3: Replication lag is being mitigated by the quarantine mechanism, but may indicate primary database overload.
Verification
Section titled “Verification”-
Metrics:
registry: Database Detail- Load Balancing panelpatroni-registry: Overview- Replication lag graphs- Check WAL generation rate on primary
-
Logs: Filter by
json.msg: "replica quarantined" AND json.reason: "replication_lag"to identify affected replicas.
Recent changes
Section titled “Recent changes”Recent registry deployments and configuration changes can be found here.
Troubleshooting
Section titled “Troubleshooting”- Check primary database WAL generation rate - is it unusually high?
- Check replica disk I/O and CPU utilization:
- Grafana: patroni-registry Overview - Host metrics panels.
- Check network throughput between primary and replicas.
- Look for any long-running transactions or maintenance operations:
- Identify which replicas are being quarantined from logs:
- Kibana: replica quarantined (replication_lag)
- Look for
json.db_host_addrandjson.lag_bytesfields.
Possible Resolutions
Section titled “Possible Resolutions”- Reduce write load on primary if possible;
- Investigate and resolve replica performance bottlenecks;
- Scale up replica resources if needed;
- Quarantined replicas will auto-reintegrate once they catch up on lag.
Dependencies
Section titled “Dependencies”- Patroni (replication)
- Network infrastructure (WAL streaming)
Escalation
Section titled “Escalation”Escalate if lag-based quarantines persist or if primary shows signs of overload:
Definitions
Section titled “Definitions”The definition for this alert can be found at: