Container Registry Database Load Balancing
Background
Section titled “Background”The Container Registry supports database load balancing. This feature is implemented as described in the technical specification.
You can follow Container Registry: Database Load Balancing (DLB) (&8591) for more updates. The rollout plan being followed is detailed here.
Alerts
Section titled “Alerts”| Alert | Condition | Duration | Severity |
|---|---|---|---|
ContainerRegistryDBHighReplicaPoolChurnRate | DNS add/remove rate > 0.1/sec | 15m | s3 |
ContainerRegistryDBHighReplicaConnectivityQuarantineRate | Connectivity quarantine rate > 0.05/sec | 10m | s3 |
ContainerRegistryDBHighReplicaLagQuarantineRate | Lag quarantine rate > 0.05/sec | 5m | s3 |
ContainerRegistryDBReplicaPoolSizeInstability | Pool size stddev > 1 | 5m | s3 |
ContainerRegistryDBReplicaPoolDegraded | Pool < 50% of 1-day avg | 5m | s3 |
ContainerRegistryDBNoReplicasAvailable | Pool size == 0 | 2m | s2 |
ContainerRegistryDBLoadBalancerReplicaPoolSize | Pool below minimum threshold | 5m | s3/s4 |
PatroniRegistryServiceDnsLookupsApdexSLOViolation | DNS lookup latency SLO violation | - | s3 |
The first six alerts monitor the replica connectivity tracking and quarantine mechanism introduced in MR !2596. The mechanism protects the load balancer from unstable replicas through:
- Consecutive Failure Detection: Quarantines a replica after 3 consecutive connectivity failures.
- Flapping Detection: Quarantines a replica after 5 add/remove events within a 60-second window.
Quarantined replicas are automatically reintegrated after a 5-minute cooldown period.
The list of log entries emitted by the registry is documented here.
To find all relevant log entries, you can filter logs by json.msg: "replica" or "replicas" or "LSN" (example).
Metrics
Section titled “Metrics”The list of Prometheus metrics emitted by the registry is documented here.
There are graphs for all relevant metrics in the registry: Database Detail dashboard, under a dedicated Load Balancing row.