ContainerRegistryDBLoadBalancerReplicaPoolSize

Overview

This alert is triggered when the size of the application-side database load balancer replica pool has been below the configured minimum threshold for a prolongued period of time. This can be due to:

Missing or unhealthy/unresponsive database replicas hosts;
Application unable to connect to database replicas hosts due to e.g. a network issue.

The HTTP API component makes use of database load balancing. The registry is able to operate using an empty replica pool, in which case all queries are directed to primary. Therefore, this alert does not pose any immediate availability risk, but will increase the load on the primary.

As recipient of this alert, please confirm if there are missing/unresponsive database replicas hosts and investigate why. Ultimately, restore the number of available replicas.

Services

Metrics

This alert is based on registry_database_lb_pool_size, which is a gauge. It measure the size (count) of the application-side load balancer replica pool reported by each registry instance. The alert observes the maximum of the reported size across all registry instances to exclude temporary fluctuations due to expects events such as scaling.

The current threshold is based on the minimum expected number of replica hosts in each environment. The metric value should equal the minimum expected number of replica hosts in each environment.

Alert Behavior

This alert is expected to be rare. There are no automated silencing rules. The alert should be silenced if the reported value was found to be incorrect and a related application change is due for deployment soon.

30 days worth of data around enabling load balancing in staging can be observed here. We faced some network constraints, thus we can see how it looks like when there were no replicas in the pool and then when the pool was filled with the expected number of replicas.

Severities

The registry is able to operate using an empty replica pool, in which case all queries are directed to primary. Therefore, this alert does not pose any immediate availability risk, but will increase the load on the primary.

All clients may be affected by this, but only if the primary server becomes overloaded as a side-effect, at which point queries and therefore API requests may suffer increased latency.

Please ensure the primary is not overloaded due to the missing replicas. If there is plently of room left, then this is low severity, likely s4. Otherwise, s3 is appropriate.

Verification

Metrics:
Logs: This query shows the error log messages that are emitted when the registry fails to connect to database replicas.

Recent changes

Recent registry deployments and configuration changes can be found here.

Before proceding with a rollback, please:

Check the changelog in the MR that updated the registry.
Review MRs included in the related release issue
If any MR has the label ~cannot-rollback applied, a detailed description should exist in that MR.
Otherwise, proceed to revert the commit and watch the deployment.
Review the dashboards and expect the metric to go back to normal.

Troubleshooting

We need to identify why replica(s) are unhealthy. To do so we can look at the following dashboards and metrics:

registry: Database Detail: This dashboard includes the application metrics that triggered this alarm. Look at the Load Balancing panel for more details. The included graphs allow us to identify the current pool size and when the problem started.
patroni-registry: Overview: Look at each node metrics to identify which ones are unhealthy and why.
pgbouncer-registry: Overview: Look at the PgBouncer metrics to identify potential issues at the connection pool level.

Possible Resolutions

Resolve the underlying cause for the unhealthy state of replica(s).

If you suspect of any application-side metrics issues, please inform the development team.

Dependencies

PgBouncer
Patroni
Consul

Escalate if primary server becomes overloaded due to the missing replicas. Escalate to the development team if metric values appear to be innacurate or further help is required for investigation:

The definition for this alert can be found at registry/registry-db.yml.