Recovering from CI Patroni cluster lagging too much or becoming completely broken
IMPORTANT: This troubleshooting only applies before CI decomposition is finished (ie. patroni-ci
is still just a standby replica of patroni
), after patroni-ci
is promoted as Writer this runbook is no longer valid.
Symptoms
Section titled “Symptoms”We have several alerts that detect replication problems, but this Runbook should only be considered if these alerts are related with the Standby Leader
of our patroni-ci
cluster, otherwise please consider this incident as a regular Replica lagging issue;
Possible related alerts are:
- Alert that replication is stopped
- Alert that replication lag is over 2min (over 120m on archive and delayed replica)
- Alert that replication lag is over 200MB
To check what node is the Standby Leader
of our patroni-ci
cluster execute ssh patroni-ci-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
Possible checks
Section titled “Possible checks”- Check for lag pile up (continuous lag increase without reducing) in the
patroni-ci
Standby Leader lag in Thanos - Check if the CI Standby Leader can’t find WAL segments from WAL stream
- SSH into the Standby Leader of
patroni-ci
cluster - Check the
/var/log/gitlab/postgresql/postgresql.csv
log file for errors likeFATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment ???????????? has already been removed"
- SSH into the Standby Leader of
- Search
patroni-ci
logs into Elastic forFATAL
error and messages likeXX000
or"could not receive data from WAL stream"
Resolution
Section titled “Resolution”This procedure can recover from patroni-ci
being broken but was designed as a
rollback procedure in case CI decomposition failover fails.
This solution will not be applicable once CI decomposition is finished and the CI cluster is diverged fully from Main.
Before we’ve finished CI decomposition the Patroni CI cluster is just another
set of replicas and is only used for read-only
traffic by gitlab-rails
.
This means it is quite simple to recover if the cluster becomes corrupted, too
lagged behind or otherwise unavailable. The solution is to just send all CI
read-only
traffic to Main Patroni replicas. The quickest way to do this is
reconfigure all Patroni Main replicas to also present as
ci-db-replica.service.consul
.
The resolution to this problem basically consist into temporarily routing the CI read-only
workload from the patroni-ci
cluster in our patroni-main
Replicas, while we can rebuild and re-sync the patroni-ci
cluster.
To handle the CI read-only
workload in case of incident, all patroni-main
nodes have 3 additional pgbouncers deployed and listening in 6435, 6436 and 6437 TCP ports.
If they are not being used these ports are defined as idle-ci-db-replica
Consul service name and as the name suggests, nothing points at these extra pgbouncers.
Resolution Steps - Route CI read-only workload to Main
Section titled “Resolution Steps - Route CI read-only workload to Main”In case of incident you will have to:
- 1. In
patroni-main
nodes, rename the Consul serviceidle-ci-db-replica
toci-db-replica
. We have a sample MR at for what this would involve - 2. In
patroni-ci
nodes, rename Consul service name fromci-db-replica
todormant-ci-db-replica
. We have a sample MR at for what this would involve
Note: In case these MRs are unavailable the diffs are:
Diff for reconfiguring Patroni cluster to also present as ci-db-replica in Consul
diff --git a/roles/gprd-base-db-patroni-v12.json b/roles/gprd-base-db-patroni-v12.json--- a/roles/gprd-base-db-patroni-v12.json+++ b/roles/gprd-base-db-patroni-v12.json@@ -5,9 +5,9 @@ "gitlab-pgbouncer": { "consul": { "port_service_name_overrides": {- "6435": "idle-ci-db-replica",- "6436": "idle-ci-db-replica",- "6437": "idle-ci-db-replica"+ "6435": "ci-db-replica",+ "6436": "ci-db-replica",+ "6437": "ci-db-replica" } }, "listen_ports": [
diff --git a/roles/gprd-base-db-patroni-ci.json b/roles/gprd-base-db-patroni-ci.json--- a/roles/gprd-base-db-patroni-ci.json+++ b/roles/gprd-base-db-patroni-ci.json@@ -5,7 +5,7 @@ "default_attributes": { "gitlab-pgbouncer": { "consul": {- "service_name": "ci-db-replica"+ "service_name": "dormant-ci-db-replica" }, "databases": { "gitlabhq_production": {
-
3. You will likely want to apply this as quickly as possible by running chef directly on all the Patroni Main nodes.
-
4. Once you’ve done this you will have to do 1 minor cleanup on Patroni CI nodes, since the
gitlab-pgbouncer
cookbook does not handle renamingservice_name
you will also need to delete/etc/consul/conf.d/ci-db-replica*.json
from the problematic CI Patroni nodes, by executing:knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo rm -f /etc/consul/conf.d/ci-db-replica*.json'
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo consul reload'
-
5. Validate Consul resolver should return just
patroni-v12
(akapatroni-main
) replica hosts, by runningdig @localhost ci-db-replica.service.consul +short SRV | sort -k 4
, like for example:Name resolution for `ci-db-replica.service.consul` after route of CI read-only workload to Main is done
$ dig @localhost ci-db-replica.service.consul +short SRV | sort -k 41 1 6435 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-09-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-09-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-09-db-gprd.node.east-us-2.consul. -
6. Verify that CI read requests are shifting:
Resolution Steps - Redeploy and Resync the Patroni CI cluster
Section titled “Resolution Steps - Redeploy and Resync the Patroni CI cluster”Fistly, escalate the incident to a DBRE and ask them to proceed with the recovery of the broken CI Patroni cluster using a Snapshot from the Master cluster (instead of pg_basebackup).
Once the CI Patroni cluster has fully recovered you can revert these changes but you should do this in 2 MRs using the following steps:
-
1. Change
roles/gstg-base-db-patroni-ci.json
back toservice_name: ci-db-replica
. Then wait for chef to run on CI Patroni nodes and confirm they are correctly registering in consul under DNSci-db-replica.service.consul
-
You can validate by running
dig @localhost ci-db-replica.service.consul +short SRV | sort -k 4
and the resolver should return bothpatroni-v12
(akapatroni-main
) andpatroni-ci
replica hosts, like for example:Name resolution for `ci-db-replica.service.consul` SRV name
$ dig @localhost ci-db-replica.service.consul +short SRV | sort -k 41 1 6432 patroni-ci-02-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-02-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-02-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-04-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-04-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-04-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-05-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-05-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-05-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-06-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-06-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-06-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-07-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-07-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-07-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-08-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-08-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-08-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-09-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-09-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-09-db-gprd.node.east-us-2.consul.1 1 6432 patroni-ci-10-db-gprd.node.east-us-2.consul.1 1 6433 patroni-ci-10-db-gprd.node.east-us-2.consul.1 1 6434 patroni-ci-10-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-01-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-02-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-03-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-04-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-06-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-07-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-08-db-gprd.node.east-us-2.consul.1 1 6435 patroni-v12-09-db-gprd.node.east-us-2.consul.1 1 6436 patroni-v12-09-db-gprd.node.east-us-2.consul.1 1 6437 patroni-v12-09-db-gprd.node.east-us-2.consul.
-
-
2. Revert the
port_service_name_overrides
inroles/gprd-base-db-patroni-main.json
toidle-ci-db-replica
so thatpatroni-main
nodes stop registering in Consul forci-db-replica.service.consul
-
3. Remove
/etc/consul/conf.d/dormant-ci-db-replica*.json
from CI Patroni nodes as this is no longer needed and Chef won’t clean this up for youknife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo rm -f /etc/consul/conf.d/dormant-ci-db-replica*.json'
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo consul reload'
-
4. Verify the that DNS resolve for
ci-db-replica.service.consul
is only returningpatroni-ci
nodes, by executingdig @localhost ci-db-replica.service.consul +short SRV | sort -k 4
-
5. Verify that CI read requests shifted back: