PostgresSplitBrain

Overview

This alert PostgresSplitBrain, is designed to detect a split-brain scenario in a PostgreSQL cluster managed by Patroni. It validates if each Patroni cluster has only one Primary node ie: each type of Patroni cluster only has one Primary node accepting R/W requests.
An incorrect consul configuration or Patroni configuration can trigger this , Incorrect health check results can cause Patroni to incorrectly promote a new leader or fail to demote a current leader.

Services

Postgres Overview
Patroni Service
Consul
Team that owns the service: Production Engineering : Database Reliability

Metrics

Link to the metrics catalogue
This Prometheus expression counts the number of PostgreSQL instances in the gprd/gstg environment that are not in replica mode (pg_replication_is_replica == 0). If this count is greater than 1, it triggers the alert. This condition indicates that more than one PostgreSQL instance is operating in read-write mode within a cluster.This condition must be true for at least 1 minute to trigger the alert.
In both gstg and gprd environments we have 3 patroni clusters (main, ci and registry at the time of writing) and as such we should expect 3 total primary nodes. Be sure to check the type field to distinguish the Patroni clusters and ensure there is only 1 per type. If we see more than three different type of clusters it might suggest a PatroniConsulMultipleMaster situation.

Alert Behavior

We can silence this alert by going here, finding the PostgresSplitBrain and click on silence option
This is a very rare and critical event
There might be a sudden spike in the gitlab_schema_prevent_write errors , Link to dashboard

Severities

This alert might create S1 incidents.
Who is likely to be impacted by this cause of this alert?
- Depending on the database it could be some or all customers. If it is the main or ci database then we expect nearly all customers would be impacted. If it is the registry database then it would be a subset of customers that are depending on the registry.
Review Incident Severity Handbook page to identify the required Severity Level

Verification

To validate if a Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the type of the Patroni cluster which has more than one leader if any
On executing the query pg_replication_is_replica{type="<type of cluster with more than one replica>", env="<gstg/gprd>"} on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note the pg_replication_is_replica value for leader nodes would be 0

Recent changes

Recent Patroni Service change issues
Recently closed issues to determine, if a CR was completed recently, which might be correlated: Recently Closed Issues

Troubleshooting

First step is to figure out which Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the type of the Patroni cluster which has more than one leader
On executing the query pg_replication_is_replica{type="<type of cluster with more than one replica>", env="gprd"} on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note the pg_replication_is_replica value for leader nodes would be 0
It might be helpful to look in recent MRs to see if any changes rolled out recently related to Patroni clusters in chef repo, or config-mgmt
Steps that can be used via the Patroni CLI to remediate . For example:

#(Untested steps proceed with extreme caution)

ssh patroni-main-v14-03-db-gprd.c.gitlab-production.internal

sudo gitlab-patronictl list

sudo mv /var/opt/gitlab/postgresql/data12 /var/opt/gitlab/postgresql/data"<find the number going into the directory>"_dontstart_see_production_"<Issue_number>"

sudo gitlab-patronictl failover

Helpful gitlab-patronictl commands

Possible Resolutions

Issue 15773

Dependencies

Internal dependencies like issues with Patroni configuration , Consul configuration , false positive healthchecks of nodes may cause a splitbrain situation
External dependencies like networ outage or high network latency may also cause a splitbrain situation.
Please use /devoncall <incident_url> on Slack for any escalation that meets the criteria.
Slack channels where help is likely to be found: #g_infra_database_reliability, #database
Link to the definition of this alert for review and tuning
Link to edit this playbook
Update the template used to format this playbook
Related alerts
Postgres Runbook docs
Update the template used to format this playbook