PostgresSplitBrain
Overview
Section titled “Overview”- This alert
PostgresSplitBrain
, is designed to detect a split-brain scenario in a PostgreSQL cluster managed by Patroni. It validates if each Patroni cluster has only one Primary node ie: eachtype
of Patroni cluster only has one Primary node accepting R/W requests. - An incorrect consul configuration or Patroni configuration can trigger this , Incorrect health check results can cause Patroni to incorrectly promote a new leader or fail to demote a current leader.
Services
Section titled “Services”- Postgres Overview
- Patroni Service
- Consul
- Team that owns the service: Production Engineering : Database Reliability
Metrics
Section titled “Metrics”- Link to the metrics catalogue
- This Prometheus expression counts the number of PostgreSQL instances in the gprd/gstg environment that are not in replica mode
(pg_replication_is_replica == 0)
. If this count is greater than 1, it triggers the alert. This condition indicates that more than one PostgreSQL instance is operating in read-write mode within a cluster.This condition must be true for at least 1 minute to trigger the alert. - In both
gstg
andgprd
environments we have 3 patroni clusters (main, ci and registry at the time of writing) and as such we should expect 3 total primary nodes. Be sure to check thetype
field to distinguish the Patroni clusters and ensure there is only 1 per type. If we see more than three differenttype
of clusters it might suggest aPatroniConsulMultipleMaster
situation.
Alert Behavior
Section titled “Alert Behavior”- We can silence this alert by going here, finding the
PostgresSplitBrain
and click on silence option - This is a very rare and critical event
- There might be a sudden spike in the gitlab_schema_prevent_write errors , Link to dashboard
Severities
Section titled “Severities”- This alert might create S1 incidents.
- Who is likely to be impacted by this cause of this alert?
- Depending on the database it could be some or all customers. If it is the
main
orci
database then we expect nearly all customers would be impacted. If it is theregistry
database then it would be a subset of customers that are depending on the registry.
- Depending on the database it could be some or all customers. If it is the
- Review Incident Severity Handbook page to identify the required Severity Level
Verification
Section titled “Verification”-
To validate if a Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the
type
of the Patroni cluster which has more than one leader if any -
On executing the query
pg_replication_is_replica{type="<type of cluster with more than one replica>", env="<gstg/gprd>"}
on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note thepg_replication_is_replica
value for leader nodes would be 0
Recent changes
Section titled “Recent changes”-
Recently closed issues to determine, if a CR was completed recently, which might be correlated: Recently Closed Issues
Troubleshooting
Section titled “Troubleshooting”-
First step is to figure out which Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the
type
of the Patroni cluster which has more than one leader -
On executing the query
pg_replication_is_replica{type="<type of cluster with more than one replica>", env="gprd"}
on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note thepg_replication_is_replica
value for leader nodes would be 0 -
It might be helpful to look in recent MRs to see if any changes rolled out recently related to Patroni clusters in chef repo, or config-mgmt
-
Steps that can be used via the Patroni CLI to remediate . For example:
#(Untested steps proceed with extreme caution)
ssh patroni-main-v14-03-db-gprd.c.gitlab-production.internal
sudo gitlab-patronictl list
sudo mv /var/opt/gitlab/postgresql/data12 /var/opt/gitlab/postgresql/data"<find the number going into the directory>"_dontstart_see_production_"<Issue_number>"
sudo gitlab-patronictl failover
Possible Resolutions
Section titled “Possible Resolutions”Dependencies
Section titled “Dependencies”-
Internal dependencies like issues with Patroni configuration , Consul configuration , false positive healthchecks of nodes may cause a splitbrain situation
-
External dependencies like networ outage or high network latency may also cause a splitbrain situation.
-
Please use /devoncall <incident_url> on Slack for any escalation that meets the criteria.
-
Slack channels where help is likely to be found:
#g_infra_database_reliability
,#database