Skip to content

PostgresSplitBrain

  • This alert PostgresSplitBrain, is designed to detect a split-brain scenario in a PostgreSQL cluster managed by Patroni. It validates if each Patroni cluster has only one Primary node ie: each type of Patroni cluster only has one Primary node accepting R/W requests.
  • An incorrect consul configuration or Patroni configuration can trigger this , Incorrect health check results can cause Patroni to incorrectly promote a new leader or fail to demote a current leader.
  • Link to the metrics catalogue
  • This Prometheus expression counts the number of PostgreSQL instances in the gprd/gstg environment that are not in replica mode (pg_replication_is_replica == 0). If this count is greater than 1, it triggers the alert. This condition indicates that more than one PostgreSQL instance is operating in read-write mode within a cluster.This condition must be true for at least 1 minute to trigger the alert.
  • In both gstg and gprd environments we have 3 patroni clusters (main, ci and registry at the time of writing) and as such we should expect 3 total primary nodes. Be sure to check the type field to distinguish the Patroni clusters and ensure there is only 1 per type. If we see more than three different type of clusters it might suggest a PatroniConsulMultipleMaster situation.
  • We can silence this alert by going here, finding the PostgresSplitBrain and click on silence option
  • This is a very rare and critical event
  • There might be a sudden spike in the gitlab_schema_prevent_write errors , Link to dashboard
  • This alert might create S1 incidents.
  • Who is likely to be impacted by this cause of this alert?
    • Depending on the database it could be some or all customers. If it is the main or ci database then we expect nearly all customers would be impacted. If it is the registry database then it would be a subset of customers that are depending on the registry.
  • Review Incident Severity Handbook page to identify the required Severity Level
  • To validate if a Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the type of the Patroni cluster which has more than one leader if any

  • On executing the query pg_replication_is_replica{type="<type of cluster with more than one replica>", env="<gstg/gprd>"} on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note the pg_replication_is_replica value for leader nodes would be 0

  • First step is to figure out which Patroni cluster has more than one leader , a quick view through this dashboard should give you that information regarding the type of the Patroni cluster which has more than one leader

  • On executing the query pg_replication_is_replica{type="<type of cluster with more than one replica>", env="gprd"} on your grafana dashboard should tell you the cluster and the fqdn of the leader nodes , please note the pg_replication_is_replica value for leader nodes would be 0

  • It might be helpful to look in recent MRs to see if any changes rolled out recently related to Patroni clusters in chef repo, or config-mgmt

  • Steps that can be used via the Patroni CLI to remediate . For example:

Terminal window
#(Untested steps proceed with extreme caution)
ssh patroni-main-v14-03-db-gprd.c.gitlab-production.internal
sudo gitlab-patronictl list
sudo mv /var/opt/gitlab/postgresql/data12 /var/opt/gitlab/postgresql/data"<find the number going into the directory>"_dontstart_see_production_"<Issue_number>"
sudo gitlab-patronictl failover