patroni-consul-postgres-pgbouncer-interactions
Quick orientation
Section titled “Quick orientation”What are these services, and how do they work together?
Here is a brief summary of how Postgres database access is supported by Patroni, Consul, PgBouncer, and our Rails app:
- Postgres is our relational database.
- Currently we have 1 writable primary instance and several read-only replica instances of Postgres.
- The replica dbs handle read-only queries and act as failover candidates in case the primary db becomes unavailable (e.g. unreachable, unresponsive) or concerned about possible split-brain (e.g. unable to see/update Patroni cluster state).
- Patroni coordinates failover when the primary Postgres instance becomes unavailable.
- Patroni stores its cluster state in Consul.
- A Patroni agent runs on each Postgres host, monitoring that local Postgres instance and interacting with Consul to check cluster state and publish its own state.
- Database clients (Rails, Sidekiq, etc.) discover Postgres through Consul service discovery and access it through PgBouncer (a connection pooler).
- PgBouncer is a connection pooling proxy in front of Postgres. Having thousands of clients connected directly to Postgres causes significant performance overhead. To avoid this penalty, PgBouncer dynamically maps thousands of client connections to a few hundred db sessions.
- Consul advertises each PgBouncer instance as a proxy for a corresponding Postgres instance.
- Consul keeps DNS records for primary and replica up to date by Consul agent on each node running periodic service healthchecks against the local Patroni API
- Because PgBouncer itself is single-threaded and CPU-bound, it can use at most 1 vCPU, so we run multiple PgBouncer instances in front of each Postgres instance to avoid CPU starvation.
- Database clients discover the IP+port of the available PgBouncer instances by sending DNS queries to a local Consul agent.
- Currently we use dedicated PgBouncer VMs for accessing the primary db, rather than local PgBouncer processes on the db host (as we do for the replica dbs).
- The primary db gets more traffic than any one replica db.
- PgBouncer appears to be more CPU-efficient on dedicated VMs than when running on the primary db host. We have a couple untested hypotheses as to why.
- The primary db’s PgBouncer VMs share a virtual IP address (a Google TCP Internal Load Balancer VIP). That ILB VIP is what Consul advertises to database clients as the primary db IP.
See here and here for more details on the purpose, behaviors, and interactions of each service.
Diagrams
Section titled “Diagrams”Path from Rails app to primary db
Section titled “Path from Rails app to primary db”graph LR subgraph "Web Host" Rails end subgraph "Network forwarding rule" ILB["Load balancer virtual IP"] end subgraph "PgBouncer Host A" PgBouncer1 end subgraph "PgBouncer Host B" PgBouncer2 end subgraph "DB Host" Postgres["Postgres (read/write primary mode)"] end Rails --> ILB ILB --> PgBouncer1 ILB --> PgBouncer2 PgBouncer1 --> Postgres PgBouncer2 --> Postgres
Rails is configured to connect to the primary db using a static host/port pointing to the ILB:
$ egrep 'host:|port:' /srv/gitlab/config/database.yml host: "pgbouncer.int.gprd.gitlab.net" port: 6432 port: 8600 host: "pgbouncer-ci.int.gprd.gitlab.net" port: 6432 port: 8600
$ dig +short pgbouncer.int.gprd.gitlab.net10.217.4.5$ dig +short pgbouncer-ci.int.gprd.gitlab.net10.217.16.5
Note that even though this is an internal-only IP address, that DNS A
record is served by our public DNS provider:
$ dig +noall +authority -t SOA pgbouncer.int.gprd.gitlab.netgitlab.net. 1800 IN SOA arya.ns.cloudflare.com. dns.cloudflare.com. 2276495604 10000 2400 604800 3600
The ILB (a Google Internal TCP/UDP Load Balancer) is just a set of network configurations, not an in-line proxy for the traffic. Its forwarding-rule routes each new connection to one of the registered backends (in this case, dedicated PgBouncer hosts). For details on how to inspect the ILB’s components, see section Internal Loadbalancer (ILB).
The dedicated PgBouncer hosts proxy traffic to Postgres. They discover from Consul which Patroni node is currently the primary db.
$ sudo cat /var/opt/gitlab/pgbouncer/databases.ini | grep 'host='gitlabhq_production = host=master.patroni.service.consul port=5432 pool_size=75 auth_user=pgbouncergitlabhq_production_sidekiq = host=master.patroni.service.consul port=5432 pool_size=50 auth_user=pgbouncer dbname=gitlabhq_production
$ dig +short master.patroni.service.consul10.220.16.106
$ dig +short master.patroni.service.consul | xargs -i dig +short -x {}patroni-06-db-gprd.c.gitlab-production.internal.
Note that PgBouncer uses the default system DNS resolver, not Consul directly. We run dnsmasq
on these hosts with a configuration to recurse to the Consul agent’s DNS port (8600), so that PgBouncer’s service discovery queries work properly.
$ cat /etc/dnsmasq.d/10-consulserver=/consul/127.0.0.1#8600
cache-size=0
PgBouncer frequently rechecks this DNS query to Consul, so it can respond promptly if a Patroni failover occurs.
$ sudo cat /var/opt/gitlab/pgbouncer/pgbouncer.ini | grep 'dns_.*_ttl'dns_max_ttl = 2.0dns_nxdomain_ttl = 15.0
Path from Rails app to replica db
Section titled “Path from Rails app to replica db”graph LR subgraph "Web Host" Rails end subgraph "DB Host" PgBouncer1 PgBouncer2 PgBouncer3 Postgres["Postgres (read-only replica mode)"] end Rails --> PgBouncer1 Rails --> PgBouncer2 Rails --> PgBouncer3 PgBouncer1 --> Postgres PgBouncer2 --> Postgres PgBouncer3 --> Postgres
Rails is configured to connect to the read-only replca dbs using Consul service discovery. Rails sends a DNS query to the local Consul agent to get a set of DNS SRV
(service) records. This provides a list of IP and port combinations to reach all available PgBouncer instances.
$ grep -A6 'load_balancing:' /srv/gitlab/config/database.yml load_balancing: discover: nameserver: gke-gprd-us-east1-d-web-0-f4bee7c1-xwwh record: "db-replica.service.consul." record_type: "SRV" port: 8600 use_tcp: true-- load_balancing: discover: nameserver: gke-gprd-us-east1-d-web-0-f4bee7c1-xwwh record: "ci-db-replica.service.consul." record_type: "SRV" port: 8600 use_tcp: true
$ dig +short -t SRV -p 8600 @127.0.0.1 db-replica.service.consul. | wc -l27
$ dig +short -t SRV -p 8600 @127.0.0.1 db-replica.service.consul. | sort -V -k4 -k31 1 6432 patroni-02-db-gprd.node.east-us-2.consul.1 1 6433 patroni-02-db-gprd.node.east-us-2.consul.1 1 6434 patroni-02-db-gprd.node.east-us-2.consul.1 1 6432 patroni-03-db-gprd.node.east-us-2.consul.1 1 6433 patroni-03-db-gprd.node.east-us-2.consul.1 1 6434 patroni-03-db-gprd.node.east-us-2.consul....
Normal pattern of REST calls from Patroni to Consul
Section titled “Normal pattern of REST calls from Patroni to Consul”Patroni agent’s “loop” periodically runs a sequence of calls to Consul. Each of these REST calls from Patroni agent to Consul agent results in a single corresponding RPC call from Consul agent to Consul server:
sequenceDiagram participant PatroniAgent participant ConsulAgent participant ConsulServer PatroniAgent->>+ConsulAgent: REST call ConsulAgent->>+ConsulServer: RPC call ConsulServer->>-ConsulAgent: RPC response ConsulAgent->>-PatroniAgent: REST response
The normal healthy sequence of periodic calls is slightly different for the Patroni leader versus the Patroni replicas.
Patroni leader:
REST call | REST response | Purpose |
---|---|---|
GET /v1/kv/service/[CLUSTER_NAME]/?recurse=1 | All Patroni-related objects as JSON | Get complete state of Patroni cluster |
PUT /v1/session/renew/[SESSION_ID] | Session object as JSON | Renew TTL of this Patroni node’s consul session |
PUT /v1/kv/service/[CLUSTER_NAME]/optime/leader | Leader object as JSON | Write latest Postgres WAL location of primary db as Patroni leader optime |
PUT /v1/kv/service/[CLUSTER_NAME]/members/[HOSTNAME]?acquire=[SESSION_ID] | String “true” confirming success | Publish current state of this Patroni node |
Patroni replica:
REST call | REST response | Purpose |
---|---|---|
GET /v1/kv/service/[CLUSTER_NAME]/?recurse=1 | All Patroni-related objects as JSON | Get complete state of Patroni cluster |
PUT /v1/session/renew/[SESSION_ID] | Session object as JSON | Renew TTL of this Patroni node’s consul session |
PUT /v1/kv/service/[CLUSTER_NAME]/members/[HOSTNAME]?acquire=[SESSION_ID] | String “true” confirming success | Publish current state of this Patroni node |
GET /v1/kv/service/[CLUSTER_NAME]/leader?wait=[DURATION]&index=[CURRENT_VALUE] | Leader object as JSON | Sleep for the remainder of loop_wait seconds, unless Patroni leader fails |
These calls can be directly observed on a live Patroni host through a brief packet capture of:
- the Consul agent’s REST port (8500) on the loopback network interface
- the Consul server’s RPC port (8300) on the default network interface
The HTTP REST calls from Patroni agent to Consul agent over the loopback interface are unencrypted and easy to interpret. The RPC calls from Consul agent to Consul server are encrypted, but we can observe the close timing of those calls relative to the REST request/response.
sudo timeout 30 tcpdump -v -w sample_consul_rest_and_rpc_calls.pcap -c 1000 -i any 'port 8500 or 8300'
Download the above PCAP file to your workstation, and examine it with Wireshark (or its command-line version, tshark
). Remember to decode TCP port 8500 as HTTP, as shown below.
wireshark -r sample_consul_rest_and_rpc_calls.pcap -d 'tcp.port==8500,http'
tshark -r sample_consul_rest_and_rpc_calls.pcap -d 'tcp.port==8500,http' -Y 'http.request.uri' -T fields -e frame.time_relative -e http.request.method -e http.request.uri | head -n12
Typical output for primary db:
0.000000000 GET /v1/kv/service/pg-ha-cluster/?recurse=10.006017000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f0.008736000 PUT /v1/kv/service/pg-ha-cluster/optime/leader0.019531000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f10.000356000 GET /v1/kv/service/pg-ha-cluster/?recurse=110.006160000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f10.009424000 PUT /v1/kv/service/pg-ha-cluster/optime/leader10.020799000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f20.000465000 GET /v1/kv/service/pg-ha-cluster/?recurse=120.006820000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f20.009269000 PUT /v1/kv/service/pg-ha-cluster/optime/leader20.020930000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f
Typical output on replica db:
0.064859000 GET /v1/kv/service/pg-ha-cluster/?recurse=10.070741000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf0.078500000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf0.088481000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.711172819137573s&index=4406558210.232919000 GET /v1/kv/service/pg-ha-cluster/?recurse=110.236985000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf10.239178000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf10.255033000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.544909477233887s&index=4406558220.327716000 GET /v1/kv/service/pg-ha-cluster/?recurse=120.332099000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf20.334070000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf20.347014000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.452655792236328s&index=44065582
Where do the components run?
Section titled “Where do the components run?”Service/component | Chef role | Hostname pattern | Port (Protocol) |
---|---|---|---|
Postgres | gprd-base-db-patroni | patroni-{01..NN}-db-gprd | 5432 (Pgsql) |
PgBouncer for primary db | gprd-base-db-pgbouncer | pgbouncer-{01..NN}-db-gprd | 6432 (Pgsql) |
PgBouncer for replica dbs | Same as Postgres | Same as Postgres | 6432-6434 (Pgsql) |
Patroni agent | Same as Postgres | Same as Postgres | 8009 (REST) |
Consul agent | gprd-base (recipe gitlab_consul::agent ) | Nearly all Chef-managed VMs | 8600 (DNS), 8500 (REST), 8301 (Serf LAN) |
Consul server | gprd-infra-consul | consul-{01..NN}-inf-gprd | 8600 (DNS), 8500 (REST), 8301 (Serf LAN), 8302 (Serf WAN), 8300 (RPC) |
In addition to the above Chef-managed services, we use a Google TCP Internal Loadbalancer (ILB) to provide a single virtual IP address for the pool of PgBouncer instances for the primary db. This allows clients to treat a pool of PgBouncers as a single endpoint.
Notes about the ILB:
- A Google TCP/UDP Internal Loadbalancer (ILB) is not an inline device in the network path.
- Instead, ILB is part of the control plane of the software defined network within a single geographic region.
- All backends (i.e. PgBouncer VMs) share the IP address of the ILB’s forwarding rule, and within the VPC network, each TCP/UDP connection is routed to one of those backends.
- Backend instances contact the metadata server (metadata.google.internal) to generate local routes to accept traffic for the ILB’s IP address.
Links to external docs
Section titled “Links to external docs”- Postgres docs (remember to choose the appropriate version)
- PgBouncer docs
- Patroni:
- Consul:
- Google TCP/UDP Internal Loadbalancer (ILB)
Quick reference commands
Section titled “Quick reference commands”List of tools for working with these services
Section titled “List of tools for working with these services”gitlab-psql
: Wrapper for genericpsql
, authenticating and connecting to the local Postgres instance as a superuser.gitlab-patronictl
: Wrapper for genericpatronictl
, authenticating and connecting to the local Patroni agent.pgb-console
,pgb-console-1
,pgb-console-2
: Wrapper for genericpsql
, authenticating and connecting to each of the local PgBouncer instances on their distinctive ports. This allows running PgBouncer’s Admin Console commands likeSHOW HELP
,SHOW STATS
,RELOAD
, etc.consul
: Consul CLI tool, for querying or modifying Consul from any host running a Consul agent.
PgBouncer
Section titled “PgBouncer”All Patroni hosts are running local PgBouncer instances, including whichever host is currently the Patroni leader and primary Postgres db. The PgBouncers on the primary db are idle, since we route its traffic through the dedicated PgBouncer hosts.
The same pgb-console
script exists on the dedicated PgBouncer hosts used for routing traffic to the primary db.
Note: The mechanism for taking a PgBouncer out of service for maintenance is different for dedicated PgBouncer hosts than for a PgBouncer instance on a Patroni host.
List the PgBouncer processes.
pgrep -a -f '/usr/local/bin/pgbouncer'
pgrep -f '/usr/local/bin/pgbouncer' | xargs -r ps uwf
Each PgBouncer runs on a different TCP port and has its own script to connect to its Admin Console.
$ ls -1 /usr/local/bin/pgb-console*/usr/local/bin/pgb-console/usr/local/bin/pgb-console-1/usr/local/bin/pgb-console-2
Connect to Admin Console for each of the PgBouncer instances. This puts you in an interactive psql
session, where you can run PgBouncer commands like SHOW HELP
, SHOW POOLS
, SHOW CLIENTS
, SHOW SERVERS
, etc.
Note: You can also use any psql
options on the command-line or psql meta-commands in the interactive shell (e.g. \pset pager off
, \x
).
sudo pgb-consolesudo pgb-console-1sudo pgb-console-2
In an interactive session, show the PgBouncer configuration (conffile
, listen_port
, max_client_conn
, client_idle_timeout
, server_lifetime
, server_reset_query
, etc.).
pgbouncer=# SHOW CONFIG ;
For each PgBouncer instance, show a summary of each connection pool, including:
cl_active
: number of clients that are linked to a db server connection and can process queriescl_waiting
: number of clients who have sent queries but are waiting for a db server connection to become availablesv_active
: number of db server connections linked to a clientsv_idle
: number of db server connections that are unused and immediately available for clientssv_used
(unintuitive): number of db server connections that have been idle for more thanserver_check_delay
, so they needserver_check_query
to run on them before they can be used again
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW POOLS' | cat ; done
For each PgBouncer instance, list the configured databases and their pool size limits.
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW DATABASES' | cat ; done
List connections from each PgBouncer instance to the local Postgres database instance.
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW SERVERS' | cat ; done
List connections from clients to each PgBouncer instance.
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW CLIENTS' | cat ; done
Consul CLI
Section titled “Consul CLI”Consul queries and operations can mostly be done via either the consul
CLI tool or the consul agent’s HTTP REST interface (TCP port 8500).
All of these Consul queries can be run from any host running a Consul agent in the environment you want to inspect (e.g. gprd
, gstg
, etc.).
It does not have to be a Patroni host, because all Consul agents participating in the same gossip membership list can also make RPC calls to the Consul servers.
Get help with the consul
CLI tool’s commands and subcommands.
consul helpconsul operator --helpconsul kv --helpconsul kv get --help
Summarize generic status info for the local host’s Consul agent.
consul info
Tail the logs from the consul agent. Ctrl-C to stop following.
consul monitor
Show the list of Consul servers, and indicate which one is currently the Consul leader.
consul operator raft list-peers
Show all KV records, including their metadata and base64-encoded values.
consul kv export
List all key names in the KV store.
consul kv export | jq '.[].key'
Show the values of all keys having prefix “service/pg-ha-cluster”. Output values are automatically decoded from base64.
consul kv get -recurse 'service/pg-ha-cluster'
Show a single KV record’s value, with and without metadata. Many of Patroni’s KV record values are JSON, so piping the raw value to jq
is sometimes more readable.
consul kv get -detailed 'service/pg-ha-cluster/config'consul kv get 'service/pg-ha-cluster/config'consul kv get 'service/pg-ha-cluster/config' | jq
Show the latest Patroni state self-published by each Patroni node through its Consul agent.
consul kv get -recurse -keys 'service/pg-ha-cluster/members/' | xargs -i consul kv get {} | jq -S
List all nodes running a Consul agent in this environment.
Note: Currently we run all Consul agents as though they were in the same datacenter (i.e. the same “Serf LAN”). Consul expects each “datacenter” to have its own set of Consul servers, with loose coupling between datacenters. But the Consul servers in each datacenter would maintain state separately. If we used “Serf WAN” to connect multiple regions, this command’s output would only include Consul agents in this host’s region (i.e. the Consul agents bound to this region’s set of Consul servers).
consul membersconsul catalog nodes
List all service names registered to Consul.
consul catalog services
List all nodes associated with a specific registered service. (Service names are listed by consul catalog services
.)
consul catalog nodes -service=patroniconsul catalog nodes -service=db-replica
List all services registered by a specific node. (Node names are listed by consul catalog nodes
.)
consul catalog services -node=patroni-01-db-gprd
Put the local Consul agent (or a specific service it provides) into maintenance mode. This is comparable to failing a health check for the node or service.
Warning: This state persists across agent restarts. If maint mode is enabled, it must later be manually disabled.
consul maint -helpconsul maintconsul maint -enable -reason 'Optional comment explaining why this node is being taken down'consul maint -disable
Consul REST API: Commands to inspect/explore Patroni’s state stored in Consul’s key-value (KV) store
Section titled “Consul REST API: Commands to inspect/explore Patroni’s state stored in Consul’s key-value (KV) store”Consul queries and operations can mostly be done via either the consul
CLI tool or the consul agent’s HTTP REST interface (TCP port 8500).
All of these Consul queries can be run from any host running a Consul agent in the environment you want to inspect (e.g. gprd
, gstg
, etc.).
It does not have to be a Patroni host, because all Consul agents participating in the same gossip membership list can also make RPC calls to the Consul servers.
Patroni uses the HTTP REST interface to interact with Consul agent. To see these calls, run a packet capture of TCP port 8500 on the loopback interface.
Who is the current Patroni leader for the Patroni cluster named “pg-ha-cluster”?
Note: The response’s Session
key is deleted if the cluster lock is voluntarily released. It acts as a mutex, indicating the session id of the Consul agent running on the current Patroni leader.
$ curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq .[ { "LockIndex": 1, "Key": "service/pg-ha-cluster/leader", "Flags": 0, "Value": "cGF0cm9uaS0xMS1kYi1ncHJkLmMuZ2l0bGFiLXByb2R1Y3Rpb24uaW50ZXJuYWw=", "Session": "ee43c2cf-5b93-08b7-6900-1cf55c9e83b3", "CreateIndex": 34165794, "ModifyIndex": 34165794 }]
Show all the state data stored in Consul for this Patroni cluster.
Note: This is the same REST call that’s run periodically by Patroni’s get_cluster
method. The Value
field is always base-64 encoded. The decoded values are typically either JSON or a plain strings.
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/?recurse=1 | jq .
List just the Key
field of the Patroni cluster’s Consul KV keys.
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/?recurse=1 | jq -S '.[].Key'
Extract and decode the value of one of the above records.
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq -r '.[].Value' | base64 -d ; echo
Or use the consul
CLI tool, and let it do the base-64 decoding for you.
consul kv get -detailed service/pg-ha-cluster/leader
Does this host’s Consul agent have a Consul session? If so, show its full details.
Notes:
- This shows the complete definition of a “consul session”.
- Each Patroni agent has its own session.
- A consul agent can use its session
ID
as an advisory lock (mutex) on any consul KV record. A session can claim exclusive ownership of that record by setting the record’sSession
attribute with its ownID
value. When the session is invalidated/expired, the lock is automatically released. - This locking mechanism is how Patroni uses Consul to ensure that only one node is the Patroni leader (represented by the consul KV record “service/[cluster_name]/leader”).
$ curl -s http://127.0.0.1:8500/v1/session/node/$( hostname -s ) | jq .[ { "ID": "0e9e66a5-d17a-3543-e389-209c90731209", "Name": "pg-ha-cluster-patroni-09-db-gprd.c.gitlab-production.internal", "Node": "patroni-09-db-gprd", "Checks": [ "serfHealth" ], "LockDelay": 1000000, "Behavior": "delete", "TTL": "15.0s", "CreateIndex": 32308118, "ModifyIndex": 32308118 }]
List the consul session id for each Patroni agent in this Patroni cluster (pg-ha-cluster
).
Notes:
- Failing the health check or expiring the TTL invalidates the session. If that happens to the Patroni leader, it loses the cluster-lock, causing a failover.
- The TTL reported here is always half the value specified in the Patroni config, because Patroni divides that configured value by 2 before setting it in Consul.
curl -s http://127.0.0.1:8500/v1/session/list | jq -c '.[] | { ID, Name, TTL, Checks }' | grep 'pg-ha-cluster' | sort
Show which of the above session ids holds the lock as the Patroni leader.
Again, the existence of the Session
field on this Consul record acts as the mutex. If the session is invalidated (expires, fails a health check, or is deleted), then the service/pg-ha-cluster/leader
is unlocked — meaning no Patroni node holds the Patroni “cluster lock”, causing Patroni to start its leader election process.
$ curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq -r '.[].Value' | base64 -d ; echo
Or
$ consul kv get service/pg-ha-cluster/leader
Internal Loadbalancer (ILB)
Section titled “Internal Loadbalancer (ILB)”Unlike other components, the ILB is not a service/process/instance. It is just a set of network routing rules, with no inline host or device acting as a proxy. It is purely configuration in the network control plane.
A Google Internal TCP/UDP Load Balancer consists of the following components:
- Forwarding rule, which owns the IP address of the load balancer (shared by all backends for routing purposes)
- Backend Service, which contains instance-groups and/or instances (i.e. pool members)
- Health check, for probing each backend instance
Here are gcloud
commands for inspecting the above components of the ILB.
Navigating the GCP Console web UI is also intuitive.
Find and show the internal loadbalancer’s forwarding rule.
gcloud --project='gitlab-production' compute forwarding-rules list | egrep 'NAME|pgbouncer'gcloud --project='gitlab-production' compute forwarding-rules describe --region=us-east1 gprd-gcp-tcp-lb-internal-pgbouncer
Find and show the internal loadbalancer’s target backend service.
gcloud --project='gitlab-production' compute backend-services list --filter="name~'pgbouncer'"gcloud --project='gitlab-production' compute backend-services describe gprd-pgbouncer-regional --region='us-east1'
Find and show the internal loadbalancer’s health check.
gcloud --project='gitlab-production' compute health-checks list | egrep 'NAME|pgbouncer'gcloud --project='gitlab-production' compute health-checks describe gprd-pgbouncer-http
Show the latest results of health-checks against each of the backend-service’s backends (i.e. the zone-specific instance-groups).
gcloud --project='gitlab-production' compute backend-services get-health gprd-pgbouncer-regional --region='us-east1'
Show that backend service’s instance groups.
gcloud --project='gitlab-production' compute instance-groups list | egrep 'NAME|pgbouncer'
Describe those instance-groups.
( for ZONE in us-east1-{b,c,d} ; do INSTANCE_GROUP="gprd-pgbouncer-${ZONE}" ; echo -e "\nInstance-group: ${INSTANCE_GROUP}" ; gcloud --project='gitlab-production' compute instance-groups describe "${INSTANCE_GROUP}" --zone="${ZONE}" ; done )
List the instances in those instance-groups.
( for ZONE in us-east1-{b,c,d} ; do INSTANCE_GROUP="gprd-pgbouncer-${ZONE}" ; echo -e "\nInstance-group: ${INSTANCE_GROUP}" ; gcloud --project='gitlab-production' compute instance-groups list-instances "${INSTANCE_GROUP}" --zone="${ZONE}" ; done )
Background details
Section titled “Background details”Purpose of each service
Section titled “Purpose of each service”- Patroni provides cluster management for Postgres, automating failover of the primary db and reconfiguring replica dbs to follow the new primary db’s transaction stream.
- Consul provides shared state and lock management to Patroni. It also provides DNS-based service discovery, so database clients can learn when Patroni nodes fail or change roles.
- PgBouncer provides connection pooling for Postgres. We have two separate styles of use for PgBouncer:
- Access to the primary db transits a dedicated pool of PgBouncer hosts, which discover the current Patroni leader (i.e. the primary db) by querying Consul’s DNS record
master.patroni.service.consul
. In turn, database clients access that pool of PgBouncer hosts through a single IP address that is load balanced among the pool members by a Google Internal TCP Loadbalancer. That load balanced IP address is published as the DNS A recordpgbouncer.int.gprd.gitlab.net
. - Access to each of the replica dbs transits either of 2 PgBouncer instances running locally on each Patroni host. Database clients discover the list of available replica db PgBouncer instances by querying Consul’s DNS
SRV
records fordb-replica.service.consul
. For historical reasons (prior to running multiple PgBouncers per replica db), Consul also publishes DNSA
records forreplica.patroni.service.consul
pointing to the 1st PgBouncer instance (the one bound to port 6432).
- Access to the primary db transits a dedicated pool of PgBouncer hosts, which discover the current Patroni leader (i.e. the primary db) by querying Consul’s DNS record
Normal healthy interactions between these services
Section titled “Normal healthy interactions between these services”- Patroni uses Consul mainly as a lock manager and as a key-value datastore to hold Patroni cluster metadata.
- Each Patroni agent must regularly:
- Fetch the Patroni cluster’s state from Consul.
- Publish its own local node state to Consul.
- Renew its Consul session lock.
- If the Patroni leader fails to renew its session lock before the lock’s TTL expires, the other Patroni nodes will elect a new leader and trigger a Patroni failover.
- Each Patroni agent must regularly:
- Patroni agent makes REST calls to the local Consul agent.
- To build a response, the Consul agent makes RPC calls to a Consul server, which is where state data is stored. Consul agents do not locally cache KV data.
- Our Consul topology has 5 Consul servers and several hundred Consul agents.
- Consul servers:
- The Consul servers each store state locally, but only 1 (the Consul leader) accepts write requests. Typically also only the Consul leader accepts read requests.
- The non-leader Consul servers exist for redundancy and durability, as each Consul server stores a local copy of the event log.
- Consul’s CAP bias is to prefer consistency over availability, but with 5 servers, Consul can lose 2 and remain available.
- Unreliable network connectivity can trigger a Consul leader election, which causes Consul to be unavailable until connectivity recovers to the point that a quorum is able to elect a new leader. Such a Consul outage can in turn potentially cause Patroni to demote its leader and wait for Consul to become available so Patroni can elect a new leader.
- Consul agents:
- Every Consul agent participates in a gossip protocol (Serf) that natively provides a distributed node failure detection mechanism.
- Whenever one Consul agent probes another and fails to elicit a prompt response, the probing node announces via gossip that it suspects the buddy it just probed may be down. That gossip message quickly propagates to all other Consul agents.
- Every agent locally logs this event, so we could look on any other host running Consul agent to see roughly when the event occurred.
- This has proven to be a reliable detector for intermittent partial network outages in GCP’s infrastructure.
- Consul servers:
- We also use Consul to publish via DNS which Patroni node is the leader (i.e. the primary db) and which are healthy replica dbs.
- Database clients (e.g. Rails app instances) periodically query their host’s local Consul agent via DNS to discover the list of available databases. This list actually refers to PgBouncer instances, which proxy to Postgres itself.
- If Consul’s list of available dbs changes, our Rails app updates its internal database connection pool accordingly.
- Patroni does not update the DNS records (services in Consul) by calling Consul, but instead Consul agents on the DB hosts periodically run health checks on their local instance to check if they are primary or replica nodes. They do this by requesting
http://0.0.0.0:8009/master
andhttp://0.0.0.0:8009/replica
every 2s to check the statuses of each service.
Details of how Patroni uses Consul
Section titled “Details of how Patroni uses Consul”What is Patroni’s purpose?
Section titled “What is Patroni’s purpose?”Patroni’s job is to provide high availability for Postgres by automatically detecting node failure, promoting a replica to become the new writable primary, and coordinating the switchover for all other replicas to start following transactions from the new primary once they reach the point of divergence between the old and new primary’s timelines.
What is Consul’s purpose?
Section titled “What is Consul’s purpose?”Consul has 2 primary jobs:
- Store Patroni’s state data in a highly-available distributed datastore.
- Advertise via DNS to database clients how to connect to a primary or replica database.
- This is kept up to date by health checks run periodically from Consul agents on the Patroni hosts
How and why does Patroni interact with Consul as a datastore?
Section titled “How and why does Patroni interact with Consul as a datastore?”To accomplish its job, Patroni needs to maintain a strongly consistent and highly available representation of the state of all the Patroni cluster’s Postgres instances. It delegates the durable storage of that state data to an external storage service — what it calls its Distributed Configuration Store (DCS). Patroni supports several options for this DCS (Consul, Zookeeper, Etcd, and others); in our case, we chose to use Consul.
Patroni stores several kinds of data in its DCS (Consul), such as:
- Who is the current cluster leader? (consul key
service/pg-ha-cluster/leader
) - Patroni config settings from the
dcs
stanza of patroni.yml. (consul keyservice/pg-ha-cluster/config
) - Each Patroni node periodically self-describes its status (xlog_location, timeline, role, etc.). (consul keys
service/pg-ha-cluster/members/[hostname]
) - Other ancillary data, including a history of past failover events, metadata about an in-progress failover, whether or not failover is currently paused, etc.
- While Patroni updates KV stores for Patroni state by calling Consul APIs it does not directly call Consul service APIs to update service status (and hence DNS records). Instead Consul agent is requesting
http://0.0.0.0:8009/master
andhttp://0.0.0.0:8009/replica
every 2s to check the statuses of each service using the Patroni API.
Warning: If you need to manually PAUSE
Patroni (i.e. prevent failover even if the primary starts failing health checks), a Chef run an any Patroni node will revert that pause. Chef tells the Patroni agent to force the dcs
settings in patroni.yml to overwrite any conflicting settings stored in Consul, and that scope unfortunately includes the consul key used for pausing failovers. So to pause Patroni (e.g. for maintenance), we must first stop Chef on all Patroni hosts.
The Consul agent does not locally cache any of the above data. Every time the Patroni agent asks the local Consul agent to read or write this data, the Consul agent must synchronously make RPC calls to a Consul server. The Patroni agent’s REST call to the Consul agent can timeout or fail if Consul agent’s RPC call to Consul server stalls or fails. (This has proven to be a common failure mode on GCP due to transient network connectivity loss.)
What is the difference between the Patroni leader and the Consul leader?
Section titled “What is the difference between the Patroni leader and the Consul leader?”The Patroni leader is the Patroni agent corresponding to the writable Postgres database (a.k.a. the primary Postgres db). All other Patroni nodes correspond to read-only replica Postgres databases that asynchronously replay transaction logs received from the primary Postgres database.
The Consul leader is whichever one of the Consul servers is currently accepting writes. Consul has its own internal leader-election process, independent of Patroni.
Why use one database (Consul) to manage another database (Postgres)?
Section titled “Why use one database (Consul) to manage another database (Postgres)?”Patroni uses Consul to provide high availability to Postgres through automated failover.
Postgres and Consul are both databases, but they have different strengths and weaknesses. Consul excels at storing a small amount of data, providing strong consistency guarantees while tolerating the loss of potentially multiple replicas. But Consul is not designed to handle a high write rate, and it provides just basic key-value storage. In contrast, Postgres is a much more featureful relational database and supports many concurrent writers. While Postgres natively provides replication, it does not natively provide an automated failover mechanism.
For Patroni to provide high availability (automated failover) to Postgres, it needs all Patroni agents to have a consistent view of the Patroni cluster state (i.e. who is the current leader, how stale is each replica, etc.). Patroni stores that state in its DCS (which for us is Consul), with the expectation that writes are durable and reads are strongly consistent.
How does Consul balance consistency versus availability?
Section titled “How does Consul balance consistency versus availability?”Consul prefers consistency over availability. When failure conditions such as node loss or network partitions force Consul to choose between consistency and availability, Consul prefers to stop accepting writes and reads until a quorum of Consul server nodes is again reached. This avoids split-brain. To reduce the likelihood of losing quorum, Consul supports a peer group of up to 11 servers, but most production deployments use 3 or 5 (which tolerates the loss of 1 or 2 nodes respectively).
As is typical, in production we run 5 hosts as Consul servers to act as the datastore, and we run a Consul agent on every other host that needs to read or write data stored on the Consul servers.
The Consul servers participate in a strongly consistent consensus protocol (RAFT) for leader election. Only the current leader is allowed to accept writes, so that all writes are serializable. These logged writes are replicated to the other Consul servers; at least a majority (quorum) of Consul servers must receive the new log entry for the write to be considered successful (i.e. guaranteed to be present if a new leader is elected). If the current leader fails, Consul will stop accepting new writes until the surviving quorum of peers elect a new leader (which may take several seconds). Typically read requests are also handled by the Consul leader, again to provide strong consistency guarantees, but that is tunable. If a non-leader consul server receives a read request, it will forward that call to the current Consul leader.
Only the Consul servers participate as peers in the strongly-consistent RAFT protocol. But all Consul agents participate in a weakly-consistent gossip protocol (SERF). This supports automatic node discovery and provides distributed node failure detection.
How do Patroni’s calls to Consul let it decide when to failover?
Section titled “How do Patroni’s calls to Consul let it decide when to failover?”Each Patroni agent (whether replica or primary) periodically interacts with Consul to:
- Fetch the most recently published status of its cluster peers.
- Affirm its liveness by renewing its consul session (which quickly auto-expires without these renewals).
- Publish its own current state metadata.
- If leader, publish current Postgres WAL location; otherwise, watch for any change in leadership.
If Patroni fails one of these REST calls to Consul agent, the failed call can be retried for up to its configured retry_timeout
deadline (currently 30 seconds). For the Patroni leader (i.e. the Patroni agent whose Postgres instance is currently the writable primary db), if that retry deadline is reached, Patroni will initiate failover by voluntarily releasing the Patroni cluster lock.
Similarly, if the Patroni leader’s loop takes long enough to complete that its consul session expires (TTL is currently effectively 45 seconds), then it involuntarily loses the cluster lock, which also initiates failover.
Note: Remember that Consul agent does not locally cache data; it makes synchronous remote calls to Consul server. Often when Patroni logs a timeout, it is due to troubled connectivity between Consul agent and Consul server.
See also: Normal pattern of REST calls from Patroni to Consul
What is Consul’s serfHealth
check, and how can it trigger a Patroni failover?
Section titled “What is Consul’s serfHealth check, and how can it trigger a Patroni failover?”Another way for Patroni’s leader to involuntarily lose its cluster lock is if Consul’s serfHealth
health check fails for that host’s Consul agent.
Consul uses Serf as its gossip protocol. In addition to being a medium for asynchronous eventually-consistent information sharing, Serf also provides Consul’s mechanism for node failure detection. Every host running a Consul agent participates in Serf gossip traffic, including its liveness probe: serfHealth
(also called SerfCheck
). Each Consul agent intermittently tries to connect to a random subset of other agents, and if that attempt fails, it announces that target as being suspected of being down. The suspected-down agent has a limit window of time to actively refute that suspicion. Meanwhile, other agents will “dogpile” on health-checking the suspected-down agent, and if they concur that the agent is unresponsive, the window for refutation shortens (to reduce time to detect a legitimate failure). After the refutation window expires, the Consul server can mark that Consul agent as failed. When the problem is resolved (e.g. agent restarted, network connectivity restored), the failed Consul agent automatically rejoins by contacting the Consul server.
If a Patroni node’s Consul agent is marked by its peers as failing that serfHealth
check, then this immediately invalidates the Patroni agent’s consul session (which, as described above, acts as a mutex). If that happens to the Patroni leader, it immediately loses the Patroni “cluster lock”, triggering a Patroni failover. In contrast, if serfHealth
check fails for a non-leader Patroni node, it is merely excluded as a candidate for leader election until it establishes a new session (which typically happens seconds after network connectivity is restored).
In summary, any host’s Consul agent can flag the Patroni leader’s Consul agent as potentially down, and if it does not promptly refute that claim, Patroni will initiate a failover — all because Consul’s serfHealth
check is part of Patroni’s contract for maintaining the validity of its consul session.
Because GCP network connectivity has proven to be intermittently unreliable, we are considering reconfiguring Patroni to ignore the serfCheck
, so that its consul sessions would not be invalidated due to any one random VM in the environment having packet loss. Pros and cons are described in Issue 8050.
How does serfHealth
work?
Section titled “How does serfHealth work?”In brief: A Consul agent first sends a UDP message probe. If it gets no response, it asks a few other nodes to try the same UDP message while it concurrently tries one via TCP. If those attempts all fail, it broadcasts via gossip a suspicion that the target node is dead. If the target node does not explicitly refute that suspicion within the time limit, it is marked as failed by Consul server. For more details, continue reading.
The following notes are from a source code review of Consul 1.5.1, which includes a fork of the Serf library.
For reference, the configuration options and their defaults are defined (and well-annotated) here:
- Serf config.go, with default settings at the end
- MemberList config.go, with default settings near the end for the Serf LAN (within “datacenter”) and Serf WAN (cross-”datacenter”) scopes of communication. Most of the timeouts mentioned below are defined here.
The node failure detection behavior using Serf primatives is implemented by the probeNode
method of Memberlist
. Here is a walk-through of that behavior:
- Deadline for a probe is dynamic but is at least
Memberlist.Config.ProbeInterval
(default: 1 second). - If the target node’s current state is
stateAlive
(healthy), then just send apingMsg
via Gossip (UDP). Otherwise, append asuspect
message to the pingMsg and send it. - Wait for an acknowledgement response for up to
Memberlist.Config.ProbeTimeout
(default: 500 ms).- Exit successfully if ack received within
ProbeTimeout
deadline. Else continue.
- Exit successfully if ack received within
- Concurrently try 2 tactics:
- Tactic 1: Request indirect UDP probes: Ask up to
Memberlist.Config.IndirectChecks
nodes (default: 3) to ping the target node on our behalf. - Tactic 2: Try “fallback” TCP ping: If the target node speaks protocol version 3 or higher and
DisableTcpPings=false
(default), then open a TCP connection, and send the samepingMsg
.- Note: The
deadline
variable passed to TCP ping has roughlyProbeInterval
minusProbeTimeout
time remaining (default: 1000 ms - 500 ms = 500 ms). Thedeadline
variable was set totime.Now().Add(ProbeInterval)
at the start of theprobeNode
method, and by this point we have already spend roughlyProbeTimeout
(default: 500 ms) of that budgetted deadline. - Note: This is probably not a practical concern for the time scales involved, but for completeness: Both the Indirect UDP Pings and the TCP Ping are run in goroutines, so they may not run immediately if the Golang runtime has too few threads free (e.g. on a host with very few CPUs). The TCP probe’s
deadline
is an absolute timestamp, so any delay in scheduling a goroutine consumes part of the patience of that health check.
- Note: The
- Tactic 1: Request indirect UDP probes: Ask up to
- Check results: Wait for the UDP-based indirect acks or nacks to arrive or timeout. Then check the result of the TCP ping.
- If any indirect UDP Ping succeeds, exit successfully. If they all fail (timeout or nack), then check results of the TCP Ping.
- If TCP Ping succeeded, log warning and exit successfully.
- Note: We warn if only TCP succeeds because if UDP is consistently blocked/failing, the node cannot hear most gossip and may pollute peers with its stale state via full-sync.
- If those fallback probes all failed, then:
- Update self-awareness metrics.
- Call method
suspectNode
:- Broadcasts via gossip the identity (name and incarnation number) of the node that failed its health check, and starts the timer for the refutation window.
- Other nodes in the cluster receive this
suspect
message and also locally callsuspectNode
.- If the suspected node receives this gossip message, its local call to
suspectNode
notices the message refers to itself, logs a warning, and broadcasts a refutation message. If egress network traffic is working, its peers should see this broadcast and accept this refutation; otherwise, the countdown continues until the node is marked as failed. - The duration of the refutation window is calculated here using settings from MemberList config.go.
- If the suspected node receives this gossip message, its local call to
What happens during a Patroni leader election?
Section titled “What happens during a Patroni leader election?”Patroni’s leader election protocol allows a period of time for all replicas to catch up on any transaction data from the old primary db that they had received but not yet applied. During that period, each Patroni node frequently updates its state (e.g. xlog replay location), so its peers know who is the freshest. At the end of the grace period, the freshest replica will be promoted to become the new primary.
Patroni then helps each replica switch to the new primary’s timeline, so they have a consistent point of divergence from the old primary’s timeline and can safely apply transaction logs from the new primary. If a replica fails to switch timelines for any reason, it is shutdown and removed from the list of replica dbs available for handling read-only queries. (However, in this case the new primary db still needs to be manually told to stop retaining transaction logs for that dead replica.)
Meanwhile, the dedicated PgBouncer instances frequently query Consul DNS for updates, so they quickly detect that Patroni has promoted a new Postgres instance to be the primary db. PgBouncer discards any residual connections to the old primary db, opens connections to the new primary db, and starts mapping client queries to these new db connections. Clients such as Rails instances remain connected to PgBouncer throughout this process. Transactions that were in progress at the time the primary db failed (or was demoted) are aborted, and newer transactions fail until the new primary db is elected by Patroni and detected by PgBouncer.
Define the relationship between Patroni settings ttl
, loop_wait
, and retry_timeout
Section titled “Define the relationship between Patroni settings ttl, loop_wait, and retry_timeout”The Patroni documentation is vague about how these 3 settings are used and how to tune them:
loop_wait
: the number of seconds the loop will sleep. Default value: 10ttl
: the TTL to acquire the leader lock. Think of it as the length of time before initiation of the automatic failover process. Default value: 30retry_timeout
: timeout for DCS and PostgreSQL operation retries. DCS or network issues shorter than this will not cause Patroni to demote the leader. Default value: 10
The following more detailed description is based on review of the Patroni 1.6.0 source code, specifically focusing on the generic DCS code and the Consul-specific code.
loop_wait
(seconds): The minimum delay the Patroni agent waits between attempts to renew the TTL of its consul session. The consul session of the Patroni leader is effectively the leader lock; if the leader’s consul session expires, theservices/[cluster_name]/leader
record will be deleted by the Consul server, and surviving Patroni agents will initiate a leader election.ttl
(seconds): Effectively the TTL of the leader lock. Technically every Patroni agent has a consul session with this TTL. If the leader’s consul session expires, the leader lock is released (i.e. the Consul KV revord namedservices/[cluster_name]/leader
is automatically deleted), which surviving Patroni agents detect, initiating leader election. If a non-leader Patroni node’s consul session expires, it cannot publish status updates or participate in leader elections until it creates a new consul session. All Patroni agents try to renew the TTL on their consul sessions approximately everyloop_wait
seconds, but only the leader’s session expiry can cause failover.retry_timeout
(seconds): The cumulative max duration of all tries of any single operation (e.g. a single REST operation to Consul can be retried for up toretry_timeout
seconds.) A typical healthy Patroni loop makes 3-4 REST calls to Consul. If more than one is affected by slowness, the TTL could expire before Patroni issues the REST call to renew it. In the Consul-specific Patroni code, each of 3 attempts at a REST HTTP call to Consul agent gets 1/3 of thisretry_timeout
budget. (The Retry object’s overall deadline isretry_timeout
, and it wraps the HTTP client object whoseread_timeout
is set to 1/3 of that value.) Only the last of 3 failed attempts is logged, which is why its log message shows the HTTP timeout as being 1/3 of this configured value.
When adjusting these 3 settings, for the timeouts to work properly:
ttl
must be at leastloop_wait
+retry_timeout
. Otherwise the leader lock will implicitly expire either between renewal attempts or before a stalled DCS operation reaches itsretry_timeout
.ttl
being comfortably larger than this makes it tolerate multiple slow calls to Consul.- The default settings are equivalent to:
ttl
=loop_wait
+ 2 *retry_timeout
Why is Patroni’s actual TTL half of its configured value?
Section titled “Why is Patroni’s actual TTL half of its configured value?”Surprisingly, Patroni silently halves the configured ttl
setting, because it expects Consul to silently double it.
def set_ttl(self, ttl): if self._client.http.set_ttl(ttl/2.0): # Consul multiplies the TTL by 2x ...
Example: The default ttl
is 30 seconds, but the actual consul session’s TTL is really set to 15 seconds:
$ consul kv get service/pg-ha-cluster/config | jq -c '. | { ttl }'{"ttl":30}
$ curl -s http://127.0.0.1:8500/v1/session/node/$( hostname -s ) | jq -c '.[] | { TTL }'{"TTL":"15.0s"}
This hard-coded behavior may be helpful in environments that prefer a very strict TTL, typically with very small value — in other words, environments that prefer to failover spuriously rather than waiting a little longer to see if the triggering condition was ephemeral or a false alarm. The Consul documentation also describes its TTL support as following a lazy expiry policy, but it does not claim any divide-by-two logic. Rather it states the session TTL is a lower bound (not an upper bound) for when the Consul server will actually delete/release an expired session:
When creating a session, a TTL can be specified. If the TTL interval expires without being renewed, the session has expired and an invalidation is triggered. […] The contract of a TTL is that it represents a lower bound for invalidation; that is, Consul will not expire the session before the TTL is reached, but it is allowed to delay the expiration past the TTL.
So we should be aware that whatever value we set in Patroni’s DCS ttl config, Consul is being told half of that value. Despite the heuristic testing done 2 years ago when that divide-by-two logic was added to the Patroni code, a strict interpretation of the Consul docs suggests the session could be expired as early as half the time we specify in the Patroni ttl
config.
Known failure modes
Section titled “Known failure modes”The following lists some selected illustrative failure patterns and what their symptoms look like.
What specific network paths can trigger Patroni failover if they become lossy?
Section titled “What specific network paths can trigger Patroni failover if they become lossy?”See: “Concise summary of RCA” comment on issue “Why are patroni failovers occurring so often?”
Dedicated PgBouncer hosts can develop a very uneven distribution of client connections after maintenance or restart events
Section titled “Dedicated PgBouncer hosts can develop a very uneven distribution of client connections after maintenance or restart events”See: Why do the PgBouncer hosts have a very uneven distribution of client connections?