Skip to content

patroni-consul-postgres-pgbouncer-interactions

What are these services, and how do they work together?

Here is a brief summary of how Postgres database access is supported by Patroni, Consul, PgBouncer, and our Rails app:

  • Postgres is our relational database.
    • Currently we have 1 writable primary instance and several read-only replica instances of Postgres.
    • The replica dbs handle read-only queries and act as failover candidates in case the primary db becomes unavailable (e.g. unreachable, unresponsive) or concerned about possible split-brain (e.g. unable to see/update Patroni cluster state).
  • Patroni coordinates failover when the primary Postgres instance becomes unavailable.
    • Patroni stores its cluster state in Consul.
    • A Patroni agent runs on each Postgres host, monitoring that local Postgres instance and interacting with Consul to check cluster state and publish its own state.
  • Database clients (Rails, Sidekiq, etc.) discover Postgres through Consul service discovery and access it through PgBouncer (a connection pooler).
    • PgBouncer is a connection pooling proxy in front of Postgres. Having thousands of clients connected directly to Postgres causes significant performance overhead. To avoid this penalty, PgBouncer dynamically maps thousands of client connections to a few hundred db sessions.
    • Consul advertises each PgBouncer instance as a proxy for a corresponding Postgres instance.
    • Consul keeps DNS records for primary and replica up to date by Consul agent on each node running periodic service healthchecks against the local Patroni API
    • Because PgBouncer itself is single-threaded and CPU-bound, it can use at most 1 vCPU, so we run multiple PgBouncer instances in front of each Postgres instance to avoid CPU starvation.
    • Database clients discover the IP+port of the available PgBouncer instances by sending DNS queries to a local Consul agent.
  • Currently we use dedicated PgBouncer VMs for accessing the primary db, rather than local PgBouncer processes on the db host (as we do for the replica dbs).
    • The primary db gets more traffic than any one replica db.
    • PgBouncer appears to be more CPU-efficient on dedicated VMs than when running on the primary db host. We have a couple untested hypotheses as to why.
    • The primary db’s PgBouncer VMs share a virtual IP address (a Google TCP Internal Load Balancer VIP). That ILB VIP is what Consul advertises to database clients as the primary db IP.

See here and here for more details on the purpose, behaviors, and interactions of each service.

graph LR
    subgraph "Web Host"
        Rails
    end

    subgraph "Network forwarding rule"
        ILB["Load balancer virtual IP"]
    end

    subgraph "PgBouncer Host A"
        PgBouncer1
    end

    subgraph "PgBouncer Host B"
        PgBouncer2
    end

    subgraph "DB Host"
        Postgres["Postgres (read/write primary mode)"]
    end

    Rails --> ILB

    ILB --> PgBouncer1
    ILB --> PgBouncer2

    PgBouncer1 --> Postgres
    PgBouncer2 --> Postgres

Rails is configured to connect to the primary db using a static host/port pointing to the ILB:

Terminal window
$ egrep 'host:|port:' /srv/gitlab/config/database.yml
host: "pgbouncer.int.gprd.gitlab.net"
port: 6432
port: 8600
host: "pgbouncer-ci.int.gprd.gitlab.net"
port: 6432
port: 8600
$ dig +short pgbouncer.int.gprd.gitlab.net
10.217.4.5
$ dig +short pgbouncer-ci.int.gprd.gitlab.net
10.217.16.5

Note that even though this is an internal-only IP address, that DNS A record is served by our public DNS provider:

Terminal window
$ dig +noall +authority -t SOA pgbouncer.int.gprd.gitlab.net
gitlab.net. 1800 IN SOA arya.ns.cloudflare.com. dns.cloudflare.com. 2276495604 10000 2400 604800 3600

The ILB (a Google Internal TCP/UDP Load Balancer) is just a set of network configurations, not an in-line proxy for the traffic. Its forwarding-rule routes each new connection to one of the registered backends (in this case, dedicated PgBouncer hosts). For details on how to inspect the ILB’s components, see section Internal Loadbalancer (ILB).

The dedicated PgBouncer hosts proxy traffic to Postgres. They discover from Consul which Patroni node is currently the primary db.

Terminal window
$ sudo cat /var/opt/gitlab/pgbouncer/databases.ini | grep 'host='
gitlabhq_production = host=master.patroni.service.consul port=5432 pool_size=75 auth_user=pgbouncer
gitlabhq_production_sidekiq = host=master.patroni.service.consul port=5432 pool_size=50 auth_user=pgbouncer dbname=gitlabhq_production
$ dig +short master.patroni.service.consul
10.220.16.106
$ dig +short master.patroni.service.consul | xargs -i dig +short -x {}
patroni-06-db-gprd.c.gitlab-production.internal.

Note that PgBouncer uses the default system DNS resolver, not Consul directly. We run dnsmasq on these hosts with a configuration to recurse to the Consul agent’s DNS port (8600), so that PgBouncer’s service discovery queries work properly.

Terminal window
$ cat /etc/dnsmasq.d/10-consul
server=/consul/127.0.0.1#8600
cache-size=0

PgBouncer frequently rechecks this DNS query to Consul, so it can respond promptly if a Patroni failover occurs.

Terminal window
$ sudo cat /var/opt/gitlab/pgbouncer/pgbouncer.ini | grep 'dns_.*_ttl'
dns_max_ttl = 2.0
dns_nxdomain_ttl = 15.0
graph LR
    subgraph "Web Host"
        Rails
    end

    subgraph "DB Host"
        PgBouncer1
        PgBouncer2
        PgBouncer3
        Postgres["Postgres (read-only replica mode)"]
    end

    Rails --> PgBouncer1
    Rails --> PgBouncer2
    Rails --> PgBouncer3

    PgBouncer1 --> Postgres
    PgBouncer2 --> Postgres
    PgBouncer3 --> Postgres

Rails is configured to connect to the read-only replca dbs using Consul service discovery. Rails sends a DNS query to the local Consul agent to get a set of DNS SRV (service) records. This provides a list of IP and port combinations to reach all available PgBouncer instances.

Terminal window
$ grep -A6 'load_balancing:' /srv/gitlab/config/database.yml
load_balancing:
discover:
nameserver: gke-gprd-us-east1-d-web-0-f4bee7c1-xwwh
record: "db-replica.service.consul."
record_type: "SRV"
port: 8600
use_tcp: true
--
load_balancing:
discover:
nameserver: gke-gprd-us-east1-d-web-0-f4bee7c1-xwwh
record: "ci-db-replica.service.consul."
record_type: "SRV"
port: 8600
use_tcp: true
$ dig +short -t SRV -p 8600 @127.0.0.1 db-replica.service.consul. | wc -l
27
$ dig +short -t SRV -p 8600 @127.0.0.1 db-replica.service.consul. | sort -V -k4 -k3
1 1 6432 patroni-02-db-gprd.node.east-us-2.consul.
1 1 6433 patroni-02-db-gprd.node.east-us-2.consul.
1 1 6434 patroni-02-db-gprd.node.east-us-2.consul.
1 1 6432 patroni-03-db-gprd.node.east-us-2.consul.
1 1 6433 patroni-03-db-gprd.node.east-us-2.consul.
1 1 6434 patroni-03-db-gprd.node.east-us-2.consul.
...

Normal pattern of REST calls from Patroni to Consul

Section titled “Normal pattern of REST calls from Patroni to Consul”

Patroni agent’s “loop” periodically runs a sequence of calls to Consul. Each of these REST calls from Patroni agent to Consul agent results in a single corresponding RPC call from Consul agent to Consul server:

sequenceDiagram
    participant PatroniAgent
    participant ConsulAgent
    participant ConsulServer

    PatroniAgent->>+ConsulAgent: REST call
    ConsulAgent->>+ConsulServer: RPC call
    ConsulServer->>-ConsulAgent: RPC response
    ConsulAgent->>-PatroniAgent: REST response

The normal healthy sequence of periodic calls is slightly different for the Patroni leader versus the Patroni replicas.

Patroni leader:

REST callREST responsePurpose
GET /v1/kv/service/[CLUSTER_NAME]/?recurse=1All Patroni-related objects as JSONGet complete state of Patroni cluster
PUT /v1/session/renew/[SESSION_ID]Session object as JSONRenew TTL of this Patroni node’s consul session
PUT /v1/kv/service/[CLUSTER_NAME]/optime/leaderLeader object as JSONWrite latest Postgres WAL location of primary db as Patroni leader optime
PUT /v1/kv/service/[CLUSTER_NAME]/members/[HOSTNAME]?acquire=[SESSION_ID]String “true” confirming successPublish current state of this Patroni node

Patroni replica:

REST callREST responsePurpose
GET /v1/kv/service/[CLUSTER_NAME]/?recurse=1All Patroni-related objects as JSONGet complete state of Patroni cluster
PUT /v1/session/renew/[SESSION_ID]Session object as JSONRenew TTL of this Patroni node’s consul session
PUT /v1/kv/service/[CLUSTER_NAME]/members/[HOSTNAME]?acquire=[SESSION_ID]String “true” confirming successPublish current state of this Patroni node
GET /v1/kv/service/[CLUSTER_NAME]/leader?wait=[DURATION]&index=[CURRENT_VALUE]Leader object as JSONSleep for the remainder of loop_wait seconds, unless Patroni leader fails

These calls can be directly observed on a live Patroni host through a brief packet capture of:

  • the Consul agent’s REST port (8500) on the loopback network interface
  • the Consul server’s RPC port (8300) on the default network interface

The HTTP REST calls from Patroni agent to Consul agent over the loopback interface are unencrypted and easy to interpret. The RPC calls from Consul agent to Consul server are encrypted, but we can observe the close timing of those calls relative to the REST request/response.

Terminal window
sudo timeout 30 tcpdump -v -w sample_consul_rest_and_rpc_calls.pcap -c 1000 -i any 'port 8500 or 8300'

Download the above PCAP file to your workstation, and examine it with Wireshark (or its command-line version, tshark). Remember to decode TCP port 8500 as HTTP, as shown below.

Terminal window
wireshark -r sample_consul_rest_and_rpc_calls.pcap -d 'tcp.port==8500,http'
tshark -r sample_consul_rest_and_rpc_calls.pcap -d 'tcp.port==8500,http' -Y 'http.request.uri' -T fields -e frame.time_relative -e http.request.method -e http.request.uri | head -n12

Typical output for primary db:

Terminal window
0.000000000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
0.006017000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f
0.008736000 PUT /v1/kv/service/pg-ha-cluster/optime/leader
0.019531000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f
10.000356000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
10.006160000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f
10.009424000 PUT /v1/kv/service/pg-ha-cluster/optime/leader
10.020799000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f
20.000465000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
20.006820000 PUT /v1/session/renew/c255bb15-5cc3-b9ba-be46-0522fb9f883f
20.009269000 PUT /v1/kv/service/pg-ha-cluster/optime/leader
20.020930000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-06-db-gprd.c.gitlab-production.internal?acquire=c255bb15-5cc3-b9ba-be46-0522fb9f883f

Typical output on replica db:

Terminal window
0.064859000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
0.070741000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf
0.078500000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf
0.088481000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.711172819137573s&index=44065582
10.232919000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
10.236985000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf
10.239178000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf
10.255033000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.544909477233887s&index=44065582
20.327716000 GET /v1/kv/service/pg-ha-cluster/?recurse=1
20.332099000 PUT /v1/session/renew/d93acdd6-51de-3bce-5014-f1d4a2cd43cf
20.334070000 PUT /v1/kv/service/pg-ha-cluster/members/patroni-03-db-gprd.c.gitlab-production.internal?acquire=d93acdd6-51de-3bce-5014-f1d4a2cd43cf
20.347014000 GET /v1/kv/service/pg-ha-cluster/leader?wait=9.452655792236328s&index=44065582
Service/componentChef roleHostname patternPort (Protocol)
Postgresgprd-base-db-patronipatroni-{01..NN}-db-gprd5432 (Pgsql)
PgBouncer for primary dbgprd-base-db-pgbouncerpgbouncer-{01..NN}-db-gprd6432 (Pgsql)
PgBouncer for replica dbsSame as PostgresSame as Postgres6432-6434 (Pgsql)
Patroni agentSame as PostgresSame as Postgres8009 (REST)
Consul agentgprd-base (recipe gitlab_consul::agent)Nearly all Chef-managed VMs8600 (DNS), 8500 (REST), 8301 (Serf LAN)
Consul servergprd-infra-consulconsul-{01..NN}-inf-gprd8600 (DNS), 8500 (REST), 8301 (Serf LAN), 8302 (Serf WAN), 8300 (RPC)

In addition to the above Chef-managed services, we use a Google TCP Internal Loadbalancer (ILB) to provide a single virtual IP address for the pool of PgBouncer instances for the primary db. This allows clients to treat a pool of PgBouncers as a single endpoint.

Notes about the ILB:

  • A Google TCP/UDP Internal Loadbalancer (ILB) is not an inline device in the network path.
  • Instead, ILB is part of the control plane of the software defined network within a single geographic region.
  • All backends (i.e. PgBouncer VMs) share the IP address of the ILB’s forwarding rule, and within the VPC network, each TCP/UDP connection is routed to one of those backends.
  • Backend instances contact the metadata server (metadata.google.internal) to generate local routes to accept traffic for the ILB’s IP address.

List of tools for working with these services

Section titled “List of tools for working with these services”
  • gitlab-psql: Wrapper for generic psql, authenticating and connecting to the local Postgres instance as a superuser.
  • gitlab-patronictl: Wrapper for generic patronictl, authenticating and connecting to the local Patroni agent.
  • pgb-console, pgb-console-1, pgb-console-2: Wrapper for generic psql, authenticating and connecting to each of the local PgBouncer instances on their distinctive ports. This allows running PgBouncer’s Admin Console commands like SHOW HELP, SHOW STATS, RELOAD, etc.
  • consul: Consul CLI tool, for querying or modifying Consul from any host running a Consul agent.

All Patroni hosts are running local PgBouncer instances, including whichever host is currently the Patroni leader and primary Postgres db. The PgBouncers on the primary db are idle, since we route its traffic through the dedicated PgBouncer hosts.

The same pgb-console script exists on the dedicated PgBouncer hosts used for routing traffic to the primary db.

Note: The mechanism for taking a PgBouncer out of service for maintenance is different for dedicated PgBouncer hosts than for a PgBouncer instance on a Patroni host.

List the PgBouncer processes.

Terminal window
pgrep -a -f '/usr/local/bin/pgbouncer'
pgrep -f '/usr/local/bin/pgbouncer' | xargs -r ps uwf

Each PgBouncer runs on a different TCP port and has its own script to connect to its Admin Console.

Terminal window
$ ls -1 /usr/local/bin/pgb-console*
/usr/local/bin/pgb-console
/usr/local/bin/pgb-console-1
/usr/local/bin/pgb-console-2

Connect to Admin Console for each of the PgBouncer instances. This puts you in an interactive psql session, where you can run PgBouncer commands like SHOW HELP, SHOW POOLS, SHOW CLIENTS, SHOW SERVERS, etc.

Note: You can also use any psql options on the command-line or psql meta-commands in the interactive shell (e.g. \pset pager off, \x).

Terminal window
sudo pgb-console
sudo pgb-console-1
sudo pgb-console-2

In an interactive session, show the PgBouncer configuration (conffile, listen_port, max_client_conn, client_idle_timeout, server_lifetime, server_reset_query, etc.).

Terminal window
pgbouncer=# SHOW CONFIG ;

For each PgBouncer instance, show a summary of each connection pool, including:

  • cl_active: number of clients that are linked to a db server connection and can process queries
  • cl_waiting: number of clients who have sent queries but are waiting for a db server connection to become available
  • sv_active: number of db server connections linked to a client
  • sv_idle: number of db server connections that are unused and immediately available for clients
  • sv_used (unintuitive): number of db server connections that have been idle for more than server_check_delay, so they need server_check_query to run on them before they can be used again
Terminal window
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW POOLS' | cat ; done

For each PgBouncer instance, list the configured databases and their pool size limits.

Terminal window
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW DATABASES' | cat ; done

List connections from each PgBouncer instance to the local Postgres database instance.

Terminal window
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW SERVERS' | cat ; done

List connections from clients to each PgBouncer instance.

Terminal window
for console in /usr/local/bin/pgb-console* ; do sudo $console -c 'SHOW CLIENTS' | cat ; done

Consul queries and operations can mostly be done via either the consul CLI tool or the consul agent’s HTTP REST interface (TCP port 8500). All of these Consul queries can be run from any host running a Consul agent in the environment you want to inspect (e.g. gprd, gstg, etc.). It does not have to be a Patroni host, because all Consul agents participating in the same gossip membership list can also make RPC calls to the Consul servers.

Get help with the consul CLI tool’s commands and subcommands.

Terminal window
consul help
consul operator --help
consul kv --help
consul kv get --help

Summarize generic status info for the local host’s Consul agent.

Terminal window
consul info

Tail the logs from the consul agent. Ctrl-C to stop following.

Terminal window
consul monitor

Show the list of Consul servers, and indicate which one is currently the Consul leader.

Terminal window
consul operator raft list-peers

Show all KV records, including their metadata and base64-encoded values.

Terminal window
consul kv export

List all key names in the KV store.

Terminal window
consul kv export | jq '.[].key'

Show the values of all keys having prefix “service/pg-ha-cluster”. Output values are automatically decoded from base64.

Terminal window
consul kv get -recurse 'service/pg-ha-cluster'

Show a single KV record’s value, with and without metadata. Many of Patroni’s KV record values are JSON, so piping the raw value to jq is sometimes more readable.

Terminal window
consul kv get -detailed 'service/pg-ha-cluster/config'
consul kv get 'service/pg-ha-cluster/config'
consul kv get 'service/pg-ha-cluster/config' | jq

Show the latest Patroni state self-published by each Patroni node through its Consul agent.

Terminal window
consul kv get -recurse -keys 'service/pg-ha-cluster/members/' | xargs -i consul kv get {} | jq -S

List all nodes running a Consul agent in this environment.

Note: Currently we run all Consul agents as though they were in the same datacenter (i.e. the same “Serf LAN”). Consul expects each “datacenter” to have its own set of Consul servers, with loose coupling between datacenters. But the Consul servers in each datacenter would maintain state separately. If we used “Serf WAN” to connect multiple regions, this command’s output would only include Consul agents in this host’s region (i.e. the Consul agents bound to this region’s set of Consul servers).

Terminal window
consul members
consul catalog nodes

List all service names registered to Consul.

Terminal window
consul catalog services

List all nodes associated with a specific registered service. (Service names are listed by consul catalog services.)

Terminal window
consul catalog nodes -service=patroni
consul catalog nodes -service=db-replica

List all services registered by a specific node. (Node names are listed by consul catalog nodes.)

Terminal window
consul catalog services -node=patroni-01-db-gprd

Put the local Consul agent (or a specific service it provides) into maintenance mode. This is comparable to failing a health check for the node or service.

Warning: This state persists across agent restarts. If maint mode is enabled, it must later be manually disabled.

Terminal window
consul maint -help
consul maint
consul maint -enable -reason 'Optional comment explaining why this node is being taken down'
consul maint -disable

Consul REST API: Commands to inspect/explore Patroni’s state stored in Consul’s key-value (KV) store

Section titled “Consul REST API: Commands to inspect/explore Patroni’s state stored in Consul’s key-value (KV) store”

Consul queries and operations can mostly be done via either the consul CLI tool or the consul agent’s HTTP REST interface (TCP port 8500). All of these Consul queries can be run from any host running a Consul agent in the environment you want to inspect (e.g. gprd, gstg, etc.). It does not have to be a Patroni host, because all Consul agents participating in the same gossip membership list can also make RPC calls to the Consul servers.

Patroni uses the HTTP REST interface to interact with Consul agent. To see these calls, run a packet capture of TCP port 8500 on the loopback interface.

Who is the current Patroni leader for the Patroni cluster named “pg-ha-cluster”?

Note: The response’s Session key is deleted if the cluster lock is voluntarily released. It acts as a mutex, indicating the session id of the Consul agent running on the current Patroni leader.

Terminal window
$ curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq .
[
{
"LockIndex": 1,
"Key": "service/pg-ha-cluster/leader",
"Flags": 0,
"Value": "cGF0cm9uaS0xMS1kYi1ncHJkLmMuZ2l0bGFiLXByb2R1Y3Rpb24uaW50ZXJuYWw=",
"Session": "ee43c2cf-5b93-08b7-6900-1cf55c9e83b3",
"CreateIndex": 34165794,
"ModifyIndex": 34165794
}
]

Show all the state data stored in Consul for this Patroni cluster.

Note: This is the same REST call that’s run periodically by Patroni’s get_cluster method. The Value field is always base-64 encoded. The decoded values are typically either JSON or a plain strings.

Terminal window
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/?recurse=1 | jq .

List just the Key field of the Patroni cluster’s Consul KV keys.

Terminal window
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/?recurse=1 | jq -S '.[].Key'

Extract and decode the value of one of the above records.

Terminal window
curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq -r '.[].Value' | base64 -d ; echo

Or use the consul CLI tool, and let it do the base-64 decoding for you.

Terminal window
consul kv get -detailed service/pg-ha-cluster/leader

Does this host’s Consul agent have a Consul session? If so, show its full details.

Notes:

  • This shows the complete definition of a “consul session”.
  • Each Patroni agent has its own session.
  • A consul agent can use its session ID as an advisory lock (mutex) on any consul KV record. A session can claim exclusive ownership of that record by setting the record’s Session attribute with its own ID value. When the session is invalidated/expired, the lock is automatically released.
  • This locking mechanism is how Patroni uses Consul to ensure that only one node is the Patroni leader (represented by the consul KV record “service/[cluster_name]/leader”).
Terminal window
$ curl -s http://127.0.0.1:8500/v1/session/node/$( hostname -s ) | jq .
[
{
"ID": "0e9e66a5-d17a-3543-e389-209c90731209",
"Name": "pg-ha-cluster-patroni-09-db-gprd.c.gitlab-production.internal",
"Node": "patroni-09-db-gprd",
"Checks": [
"serfHealth"
],
"LockDelay": 1000000,
"Behavior": "delete",
"TTL": "15.0s",
"CreateIndex": 32308118,
"ModifyIndex": 32308118
}
]

List the consul session id for each Patroni agent in this Patroni cluster (pg-ha-cluster).

Notes:

  • Failing the health check or expiring the TTL invalidates the session. If that happens to the Patroni leader, it loses the cluster-lock, causing a failover.
  • The TTL reported here is always half the value specified in the Patroni config, because Patroni divides that configured value by 2 before setting it in Consul.
Terminal window
curl -s http://127.0.0.1:8500/v1/session/list | jq -c '.[] | { ID, Name, TTL, Checks }' | grep 'pg-ha-cluster' | sort

Show which of the above session ids holds the lock as the Patroni leader.

Again, the existence of the Session field on this Consul record acts as the mutex. If the session is invalidated (expires, fails a health check, or is deleted), then the service/pg-ha-cluster/leader is unlocked — meaning no Patroni node holds the Patroni “cluster lock”, causing Patroni to start its leader election process.

$ curl -s http://127.0.0.1:8500/v1/kv/service/pg-ha-cluster/leader | jq -r '.[].Value' | base64 -d ; echo
Or
$ consul kv get service/pg-ha-cluster/leader

Unlike other components, the ILB is not a service/process/instance. It is just a set of network routing rules, with no inline host or device acting as a proxy. It is purely configuration in the network control plane.

A Google Internal TCP/UDP Load Balancer consists of the following components:

  • Forwarding rule, which owns the IP address of the load balancer (shared by all backends for routing purposes)
  • Backend Service, which contains instance-groups and/or instances (i.e. pool members)
  • Health check, for probing each backend instance

Here are gcloud commands for inspecting the above components of the ILB. Navigating the GCP Console web UI is also intuitive.

Find and show the internal loadbalancer’s forwarding rule.

Terminal window
gcloud --project='gitlab-production' compute forwarding-rules list | egrep 'NAME|pgbouncer'
gcloud --project='gitlab-production' compute forwarding-rules describe --region=us-east1 gprd-gcp-tcp-lb-internal-pgbouncer

Find and show the internal loadbalancer’s target backend service.

Terminal window
gcloud --project='gitlab-production' compute backend-services list --filter="name~'pgbouncer'"
gcloud --project='gitlab-production' compute backend-services describe gprd-pgbouncer-regional --region='us-east1'

Find and show the internal loadbalancer’s health check.

Terminal window
gcloud --project='gitlab-production' compute health-checks list | egrep 'NAME|pgbouncer'
gcloud --project='gitlab-production' compute health-checks describe gprd-pgbouncer-http

Show the latest results of health-checks against each of the backend-service’s backends (i.e. the zone-specific instance-groups).

Terminal window
gcloud --project='gitlab-production' compute backend-services get-health gprd-pgbouncer-regional --region='us-east1'

Show that backend service’s instance groups.

Terminal window
gcloud --project='gitlab-production' compute instance-groups list | egrep 'NAME|pgbouncer'

Describe those instance-groups.

Terminal window
( for ZONE in us-east1-{b,c,d} ; do INSTANCE_GROUP="gprd-pgbouncer-${ZONE}" ; echo -e "\nInstance-group: ${INSTANCE_GROUP}" ; gcloud --project='gitlab-production' compute instance-groups describe "${INSTANCE_GROUP}" --zone="${ZONE}" ; done )

List the instances in those instance-groups.

Terminal window
( for ZONE in us-east1-{b,c,d} ; do INSTANCE_GROUP="gprd-pgbouncer-${ZONE}" ; echo -e "\nInstance-group: ${INSTANCE_GROUP}" ; gcloud --project='gitlab-production' compute instance-groups list-instances "${INSTANCE_GROUP}" --zone="${ZONE}" ; done )
  • Patroni provides cluster management for Postgres, automating failover of the primary db and reconfiguring replica dbs to follow the new primary db’s transaction stream.
  • Consul provides shared state and lock management to Patroni. It also provides DNS-based service discovery, so database clients can learn when Patroni nodes fail or change roles.
  • PgBouncer provides connection pooling for Postgres. We have two separate styles of use for PgBouncer:
    • Access to the primary db transits a dedicated pool of PgBouncer hosts, which discover the current Patroni leader (i.e. the primary db) by querying Consul’s DNS record master.patroni.service.consul. In turn, database clients access that pool of PgBouncer hosts through a single IP address that is load balanced among the pool members by a Google Internal TCP Loadbalancer. That load balanced IP address is published as the DNS A record pgbouncer.int.gprd.gitlab.net.
    • Access to each of the replica dbs transits either of 2 PgBouncer instances running locally on each Patroni host. Database clients discover the list of available replica db PgBouncer instances by querying Consul’s DNS SRV records for db-replica.service.consul. For historical reasons (prior to running multiple PgBouncers per replica db), Consul also publishes DNS A records for replica.patroni.service.consul pointing to the 1st PgBouncer instance (the one bound to port 6432).

Normal healthy interactions between these services

Section titled “Normal healthy interactions between these services”
  • Patroni uses Consul mainly as a lock manager and as a key-value datastore to hold Patroni cluster metadata.
    • Each Patroni agent must regularly:
      • Fetch the Patroni cluster’s state from Consul.
      • Publish its own local node state to Consul.
      • Renew its Consul session lock.
    • If the Patroni leader fails to renew its session lock before the lock’s TTL expires, the other Patroni nodes will elect a new leader and trigger a Patroni failover.
  • Patroni agent makes REST calls to the local Consul agent.
    • To build a response, the Consul agent makes RPC calls to a Consul server, which is where state data is stored. Consul agents do not locally cache KV data.
  • Our Consul topology has 5 Consul servers and several hundred Consul agents.
    • Consul servers:
      • The Consul servers each store state locally, but only 1 (the Consul leader) accepts write requests. Typically also only the Consul leader accepts read requests.
      • The non-leader Consul servers exist for redundancy and durability, as each Consul server stores a local copy of the event log.
      • Consul’s CAP bias is to prefer consistency over availability, but with 5 servers, Consul can lose 2 and remain available.
      • Unreliable network connectivity can trigger a Consul leader election, which causes Consul to be unavailable until connectivity recovers to the point that a quorum is able to elect a new leader. Such a Consul outage can in turn potentially cause Patroni to demote its leader and wait for Consul to become available so Patroni can elect a new leader.
    • Consul agents:
      • Every Consul agent participates in a gossip protocol (Serf) that natively provides a distributed node failure detection mechanism.
      • Whenever one Consul agent probes another and fails to elicit a prompt response, the probing node announces via gossip that it suspects the buddy it just probed may be down. That gossip message quickly propagates to all other Consul agents.
      • Every agent locally logs this event, so we could look on any other host running Consul agent to see roughly when the event occurred.
      • This has proven to be a reliable detector for intermittent partial network outages in GCP’s infrastructure.
  • We also use Consul to publish via DNS which Patroni node is the leader (i.e. the primary db) and which are healthy replica dbs.
    • Database clients (e.g. Rails app instances) periodically query their host’s local Consul agent via DNS to discover the list of available databases. This list actually refers to PgBouncer instances, which proxy to Postgres itself.
    • If Consul’s list of available dbs changes, our Rails app updates its internal database connection pool accordingly.
    • Patroni does not update the DNS records (services in Consul) by calling Consul, but instead Consul agents on the DB hosts periodically run health checks on their local instance to check if they are primary or replica nodes. They do this by requesting http://0.0.0.0:8009/master and http://0.0.0.0:8009/replica every 2s to check the statuses of each service.

Patroni’s job is to provide high availability for Postgres by automatically detecting node failure, promoting a replica to become the new writable primary, and coordinating the switchover for all other replicas to start following transactions from the new primary once they reach the point of divergence between the old and new primary’s timelines.

Consul has 2 primary jobs:

  • Store Patroni’s state data in a highly-available distributed datastore.
  • Advertise via DNS to database clients how to connect to a primary or replica database.
    • This is kept up to date by health checks run periodically from Consul agents on the Patroni hosts

How and why does Patroni interact with Consul as a datastore?

Section titled “How and why does Patroni interact with Consul as a datastore?”

To accomplish its job, Patroni needs to maintain a strongly consistent and highly available representation of the state of all the Patroni cluster’s Postgres instances. It delegates the durable storage of that state data to an external storage service — what it calls its Distributed Configuration Store (DCS). Patroni supports several options for this DCS (Consul, Zookeeper, Etcd, and others); in our case, we chose to use Consul.

Patroni stores several kinds of data in its DCS (Consul), such as:

  • Who is the current cluster leader? (consul key service/pg-ha-cluster/leader)
  • Patroni config settings from the dcs stanza of patroni.yml. (consul key service/pg-ha-cluster/config)
  • Each Patroni node periodically self-describes its status (xlog_location, timeline, role, etc.). (consul keys service/pg-ha-cluster/members/[hostname])
  • Other ancillary data, including a history of past failover events, metadata about an in-progress failover, whether or not failover is currently paused, etc.
  • While Patroni updates KV stores for Patroni state by calling Consul APIs it does not directly call Consul service APIs to update service status (and hence DNS records). Instead Consul agent is requesting http://0.0.0.0:8009/master and http://0.0.0.0:8009/replica every 2s to check the statuses of each service using the Patroni API.

Warning: If you need to manually PAUSE Patroni (i.e. prevent failover even if the primary starts failing health checks), a Chef run an any Patroni node will revert that pause. Chef tells the Patroni agent to force the dcs settings in patroni.yml to overwrite any conflicting settings stored in Consul, and that scope unfortunately includes the consul key used for pausing failovers. So to pause Patroni (e.g. for maintenance), we must first stop Chef on all Patroni hosts.

The Consul agent does not locally cache any of the above data. Every time the Patroni agent asks the local Consul agent to read or write this data, the Consul agent must synchronously make RPC calls to a Consul server. The Patroni agent’s REST call to the Consul agent can timeout or fail if Consul agent’s RPC call to Consul server stalls or fails. (This has proven to be a common failure mode on GCP due to transient network connectivity loss.)

What is the difference between the Patroni leader and the Consul leader?

Section titled “What is the difference between the Patroni leader and the Consul leader?”

The Patroni leader is the Patroni agent corresponding to the writable Postgres database (a.k.a. the primary Postgres db). All other Patroni nodes correspond to read-only replica Postgres databases that asynchronously replay transaction logs received from the primary Postgres database.

The Consul leader is whichever one of the Consul servers is currently accepting writes. Consul has its own internal leader-election process, independent of Patroni.

Why use one database (Consul) to manage another database (Postgres)?

Section titled “Why use one database (Consul) to manage another database (Postgres)?”

Patroni uses Consul to provide high availability to Postgres through automated failover.

Postgres and Consul are both databases, but they have different strengths and weaknesses. Consul excels at storing a small amount of data, providing strong consistency guarantees while tolerating the loss of potentially multiple replicas. But Consul is not designed to handle a high write rate, and it provides just basic key-value storage. In contrast, Postgres is a much more featureful relational database and supports many concurrent writers. While Postgres natively provides replication, it does not natively provide an automated failover mechanism.

For Patroni to provide high availability (automated failover) to Postgres, it needs all Patroni agents to have a consistent view of the Patroni cluster state (i.e. who is the current leader, how stale is each replica, etc.). Patroni stores that state in its DCS (which for us is Consul), with the expectation that writes are durable and reads are strongly consistent.

How does Consul balance consistency versus availability?

Section titled “How does Consul balance consistency versus availability?”

Consul prefers consistency over availability. When failure conditions such as node loss or network partitions force Consul to choose between consistency and availability, Consul prefers to stop accepting writes and reads until a quorum of Consul server nodes is again reached. This avoids split-brain. To reduce the likelihood of losing quorum, Consul supports a peer group of up to 11 servers, but most production deployments use 3 or 5 (which tolerates the loss of 1 or 2 nodes respectively).

As is typical, in production we run 5 hosts as Consul servers to act as the datastore, and we run a Consul agent on every other host that needs to read or write data stored on the Consul servers.

The Consul servers participate in a strongly consistent consensus protocol (RAFT) for leader election. Only the current leader is allowed to accept writes, so that all writes are serializable. These logged writes are replicated to the other Consul servers; at least a majority (quorum) of Consul servers must receive the new log entry for the write to be considered successful (i.e. guaranteed to be present if a new leader is elected). If the current leader fails, Consul will stop accepting new writes until the surviving quorum of peers elect a new leader (which may take several seconds). Typically read requests are also handled by the Consul leader, again to provide strong consistency guarantees, but that is tunable. If a non-leader consul server receives a read request, it will forward that call to the current Consul leader.

Only the Consul servers participate as peers in the strongly-consistent RAFT protocol. But all Consul agents participate in a weakly-consistent gossip protocol (SERF). This supports automatic node discovery and provides distributed node failure detection.

How do Patroni’s calls to Consul let it decide when to failover?

Section titled “How do Patroni’s calls to Consul let it decide when to failover?”

Each Patroni agent (whether replica or primary) periodically interacts with Consul to:

  • Fetch the most recently published status of its cluster peers.
  • Affirm its liveness by renewing its consul session (which quickly auto-expires without these renewals).
  • Publish its own current state metadata.
  • If leader, publish current Postgres WAL location; otherwise, watch for any change in leadership.

If Patroni fails one of these REST calls to Consul agent, the failed call can be retried for up to its configured retry_timeout deadline (currently 30 seconds). For the Patroni leader (i.e. the Patroni agent whose Postgres instance is currently the writable primary db), if that retry deadline is reached, Patroni will initiate failover by voluntarily releasing the Patroni cluster lock.

Similarly, if the Patroni leader’s loop takes long enough to complete that its consul session expires (TTL is currently effectively 45 seconds), then it involuntarily loses the cluster lock, which also initiates failover.

Note: Remember that Consul agent does not locally cache data; it makes synchronous remote calls to Consul server. Often when Patroni logs a timeout, it is due to troubled connectivity between Consul agent and Consul server.

See also: Normal pattern of REST calls from Patroni to Consul

What is Consul’s serfHealth check, and how can it trigger a Patroni failover?

Section titled “What is Consul’s serfHealth check, and how can it trigger a Patroni failover?”

Another way for Patroni’s leader to involuntarily lose its cluster lock is if Consul’s serfHealth health check fails for that host’s Consul agent.

Consul uses Serf as its gossip protocol. In addition to being a medium for asynchronous eventually-consistent information sharing, Serf also provides Consul’s mechanism for node failure detection. Every host running a Consul agent participates in Serf gossip traffic, including its liveness probe: serfHealth (also called SerfCheck). Each Consul agent intermittently tries to connect to a random subset of other agents, and if that attempt fails, it announces that target as being suspected of being down. The suspected-down agent has a limit window of time to actively refute that suspicion. Meanwhile, other agents will “dogpile” on health-checking the suspected-down agent, and if they concur that the agent is unresponsive, the window for refutation shortens (to reduce time to detect a legitimate failure). After the refutation window expires, the Consul server can mark that Consul agent as failed. When the problem is resolved (e.g. agent restarted, network connectivity restored), the failed Consul agent automatically rejoins by contacting the Consul server.

If a Patroni node’s Consul agent is marked by its peers as failing that serfHealth check, then this immediately invalidates the Patroni agent’s consul session (which, as described above, acts as a mutex). If that happens to the Patroni leader, it immediately loses the Patroni “cluster lock”, triggering a Patroni failover. In contrast, if serfHealth check fails for a non-leader Patroni node, it is merely excluded as a candidate for leader election until it establishes a new session (which typically happens seconds after network connectivity is restored).

In summary, any host’s Consul agent can flag the Patroni leader’s Consul agent as potentially down, and if it does not promptly refute that claim, Patroni will initiate a failover — all because Consul’s serfHealth check is part of Patroni’s contract for maintaining the validity of its consul session.

Because GCP network connectivity has proven to be intermittently unreliable, we are considering reconfiguring Patroni to ignore the serfCheck, so that its consul sessions would not be invalidated due to any one random VM in the environment having packet loss. Pros and cons are described in Issue 8050.

In brief: A Consul agent first sends a UDP message probe. If it gets no response, it asks a few other nodes to try the same UDP message while it concurrently tries one via TCP. If those attempts all fail, it broadcasts via gossip a suspicion that the target node is dead. If the target node does not explicitly refute that suspicion within the time limit, it is marked as failed by Consul server. For more details, continue reading.

The following notes are from a source code review of Consul 1.5.1, which includes a fork of the Serf library.

For reference, the configuration options and their defaults are defined (and well-annotated) here:

The node failure detection behavior using Serf primatives is implemented by the probeNode method of Memberlist. Here is a walk-through of that behavior:

  • Deadline for a probe is dynamic but is at least Memberlist.Config.ProbeInterval (default: 1 second).
  • If the target node’s current state is stateAlive (healthy), then just send a pingMsg via Gossip (UDP). Otherwise, append a suspect message to the pingMsg and send it.
  • Wait for an acknowledgement response for up to Memberlist.Config.ProbeTimeout (default: 500 ms).
    • Exit successfully if ack received within ProbeTimeout deadline. Else continue.
  • Concurrently try 2 tactics:
    • Tactic 1: Request indirect UDP probes: Ask up to Memberlist.Config.IndirectChecks nodes (default: 3) to ping the target node on our behalf.
    • Tactic 2: Try “fallback” TCP ping: If the target node speaks protocol version 3 or higher and DisableTcpPings=false (default), then open a TCP connection, and send the same pingMsg.
      • Note: The deadline variable passed to TCP ping has roughly ProbeInterval minus ProbeTimeout time remaining (default: 1000 ms - 500 ms = 500 ms). The deadline variable was set to time.Now().Add(ProbeInterval) at the start of the probeNode method, and by this point we have already spend roughly ProbeTimeout (default: 500 ms) of that budgetted deadline.
      • Note: This is probably not a practical concern for the time scales involved, but for completeness: Both the Indirect UDP Pings and the TCP Ping are run in goroutines, so they may not run immediately if the Golang runtime has too few threads free (e.g. on a host with very few CPUs). The TCP probe’s deadline is an absolute timestamp, so any delay in scheduling a goroutine consumes part of the patience of that health check.
  • Check results: Wait for the UDP-based indirect acks or nacks to arrive or timeout. Then check the result of the TCP ping.
    • If any indirect UDP Ping succeeds, exit successfully. If they all fail (timeout or nack), then check results of the TCP Ping.
    • If TCP Ping succeeded, log warning and exit successfully.
    • Note: We warn if only TCP succeeds because if UDP is consistently blocked/failing, the node cannot hear most gossip and may pollute peers with its stale state via full-sync.
  • If those fallback probes all failed, then:
    • Update self-awareness metrics.
    • Call method suspectNode:
      • Broadcasts via gossip the identity (name and incarnation number) of the node that failed its health check, and starts the timer for the refutation window.
  • Other nodes in the cluster receive this suspect message and also locally call suspectNode.

What happens during a Patroni leader election?

Section titled “What happens during a Patroni leader election?”

Patroni’s leader election protocol allows a period of time for all replicas to catch up on any transaction data from the old primary db that they had received but not yet applied. During that period, each Patroni node frequently updates its state (e.g. xlog replay location), so its peers know who is the freshest. At the end of the grace period, the freshest replica will be promoted to become the new primary.

Patroni then helps each replica switch to the new primary’s timeline, so they have a consistent point of divergence from the old primary’s timeline and can safely apply transaction logs from the new primary. If a replica fails to switch timelines for any reason, it is shutdown and removed from the list of replica dbs available for handling read-only queries. (However, in this case the new primary db still needs to be manually told to stop retaining transaction logs for that dead replica.)

Meanwhile, the dedicated PgBouncer instances frequently query Consul DNS for updates, so they quickly detect that Patroni has promoted a new Postgres instance to be the primary db. PgBouncer discards any residual connections to the old primary db, opens connections to the new primary db, and starts mapping client queries to these new db connections. Clients such as Rails instances remain connected to PgBouncer throughout this process. Transactions that were in progress at the time the primary db failed (or was demoted) are aborted, and newer transactions fail until the new primary db is elected by Patroni and detected by PgBouncer.

Define the relationship between Patroni settings ttl, loop_wait, and retry_timeout

Section titled “Define the relationship between Patroni settings ttl, loop_wait, and retry_timeout”

The Patroni documentation is vague about how these 3 settings are used and how to tune them:

  • loop_wait: the number of seconds the loop will sleep. Default value: 10
  • ttl: the TTL to acquire the leader lock. Think of it as the length of time before initiation of the automatic failover process. Default value: 30
  • retry_timeout: timeout for DCS and PostgreSQL operation retries. DCS or network issues shorter than this will not cause Patroni to demote the leader. Default value: 10

The following more detailed description is based on review of the Patroni 1.6.0 source code, specifically focusing on the generic DCS code and the Consul-specific code.

  • loop_wait (seconds): The minimum delay the Patroni agent waits between attempts to renew the TTL of its consul session. The consul session of the Patroni leader is effectively the leader lock; if the leader’s consul session expires, the services/[cluster_name]/leader record will be deleted by the Consul server, and surviving Patroni agents will initiate a leader election.
  • ttl (seconds): Effectively the TTL of the leader lock. Technically every Patroni agent has a consul session with this TTL. If the leader’s consul session expires, the leader lock is released (i.e. the Consul KV revord named services/[cluster_name]/leader is automatically deleted), which surviving Patroni agents detect, initiating leader election. If a non-leader Patroni node’s consul session expires, it cannot publish status updates or participate in leader elections until it creates a new consul session. All Patroni agents try to renew the TTL on their consul sessions approximately every loop_wait seconds, but only the leader’s session expiry can cause failover.
  • retry_timeout (seconds): The cumulative max duration of all tries of any single operation (e.g. a single REST operation to Consul can be retried for up to retry_timeout seconds.) A typical healthy Patroni loop makes 3-4 REST calls to Consul. If more than one is affected by slowness, the TTL could expire before Patroni issues the REST call to renew it. In the Consul-specific Patroni code, each of 3 attempts at a REST HTTP call to Consul agent gets 1/3 of this retry_timeout budget. (The Retry object’s overall deadline is retry_timeout, and it wraps the HTTP client object whose read_timeout is set to 1/3 of that value.) Only the last of 3 failed attempts is logged, which is why its log message shows the HTTP timeout as being 1/3 of this configured value.

When adjusting these 3 settings, for the timeouts to work properly:

  • ttl must be at least loop_wait + retry_timeout. Otherwise the leader lock will implicitly expire either between renewal attempts or before a stalled DCS operation reaches its retry_timeout.
  • ttl being comfortably larger than this makes it tolerate multiple slow calls to Consul.
  • The default settings are equivalent to: ttl = loop_wait + 2 * retry_timeout

Why is Patroni’s actual TTL half of its configured value?

Section titled “Why is Patroni’s actual TTL half of its configured value?”

Surprisingly, Patroni silently halves the configured ttl setting, because it expects Consul to silently double it.

def set_ttl(self, ttl):
if self._client.http.set_ttl(ttl/2.0): # Consul multiplies the TTL by 2x
...

Example: The default ttl is 30 seconds, but the actual consul session’s TTL is really set to 15 seconds:

Terminal window
$ consul kv get service/pg-ha-cluster/config | jq -c '. | { ttl }'
{"ttl":30}
$ curl -s http://127.0.0.1:8500/v1/session/node/$( hostname -s ) | jq -c '.[] | { TTL }'
{"TTL":"15.0s"}

This hard-coded behavior may be helpful in environments that prefer a very strict TTL, typically with very small value — in other words, environments that prefer to failover spuriously rather than waiting a little longer to see if the triggering condition was ephemeral or a false alarm. The Consul documentation also describes its TTL support as following a lazy expiry policy, but it does not claim any divide-by-two logic. Rather it states the session TTL is a lower bound (not an upper bound) for when the Consul server will actually delete/release an expired session:

When creating a session, a TTL can be specified. If the TTL interval expires without being renewed, the session has expired and an invalidation is triggered. […] The contract of a TTL is that it represents a lower bound for invalidation; that is, Consul will not expire the session before the TTL is reached, but it is allowed to delay the expiration past the TTL.

So we should be aware that whatever value we set in Patroni’s DCS ttl config, Consul is being told half of that value. Despite the heuristic testing done 2 years ago when that divide-by-two logic was added to the Patroni code, a strict interpretation of the Consul docs suggests the session could be expired as early as half the time we specify in the Patroni ttl config.

The following lists some selected illustrative failure patterns and what their symptoms look like.

What specific network paths can trigger Patroni failover if they become lossy?

Section titled “What specific network paths can trigger Patroni failover if they become lossy?”

See: “Concise summary of RCA” comment on issue “Why are patroni failovers occurring so often?”

Dedicated PgBouncer hosts can develop a very uneven distribution of client connections after maintenance or restart events

Section titled “Dedicated PgBouncer hosts can develop a very uneven distribution of client connections after maintenance or restart events”

See: Why do the PgBouncer hosts have a very uneven distribution of client connections?