postgresql-buffermapping-lwlock-contention
Originally published in this reference documentation issue, and then copied into Runbooks.
Table of Contents:
Quick reference
Section titled “Quick reference”Severe BufferMapping
contention indicates postgres’s shared buffers pool is thrashing:
- This flavor of LWLock contention emerges from churn on the shared buffer pool.
- The workload’s working set no longer fits well into the pool of shared buffers.
- Evictions occur frequently enough to severely degrade queries.
- The instance-wide bottleneck becomes contention for the
BufferMapping
LWLocks, which mediate evicting pages from shared buffers. - This can only happen when the working set size does not fit in the pool of shared buffers but still fits in the filesystem cache. If the filesystem cache also had a high miss rate, physical disk IO latency would become the bottleneck, rather than the shared buffer pool’s
BufferMapping
lock acquisition rate.
Mitigation first response
Section titled “Mitigation first response”Reduce query concurrency per postgres instance:
- If affecting replicas, add another replica.
- If affecting primary or if cannot add replicas quickly, moderately reduce the appropriate pgbouncers’ backend connection pool max size.
- Reducing per-instance query concurrency should reduce working set size and shared buffer churn. This reduces eviction overhead, improving query response time and possibly also query throughput.
Analysis
Section titled “Analysis”- In prometheus metrics, what activity patterns recently changed? Increases in fetched rows per second may hint at what is driving churn on the shared buffer pool (which leads to contention beyond a certain churn rate).
- rate of index scans on each table or index
- ratio of index fetches per index scan on each index of any large or interesting table
- In snapshots of
pg_stat_activity
, were any queries showing unexpectedly high concurrency? High frequency queries that touch many shared buffers may significantly contribute to buffer churn. - In the
auto_explain
log events inpostgresql.csv
logs, what queries show up often?- Which part of their execution plan touch the most shared buffers?
- This may help identify a specific index or table to focus on. A moderate buffer usage on a frequently run query can still drive the working set size far enough above the pool size to cause severe contention.
Background Details
Section titled “Background Details”Symptoms
Section titled “Symptoms”When we have an incident that emerge from BufferMapping
lightweight lock contention, usually the cause is not obvious.
The more immediately visible symptoms tend to be widespread slowness and timeout errors affecting a wide variety of customer-facing endpoints — a generic “site degraded or down” suite of symptoms. Those symptoms emerge from the saturation of a resource in a common critical path shared by many endpoints (in this case one or more postgres instances).
Once we determine the bottleneck to be somewhere in the database, there are several ways to discover if BufferMapping
LWLock contention is consuming a significant amount of time: the pg_stat_activity
view, the pg_wait_sampling
extension, or the prometheus metrics or elasticsearch log events based on sampling them.
So what does BufferMapping
contention mean?
Section titled “So what does BufferMapping contention mean?”Essentially the shared buffer pool is thrashing.
The shared buffer pool is a fixed-sized pool of 8 KB buffers, each holding a single 8 KB page of data from a file backing a specific table or index. (Custom builds of postgres may use other page sizes, but 8 KB pages are the default.)
Each time a running query needs to access table/index data from a file, it must read that page of file data into a shared buffer. If that particular page happens to already be in an existing shared buffer, it can use that buffer as-is. Otherwise, it has to choose a shared buffer and read that page into it. Because the shared buffer pool is a fixed size, reading a page into a buffer almost always requires evicting some other page from a buffer. This eviction is handled by the BufferAlloc
function (source). In the usual case where eviction is required, it must briefly obtain an exclusive-mode lock on a specific 1 or 2 of the 16 BufferMapping
lightweight locks (source). Obtaining those locks guarantees concurrency safety when updating the accounting metadata for the buffer — disassociating it with its old page contents and reassociating it with its new page contents.
Contention dynamics
Section titled “Contention dynamics”Here’s where concurrency comes into play.
Obtaining those 1 or 2 of those 16 locks in exclusive-mode conflicts with any other backends concurrently holding either lock in share-mode. There can be many concurrent share-mode lock holders; those do not conflict with each other. But whenever a backend needs exclusive-mode, it must both wait for existing lock holders and block subsequent lock attempts. In other words, each time a lock must be acquired in exclusive-mode, it acts as a serialization point, impeding all of the subsequent lockers for that portion of the shared buffer pool, even when they only want share-mode.
This effectively reduces query concurrency from N backends down to somewhere between 16 and 1 (trending towards 1 as the contention severity increases). This concurrency reduction lasts for whatever brief timespan exclusive-mode is in effect. Lightweight locks are typically held very briefly — often just a few microseconds between acquire and release. But when exclusive-mode is acquired too often, that effectively reduces the amount of time when multiple backends can concurrently make forward progress — reducing the system-wide query throughput and increasing query duration.
The impact of this serialization tends to be more severe when the normal number of concurrently active backends is larger, because:
- Each time exclusive-mode is needed, it blocks all other share-mode acquires of that lock.
- As more queries run concurrently, their collective working set size tends to grow, and that drives up eviction rate since the shared buffer pool’s size is fixed.
So both the probability and severity of a performance regression tend to increase in tandem with:
- the shared buffer pool’s eviction rate (i.e. the exclusive-mode lock acquisition rate of
BufferMapping
LWLocks) - the number of concurrently active backends executing queries (since that increases working set size)
- the number of buffers visited by concurrently running queries
- a frequently run query whose execution plan touches thousands of buffers (e.g. such that its concurrency times buffer count sums to a non-trivial percentage of the shared buffer pool size and those concurrent executions are not likely to have a lot of overlap in the pages they access)
Managing eviction rate: pool size, concurrency, execution plans for frequent queries
Section titled “Managing eviction rate: pool size, concurrency, execution plans for frequent queries”For BufferMapping
LWLocks, exclusive mode is used for shared buffer eviction (as well as less common cases like invalidating a buffer when its backing page is being deleted).
Evictions are normal and happen regularly in virtually all dbs. Most contention events are trivially short, resolving in microseconds. This only becomes significant when the eviction rate causes enough contention to affect query response times and impede the query throughput rate.
When the eviction rate increases to the point where we can observe BufferMapping
lock wait events affecting many backends concurrently, that’s a clear sign that the shared buffer pool is not big enough to hold the current “working set” of pages that are collectively required to satisfy the concurrently running queries’ execution plans. That is “cache thrashing” behavior: evicting pages frequently enough that collective overhead impedes throughput. A common flavor of this is when the working set size moderately exceeds the cache size; pages are repeatedly evicted and soon-after re-read in a repeating cycle due to a mismatch of demand vs. capacity.
Typically the OS kernel’s filesystem cache is larger than the postgres shared-buffer pool, so when postgres evicts a page from a shared buffer and then quickly re-reads it back into another shared buffer, often that does not require physical disk IO. Rather, the eviction overhead is the main concern, because it drives serialization.
Solutions to this contention include:
- Reduce query concurrency per postgres instance:
- This can take the form of:
- Adding another replica db node to spread the workload across.
- Reducing pgbouncer backend connection pool sizes
- Rationale: Fewer concurrently active backends should reduce the eviction rate (and the contention it drives) by reducing the working set size to better fit in the shared buffer pool. The shared buffer pool is a per-postgres-instance resource, so this tactic aims to reduce per-instance concurrency (either by reducing concurrent client connections or adding nodes for extra capacity).
- How effective this is depends on the severity of the contention.
- If contention affects the replicas, adding more replicas should always help.
- Moderately reducing pgbouncer backend pool size may help and may be worth trying if contention affects the primary (where adding nodes is not an option).
- This can take the form of:
- Query tuning to reduce buffer usage:
- If any frequently run queries access many buffers (especially if the shared buffer pool spends a large proportion of its pages on the tables+indexes used by frequently run queries), check if those queries could be adjusted to use fewer buffers (e.g. use a better index, vacuum more often to clean up dead index item pointers).
- Or can that query be run less often (e.g. results-caching or special-casing)? Or refactored to make the common case cheaper?
- Grow the
shared_buffers
pool size:- Rarely a first choice, for several reasons:
- Requires a postgres restart (intrusive). That restart temporarily reduces fleet capacity (down by 1 replica), driving more traffic onto the (already struggling) remaining replica nodes.
- Overhead trade-offs may potentially make things worse. A larger pool of shared buffers should have a lower eviction rate at the same query concurrency (because the working set fits better into the pool), but its per-eviction overhead increases (which may offset the benefit of a lower eviction rate).
- Lastly, growing
shared_buffers
implicitly reduces the memory available for other purposes (e.g. filesystem cache,work_mem
, etc.). A proportionally large change toshared_buffers
could introduce unwanted side-effects, such as shifting the bottleneck (e.g. to disk IO latency) without actually improving the query throughput and response times.
- If we decide to try increasing
shared_buffers
, aim for a controlled experimental setup that is quick and safe to revert.- Example: Provision a new replica with a larger
shared_buffers
and otherwise identical specs and config. Allow for cache warming when first introduced to production traffic. Once stable state is reached, consider setting patroni’snoloadbalance
tag on one of the normal replicas; it remains a viable up-to-date replica, but clients will ignore it until we remove that tag. This reduces the count of active replicas to what it initially was, but now one of them has the adjustedshared_buffers
setting. Since all active replicas should have comparable query rates, we can directly compare their usage metrics and concurrently capture any desired profiling runs. At any point, we can quickly and safely end the experiment by adding thenoloadbalance
tag to the experimental node and removing that tag from the normal healthy node.
- Example: Provision a new replica with a larger
- Rarely a first choice, for several reasons: