`pg_txid_xmin_age` Saturation Alert
Symptoms – how pg_txid_xmin_age
may be associated with various issues
Section titled “Symptoms – how pg_txid_xmin_age may be associated with various issues”First of all, understand if there are ongoing issues that may be associated to the ongoing pg_txid_xmin_age
spike (an example for one of past incidents):
-
SubtransControlLock spikes on standby nodes – check Thanos, using this query:
sum by (fqdn) (avg_over_time(pg_stat_activity_marginalia_sampler_active_count{env="gprd", wait_event=~"[Ss]ubtrans.*"}[5m]))If you see a spike, this is a serious issue – most likely there are 5xx errors happening already, affecting users.
-
Locking issues – a grown number of slow queries (>1s) and timed-out (reached 15s) queries, affecting users (if we deal with a long-running transaction holding many locks, blocking other sessions).
-
Autovacuum cannot delete tuples that became dead during the ongoing
pg_txid_xmin_age
spike – can be checked in PostgreSQL log (growing numbers of “XXXX are dead but not yet removable” reported when autoVACUUM finishes processing a table). This doesn’t hurt immediately, 10-20 minutes is not a big deal, but it contributes to the bloat growth. A spike lasting an hour would definitely cause serious bloat growth for some tables.
What to do about it
Section titled “What to do about it”Depending on symptoms you observe, you may decide to choose:
- immediate intervention (see below) if symptoms are severe, or
- “watch and wait” tactics, being ready to act (if there are no serious issues).
Diagnostics: root cause of pg_txid_xmin_age
spike
Section titled “Diagnostics: root cause of pg_txid_xmin_age spike”The spike of pg_txid_xmin_age
can be caused by one of the following 3 reasons:
-
AutoANALYZE on one of large tables
- if it finished, it can be checked in PostgreSQL logs, searching for “automatic analyze” and paying attention to the last value (
elapsed: XXX s
) - if it’s still happening, logs won’t show it – it is reported only in the end. But we can look at
pg_stat_activity
— search forquery ~ '^autovacuum
and pay attention to those entries that are runningANALYZE
(note that VACUUM entries are not harmful – they do not hold snapshot, so they do not causepg_txid_xmin_age
spikes), and there isbackend_xmin
value. Example snippet to use:
selectclock_timestamp() - xact_start as xact_duration,age(backend_xmin) as xmin_age,*from pg_stat_activitywhere query ~ '^autovacuum'order by 1 desc nulls last; - if it finished, it can be checked in PostgreSQL logs, searching for “automatic analyze” and paying attention to the last value (
-
A long-running transaction on the primary. Use a snippet like this, in a psql session connected to primary:
selectnow(),now() - query_start as query_age,now() - xact_start as xact_age,pid,backend_type,state,client_addr,wait_event_type,wait_event,xact_start,query_start,state_change,queryfrom pg_stat_activitywherestate != 'idle'and backend_type != 'autovacuum worker'and xact_start < now() - '60 seconds'::intervalorder by xact_age desc nulls last;Since we have very low values of
statement_timeout
(15s for application sessions; but 0 forgitlab-superuser
) andidle_in_transaction_session_timeout
(30s), in most cases (not all though!) such transaction will consist of many brief queries with brief pauses between them – so you may want to collect sampled data to understand the transaction behavior – in this case, use a shell snippet like this one, collecting samples to a CSV file (replaceXXX
in the filename with issue number likeproduction-1234
):Terminal window while sleep 1; dosudo gitlab-psql -taX -c "copy(selectnow(),now() - query_start as query_age,now() - xact_start as xact_age,pid,backend_type,state,client_addr,wait_event_type,wait_event,xact_start,query_start,state_change,queryfrom pg_stat_activitywherestate != 'idle'and backend_type != 'autovacuum worker'and xact_start < now() - '60 seconds'::intervalorder by xact_age desc nulls last) to stdout csv" \| tee -a issue_XXX_long_tx_sampling.csvdoneNote that if long-running transaction has already finished and it had queries lasting less than
log_min_duration_statement
(currently 1s), the PostgreSQL log won’t have any entries logged for such a transaction, so we won’t be able to understand what caused this transaction and what queries it had. That’s why it is important to collect samples when it’s still happening. -
A long-running transaction on a standby with
hot_standby_feedback = on
. To identify such standby, checkpg_replication_slots
on the primary – the standby causing the issue will have laggingxmin
value (useage(xmin)
). Then connect to that standby and use the query above to identify long-running transactions.
Other interesting related metrics that are available:
Mitigation
Section titled “Mitigation”Once the root cause of the pg_txid_xmin_age
spike is identified, consider interrupting the offending session using pg_terminuate_backend({PID})
:
-
for autoANALYZE, it won’t cause issue (but recommended to run a manual
analyze (verbose, skip_locked) {table_name}
right after interrupting, ensuring that autoANALYZE didn’t start to compete with our manual attempt); -
for a regular long-running transcactions – it is a business decision. Consequences of transaction being interrupted need to be analyzed. Interruption can be done using
pg_terminate_backend({PID})
as well. In any case, consider opening a https://gitlab.com/gitlab-org/gitlab issue so that product development teams can look into potential underlying problems.select pg_terminate_backend(1451058);
Post-checks
Section titled “Post-checks”Once the root cause of the pg_txid_xmin_age
spike is eliminated, ensure that:
- the spike has ended,
- the 5xx errors, slow Postgres queries have ended.