Skip to content

PatroniLongRunningTransactionDetected

This alert means that there are transactions older than 10 minutes running on the server listed in the alert.

This could be caused by many things, most often a large sidekiq job. The alert will tell you what endpoint is executing the long running transaction.

This is important as long running transactions can prevent Postgres from running routine vacuuming, which can lead to bloat and slowdowns of the database.

Long-running transactions are one of the main drivers for pg_txid_xmin_age, and can result in severe performance degradation if left unaddressed.

The recipient of this alert should investigate what the long running transaction is, and whether it is going to cause performance problems. In most cases we are going to want to cancel the transaction.

  • Long Running Alerts Dashboard
  • This alert fires when we measure one or more transactions with an age greater than 9 minutes, AND we have been in this state for 1 minute - which gives a total of 10 minutes that a transaction can be active before alerting
  • Under normal conditions this dashboard should show lists of transactions with ages less 1 minute. Occasionally, there will be transactions which have been running for longer, but very few shoud approach the threshold
  • This alert will clear once the long running transactions are no longer active. The alert should only be silenced during an open Change Reqeust
  • This alert should be fairly rare, and usually indicates that there is a query that is not behaving as we expect.
  • Previous Incidents for alert
  • This alert is unlikely to be causing active customer issues, and is most likely an S4
  • However, this alert could evolve into performance issues for all of GitLab.com
  • Check the Patroni SLI Overview Dashboard to determine whether we are already experiencing performance issues
  • Recent Patroni Service change issues
  • Recent Patroni Change Requests
  • This alert is likely to have been triggered by a recent deployment, rather than a database related change.
  • If there is a deployment causing the issue, roll back the change that was deployed
  • If a change request caused the problem, follow the rollback instructions in the Change Request.

The first step is to figure out what the long-running transaction is doing. The alert will tell you which host to look at. From there, it is possible to get a query, pid, client_addr, and client_port:

iwiedler@patroni-main-2004-01-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-psql
gitlabhq_production=# \x
Expanded display is on.
gitlabhq_production=# \pset pager 0
Pager usage is off.
gitlabhq_production=# select *,
now(),
now() - query_start as query_age,
now() - xact_start as xact_age
from pg_stat_activity
where
state != 'idle'
and backend_type != 'autovacuum worker'
and xact_start < now() - '60 seconds'::interval
order by now() - xact_start desc nulls last
;

In the event that the alert has already cleared because the query in question has completed, or been canceled, it may still be possible to find the query in the Postgres pg_stat_activity logs. These transactions are dumped to Kibana every several minutes, so the same query may appear multiple times if it spans multiple dumps.

Usually postgres connections go through a pgbouncer. The client_addr and client_port will tell you which one.

In this case it was pgbouncer-01-db-gprd as discovered via the client_port of 42792:

iwiedler@patroni-main-2004-01-db-gprd.c.gitlab-production.internal:~$ sudo netstat -tp | grep 42792
tcp 0 229 patroni-main:postgresql pgbouncer-01-db-g:42792 ESTABLISHED 3889825/postgres: g

Now that we have the pgbouncer address and port, we can log into that pgbouncer box and get the actual client.

This can be done by first running show sockets on the pgbouncer admin console, and finding the metadata for the backend port:

[email protected]:~$ sudo pgb-console -c 'show sockets' | grep 42792
type | user | database | state | addr | port | local_addr | local_port | connect_time | request_time | wait | wait_us | close_needed | ptr | link | remote_pid | tls | recv_pos | pkt_pos | pkt_remain | send_pos | send_remain | pkt_avail | send_avail
S | gitlab | gitlabhq_production | sv_active | 10.220.21.101 | 5432 | 10.217.4.3 | 42792 | 2022-10-28 08:33:46 UTC | 2022-10-28 09:19:56 UTC | 0 | 0 | 0 | 0x1906c30 | 0x7fc9a694e498 | 3889825 | | 0 | 0 | 0 | 0 | 0 | 0 | 0

Using this information we can then grab the link column. And join it against show clients to discover the actual client:

[email protected]:~$ sudo pgb-console -c 'show clients' | grep 0x7fc9a694e498
type | user | database | state | addr | port | local_addr | local_port | connect_time | request_time | wait | wait_us | close_needed | ptr | link | remote_pid | tls
C | gitlab | gitlabhq_production | active | 10.218.5.2 | 49266 | 10.217.4.5 | 6432 | 2022-10-28 09:01:06 UTC | 2022-10-28 09:20:51 UTC | 0 | 0 | 0 | 0x7fc9a694e498 | 0x1906c30 | 0 |

Double checking connections, and this tells us that it’s console-01-sv-gprd:

[email protected]:~$ sudo netstat -tp | grep 49266
tcp 0 35 10.217.4.5:6432 console-01-sv-gpr:49266 ESTABLISHED 19532/pgbouncer

Resolving client address and port to process and user

Section titled “Resolving client address and port to process and user”

Now that we’ve confirmed it’s a console user, we can look up which process on the console host holds that connection:

[email protected]:~$ sudo ss -tp | grep 49266
ESTAB 0 194 10.218.5.2:49266 10.217.4.5:6432 users:(("ruby",pid=3226026,fd=10))

And we can feed the pid into pstree to see the process hierarchy:

systemd(1)───sshd(1369)───sshd(3225836)───sshd(3226021,arihant-rails)───bash(3226022)───script(3226024)───sudo(3226025,root)───ruby(3226026,git)─┬─{ruby}(3226033)
├─{ruby}(3226039)
├─{ruby}(3226040)
├─{ruby}(3226050)
├─{ruby}(3228019)
├─{ruby}(3228021)
├─{ruby}(3228022)
├─{ruby}(3228023)
├─{ruby}(3228024)
└─{ruby}(3228025)

In this case it was user arihant-rails who was running a script via rails console.

In most cases, we will want to cancel the query. If it’s a single long-running query, this can be done via pg_cancel_backend (passing in the pid):

gitlabhq_production=# select pg_cancel_backend(<pid>);
pg_cancel_backend
-------------------
t

However, if we are dealing with a long-running transaction consisting of many short-lived queries, it may be necessary to terminate the backend instead:

gitlabhq_production=# select pg_terminate_backend(<pid>);
pg_terminate_backend
----------------------
t