PatroniLongRunningTransactionDetected
Overview
Section titled “Overview”This alert means that there are transactions older than 10 minutes running on the server listed in the alert.
This could be caused by many things, most often a large sidekiq job. The alert will tell you what endpoint is executing the long running transaction.
This is important as long running transactions can prevent Postgres from running routine vacuuming, which can lead to bloat and slowdowns of the database.
Long-running transactions are one of the main drivers for pg_txid_xmin_age
, and can result in severe performance degradation if left unaddressed.
The recipient of this alert should investigate what the long running transaction is, and whether it is going to cause performance problems. In most cases we are going to want to cancel the transaction.
Services
Section titled “Services”- Patroni Service
- Team that owns the service: Production Engineering : Database Reliability
Metrics
Section titled “Metrics”- Long Running Alerts Dashboard
- This alert fires when we measure one or more transactions with an age greater than 9 minutes, AND we have been in this state for 1 minute - which gives a total of 10 minutes that a transaction can be active before alerting
- Under normal conditions this dashboard should show lists of transactions with ages less 1 minute. Occasionally, there will be transactions which have been running for longer, but very few shoud approach the threshold
Alert Behavior
Section titled “Alert Behavior”- This alert will clear once the long running transactions are no longer active. The alert should only be silenced during an open Change Reqeust
- This alert should be fairly rare, and usually indicates that there is a query that is not behaving as we expect.
- Previous Incidents for alert
Severities
Section titled “Severities”- This alert is unlikely to be causing active customer issues, and is most likely an S4
- However, this alert could evolve into performance issues for all of GitLab.com
- Check the Patroni SLI Overview Dashboard to determine whether we are already experiencing performance issues
Verification
Section titled “Verification”Recent changes
Section titled “Recent changes”- Recent Patroni Service change issues
- Recent Patroni Change Requests
- This alert is likely to have been triggered by a recent deployment, rather than a database related change.
- If there is a deployment causing the issue, roll back the change that was deployed
- If a change request caused the problem, follow the rollback instructions in the Change Request.
Troubleshooting
Section titled “Troubleshooting”Getting the current query
Section titled “Getting the current query”The first step is to figure out what the long-running transaction is doing. The alert will tell you which host to look at. From there, it is possible to get a query
, pid
, client_addr
, and client_port
:
iwiedler@patroni-main-2004-01-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-psql
gitlabhq_production=# \xExpanded display is on.
gitlabhq_production=# \pset pager 0Pager usage is off.
gitlabhq_production=# select *, now(), now() - query_start as query_age, now() - xact_start as xact_agefrom pg_stat_activitywhere state != 'idle' and backend_type != 'autovacuum worker' and xact_start < now() - '60 seconds'::intervalorder by now() - xact_start desc nulls last;
In the event that the alert has already cleared because the query in question has completed, or been canceled, it may still be possible to find the query in the Postgres pg_stat_activity
logs. These transactions are dumped to Kibana every several minutes, so the same query may appear multiple times if it spans multiple dumps.
Getting the pgbouncer instance
Section titled “Getting the pgbouncer instance”Usually postgres connections go through a pgbouncer. The client_addr
and client_port
will tell you which one.
In this case it was pgbouncer-01-db-gprd
as discovered via the client_port
of 42792
:
iwiedler@patroni-main-2004-01-db-gprd.c.gitlab-production.internal:~$ sudo netstat -tp | grep 42792
tcp 0 229 patroni-main:postgresql pgbouncer-01-db-g:42792 ESTABLISHED 3889825/postgres: g
Resolving pgbouncer port to client
Section titled “Resolving pgbouncer port to client”Now that we have the pgbouncer address and port, we can log into that pgbouncer box and get the actual client.
This can be done by first running show sockets
on the pgbouncer admin console, and finding the metadata for the backend port:
[email protected]:~$ sudo pgb-console -c 'show sockets' | grep 42792
type | user | database | state | addr | port | local_addr | local_port | connect_time | request_time | wait | wait_us | close_needed | ptr | link | remote_pid | tls | recv_pos | pkt_pos | pkt_remain | send_pos | send_remain | pkt_avail | send_avail S | gitlab | gitlabhq_production | sv_active | 10.220.21.101 | 5432 | 10.217.4.3 | 42792 | 2022-10-28 08:33:46 UTC | 2022-10-28 09:19:56 UTC | 0 | 0 | 0 | 0x1906c30 | 0x7fc9a694e498 | 3889825 | | 0 | 0 | 0 | 0 | 0 | 0 | 0
Using this information we can then grab the link
column. And join it against show clients
to discover the actual client:
[email protected]:~$ sudo pgb-console -c 'show clients' | grep 0x7fc9a694e498
type | user | database | state | addr | port | local_addr | local_port | connect_time | request_time | wait | wait_us | close_needed | ptr | link | remote_pid | tls C | gitlab | gitlabhq_production | active | 10.218.5.2 | 49266 | 10.217.4.5 | 6432 | 2022-10-28 09:01:06 UTC | 2022-10-28 09:20:51 UTC | 0 | 0 | 0 | 0x7fc9a694e498 | 0x1906c30 | 0 |
Double checking connections, and this tells us that it’s console-01-sv-gprd
:
[email protected]:~$ sudo netstat -tp | grep 49266
tcp 0 35 10.217.4.5:6432 console-01-sv-gpr:49266 ESTABLISHED 19532/pgbouncer
Resolving client address and port to process and user
Section titled “Resolving client address and port to process and user”Now that we’ve confirmed it’s a console user, we can look up which process on the console host holds that connection:
[email protected]:~$ sudo ss -tp | grep 49266
ESTAB 0 194 10.218.5.2:49266 10.217.4.5:6432 users:(("ruby",pid=3226026,fd=10))
And we can feed the pid into pstree
to see the process hierarchy:
systemd(1)───sshd(1369)───sshd(3225836)───sshd(3226021,arihant-rails)───bash(3226022)───script(3226024)───sudo(3226025,root)───ruby(3226026,git)─┬─{ruby}(3226033) ├─{ruby}(3226039) ├─{ruby}(3226040) ├─{ruby}(3226050) ├─{ruby}(3228019) ├─{ruby}(3228021) ├─{ruby}(3228022) ├─{ruby}(3228023) ├─{ruby}(3228024) └─{ruby}(3228025)
In this case it was user arihant-rails
who was running a script via rails console.
Cancelling the query
Section titled “Cancelling the query”In most cases, we will want to cancel the query. If it’s a single long-running query, this can be done via pg_cancel_backend
(passing in the pid
):
gitlabhq_production=# select pg_cancel_backend(<pid>); pg_cancel_backend------------------- t
However, if we are dealing with a long-running transaction consisting of many short-lived queries, it may be necessary to terminate the backend instead:
gitlabhq_production=# select pg_terminate_backend(<pid>); pg_terminate_backend---------------------- t
Possible Resolutions
Section titled “Possible Resolutions”- 2023-02-01: long postgres transaction was a long running transaction in a rails console session.
- 2021-08-18: Long running transaction on database was expected behavior as a foreign key was being added.
- 2023-10-21: Long Postgres transactions Vulnerabilities::MarkDroppedAsResolvedWorker was an application issue that warrented an InfraDev issue.
- Other incidents for this alert
Dependencies
Section titled “Dependencies”-
Background migrations
-
Sidekiq jobs
-
If the recipient of this alert cann’t determine the cause of the long running transaction and correct it using the troubleshooting steps above, it may be necessary to escalate
-
Slack channels where help is likely to be found:
#g_infra_database_reliability
-
The main parameter that we can tune is the amount of time we allow before alerting. It is currently set to 10 minutes. We can also exclude certain types of transactions from alerting