Sidekiq or Web/API is using most of its PgBouncer connections
Symptoms
Section titled “Symptoms”- Message in #alerts-general: The pgbouncer service (main stage), pgbouncer_async_pool component has a saturation exceeding SLO
- Message in #alerts: PGBouncer is logging errors
Troubleshoot
Section titled “Troubleshoot”- Imbalance on pgbouncer connections
- On the dedicated pgbouncer fleet (for rw connections to the master),
pgbouncer_async_pool
is for sidekiq,pgbouncer_sync_pool
for web/api. - Check the PGBouncer Overview dashboard and look at
Backend Connections
:- if the distribution of connections is uneven, then perhaps:
- one pgbouncer node was down during a deployment and all (long persisting) db connections where made to the other ones
- or the healthcheck service on one of the active nodes is down.
- Solution:
- check the iLB status of the primary and sidekiq load balancers, and metrics for the primary and sidekiq
- Generally, all nodes in the load balancer should be active (pre Dec 2019 we used to run 2 active and 1 warm spare, but now they are all active)
- The healthcheck is a TCP check to pgbouncer, not anything more active
- All puma processes which are connected to the database need to reconnect. As almost all our puma processes now run in Kubernetes, you need to do an action in order to essentially cycle all puma pods.
This simplest way to do this is to open a MR against the gitlab-com repository, adding a new annotation (e.g.
cycle-puma: true
) to thewebservice
deployment. Once this is merged, this will cause all pods to be cycled.
- if the distribution of connections is uneven, then perhaps:
- Too many Sidekiq connections
-
Check the PostgreSQL dashboard and look at several graphs:
PGBouncer Errors
Slow Lock Acquires
Locks across all hosts
-
Find the PostgreSQL master and take a dump of all SQL queries:
COPY (SELECT * FROM pg_stat_activity) TO '/tmp/queries.csv' With CSV DELIMITER ','; -
Download
sq.rb
and run it to log all Sidekiq jobs and their arguments:Terminal window curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/-/raw/master/scripts/sidekiq/sq.rbsudo gitlab-rails runner /tmp/sq.rb --job-type='*' show > /tmp/sidekiq-jobs.txt -
Look inside the file for the Sidekiq queue breakdown. For example, you might see something like:
-----------Queue size:-----------Gitlab::GithubImport::ImportPullRequestWorker: 17620PipelineProcessWorker: 11000GitGarbageCollectWorker: 9607PagesDomainVerificationWorker: 8428BuildFinishedWorker: 6702CreateGpgSignatureWorker: 4807BuildQueueWorker: 2792StageUpdateWorker: 531ProjectImportScheduleWorker: 243RepositoryUpdateMirrorWorker: 234BuildSuccessWorker: 45DetectRepositoryLanguagesWorker: 20BuildHooksWorker: 16ExpireJobCacheWorker: 14<snip>You will also see lines such as:
["Gitlab::GithubImport::ImportPullRequestWorker", [10267729, {"iid"=>13081, "...Each line represents a job that is encoded as JSON payload. The first item is the class name of the worker (e.g.
Gitlab::GithubImport::ImportPullRequestWorker
).The next item in the array is the job arguments. Inside the job arguments, the first item for this worker is the project ID. Using this information, you can selectively kill jobs by their project ID.
-
For example, suppose project ID 1000 appears to have many jobs, and you want to remove all jobs relating to that project. In
gitlab-rails console
, you can run:project_id = 1000queue = Sidekiq::Queue.allqueue.each do |q|q.each do |job|next unless job.klass == 'Gitlab::GithubImport::ImportPullRequestWorker'job.delete if job.args[0] == project_idendendend