Sidekiq or Web/API is using most of its PgBouncer connections

Symptoms

Message in #alerts-general: The pgbouncer service (main stage), pgbouncer_async_pool component has a saturation exceeding SLO
Message in #alerts: PGBouncer is logging errors

Troubleshoot

Imbalance on pgbouncer connections

On the dedicated pgbouncer fleet (for rw connections to the master), pgbouncer_async_pool is for sidekiq, pgbouncer_sync_pool for web/api.
Check the PGBouncer Overview dashboard and look at Backend Connections:
- if the distribution of connections is uneven, then perhaps:
  - one pgbouncer node was down during a deployment and all (long persisting) db connections where made to the other ones
  - or the healthcheck service on one of the active nodes is down.
- Solution:
  - check the iLB status of the primary and sidekiq load balancers, and metrics for the primary and sidekiq
  - Generally, all nodes in the load balancer should be active (pre Dec 2019 we used to run 2 active and 1 warm spare, but now they are all active)
    - The healthcheck is a TCP check to pgbouncer, not anything more active
  - All puma processes which are connected to the database need to reconnect. As almost all our puma processes now run in Kubernetes, you need to do an action in order to essentially cycle all puma pods. This simplest way to do this is to open a MR against the gitlab-com repository, adding a new annotation (e.g. cycle-puma: true) to the webservice deployment. Once this is merged, this will cause all pods to be cycled.

Too many Sidekiq connections

Check the PostgreSQL dashboard and look at several graphs:
- PGBouncer Errors
- Slow Lock Acquires
- Locks across all hosts

Find the PostgreSQL master and take a dump of all SQL queries:

COPY (SELECT * FROM pg_stat_activity) TO '/tmp/queries.csv' With CSV DELIMITER ',';

Download sq.rb and run it to log all Sidekiq jobs and their arguments:

curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/-/raw/master/scripts/sidekiq/sq.rb
sudo gitlab-rails runner /tmp/sq.rb --job-type='*' show > /tmp/sidekiq-jobs.txt

Look inside the file for the Sidekiq queue breakdown. For example, you might see something like:

-----------
Queue size:
-----------
Gitlab::GithubImport::ImportPullRequestWorker: 17620
PipelineProcessWorker: 11000
GitGarbageCollectWorker: 9607
PagesDomainVerificationWorker: 8428
BuildFinishedWorker: 6702
CreateGpgSignatureWorker: 4807
BuildQueueWorker: 2792
StageUpdateWorker: 531
ProjectImportScheduleWorker: 243
RepositoryUpdateMirrorWorker: 234
BuildSuccessWorker: 45
DetectRepositoryLanguagesWorker: 20
BuildHooksWorker: 16
ExpireJobCacheWorker: 14
<snip>

You will also see lines such as:

["Gitlab::GithubImport::ImportPullRequestWorker", [10267729, {"iid"=>13081, "...

Each line represents a job that is encoded as JSON payload. The first item is the class name of the worker (e.g. Gitlab::GithubImport::ImportPullRequestWorker).

The next item in the array is the job arguments. Inside the job arguments, the first item for this worker is the project ID. Using this information, you can selectively kill jobs by their project ID.

For example, suppose project ID 1000 appears to have many jobs, and you want to remove all jobs relating to that project. In gitlab-rails console, you can run:

project_id = 1000

queue = Sidekiq::Queue.all
  queue.each do |q|
    q.each do |job|
      next unless job.klass == 'Gitlab::GithubImport::ImportPullRequestWorker'

      job.delete if job.args[0] == project_id
    end
  end
end