Skip to content

Sidekiq or Web/API is using most of its PgBouncer connections

  • Message in #alerts-general: The pgbouncer service (main stage), pgbouncer_async_pool component has a saturation exceeding SLO
  • Message in #alerts: PGBouncer is logging errors
  1. Imbalance on pgbouncer connections
  • On the dedicated pgbouncer fleet (for rw connections to the master), pgbouncer_async_pool is for sidekiq, pgbouncer_sync_pool for web/api.
  • Check the PGBouncer Overview dashboard and look at Backend Connections:
    • if the distribution of connections is uneven, then perhaps:
      • one pgbouncer node was down during a deployment and all (long persisting) db connections where made to the other ones
      • or the healthcheck service on one of the active nodes is down.
    • Solution:
      • check the iLB status of the primary and sidekiq load balancers, and metrics for the primary and sidekiq
      • Generally, all nodes in the load balancer should be active (pre Dec 2019 we used to run 2 active and 1 warm spare, but now they are all active)
        • The healthcheck is a TCP check to pgbouncer, not anything more active
      • All puma processes which are connected to the database need to reconnect. As almost all our puma processes now run in Kubernetes, you need to do an action in order to essentially cycle all puma pods. This simplest way to do this is to open a MR against the gitlab-com repository, adding a new annotation (e.g. cycle-puma: true) to the webservice deployment. Once this is merged, this will cause all pods to be cycled.
  1. Too many Sidekiq connections
  • Check the PostgreSQL dashboard and look at several graphs:

    • PGBouncer Errors
    • Slow Lock Acquires
    • Locks across all hosts
  • Find the PostgreSQL master and take a dump of all SQL queries:

    COPY (SELECT * FROM pg_stat_activity) TO '/tmp/queries.csv' With CSV DELIMITER ',';
  • Download sq.rb and run it to log all Sidekiq jobs and their arguments:

    Terminal window
    curl -o /tmp/sq.rb https://gitlab.com/gitlab-com/runbooks/-/raw/master/scripts/sidekiq/sq.rb
    sudo gitlab-rails runner /tmp/sq.rb --job-type='*' show > /tmp/sidekiq-jobs.txt
  • Look inside the file for the Sidekiq queue breakdown. For example, you might see something like:

    -----------
    Queue size:
    -----------
    Gitlab::GithubImport::ImportPullRequestWorker: 17620
    PipelineProcessWorker: 11000
    GitGarbageCollectWorker: 9607
    PagesDomainVerificationWorker: 8428
    BuildFinishedWorker: 6702
    CreateGpgSignatureWorker: 4807
    BuildQueueWorker: 2792
    StageUpdateWorker: 531
    ProjectImportScheduleWorker: 243
    RepositoryUpdateMirrorWorker: 234
    BuildSuccessWorker: 45
    DetectRepositoryLanguagesWorker: 20
    BuildHooksWorker: 16
    ExpireJobCacheWorker: 14
    <snip>

    You will also see lines such as:

    ["Gitlab::GithubImport::ImportPullRequestWorker", [10267729, {"iid"=>13081, "...

    Each line represents a job that is encoded as JSON payload. The first item is the class name of the worker (e.g. Gitlab::GithubImport::ImportPullRequestWorker).

    The next item in the array is the job arguments. Inside the job arguments, the first item for this worker is the project ID. Using this information, you can selectively kill jobs by their project ID.

  • For example, suppose project ID 1000 appears to have many jobs, and you want to remove all jobs relating to that project. In gitlab-rails console, you can run:

    project_id = 1000
    queue = Sidekiq::Queue.all
    queue.each do |q|
    q.each do |job|
    next unless job.klass == 'Gitlab::GithubImport::ImportPullRequestWorker'
    job.delete if job.args[0] == project_id
    end
    end
    end