ci_pending_builds
Large CI pending builds
Section titled “Large CI pending builds”Alert Name: CICDTooManyPendingJobsPerNamespace or CICDTooManyRunningJobsPerNamespaceOnSharedRunnersGitLabOrg
The most comment problem is that we get a report that we have a large number of CI pending builds.
- Check
CI dashboard
and verify that we have a large number of CI builds, - Verify graphs and potential outcomes out of the graphs as described in CI graphs,
- Verify the number of errors the high number of errors,
- Verify that machines are created on
shared-runners-manager-X.gitlab.com
, - Verify that docker machine valid operation,
1. Check CI dashboard
and verify that we have a large number of CI builds
Section titled “1. Check CI dashboard and verify that we have a large number of CI builds”Look at the graph with number of CI builds:
2. Verify graphs and potential outcomes out of the graphs as described in CI graphs
Section titled “2. Verify graphs and potential outcomes out of the graphs as described in CI graphs”To understand what can be wrong, you need to find a cause.
- Check runner auto-scaling: CI auto-scaling graphs,
and look for the
Idle
number, - Verify jobs queues: CI auto-scaling graphs. If you see a single namespace with a lot of builds, verify what projects are in that namespace and whether this is the abuser.
- Verify long polling behavior (we are not yet aware of potential problems as of now),
- Verify workhorse queueing: Workhorse queueing graphs.
If you see a large number of requests ending up in the queue it may indicate that CI API is degraded.
Verify the performance of
builds/register
endpoint: https://dashboards.gitlab.net/dashboard/db/grape-endpoints?var-action=Grape%23POST%20%2Fbuilds%2Fregister&var-database=Production, - Verify runners uptime. If you see that runners uptime is varying it does indicate that most likely Runners Manager does die, because of the crash. It will be shown in runners manager logs:
grep panic /var/log/messages
.
3. Verify the number of errors the high number of errors
Section titled “3. Verify the number of errors the high number of errors”Generally, it is not a big problem, but it generates a lot of noise in logs. It is safe to run that runbook.
You should also be aware that you should then cross-check state between digital ocean and runners manager as described in that issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/921 (this should be moved to script and runbook).
5. Verify that machines are created on shared-runners-manager-X.gitlab.com
Section titled “5. Verify that machines are created on shared-runners-manager-X.gitlab.com”Login to runners manager and execute:
journalctl -xef | grep "Machine created"
You should see a constant stream of machines being created:
Mar 20 13:16:36 shared-runners-manager-2 gitlab-ci-multi-runner[19931]: time="2017-03-20T13:16:36Z" level=info msg="Machine created" fields.time=43.913563388s name=runner-4e4528ca-machine-1490015752-629c75cb-digital-ocean-4gb now=2017-03-20 13:16:36.246859005 +0000 UTC retries=0 time=43.913563388s
If you don’t see it, try to debug logs from docker machine:
journalctl -xef | grep operation=create
Mar 20 13:17:56 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:56Z" level=info msg="Running pre-create checks..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:17:57 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:57Z" level=info msg="Creating machine..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:17:57 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:57Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Creating SSH key..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:17:58 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:58Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Creating Digital Ocean droplet..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:18:03 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:03Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Waiting for IP address to be assigned to the Droplet..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:18:04 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:04Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Created droplet ID 42980631, IP address 159.203.179.170" driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=createMar 20 13:18:34 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:34Z" level=info msg="Waiting for machine to be running, this may take a few minutes..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
If it fails to create you will see a message here.
6. Verify that docker machine valid operation
Section titled “6. Verify that docker machine valid operation”You should try to create machine manually:
docker-machine create -d google test-machine --google-project=gitlab-ci-155816 --google-disk-size=25 --google-machine-type=n1-standard-1 --google-username=core --google-operation-backoff-initial-interval=2 --google-subnetwork=shared-runners --google-zone=us-east1-d --engine-opt=mtu=1460 --engine-opt=ipv6 --engine-opt=fixed-cidr-v6=fc00::/7 --google-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write --google-machine-image=gitlab-ci-155816/global/images/runners-coreos-stable-v20190822-0
This method should succeed. If it does not. You have to verify it.
Once it is created you can log in to this created machine:
docker-machine ssh test-machine
And try to run some docker containers, to verify that networking, DNS does work properly.
$ docker run -it docker:git /bin/sh
Afterward tear down the machine:
docker-machine rm test-machine
If it fails at any of the commands it can mean any of that:
- there’s a problem with docker-machine creating machine,
- there’s a problem with docker-engine on machine,
- there’s a problem with connectivity from docker-machine.
You may need to:
- verify if it’s a problem of
docker version
, - verify if it’s a problem of
coreos-stable
, - verify if it’s a problem of networking out of the container: DNS?