Linux CI/CD Runners fleet graceful shutdown procedure
Preflight checklist
Section titled “Preflight checklist”Before you will start any work
- Make sure that you meet Administrator prerequisites before you will start any work.
- Not in a PCL time window.
- Change Management issue was created for the change that requires the usage of Graceful Shutdown.
What is Graceful Shutdown
Section titled “What is Graceful Shutdown”However jobs are physically executed in different environments depending on the configured executors, the full data about the job, it’s current status, buffer of the log, details about connection to the execution environment etc. are stored in GitLab Runner process’ memory.
Simple termination of a Runner that is currently handling a running job have very negative consequences:
-
From GitLab’s (so also user’s!) perspective the job looks like cancelled. As such cancel is not asked nor expected by the user, it’s perceived as a system failure and in a bigger scale will be reported as a platform outage.
-
Despite the job log is anymore updated nor the status is passed, the job execution keeps happening because nothing have requested it termination. This may also have negative consequences, depending on what the job was doing.
As each of our Runner Managers is handling even up to ~800 jobs at once, terminating the process with - for example -
kill -9 PID_OF_RUNNER
means that these 800 jobs will be left unhandled and look like unwillingly canceled from the
GitLab user perspective.
For that purpose GitLab Runner have built in procedure that we name Graceful Shutdown
.
When Graceful Shutdown is initiated, Runner process stops asking for new jobs but continues to execute the ones that were already started. Only when the last job is finished and the job status is updated on GitLab side, the process exits.
Now, because jobs can hang, we need to be able to force the shutdown.
In case of our CI Runners fleet, the Runners are installed as systemd
services. The service definition is overridden
to use the SIGQUIT
signal to initiate Graceful Shutdown. The service is also configured to force-terminate the process
after 7200 seconds, which should be enough for most of the legitimate jobs to exit.
To handle Runner termination, configuration changes and new version deployments we have dedicated scripting that orchestrates all required commands.
Graceful Shutdown and different Runner Manager types (srm, gsrm, prm, gdsrm etc.)
Section titled “Graceful Shutdown and different Runner Manager types (srm, gsrm, prm, gdsrm etc.)”As it’s described, we maintain different types of Runner Managers. All managers from within one group share the same execution environment setting, capacity etc. As running Graceful Shutdown means that a Runner doesn’t request new jobs, it’s important to remember that we can shutdown only one runner from within a group at once.
For example, let’s consider we have two shared-runners-manager-X
(srm1
and srm2
) and two
private-runners-manager-X
(prm1
and prm2
). When shutting down Runner processes for whatever reason, we can
terminate for example both srm1
and prm1
at once. Or srm1
and prm2
. But we can’t never terminate both srm1
and srm2
or both prm1
and prm2
. Only one srmX
and only one prmX
can be down (or in the process of going
down with the Graceful Shutdown) at once.
Graceful Shutdown scripting
Section titled “Graceful Shutdown scripting”We have a script at /root/runner_upgrade.sh
that abstracts all the steps needed for Graceful Shutdown handling.
The script is added from the
cookbook-wrapper-gitlab-runner
cookbook.
Always use it instead of interacting directly with Systemd’s Runner service or with the process itself (for example
don’t use the kill
command!).
Graceful Shutdown procedures
Section titled “Graceful Shutdown procedures”All commands that are described bellow can be executed either directly on the host (after SSH-ing there) or through
knife ssh
. by specifying a node selector or a role selector.
With node selector remember to match only one node of a specific type at once.
With role selector, always try to use the less general role. For example, instead of using gitlab-runner-base
,
use gitlab-runner-srm
or gitlab-runner-prm
role. Also always use the -C 1
flag to instruct knife ssh
to
run the command on only one runner at a time.
How to stop Runner Manager with Graceful Shutdown
Section titled “How to stop Runner Manager with Graceful Shutdown”-
Run the stop script:
Terminal window sudo /root/runner_upgrade.sh stopThis command will suspend
chef-client
(to make sure it will not restart Runner’s service automatically nor add any other unexpected change) and terminate Runner’s process with the usage of Graceful Shutdown. -
Do whatever you needed to do with terminated Runner.
-
Restart the Runner with:
Terminal window sudo /root/runner_upgrade.sh updateThis will restore the
chef-client
, cleanup old non-removed Docker Machine VMs that were left, refresh the configuration and finally start the Runner process.
How to stop or restart Runner Manager’s VM with Graceful Shutdown
Section titled “How to stop or restart Runner Manager’s VM with Graceful Shutdown”-
Do shutdown
-
If you want to stop the VM
-
Run the script:
Terminal window sudo /root/runner_upgrade.sh stop_and_poweroffThis command will suspend
chef-client
(to make sure it will not restart Runner’s service automatically nor add any other unexpected change) and terminate Runner’s process with the usage of Graceful Shutdown. Finally it will turn off the VM. -
Do whatever you needed to do with Runner’s VM terminated.
-
Restart the VM in GCP console.
-
-
If you want to restart the VM, run the script:
Terminal window sudo /root/runner_upgrade.sh stop_and_rebootThis command will suspend
chef-client
(to make sure it will not restart Runner’s service automatically nor add any other unexpected change) and terminate Runner’s process with the usage of Graceful Shutdown. Finally it will reboot the VM.
-
-
Restart the Runner with:
Terminal window sudo /root/runner_upgrade.sh updateThis will restore the
chef-client
, cleanup old non-removed Docker Machine VMs that were left, refresh the configuration and finally start the Runner process.
How to proceed Runner Manager configuration changes with Graceful Shutdown
Section titled “How to proceed Runner Manager configuration changes with Graceful Shutdown”If your configuration change requires a shutdown of the Runner, it’s easiest to follow the deployment procedure.