Windows Autoscaling Runners
We operate 2 runner manager servers that run Windows and build Windows shared runners. We
must use Windows as the manager and executors talk via WinRM. However, the data flow
of the Windows runners managers are the same as the Linux runner described in
the README. We use a custom autoscaler
instead of docker-machine
for these Windows runners. There is an architecture diagram that
can be found in architecture.md
Windows Configurations
Section titled “Windows Configurations”We manage configurations of Windows servers in our ci-infrastructure-windows
project. In there you will find an ansible
directory which includes ansible playbooks and roles
used to configure the servers. Additionally, there is a packer
directory which is used to build
images for the Windows managers. For now, we aren’t doing too much with Packer as we don’t have
a way to properly rebuild servers without downtime.
The Windows managers use a custom image built with Packer to build the machines that execute jobs.
Connecting to Windows
Section titled “Connecting to Windows”Please read the connecting to Windows documentation to install relevant software and connect to Windows.
Graceful Shutdown of Windows Runner Managers
Section titled “Graceful Shutdown of Windows Runner Managers”Graceful shutdown is built into the gitlab-runner.exe
. In order to start a shutdown, you need to open a PowerShell as an admin,
navigate to C:\GitLab-Runner
, and execute .\gitlab-runner.exe stop
.
This will take up to an hour to finish running jobs and finally stop.
Once it is stopped you can proceed with any maintenance you need to run.
Keep in mind that we only have two runner managers, so to avoid downtime you should only stop the runner on only one of the managers at a time.
Upgrading the Runner
Section titled “Upgrading the Runner”Updating the Windows runner is a multi-step, but straightforward process.
Updating the Windows ephemeral container image
Section titled “Updating the Windows ephemeral container image”Prepare an MR to the windows-containers
project that updates the runner version and checksum in the gitlab-runner-dependencies attributes file.
Once this is merged, the container should build and publish itself.
After merging and the CI pipelines complete, verify that the image is created and available which can be done using either the GCP console or gcloud
tool.
After you’ve verified that the image is available, you can proceed to
updating Ansible.
Updating Ansible
Section titled “Updating Ansible”Create an MR that updates Ansible with the new runner version value
and new autoscaler image created above.
Both of these are declared in the gcp_role_runner_manager.yml
file.
Be sure to update each autoscaler section to ensure all versions of Windows
are updated.
If you are updating the autoscaler, change the version and be sure to also update checksum.
After merging, the CI pipeline will kick off. This is gated by a manual action. In this case, you should not run the automatically created Ansible apply job, but instead create your own. While it is not dangerous to run the apply job, it will fail because the runner process is still running.
Note: if you’re just trying to revert an autoscaler image upgrade, there is no need to proceed with the following steps to restart the runner manager processes.
Applying The Upgrade
Section titled “Applying The Upgrade”Now that the images are recreated and Ansible is updated, it is time to execute the upgrade.
Firstly, you’ll want to stop the runner gracefully on only one runner at a time. The instructions for doing so are earlier in this document.
After the runner process is fully stopped, you’ll create a new CI pipeline
in the ci-infrastructure-windows project.
You will need to define the ANSIBLE_HOST_LIMIT
and set it to the name
of the runner manager that is currently stopped (either windows-shared-runners-manager-1
or 2
). This ensures that ansible only runs on the server that is ready for the upgrade.
This is also manually gated, so you’ll need to go start the apply job after
the plan is run. Keep in mind this could take some time as Ansible on
Windows can be exceedingly slow.
When the Ansible run is completed, you can verify that the runner is
upgraded by running gitlab-runner.exe version
in PowerShell.
Ansible should start the runner process automatically after it is done
running, but you should also verify that the runner process has started.
Finally, you can repeat the above process for the other manager that needs an upgrade.
Powershell
Section titled “Powershell”Powershell is the preferred method
of interacting with Windows via command line. While Powershell is very complex and
powerful, below are some common commands you might use. Please note that as with
most things Windows, these commands are not case-sensitive. You may also be interested
in reading Ryan Palo’s PowerShell Tutorial
as it is written with those who hate PowerShell in mind and helps relate it to
more familiar bash
commands.
Get-Content
is a tool similar tohead
,tail
, andcat
on Linux.- Ex.
Get-Content -Path .\logfile.log -TotalCount 5
will get the first 5 lines of a file. - Ex.
Get-Content -Path .\logfile.log -Tail 5 -Wait
will get the last 5 lines of a file AND follow it for any changes. Get-Content
documentation
- Ex.
Third party tools
Section titled “Third party tools”There are a few tools that are currently installed on each manager during setup. These are:
Process Explorer is probably the most important software listed. It is a great tool that gives incredibly detailed information on processes, and it is substantially better than the built in task manager. If you need to find out info on any processes, you should use this instead.
Additionally, cmder
is an easier to use terminal emulator.
Troubleshooting
Section titled “Troubleshooting”Deadman Test
Section titled “Deadman Test”A simple pipeline is executed at the project windows-srm-deadman-test on a schedule every 2 hours. This serves as a canary type test to see if jobs are executing and be handled correctly by the Windows Shared Runner Managers. The job it runs is extremely simple and so a failure can be assumed to be a systemic failure of the Windows Shared Runners themselves. Notifications about job failures are posted to the Slack channel #f_win_shared_runners. If a problem is suspected on the Windows Shared Runners, look at the history of past pipelines for this project, and consider triggering one manually to see how it behaves.
Shared Runners Manager Offline
Section titled “Shared Runners Manager Offline”If a shared runners manager is shown offline:
-
Connect to the windows runner manager by following the commands in the connecting to a Windows machine doc.
-
Click Start Menu > Click Windows PowerShell > Right-click on Windows PowerShell sub-menu > Click Start as Administrator)
-
On the command-line in the PowerShell window invoke:
C:\Gitlab-Runner\gitlab-runner.exe status# if down:C:\Gitlab-Runner\gitlab-runner.exe start
Autoscaler Logs and Docs
Section titled “Autoscaler Logs and Docs”The autoscaler is a custom executor plugin for the GitLab Runner.
The autoscaler logs to a file located at C:\GitLab-Runner\autoscaler\autoscaler.log
. This file
will contain all the information regarding creation, connection, and deletion of VMs. You may want to look
here if VM creation is failing or connections from the managers are failing. This is likely
the best first place to check when issues arise.
-
On the command-line in the PowerShell window invoke:
Get-Content C:\Gitlab-Runner\autoscaler\autoscaler.log -tail 100 | Out-Host -Paging
Firewall rules for winrm
Section titled “Firewall rules for winrm”The managers must be able to connect to the spawned VMs via port 5985-5986. The relevant GCP firewall rules are defined in firewall.tf in our terraform repo. The port should be open by default on the spawned VMs during packer image creation.
Using the wrong image, with missing dependencies
Section titled “Using the wrong image, with missing dependencies”The image that the spawned VMs use is created by the windows-container project and defined in the group_vars in Ansible.