Windows Autoscaling Runners

We operate 2 runner manager servers that run Windows and build Windows shared runners. We must use Windows as the manager and executors talk via WinRM. However, the data flow of the Windows runners managers are the same as the Linux runner described in the README. We use a custom autoscaler instead of docker-machine for these Windows runners. There is an architecture diagram that can be found in architecture.md

Windows Configurations

We manage configurations of Windows servers in our ci-infrastructure-windows project. In there you will find an ansible directory which includes ansible playbooks and roles used to configure the servers. Additionally, there is a packer directory which is used to build images for the Windows managers. For now, we aren’t doing too much with Packer as we don’t have a way to properly rebuild servers without downtime.

The Windows managers use a custom image built with Packer to build the machines that execute jobs.

Connecting to Windows

Please read the connecting to Windows documentation to install relevant software and connect to Windows.

Graceful Shutdown of Windows Runner Managers

Graceful shutdown is built into the gitlab-runner.exe. In order to start a shutdown, you need to open a PowerShell as an admin, navigate to C:\GitLab-Runner, and execute .\gitlab-runner.exe stop. This will take up to an hour to finish running jobs and finally stop.

Once it is stopped you can proceed with any maintenance you need to run.

Keep in mind that we only have two runner managers, so to avoid downtime you should only stop the runner on only one of the managers at a time.

Upgrading the Runner

Updating the Windows runner is a multi-step, but straightforward process.

Updating the Windows ephemeral container image

Prepare an MR to the windows-containers project that updates the runner version and checksum in the gitlab-runner-dependencies attributes file. Once this is merged, the container should build and publish itself. After merging and the CI pipelines complete, verify that the image is created and available which can be done using either the GCP console or gcloud tool. After you’ve verified that the image is available, you can proceed to updating Ansible.

Updating Ansible

Create an MR that updates Ansible with the new runner version value and new autoscaler image created above. Both of these are declared in the gcp_role_runner_manager.yml file. Be sure to update each autoscaler section to ensure all versions of Windows are updated.

If you are updating the autoscaler, change the version and be sure to also update checksum.

After merging, the CI pipeline will kick off. This is gated by a manual action. In this case, you should not run the automatically created Ansible apply job, but instead create your own. While it is not dangerous to run the apply job, it will fail because the runner process is still running.

Note: if you’re just trying to revert an autoscaler image upgrade, there is no need to proceed with the following steps to restart the runner manager processes.

Applying The Upgrade

Now that the images are recreated and Ansible is updated, it is time to execute the upgrade.

Firstly, you’ll want to stop the runner gracefully on only one runner at a time. The instructions for doing so are earlier in this document.

After the runner process is fully stopped, you’ll create a new CI pipeline in the ci-infrastructure-windows project. You will need to define the ANSIBLE_HOST_LIMIT and set it to the name of the runner manager that is currently stopped (either windows-shared-runners-manager-1 or 2). This ensures that ansible only runs on the server that is ready for the upgrade. This is also manually gated, so you’ll need to go start the apply job after the plan is run. Keep in mind this could take some time as Ansible on Windows can be exceedingly slow.

When the Ansible run is completed, you can verify that the runner is upgraded by running gitlab-runner.exe version in PowerShell. Ansible should start the runner process automatically after it is done running, but you should also verify that the runner process has started.

Finally, you can repeat the above process for the other manager that needs an upgrade.

Tools

Powershell

Powershell is the preferred method of interacting with Windows via command line. While Powershell is very complex and powerful, below are some common commands you might use. Please note that as with most things Windows, these commands are not case-sensitive. You may also be interested in reading Ryan Palo’s PowerShell Tutorial as it is written with those who hate PowerShell in mind and helps relate it to more familiar bash commands.

Get-Content is a tool similar to head, tail, and cat on Linux.
1. Ex. Get-Content -Path .\logfile.log -TotalCount 5 will get the first 5 lines of a file.
2. Ex. Get-Content -Path .\logfile.log -Tail 5 -Wait will get the last 5 lines of a file AND follow it for any changes.
3. Get-Content documentation

Third party tools

There are a few tools that are currently installed on each manager during setup. These are:

Process Explorer is probably the most important software listed. It is a great tool that gives incredibly detailed information on processes, and it is substantially better than the built in task manager. If you need to find out info on any processes, you should use this instead.

Additionally, cmder is an easier to use terminal emulator.

Troubleshooting

Deadman Test

A simple pipeline is executed at the project windows-srm-deadman-test on a schedule every 2 hours. This serves as a canary type test to see if jobs are executing and be handled correctly by the Windows Shared Runner Managers. The job it runs is extremely simple and so a failure can be assumed to be a systemic failure of the Windows Shared Runners themselves. Notifications about job failures are posted to the Slack channel #f_win_shared_runners. If a problem is suspected on the Windows Shared Runners, look at the history of past pipelines for this project, and consider triggering one manually to see how it behaves.

Shared Runners Manager Offline

If a shared runners manager is shown offline:

Connect to the windows runner manager by following the commands in the connecting to a Windows machine doc.
Click Start Menu > Click Windows PowerShell > Right-click on Windows PowerShell sub-menu > Click Start as Administrator)

On the command-line in the PowerShell window invoke:

C:\Gitlab-Runner\gitlab-runner.exe status

# if down:
C:\Gitlab-Runner\gitlab-runner.exe start

Autoscaler Logs and Docs

The autoscaler is a custom executor plugin for the GitLab Runner.

The autoscaler logs to a file located at C:\GitLab-Runner\autoscaler\autoscaler.log. This file will contain all the information regarding creation, connection, and deletion of VMs. You may want to look here if VM creation is failing or connections from the managers are failing. This is likely the best first place to check when issues arise.

On the command-line in the PowerShell window invoke:

Get-Content C:\Gitlab-Runner\autoscaler\autoscaler.log -tail 100 | Out-Host -Paging

Firewall rules for winrm

The managers must be able to connect to the spawned VMs via port 5985-5986. The relevant GCP firewall rules are defined in firewall.tf in our terraform repo. The port should be open by default on the spawned VMs during packer image creation.

Using the wrong image, with missing dependencies

The image that the spawned VMs use is created by the windows-container project and defined in the group_vars in Ansible.

Architecture Diagram

windows runner diagram