macOS runner fleet deployments
Generally, we follow the standard runner blue/green deployment process.
However, given the nature of dedicated hosts and potential capacity issues we try to do so in periods of lower utilization, such as weekends. Peak utilization seems to be EMEA weekdays.
View the current job and historic utlization on the CI runners dashboard.
Further, given we maintain a number of components for macOS (host image, job images, nesting etc), it is sensible to test any changes in our staging shard saas-macos-staging first. Once this is deployed to, test pipelines can be run in the saas-macos-staging test project.
Pre-flight checks
Section titled “Pre-flight checks”All shards are configured to hold onto dedicated hosts rather than release them when the running instance is terminated. This is because, in the US regions at least, it is increasingly difficult to acquire macOS dedicated hosts dynamically.
Before switching deployments ensure that enough dedicated hosts exist to scale into and meet current demand.
You can view the number of jobs running currently through this ci-runners Grafana panel.
Generally number of running jobs / 2 = required number of hosts due to each host being able to run 2 VMs concurrently.
We do not expose this data to our central metrics stack (yet),
so you must log into each AWS account and check the dedicated hosts list (under EC2), or use the aws cli, e.g.:
aws ec2 describe-hosts --region "us-east-1" --filter "Name=state,Values=available" --query "Hosts[?length(Instances) == \`0\`].[HostId,HostProperties.InstanceType,AvailabilityZone,State]" --output tableOnce the older shard is shut down it can take 3-4 hours before freed dedicated hosts become available again for use. See dedicated hosts overview for more details.
mac2.metal flakiness
Section titled “mac2.metal flakiness”Environments using mac2.metal hosts (saas-macos-staging and saas-macos-medium-m1) often experience
instance instability problems.
These problems manifest in two ways:
- Instance SSH startup failures - access fails after about 5 minutes, but before being fully provisioned.
- Dedicated hosts become unhealthy according to AWS checks and are recycled.
The runner handles these issues correctly by terminating the instance. This does have the negative side effect of using utilising more dedicated hosts than would otherwise be optimal.
We don’t know why these things occur specifically on the mac2.metal hosts.
The same problems are not observed on mac2-m2pro.metal machines,
which are both newer and have more resources.
mac2.metal is the oldest generation of arm64 mac minis in AWS. It is likely that this
hardware is coming to the end of its lifetime.