CI Runners Service

Service Overview
Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22ci-runners%22%2C%20tier%3D%22sv%22%7D
Label: gitlab-com/gl-infra/production~“Service::CI Runners”

Logging

shared runners

CI Runner Overview

What are CI Runners?

CI Runners are the backbone of GitLab’s CI/CD workflows. They are specialized components responsible for executing the tasks and jobs defined in a given project’s .gitlab-ci.yml configuration file. Runners interact with GitLab’s API to receive jobs and run them in isolated environments, ensuring clean states for every pipeline execution.

Key Responsibilities

Job Execution: Execute scripts, commands, and test suites provided in the CI/CD configuration.
Resource Isolation: Maintain isolated environments to ensure jobs do not interfere with each other.
Environment Management: Set up required dependencies, containers, or virtual machines dynamically.
Scalability: Scale infrastructure dynamically based on job load.
Artifact Management: Handle the storage and transfer of job artifacts between pipeline stages.
Cache Management: Manage caching mechanisms to speed up subsequent pipeline runs.
Security Scanning: Execute security scans and vulnerability checks as part of the pipeline.

Why CI Runners Matter

Reliability: Each job runs in a clean, reproducible environment, reducing flakiness.
Automation: Automates testing, deployment, and integration processes.
Scalability: Accommodates thousands of jobs simultaneously through autoscaling.
Flexibility: Supports different environments, platforms, and architectures (Linux, Windows, macOS).
Cost Efficiency: Optimizes resource usage by spinning up environments only when needed.
Compliance: Helps maintain compliance requirements through consistent, tracked execution environments.
Debug Capability: Provides detailed logs and execution traces for troubleshooting.

Projects overview

CI Runners are managed from multiple GitLab projects.

Runner Types

GitLab supports two main categories of CI Runners:

Internal Runners

These runners are used exclusively for GitLab-managed projects. They operate within dedicated infrastructure, ensuring higher performance, reliability, and security. Examples include:

private: Dedicated to internal teams and private instances.
shared-gitlab-org: Dedicated for projects managed under gitlab-org namespace.
saas-macos-staging: Specialized runners for macOS jobs on staging environments.

External Runners

These runners are used by external users of GitLab.com. A list of the shards (also see Hosted runners for GitLab.com):

saas-linux-small-amd64
saas-linux-medium-amd64
saas-linux-large-amd64
saas-linux-xlarge-amd64
saas-linux-2xlarge-amd64
saas-linux-medium-amd64-gpu-standard
saas-linux-small-arm64
saas-linux-medium-arm64
saas-linux-large-arm64
saas-macos-medium-m1
saas-macos-large-m2pro
windows-runners

Comparison of Internal vs. External Runners:

Feature	Internal Runners	External Runners
Infrastructure	Managed by GitLab	Self-hosted or GitLab-managed
Access	Restricted to GitLab projects	Available to all users
Performance	Optimized for GitLab workflows	Dependent on host environment
Security	Jobs may share resources as VMs are reused	Enhanced isolation & dedicated resources

Runner Workflow

The workflow of a CI Runner involves multiple steps:

Pre-job Checks:
- Verify runner capabilities match job requirements
- Ensure required resources are available
Job Retrieval: Runners fetch job details from the GitLab API.
Cache Restoration:
- Restore cached dependencies
- Download artifacts from previous stages
Environment Setup: Prepare the required execution environment (e.g., Docker containers or VMs).
Job Execution: Run the scripts, commands, or pipelines as specified.
Health Checking:
- Regular status reporting to GitLab
- Monitor resource usage and job progress
Job Reporting: Send job status, logs, and artifacts back to GitLab.
Cleanup: Terminate or clean up the environment to ensure isolation.

High-Level Runner Architecture

graph TB
  subgraph GitLab.com
      API[GitLab API]
      Registry[Container Registry]
  end

  subgraph CI_Runner_Infrastructure
      subgraph Runner_Managers
          Private[Private Runner Manager]
          GitLab-org[GitLab-org Runner Manager]
          Other[All other Runner Manager]
          Windows[Windows Runner Manager]
          MacOS[MacOS Runner Manager]
      end

      subgraph Load_Balancing
          ILB[Internal Load Balancer]
          HAProxy[HAProxy Cluster]
      end

      subgraph Compute_Resources
          subgraph VM_Pools
              PrivatePool[Private Runner Manager]
              GitLab-orgPool[GitLab-org Runner Manager]
              OtherPools[All other Runner Manager]
              WindowsPool[Windows Runner Manager]
              MacOSPool[MacOS Runner Manager]
          end
      end

      subgraph Monitoring
          Prometheus[Prometheus]
          Mirmir[Mirmir]
          AlertManager[Alert Manager]
      end
  end

  %% Connections
  API --> ILB
  ILB --> HAProxy
  HAProxy --> Runner_Managers

  Private --> PrivatePool
  GitLab-org --> GitLab-orgPool
  Other --> OtherPools
  Windows --> WindowsPool
  MacOS --> MacOSPool

  PrivatePool --> Registry
  GitLab-orgPool --> Registry
  OtherPools --> Registry
  WindowsPool --> Registry
  MacOSPool --> Registry

  %% Monitoring connections
  Runner_Managers --> Prometheus
  VM_Pools --> Prometheus
  Prometheus --> Mirmir
  Prometheus --> AlertManager

  %% Styling
  classDef primary fill:#00c7b7,stroke:#333,stroke-width:2px;
  classDef secondary fill:#6666ff,stroke:#333,stroke-width:2px;
  classDef monitoring fill:#ff9900,stroke:#333,stroke-width:2px;
  classDef compute fill:#99cc00,stroke:#333,stroke-width:2px;

  class API,Registry primary;
  class Runner_Managers secondary;
  class Prometheus,Mirmir,AlertManager monitoring;
  class VM_Pools compute;

Below is a description of the runner components and their relationships:

Components of the Runner System

Runner Managers:
- Purpose: Coordinate the retrieval and execution of jobs.
- Functionality: Manage scaling, orchestration, and job lifecycle.
- GitLab.com specifically has several types:
  - Private-runners-manager
  - shared-gitlab-org-runners-manager
  - saas-macos-staging-runners-manager
  - saas-linux-large-amd64-runners-manager
  - saas-linux-xlarge-amd64-runners-manager
  - saas-linux-2xlarge-amd64-runners-manager
  - saas-linux-medium-amd64-gpu-standard-runners-manager
  - saas-linux-medium-amd64-runners-manager
  - saas-linux-small-amd64-runners-manager
  - saas-linux-small-arm64-runners-manager
  - saas-linux-medium-arm64-runners-manager
  - saas-linux-large-arm64-runners-manager
  - saas-macos-medium-m1-runners-manager
  - saas-macos-large-m2pro-runners-manager
Load Balancers:
- Purpose: Distribute job load across multiple runner managers.
- Implementation: CI runners use Internal Load Balancers (ILBs) called “ci-gateway” to reduce traffic costs and improve performance. There are ILBs in both GSTG (staging) and GPRD (production) environments and each environment has ILBs across different availability zones (us-east1-b, us-east1-c, us-east1-d). The ILBs connect to HaProxy nodes that have interfaces in both the main VPC and ci-gateway VPC and the load balancers are accessible through specific internal FQDNs like:
  - git-us-east1-c.ci-gateway.int.gprd.gitlab.net
  - git-us-east1-d.ci-gateway.int.gprd.gitlab.net
  - git-us-east1-c.ci-gateway.int.gstg.gitlab.net
  - git-us-east1-d.ci-gateway.int.gstg.gitlab.net
The setup helps optimize costs by keeping traffic within GCP’s internal network when possible, only routing to the public internet when necessary (like for artifact uploads/downloads).
Compute Resources:
- Virtual Machines or containers provisioned dynamically for job execution.
- Categories: Private pools, macOS pools, and Windows pools, Other pools where each shard has it’s own.
- Specific resource tiers (S, M, L, XL, 2XL)
Monitoring Stack:
- Components: Prometheus, Grafana, and Mirmir.
- Functionality: Monitor job performance, health, and resource usage.
Network Components:
- Purpose: Handle secure communication between runners and GitLab
- Implementation: Shared VPC architecture, Strict firewall rules and network policies
- Security: Manage access controls and network isolation
- For more details see ci-runner-networking.md

Key Takeaways for CI Runners

Consistency: Jobs always run in clean, isolated environments.
Scalability: Autoscaling ensures that infrastructure adapts to workload.
Flexibility: Supports multiple platforms, programming languages, and containerized workflows.
Control: Internal runners offer optimized performance for GitLab projects, while external runners provide user flexibility.

SSH Access

Accessing the runner-manager virtual machines (VMs) is a critical step for debugging, configuration, and administrative tasks. The following instructions outline how to set up secure SSH access to these VMs hosted in the gitlab-ci-155816 GCP project.

Steps to Configure SSH

To SSH into any VM under the gitlab-ci-155816 GCP project:

Open or create the file ~/.config/ssh/config.
Add the following configuration block:

Host *.gitlab-ci-155816.internal
    ProxyJump lb-bastion.ci.gitlab.com

External Runners

See Hosted runners for -com.

Runner Deployments

Blue-Green Deployment

A blue-green deployment strategy is used to enhance the reliability and speed of GitLab Runner releases. This method involves running two clusters in parallel:

Blue Cluster: Handles active traffic and production jobs.
Green Cluster: Hosts the next release candidate for testing and gradual rollout.

Benefits

Zero Downtime: Ensures continuous operation during upgrades.
Rollbacks: Allows seamless traffic shift to a stable cluster in case of issues.
Faster Releases: Traffic can be moved between clusters instantly.

Architectures Supported

Linux Architecture: Comprehensive guide for Linux-based runner environments.
Mac Architecture: Tailored for macOS-specific workloads.
Windows Architecture: Focused on Windows VM environments and job execution.

For additional details, refer to the Blue-Green Deployment README.

Cost Factors

Overview

Cost factors determine the number of minutes deducted from a user’s account based on the type of runner and the nature of the project (public vs. private). This helps balance resource allocation and encourages efficient usage.

Cost Factor Table (WIP)

Runner Type	Public Project Factor	Private Project Factor
	0.0	1.0
	0.0	1.0
	0.0	1.0
	0.0	1.0

Key Insights

Public Projects: Benefit from free execution minutes with a cost factor of 0.0.
Private Projects: Deduct minutes at a 1.0 cost factor per minute of execution.
Specialized Runners: All specialized runners have the same cost factor, simplifying billing.

Monitoring

Purpose

Monitoring ensures the health, performance, and reliability of CI Runner infrastructure. A robust monitoring stack is in place, leveraging tools like Prometheus, Thanos, and Traefik.

Architecture

Prometheus Instances: Deployed with high availability (minimum two replicas).
Thanos Sidecar:
- Facilitates long-term metrics storage in Google Cloud Storage (GCS).
- Exposes gRPC endpoints for querying data.
Traefik Ingress:
- Acts as the load balancer for gRPC services.
- Enforces HTTPS with Let’s Encrypt certificates.

External Access

Each monitoring project uses reserved public IP addresses. These addresses are mapped to DNS records:

monitoring-lb.[ENVIRONMENT].ci-runners.gitlab.net: Used for accessing the Traefik dashboard and Thanos Query store.
prometheus.[ENVIRONMENT].ci-runners.gitlab.net: Directly connects to Prometheus deployments.

Authentication

Access is restricted via OAuth using Google as the identity provider.
Only @gitlab.com accounts are allowed access.

Example Diagram showing monitoring stack

CI Runners Monitoring Stack Design

Terraform Integration

The monitoring stack is managed entirely via Terraform, ensuring consistency across deployments. For details, see the CI Runners Monitoring Terraform Module.

Production Change Lock (PCL)

Purpose

To minimize disruptions, GitLab enforces a Production Change Lock during critical periods such as holidays and summits.

Standard PCL Events

Dates	Type	Reason
Recurring: Friday	Soft	Friday
Recurring: Weekend (Sat - Sun)	Soft	Weekend

CI Runners Service

Logging

CI Runner Overview

What are CI Runners?

Key Responsibilities

Why CI Runners Matter

Projects overview

Runner Types

Internal Runners

External Runners

Runner Workflow

High-Level Runner Architecture

Components of the Runner System

Key Takeaways for CI Runners

SSH Access

Steps to Configure SSH

External Runners

Runner Deployments

Blue-Green Deployment

Benefits

Architectures Supported

Cost Factors

Overview

Cost Factor Table (WIP)

Key Insights

Monitoring

Purpose

Architecture

External Access

Authentication

Example Diagram showing monitoring stack

Terraform Integration

Production Change Lock (PCL)

Purpose

Standard PCL Events

Alerts