Skip to content

Linux OS Patching

This document is intended to overview system patching practices at GitLab, and provide guidance for ensuring systems stay up to date with the latest security fixes for individual systems.

This document targets VM instances running Linux based operating systems that make up the numerous fleets that directly support GitLab.com. There may be unique Linux deployments, or one-off systems where additional consideration may be required, and it should be expected that the scope of this document will change over time as the service and it’s deployment evolve.

  • Identify and classify critical systems where patching is required.
  • Establish a patching cadence for systems as they exist today, determined by the system’s risk exposure.
  • Outline system owners and contacts who should be responsible for the patching of individual systems.
  • Provide an overview of the existing patching processes, and prerequisites for doing so.
  • Internet facing Linux OS
    • These systems are exposed directly to the internet and have a larger attack surface because of this.
    • Examples:
      • bastion hosts
      • HAProxy
  • Internal Linux OS
    • Linux systems that are not directly accessible from external networks, and malicious access would need to traverse other secure systems first.
    • Examples:
      • Gitaly
      • Patroni
      • Redis

GitLab almost exclusively deploys Ubuntu as the base operating system for VMs supporting GitLab.com. By default, Canonical provides 5 years of security patching for their LTS releases (releases on even numbered years in April).

All of our Ubuntu systems have unattended upgrades enabled, meaning they will automatically install security patches on a daily basis.

While there is currently no guaranteed support for packages outside of the Ubuntu main repositories, we should strive to keep our fleet within the official 5 year LTS support window, as packages in additional repositories such as universe and multiverse tend to no longer receive updates past this period.

For machines that we enroll in Ubuntu Pro, ESM (extended security maintenance) coverage extends security patching support for packages in the main Ubuntu repository out to 10 years. This support does not extend to packages installed via the universe or multiverse repositories, or PPA respositories maintained by parties other than Canonical. Canonical offers support for the packages in the universe repository for an additional licensing fee that we may consider in the future.

This service allows us to apply High and Critical severity security fixes to running machines without the need to reboot the instance. Each kernel supported by Livepatch has a limited support period ranging from 9-13 months, where they are eligible for updates before a reboot will be required to receive further updates. This means, at minimum, every machine should be restarted once a year to ensure critical kernel security fixes will be available.

Any of the following scenarios will qualify a system for enrollment in Ubuntu Pro:

  • The OS is no longer in it’s LTS support window.
  • The systems cannot be rebooted without downtime on GitLab.com
  • There is no automation available for applying updates and rebooting the systems.

The major Linux fleets that support GitLab.com are:

ServiceOwnerExposureMaintenance ImpactAutomationUbuntu ProCadence (weeks)
GKEfoundationsexternallowpartialN/Aexternal
Runner Managersscalability:practicesinternallowpartialno8
HAProxyfoundationsexternallowpartialno8
Gitalydata-access::durabilityinternalhighnoyesas needed
Patronireliability_database_reliabilityinternallownoyes8
PGBouncerreliability_database_reliabilityinternallownoyes8
Redisdata-access::durabilityinternallownoyes8
Consolenoneinternallownoyes8
Deploynoneinternalmediumnoyes12
Bastionsnoneexternallowpartialno8
  • Low:
    • The service is deployed in a highly available capacity and individual instances can be taken offline with no impact to service usability.
    • The service is not highly available, but brief outages do not impact any other systems or processes.
  • Medium:
    • The service is deployed in a highly available capacity, but there may be system degradation as a result of taking instances offline for maintenance.
    • The service is not deployed in a highly available capacity, and coordination may be required to prevent internal process disruption. Customers are not impacted.
  • High:
    • The service is not deployed in a highly available capacity, downtime of portions of GitLab.com is required to facilitate patching activities.
  • no
    • No automation exists. Maintenance activities are initiated and executed by an SRE for each component in the system.
  • partial
    • Either maintenance initiation, or execution is required to be done by hand, but the other is handled by an automated system.
  • yes
    • Initiation, and execution of maintenance is handled by automated systems. No SRE involvement is required to keep systems up to date

The patching cadence for systems will be influenced by our established SLAs for vulnerability management, weighted by the system’s maintenance impact and whether automation exists for performing maintenance on the system. The goal is to have all systems have a low impact while under maintenance, that is fully automated.

General guidance for patching cadence will look like:

ImpactAutomationCadence
lowyes4 weeks
lowno8 weeks
mediumyes4 weeks
mediumno12 weeks
highnoas needed*
highyesas needed*

*As needed means that patching and reboots will be done to maintain security compliance by following guidance for “Unscheduled Patching”

Systems where the patch management lifecycle is controlled by a 3rd party may be designated as “external”

It may be necessary to perform patching outside of a system’s normal cadence to respond to new threats that are discovered and are actively being exploited. In these circumstances we will look to the SLAs defined for the relevant vulnerability CVE score, and initiate patching processes as appropriate to maintain compliance and security of the systems.

A simplistic order of operations that can be followed to patches are applied first to systems that have the most risk exposure can be (refer to system definitions above):

  1. Ensure all systems are under active support, and are not EoL.
  2. Externally accessible systems are up to date
  3. Internally accessible systems are up to date

When targeting systems to update, the first goal should be to ensure that all systems are whithin their designated support window. This is to ensure that if security vulnerabilities are found, these systems will have a path towards resolution, even if only passively.

We make no distinction between non-production and production systems intentionally, as patch application should always first be applied to non-production systems before production, with the priority set by the risk exposure of the production system.