Cloud Connector

Cloud Connector is a way to access services common to multiple GitLab deployments, instances, and cells. Cloud Connector is not a dedicated service itself, but rather a collection of APIs, code and configuration that standardize the approach to authentication and authorization when integrating Cloud services with a GitLab instance.

This document contains general information on how Cloud Connector components are configured and operated by GitLab Inc. The intended audience is GitLab engineers and SREs who have to change configuration for or triage issues with these components.

See Cloud Connector architecture for more information.

Cloudflare

Any client consuming a Cloud Connector service must do so through cloud.gitlab.com, a public endpoint managed by Cloudflare. A “client” here is either a GitLab instance or an end-user application such as an IDE.

Cloudflare performs the following primary tasks for us:

Global load balancing using Cloudflare’s Anycast network
Enforcing WAF and other security rules such as rate limiting
Routing requests into GitLab feature backends

The cloud.gitlab.com DNS record is fully managed by Cloudflare, i.e. Cloudflare acts as a reverse proxy. This means any client dialing this endpoint will reach a Cloudflare server, not a GitLab backend. See routing for more information on how how requests are forwarded.

Routing and rate limits are configured here:

The default URL for the Cloudflare proxy is cloud.gitlab.com, which is used in production environments. It can be overridden by setting the CLOUD_CONNECTOR_BASE_URL environment variable. For example, we set this to cloud.staging.gitlab.com for our multi-tenant GitLab SaaS deployment. This will direct any Cloud Connector traffic originating from staging.gitlab.com to cloud.staging.gitlab.com. Self-managed customers are not expected to set this variable.

Monitoring

Dashboards

Alerts

CloudflareCloudConnectorRateLimitExhaustion

Logs

There are three ways to monitor traffic going through cloud.gitlab.com, each with their pros and cons:

Instant Logs. Use this to monitor live traffic. This stream will only display the most basic properties of an HTTP request, such as method and path, but can be useful to filter and monitor traffic on-the-fly.
Log Explorer. This tool allows querying historic logs using an SQL-like query language. It can surface all available information about HTTP requests, but has limited filtering capabilities. For example, since HTTP header fields are stored as an embedded JSON string, you cannot correlate log records with backend logs by filtering on e.g. correlation IDs.
LogPush. This approach first pushes logs from Cloudflare to a Google Cloud Storage bucket, from which you can then stream these files to your machine as JSON, or load them into BigQuery for further analysis. To stream the last 30m of HTTP request logs from a given timestamp into jq, run:
Terminal window
```
scripts/cloudflare_logs.sh -e cloud-connect-prd -d 2024-09-25T00:00 -t http -b 30 | jq .
```

If you wish to correlate log events between Cloudflare logs and the Rails application or Cloud Connector backends:

Via request correlation IDs: Look for x-request-id in the Cloudflare RequestHeaders field. Correlate it with correlation_id in application services. This identifies an individual request.
Via instance ID: Look for x-gitlab-instance-id in the Cloudflare RequestHeaders field. Correlate it with gitlab_instance_id in application services. This identifies an individual GitLab instance (both SM/Dedicated and gitlab.com).
Via global user ID: Look for x-gitlab-global-user-id in the Cloudflare RequestHeaders field. Correlate it with gitlab_global_user_id in application services. This identifies an individual GitLab end-user (both SM/Dedicated and gitlab.com).
Via caller IP: Look for ClientIP in Cloudflare logs. Correlate it with client_ip (or similar fields) in application services. This identifies either a GitLab instance or end-user client such as an IDE from which the request originated.

Routing

While this is not systematically enforced, we require all clients that want to reach Cloud Connector backends to dial cloud.gitlab.com instead of the backends directly. Backends that use a public load balancer such as GCP Global App LB should use Cloud Armor security policies to reject requests not coming from Cloudflare.

Routing is based on path prefix matching. Every Cloud Connector backend (e.g. the AI gateway) must be connected as a Cloudflare origin server with such a path prefix. For example, the AI gateway is routed via the /ai prefix, so requests to cloud.gitlab.com/ai/* are routed to the AI gateway, with the prefix stripped off (only * is forwarded.)

You can see an example of this here.

WAF Rules

The Cloud Connector Zones (cloud.gitlab.com) are protected by the standard WAF rules used across GitLab, which can be found in the cloudflare-waf-rules module. This provides configuration for the WAF Custom Rules and Cloudflare Managed Rules that:

block embargoed countries
block requests that try to exploit some vulnerabilities

However, Cloud Connector does not use the standard Rate Limiting Rules provided by the cloudflare-waf-rules module, and instead overrides these with customized rate limiting rules, outlined below.

Rate limiting

Additionally to standard WAF rules, we define rate limits that guard against malicious or misbehaving customer instances and clients. These rate limits are not to be confused with gitlab.com Cloudflare rate limits (which guard the gitlab.com application deployment) or application rate limits enforced in the Rails monolith itself.

Cloud Connector rate limits are instead enforced between either customer Rails instance (or end-user) and GitLab backend services. The following diagram illustrates where Cloud Connector rate limits fit into the overall rate limit setup, for the example of the AI gateway:

flowchart LR
    user(End-user request\nfrom IDE or web UI) --> is_dotcom

    is_dotcom{GitLab.com?} -- Y --> is_comda
    is_comda{Direct Access?} -- Y --> ccwaf
    is_comda{Direct Access?} -- N --> comwaf
    comwaf{{Cloudflare WAF / gitlab.com}} --> com_rails
    com_rails(GitLab.com Rails app /\nenforces in-app RLs /\GitLab Inc. controls these) --> ccwaf

    is_dotcom{GitLab.com?} -- N --> is_smda
    is_smda{Direct Access?} -- Y --> ccwaf
    is_smda{Direct Access?} -- N --> sm_rails
    ccwaf{{Cloudflare WAF / cloud.gitlab.com}} --> aigw
    sm_rails(Customer Rails app /\nenforces in-app RLs /\nGitLab Inc. can't control these) --> ccwaf

    aigw(AI gateway) --> vendorrl
    vendorrl{{AI vendor limits /\ndifficult to change}} --> aivendor
    aivendor(AI vendor / model)

The rate limits enforced in Cloudflare are specified per backend and can be classified as follows:

Per user. Limits applied to a given user identified by a unique global UID string.
Per authentication attempt. Rate limits applied to clients that produce repeated 401 Unauthorized server responses. This is to prevent credential stuffing and similar attacks that brute force authentication.

You can observe WAF events for custom rate limit rules here.

Key rotation

Cloud Connector uses JSON Web Tokens to authenticate requests in backends. For multi-tenant customers on gitlab.com, the Rails application issues and signs these tokens. For single-tenant customers (SM/Dedicated), CustomersDot issues and signs these tokens. To validate tokens, Cloud Connector backends fetch the corresponding keys from the gitlab.com and CustomersDot Rails applications respectively.

Keys should be rotated on a 6 month schedule both in staging and production.

Rotating keys for gitlab.com

Keys must be rotated in staging and production. The general steps in both environments are:

Run sudo gitlab-rake cloud_connector:keys:list to verify there is exactly one key.
Run sudo gitlab-rake cloud_connector:keys:create to add a new key to rotate to.
Run sudo gitlab-rake cloud_connector:keys:list to verify there are exactly two keys.
Ensure validators have fetched the new key via OIDC Discovery. Since keys are cached both in HTTP caches and application-specific caches, this may require waiting at least 24 hours for these caches to expire. This process can be expedited by:
- Restarting/redeploying backend services to evice their in-memory caches.
- Purging HTTP caches in Cloudflare for the /oauth/discovery/keys endpoint.
For the AI Gateway only, ensure this dashboard shows no events.
Run sudo gitlab-rake cloud_connector:keys:rotate to swap current key with new key, enacting the rotation.
Monitor affected systems:
- Ensure Puma and Sidekiq processes have swapped to the new key. This may take some time due keys being cached in process memory.
  - Puma key load events
  - Sidekiq key load events
- Ensure all Puma and Sidekiq workers are now using the new key to sign requests.
- Do not proceed with the process until:
  1. Keys in use to sign requests have converged fully to the new key.
  2. Backends should not see elevated rates of 401 Unauthorized responses.
Run sudo gitlab-rake cloud_connector:keys:trim to remove the now unused key.
Monitor affected systems as before to ensure the rotation was successful.

Rotating keys in staging

Run /change declare in Slack and create a C3 Change Request.
Teleport to console-01-sv-gstg.
Run steps outlined above.
Close the CR issue.

Rotating keys in production

Run /change declare in Slack and create a C2 Change Request.
Teleport to console-01-sv-gprd.
Run steps outlined above.
Close the CR issue.
Create a Slack reminder in #g_cloud_connector set to 6 months from now with a link to this runbook.

Rotating keys for customers.gitlab.com

Follow instructions here.