Cloud Connector
Cloud Connector is a way to access services common to multiple GitLab deployments, instances, and cells. Cloud Connector is not a dedicated service itself, but rather a collection of APIs, code and configuration that standardize the approach to authentication and authorization when integrating Cloud services with a GitLab instance.
This document contains general information on how Cloud Connector components are configured and operated by GitLab Inc. The intended audience is GitLab engineers and SREs who have to change configuration for or triage issues with these components.
See Cloud Connector architecture for more information.
Cloudflare
Section titled “Cloudflare”Any client consuming a Cloud Connector service must do so through cloud.gitlab.com
, a public endpoint
managed by Cloudflare. A “client” here is either a GitLab instance or an end-user application such as an IDE.
Cloudflare performs the following primary tasks for us:
- Global load balancing using Cloudflare’s Anycast network
- Enforcing WAF and other security rules such as rate limiting
- Routing requests into GitLab feature backends
The cloud.gitlab.com
DNS record is fully managed by Cloudflare, i.e. Cloudflare acts as a reverse proxy.
This means any client dialing this endpoint will reach a Cloudflare server, not a GitLab backend.
See routing for more information on how how requests are forwarded.
Routing and rate limits are configured here:
The default URL for the Cloudflare proxy is cloud.gitlab.com
, which is used in production environments.
It can be overridden by setting the CLOUD_CONNECTOR_BASE_URL
environment variable.
For example, we set this to cloud.staging.gitlab.com for our multi-tenant GitLab
SaaS deployment.
This will direct any Cloud Connector traffic originating from staging.gitlab.com
to cloud.staging.gitlab.com
.
Self-managed customers are not expected to set this variable.
Monitoring
Section titled “Monitoring”Dashboards
Section titled “Dashboards”Alerts
Section titled “Alerts”There are three ways to monitor traffic going through cloud.gitlab.com
, each with their pros and cons:
-
Instant Logs. Use this to monitor live traffic. This stream will only display the most basic properties of an HTTP request, such as method and path, but can be useful to filter and monitor traffic on-the-fly.
-
Log Explorer. This tool allows querying historic logs using an SQL-like query language. It can surface all available information about HTTP requests, but has limited filtering capabilities. For example, since HTTP header fields are stored as an embedded JSON string, you cannot correlate log records with backend logs by filtering on e.g. correlation IDs.
-
LogPush. This approach first pushes logs from Cloudflare to a Google Cloud Storage bucket, from which you can then stream these files to your machine as JSON, or load them into BigQuery for further analysis. To stream the last 30m of HTTP request logs from a given timestamp into
jq
, run:Terminal window scripts/cloudflare_logs.sh -e cloud-connect-prd -d 2024-09-25T00:00 -t http -b 30 | jq .
If you wish to correlate log events between Cloudflare logs and the Rails application or Cloud Connector backends:
- Via request correlation IDs: Look for
x-request-id
in the CloudflareRequestHeaders
field. Correlate it withcorrelation_id
in application services. This identifies an individual request. - Via instance ID: Look for
x-gitlab-instance-id
in the CloudflareRequestHeaders
field. Correlate it withgitlab_instance_id
in application services. This identifies an individual GitLab instance (both SM/Dedicated and gitlab.com). - Via global user ID: Look for
x-gitlab-global-user-id
in the CloudflareRequestHeaders
field. Correlate it withgitlab_global_user_id
in application services. This identifies an individual GitLab end-user (both SM/Dedicated and gitlab.com). - Via caller IP: Look for
ClientIP
in Cloudflare logs. Correlate it withclient_ip
(or similar fields) in application services. This identifies either a GitLab instance or end-user client such as an IDE from which the request originated.
Routing
Section titled “Routing”While this is not systematically enforced, we require all clients that want to reach Cloud Connector backends
to dial cloud.gitlab.com
instead of the backends directly. Backends that use a public load balancer such as
GCP Global App LB should use Cloud Armor security policies to reject requests not coming from Cloudflare.
Routing is based on path prefix matching. Every Cloud Connector backend (e.g. the AI gateway) must be connected as
a Cloudflare origin server with such a path prefix. For example, the AI gateway is routed via the /ai
prefix,
so requests to cloud.gitlab.com/ai/*
are routed to the AI gateway, with the prefix stripped off (only *
is forwarded.)
You can see an example of this here.
WAF Rules
Section titled “WAF Rules”The Cloud Connector Zones (cloud.gitlab.com
) are protected by the standard WAF rules used across GitLab, which can be found in the cloudflare-waf-rules
module. This provides configuration for the WAF Custom Rules and Cloudflare Managed Rules that:
- block embargoed countries
- block requests that try to exploit some vulnerabilities
However, Cloud Connector does not use the standard Rate Limiting Rules provided by the cloudflare-waf-rules
module, and instead overrides these with customized rate limiting rules, outlined below.
Rate limiting
Section titled “Rate limiting”Additionally to standard WAF rules, we define rate limits that guard against malicious or misbehaving customer instances and clients. These rate limits are not to be confused with gitlab.com Cloudflare rate limits (which guard the gitlab.com application deployment) or application rate limits enforced in the Rails monolith itself.
Cloud Connector rate limits are instead enforced between either customer Rails instance (or end-user) and GitLab backend services. The following diagram illustrates where Cloud Connector rate limits fit into the overall rate limit setup, for the example of the AI gateway:
flowchart LR user(End-user request\nfrom IDE or web UI) --> is_dotcom
is_dotcom{GitLab.com?} -- Y --> is_comda is_comda{Direct Access?} -- Y --> ccwaf is_comda{Direct Access?} -- N --> comwaf comwaf{{Cloudflare WAF / gitlab.com}} --> com_rails com_rails(GitLab.com Rails app /\nenforces in-app RLs /\GitLab Inc. controls these) --> ccwaf
is_dotcom{GitLab.com?} -- N --> is_smda is_smda{Direct Access?} -- Y --> ccwaf is_smda{Direct Access?} -- N --> sm_rails ccwaf{{Cloudflare WAF / cloud.gitlab.com}} --> aigw sm_rails(Customer Rails app /\nenforces in-app RLs /\nGitLab Inc. can't control these) --> ccwaf
aigw(AI gateway) --> vendorrl vendorrl{{AI vendor limits /\ndifficult to change}} --> aivendor aivendor(AI vendor / model)
The rate limits enforced in Cloudflare are specified per backend and can be classified as follows:
- Per user. Limits applied to a given user identified by a unique global UID string.
- Per authentication attempt. Rate limits applied to clients that produce repeated
401 Unauthorized
server responses. This is to prevent credential stuffing and similar attacks that brute force authentication.
You can observe WAF events for custom rate limit rules here.
Key rotation
Section titled “Key rotation”Cloud Connector uses JSON Web Tokens to authenticate requests in backends. For multi-tenant customers on gitlab.com, the Rails application issues and signs these tokens. For single-tenant customers (SM/Dedicated), CustomersDot issues and signs these tokens. To validate tokens, Cloud Connector backends fetch the corresponding keys from the gitlab.com and CustomersDot Rails applications respectively.
Keys should be rotated on a 6 month schedule both in staging and production.
Rotating keys for gitlab.com
Section titled “Rotating keys for gitlab.com”Keys must be rotated in staging and production. The general steps in both environments are:
- Run
sudo gitlab-rake cloud_connector:keys:list
to verify there is exactly one key. - Run
sudo gitlab-rake cloud_connector:keys:create
to add a new key to rotate to. - Run
sudo gitlab-rake cloud_connector:keys:list
to verify there are exactly two keys. - Ensure validators have fetched the new key via OIDC Discovery. Since keys are cached both in HTTP
caches and application-specific caches, this may require waiting at least 24 hours for these
caches to expire. This process can be expedited by:
- Restarting/redeploying backend services to evice their in-memory caches.
- Purging HTTP caches in Cloudflare
for the
/oauth/discovery/keys
endpoint.
- For the AI Gateway only, ensure this dashboard shows no events.
- Run
sudo gitlab-rake cloud_connector:keys:rotate
to swap current key with new key, enacting the rotation. - Monitor affected systems:
- Ensure Puma and Sidekiq processes have swapped to the new key. This may take some time due keys being cached in process memory.
- Ensure all Puma and Sidekiq workers are now using the new key to sign requests.
- Do not proceed with the process until:
- Keys in use to sign requests have converged fully to the new key.
- Backends should not see elevated rates of
401 Unauthorized
responses.
- Run
sudo gitlab-rake cloud_connector:keys:trim
to remove the now unused key. - Monitor affected systems as before to ensure the rotation was successful.
Rotating keys in staging
Section titled “Rotating keys in staging”- Run
/change declare
in Slack and create a C3 Change Request. - Teleport to
console-01-sv-gstg
. - Run steps outlined above.
- Close the CR issue.
Rotating keys in production
Section titled “Rotating keys in production”- Run
/change declare
in Slack and create a C2 Change Request. - Teleport to
console-01-sv-gprd
. - Run steps outlined above.
- Close the CR issue.
- Create a Slack reminder in
#g_cloud_connector
set to 6 months from now with a link to this runbook.
Rotating keys for customers.gitlab.com
Section titled “Rotating keys for customers.gitlab.com”Follow instructions here.