Skip to content

Secrets Manager GKE (OpenBao) Service

We suggest the following filters to focus on relevant project audit logs in Grafana or GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
jsonPayload.type="response"

OpenBao emits audit events as structured JSON on the app container’s stdout/stderr; Cloud Logging surfaces them under jsonPayload.

  • jsonPayload.request.namespace.path="org_<org_id>/group_<root_namespace_id>/<obj_type>_<obj_id>/" can be used to filter audit logs to a particular project or group.
    • org_id is the organization ID.
    • root_namespace_id is the ID of the top-level group.
    • obj_type is group or project.
    • obj_id is the ID of the group or project where the secrets manager lives.
    • Example: jsonPayload.request.namespace.path="org_1/group_2377064/project_74977306/".
  • jsonPayload.request.path =~ "secrets/kv/data/explicit/.*" can be used to filter to just secret value read operations.
    • An explicit secret name can also be given with jsonPayload.request.path = "secrets/kv/data/explicit/<SECRET-NAME>".
    • This is best used in conjunction with the above.
  • jsonPayload.auth.display_name=~"pipeline_jwt" selects runner-initiated requests; jsonPayload.auth.display_name=~"gitlab_rails_jwt" selects Rails-initiated requests.

We suggest the following filters to focus on service logs (non-audit) in Grafana or GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
-jsonPayload.request.remote_address:*

OpenBao writes all output (including audit events) to stderr on GKE, so GCP marks every log entry with ERROR severity regardless of the [INFO]/[WARN] level inside the message body. Treat the level inside the message as authoritative.

Rails and runners reach OpenBao through CloudFlare and the Runway-managed GKE Gateway, which is backed by a Google Cloud external HTTP load balancer (see Architecture). The load balancer logs show which requests reached the Gateway and how it responded. They help separate an edge, routing, or timeout problem at the load balancer from a problem inside OpenBao. Filter on the forwarding rule for this service in GCP Logs Explorer:

resource.type="http_load_balancer"
resource.labels.forwarding_rule_name="gkegw1-ltbu-secrets-manager-gke-secrets-manager-gk-7yxpw13bbomy"

The forwarding rule name is per environment.

  • Production is gkegw1-ltbu-secrets-manager-gke-secrets-manager-gk-7yxpw13bbomy in the gitlab-runway-production project.
  • Staging is gkegw1-l52v-secrets-manager-gke-secrets-manager-gk-6w373ljxugpk in the gitlab-runway-staging project.

Useful fields on each entry:

  • httpRequest.status is the HTTP status the load balancer returned.
  • httpRequest.requestUrl includes the OpenBao path, so the tenant namespace is visible, for example org_1/group_<id>/project_<id>/secrets/kv/data/explicit/<name>.
  • httpRequest.latency is the round-trip time. A value near 30s on a 504 is the load balancer’s backend timeout.
  • httpRequest.remoteIp is CloudFlare’s edge address, not the runner or Rails. All traffic arrives through CloudFlare, so the originating client is not visible here.
  • jsonPayload.statusDetails is a string value that explains the outcome at the load balancer.
  • jsonPayload.enforcedSecurityPolicy.outcome is ACCEPT when the cloudflare-ingress-only-policy edge policy allowed the request.

The value of statusDetails is one of the following strings, paired with the HTTP code in httpRequest.status:

  • response_sent_by_backend is the normal case, where OpenBao answered. Any 5xx here came from OpenBao itself, so check Error logs.
  • backend_timeout accompanies a 504 after the backend does not respond within the load balancer’s timeout (about 30s). This is the signature of a slow or blocked OpenBao request. Check the OpenBao service logs for the same path and time.
  • failed_to_pick_backend accompanies a 503 when there is no healthy backend to route to. Check pod health.

When debugging a Secrets Manager incident, it is useful to check the caller side to see what was sent to OpenBao

Create, update and delete operations Secrets and associated Permissions, use Kibana — data view pubsub-rails-inf-gprd-*:

json.controller : ("Projects::SecretsController" or "Groups::SecretsController") or json.meta.caller_id : graphql\:*Secret* or json.path : "/api/v4/internal/secrets_manager/audit_logs"

Three OR clauses cover the user-facing surfaces: HTML UI controllers, GraphQL mutations and resolvers, and the OpenBao→Rails audit callback Grape endpoint. The filter excludes other code declaring feature_category :secrets_management (CI Secure Files, CI job-token logging) which are unrelated to OpenBao Secrets Manager.

Provisioning, deprovisioning, and rotation reminder workers — all under the SecretsManagement::* namespace. Use Kibana — data view pubsub-sidekiq-inf-gprd-*:

json.class : "SecretsManagement::*"

gitlab.com Shared Runners. Use Kibana — data view pubsub-runner-inf-gprd:

json.msg : ("resolving secrets" or "reading from Vault" or "creating vault client" or "inline auth JWT")

Narrow with json.job : <job_id> or json.runner : <runner_id> once the affected job or runner is identified. json.correlation_id is often empty on runner-side Vault errors (the Vault SDK emits them outside any request context) — cross-trace to Rails/Sidekiq via json.job + timestamp instead.

Read node health from a few startup log lines and cross-reference the metrics. The queries below target the app service logs.

At startup, a healthy pod logs core: vault is unsealed then core: unsealed with stored key, which confirms GCP KMS auto-unseal. Open in GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
textPayload:"core: vault is unsealed"

For the current state, check the secrets_manager_gke_core_unsealed metric. A 0 on any pod means a KMS or unseal problem.

A pod that fails to unseal logs core: vault is sealed and never reaches core: post-unseal setup complete (search for the sealed line). See the Pod sealed at startup playbook.

Exactly one pod should be active. Every pod logs core: entering standby mode first. The pod that wins the PostgreSQL HA lock then logs core: acquired lock, enabling active operation. Open in GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
textPayload=~"acquired lock, enabling active operation|entering standby mode"

The secrets_manager_gke_core_active metric should be 1 on exactly one pod. Zero or two active pods means HA-lock churn. See Leadership lost or failover. A standby node stays at core: entering standby mode and answers /v1/sys/health with 429, which is expected (see Healthy startup baseline).

Filter on the in-message level, because Cloud Logging tags every line ERROR. The query returns body-level [ERROR] and [WARN] lines and excludes audit events. Set the window with the time-range picker. Open in GCP Logs Explorer (last 24h):

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
-jsonPayload.request.remote_address:*
textPayload=~"\[ERROR\]|\[WARN\]"

To narrow to a keyword, replace the textPayload clause, for example textPayload:"failed to acquire lock". The no recovery key found WARN is non-fatal (see Healthy startup baseline).

A pod’s startup begins with the ==> OpenBao server configuration: banner and ==> OpenBao server started!, then follows the healthy startup baseline. To isolate one revision’s logs, filter on the Helm chart version label and swap in the deployed version:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
-jsonPayload.request.remote_address:*
labels."k8s-pod/helm_sh/chart"="secrets-manager-gke-1.5.1"

The banner reports the exact build:

Version: OpenBao v2.5.2+v2.5.2-gitlab1, built 2026-04-22T15:17:27Z
Version Sha: 42f8b5aab6ac68424c0e9f96031759f9395c4832+932fcf892eba8d646a9bfc58a59ea3b2475b17fa

Both audit devices register during post-unseal. Look for the two core: enabled audit backend lines, path=stdout/ type=file and path=remote/ type=http.

On a healthy boot, the active pod emits this sequence in the app container. A standby follows the same path up to core: entering standby mode and stops there.

==> OpenBao server started! Log data will stream in below:
[INFO] core: vault is unsealed
[INFO] core: unsealed with stored key
[INFO] core: entering standby mode
[INFO] core: acquired lock, enabling active operation
[INFO] core: enabled audit backend: path=stdout/ type=file
[INFO] core: enabled audit backend: path=remote/ type=http
[WARN] core: post-unseal upgrade seal keys failed: error="no recovery key found"
[INFO] core: post-unseal setup complete
  • core: unsealed with stored key (alongside core: vault is unsealed) confirms KMS auto-unseal succeeded.
  • Every pod logs core: entering standby mode first. The pod that wins the PostgreSQL HA lock then logs core: acquired lock, enabling active operation. The standby stays in standby and never runs post-unseal.
  • core: post-unseal setup complete is the active pod’s ready-to-serve signal. The standby never logs that line.
  • The two enabled audit backend lines confirm both devices: file to Cloud Logging, http to Rails (see Architecture).
  • [WARN] core: post-unseal upgrade seal keys failed: error="no recovery key found" is non-fatal. No recovery key is stored yet because the cluster was initialized with recovery_shares=0, so the warning logs on every boot and the node still reaches post-unseal setup complete. production#21589 tracks recovery key generation and storage.
  • A standby answers /v1/sys/health with 429 (sealed 503, active 200). Probes use ?standbyok=true, so a standby’s 429 counts as healthy.

A deploy changes the image SHA. The chart’s image tag is the released commit’s short Git SHA (see Architecture). Flux applies the new manifest and Kubernetes rolls the pods. A new pod starts and unseals, then the active role transfers when the old pod releases its HA lock.

ContainerRevisionMessageExplanationAction needed
appnewcore: vault is unsealed and core: unsealed with stored keyNew pod booted and KMS-unsealedNone
appnewcore: acquired lock, enabling active operationNew pod took over as activeConfirm exactly one active pod via core_active
appoldcore: vault is sealedOutgoing pod sealing as it shuts downExpected during rollover
cloud-sql-proxynewThe proxy has started successfully and is ready for new connections!Database proxy ready on 127.0.0.1:5432If absent, app cannot reach PostgreSQL
pod eventnewImagePullBackOffkubelet cannot pull the new image SHA (registry auth, or image not yet pushed)Shows in pod events and Flux, not app logs. Check kubectl describe pod and flux get sources oci on VPN. Usually clears after the image is pushed

GitLab Secrets Manager is a built-in secrets management solution for CI pipelines. Secrets are created and managed using GitLab UI, and consumed by CI jobs.

GitLab Secrets Manager relies on the secrets-manager-gke Runway service. The service is configured and deployed using the secrets-manager-runway project.

secrets-manager-gke runs OpenBao, which is a fork of HashiCorp Vault. The source code of OpenBao lives in openbao-internal, a build project that is intended to modify the upstream OpenBao releases.

The Rails backend and runners connect to the secrets-manager-gke service (running OpenBao) through the CloudFlare WAF and the Runway-managed GKE Gateway. Both Rails and runners use the same external URL (https://secrets.gitlab.com); there is no separate internal Runway URL on GKE.

OpenBao stores data on the Cloud SQL instance provided by Runway, and gets the unseal key from Google KMS via GCP Workload Identity (no Vault secret is needed for KMS auth on GKE).

OpenBao is configured with two audit devices that fan out every audit event in parallel:

  • file device writing JSON to the app container’s stdout (surfaced in Cloud Logging — see the Audit Logging section)
  • http device POSTing the same events to the Rails backend at https://gitlab.com/api/v4/internal/secrets_manager/audit_logs

The GitLab Secrets Manager design docs provides request flow diagrams.

flowchart TB
    CloudFlare(CloudFlare: secrets.gitlab.com)
    KMS[GCP KMS]
    PostgreSQL[GCP CloudSQL from Runway]
    Gateway[Runway GKE Gateway]

    Rails-- Manage OpenBao -->CloudFlare
    Runner-- Fetch Pipeline Secrets -->CloudFlare
    CloudFlare-->Gateway
    Gateway-->OpenBao
    OpenBao-- Decrypt Unseal Key -->KMS
    OpenBao-- Storage -->PostgreSQL

The service runs multiple OpenBao pods:

  • a single active pod
  • one or more standby pods

Pods connect to the PostgreSQL backend to store data and to acquire a lock.

On GKE, pods coordinate directly via cluster port 8201 (pod-to-pod, no LB involvement).

flowchart TD
    Ingress


        Service_OB([HTTP API])

    subgraph OpenBao
        OB_1[Primary]
        OB_2[Standby A]
        OB_3[Standby B]

        Service_Primary([Primary gRPC])
        end

    Ingress --> Service_OB
    Service_OB --> OB_1
    Service_OB --> OB_2
    Service_OB --> OB_3

    OB_2 -. forward .-> Service_Primary
    OB_3 -. forward .-> Service_Primary

    Service_Primary --> OB_1

    OB_1 -->Service_DB
    OB_1 -. lock maintenance .->Service_DB
    OB_2 -. lock monitor .->Service_DB
    OB_3 -. lock monitor .->Service_DB

    Service_DB([PostgreSQL]) -->    DB[(PostgreSQL)]

    OB_1 -- auto-unseal --> KMS
    OB_2 -- auto-unseal --> KMS
    OB_3 -- auto-unseal --> KMS

Benchmarking and sizing recommendations are covered by gitlab#589411.

The service is deployed on Runway GKE. Replicas are fixed at min_instances: 2 / max_instances: 2 — no autoscaling. Two pods provide HA: one active and one standby, coordinating leadership via the PostgreSQL lock.

Scalability is configured in default-values.yaml.

GitLab Secrets Manager is limited to the Premium and Ultimate tiers. The feature needs to be enabled in a group or project.

The service is currently deployed in a single region: us-east1 (both staging and production). Per-environment Runway configuration lives in gke-service-staging.yaml and gke-service-production.yaml.

Runway provisions and manages the Cloud SQL instance backing OpenBao. On Runway GKE, backups are always on for the Cloud SQL instance.

Runway performs backup and backup restore validation as configured for the secrets-manager-gke service. See the Runway restore validation documentation for details.

Backup procedure:

  1. Back up the Cloud SQL PostgreSQL database (runway-db-secrets-manager-gke).
  2. Back up the unseal key material stored on Google Cloud KMS. See runbooks for our internal Vault service, which similarly relies on Google Cloud KMS.

For restore, we suggest the following steps:

  1. Scale OpenBao down to zero pods.
  2. Perform the PostgreSQL restore.
  3. Scale OpenBao back up.

The Cloud SQL PostgreSQL database only contains encrypted data, and the unseal key is stored on Google KMS.

On Runway GKE, KMS authentication uses GCP Workload Identity tied to the pod’s Kubernetes service account — there is no long-lived credential or Vault secret for KMS access.

The service comes with built-in Runway observability:

The service comes with built-in Runway metrics. Additionally, the OpenBao container exposes its own metrics.

OpenBao metrics for this service use the secrets_manager_gke prefix.

Note: SLIs and alerts for secrets-manager-gke are currently driven by Runway load-balancer metrics only (see metrics-catalog/services/secrets-manager-gke.jsonnet).

The secrets_manager_gke_* metrics are emitted by the OpenBao container and can be queried directly in Mimir, but they are not bound to any SLI for this service. To chart one in the browser, open Grafana Explore on the mimir-runway datasource and replace the secrets_manager_gke_core_unsealed expression with any metric in the table. To scope to an environment, add {environment="gprd"} or {environment="gstg"}.

See OpenBao telemetry docs for the full list. The table below lists the metrics most relevant for operating the service.

MetricDescription
secrets_manager_gke_audit_log_request_failureNumber of audit log request failures
secrets_manager_gke_audit_device_log_response_failureNumber of audit log response failures
secrets_manager_gke_barrier_deleteTime taken to delete an entry from the barrier
secrets_manager_gke_barrier_getTime taken to get an entry from the barrier
secrets_manager_gke_barrier_listTime taken to list entries in the barrier
secrets_manager_gke_barrier_putTime taken to put an entry in the barrier
secrets_manager_gke_cache_deleteNumber of delete operations on the cache
secrets_manager_gke_cache_hitNumber of cache hits
secrets_manager_gke_cache_missNumber of cache misses
secrets_manager_gke_cache_writeNumber of cache writes
secrets_manager_gke_core_activeWhether the node is active (1) or standby (0)
secrets_manager_gke_core_unsealedWhether the node is unsealed (1) or sealed (0)
secrets_manager_gke_core_leadership_lostNumber of times leadership was lost
secrets_manager_gke_core_leadership_setup_failedNumber of times leadership setup failed
secrets_manager_gke_core_in_flight_requestsNumber of concurrent requests currently being processed
secrets_manager_gke_rollback_inflightNumber of rollback operations currently in flight
secrets_manager_gke_postgres_deleteTime taken to delete an entry from the PostgreSQL storage backend
secrets_manager_gke_postgres_getTime taken to get an entry from the PostgreSQL storage backend
secrets_manager_gke_postgres_listTime taken to list entries in the PostgreSQL storage backend
secrets_manager_gke_postgres_putTime taken to put an entry in the PostgreSQL storage backend
secrets_manager_gke_runtime_alloc_bytesNumber of bytes allocated by the OpenBao process

Notes:

  • Barrier and PostgreSQL metrics are summary metrics, exposing _count, _sum, and quantile series (0.5, 0.9, 0.99).
  • PostgreSQL metrics are named postgres (not postgresql) in the telemetry output, despite the documentation listing them as postgresql.
  • OpenBao is configured to exclude high-cardinality metrics.

Excluded metrics:

  • usage_gauge_period is set to 0 to exclude the following metrics:
    • token.count
    • token.count.by_policy
    • token.count.by_auth
    • token.count.by_ttl
    • expire.leases.by_expiration
    • secret.kv.count
    • identity.entity.count
    • identity.entity.alias.count
  • prefix_filter is set to exclude the following metrics:
    • audit.* — excluded except for audit.log_request_failure, audit.log_request, audit.log_response_failure, and audit.log_response
    • rollback.attempt.* — per-mount rollback counters
    • route.* — per-route request timers

Production normally has no body-level [ERROR] lines. The error signatures below indicate when something breaks.

ContainerError messageExplanationAction needed
app[WARN] core: post-unseal upgrade seal keys failed: error="no recovery key found"Non-fatal. No recovery key is stored yet (initialized with recovery_shares=0). Logs on every boot.Generate and store the recovery keys. See production#21589.
callerFailed to authenticate with OpenBaoThe JWT GitLab presented was rejected because of an OIDC issuer, aud, bound_audiences, or role mismatchCheck the caller logs (Rails web or Runner) and the audit log. Verify the JWT aud matches the role’s bound_audiences.
appKMS or seal errors at startup. Pod stays at core: vault is sealed, never logs post-unseal setup complete, and core_unsealed is 0.GCP KMS auto-unseal failed (KMS unreachable, or a key or workload-identity permission issue)See Pod sealed at startup
appHA-lock or leadership errors such as failed to acquire lock, with core_leadership_lost and core_leadership_setup_failed rising and core_active not exactly oneHA-lock contention, or database connectivity affecting the lockSee Leadership lost or failover
appPostgreSQL connection or timeout errors in service logs, with rising postgres_* latencyCloud SQL connectivity or saturationSee Cloud SQL connection or latency
Rails401 Unauthorized on POST /api/v4/internal/secrets_manager/audit_logs, with audit_log_request_failure risingThe http audit device’s shared token does not match the token Rails expects (Gitlab-Openbao-Auth-Token)See Audit events not reaching Rails

Each playbook below pairs a symptom with where to look and what to do. Metric names omit the secrets_manager_gke_ prefix (see Metrics).

A secrets manager stays in provisioning and you cannot create secrets. Check Sidekiq logs for SecretsManagement::* and find the failing worker (ProvisionProjectSecretsManagerWorker or ProvisionGroupSecretsManagerWorker) and its json.exception.message. There is no failed state, so a stuck record stays provisioning. The maintenance cron retries the task up to three times (Retrying failed secrets_manager maintenance task), then gives up.

Fix the worker error, then re-trigger provisioning. If every tenant is affected rather than one, suspect a failed OpenBao self-init instead. See Self-init failed.

A pod never serves when the core_unsealed metric is 0, core: post-unseal setup complete is missing, and core: vault is sealed appears with KMS or seal errors in the service logs. Verify GCP KMS reachability and the workload-identity permission on the unseal key (gitlab-sm-prod-unseal in gitlab-secrets-unseal-prod).

OpenBao self-initializes only once, on the very first boot of a fresh install with an empty database. It never self-initializes on restarts or upgrades.

On that first boot, OpenBao creates the global JWT auth mount and logs core: enabled credential backend: namespace="" path=gitlab_rails_jwt/ type=jwt. Every later boot (restart, upgrade, or new pod) loads the existing mount and logs core: successfully mounted: type=jwt ... path=gitlab_rails_jwt/ instead. If neither line appears, self-init did not complete. Rails then cannot authenticate to OpenBao, every auth call returns HTTP 401, and the service is down for all tenants.

Check the startup sequence and escalate. This is a service-wide issue, not a single-tenant one (gitlab#592186).

Audit events show in Cloud Logging (the file device) but are missing in GitLab, and the audit_log_request_failure metric rises. Check the http audit device in the service logs and the Rails web audit callback (/api/v4/internal/secrets_manager/audit_logs). A 401 means the shared audit token mismatches.

Terraform generates the token in config-mgmt and writes it to two Vault paths with independent version counters. OpenBao reads runway/env/<env>/service/secrets-manager-gke/openbao-audit-token, injected as the GITLAB_OPENBAO_AUDIT_TOKEN environment variable and version-pinned in gke-service-<env>.yaml. Rails reads env/<env>/ns/gitlab/openbao/audit:token, mounted from the gitlab-openbao-audit-secret ExternalSecret and version-pinned in k8s-workloads/gitlab-com. A 401 usually means the two pins drifted.

Confirm the live versions with vault kv metadata get <path> and align both. To rotate, regenerate the token in config-mgmt, then bump the version on both sides together and deploy.

Either no pod is active, or leadership is flapping. The core_active metric is not exactly one, and core_leadership_lost and core_leadership_setup_failed climb. Look for core: leadership lost, stopping active operation and HA-lock errors. The lock lives in PostgreSQL, so check Cloud SQL health next.

Symptoms are intermittent timeouts, rising latency, or lock churn. Check the postgres_* metrics, the cloud-sql-proxy sidecar logs, and the runway-db-secrets-manager-gke runbook. OpenBao recovers on its own after latency normalizes.

A pipeline job fails to resolve a secret. Check the Runner logs by job ID for JWT, OIDC, or audience errors and the path attempted. Then check the audit log for the project’s namespace.path.

A missing entry means the request never reached OpenBao (network, auth mount, or OIDC issuer). A denied response points to a CEL policy or permission. Confirm the secret exists, and that its branch and environment scope and permission grants match the job.