Skip to content

Secrets Manager GKE (OpenBao) Service

We suggest the following filters to focus on relevant project audit logs in Grafana or GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
jsonPayload.type="response"

OpenBao emits audit events as structured JSON on the app container’s stdout/stderr; Cloud Logging surfaces them under jsonPayload.

  • jsonPayload.request.namespace.path="org_<org_id>/group_<root_namespace_id>/<obj_type>_<obj_id>/" can be used to filter audit logs to a particular project or group.
    • org_id is the organization ID.
    • root_namespace_id is the ID of the top-level group.
    • obj_type is group or project.
    • obj_id is the ID of the group or project where the secrets manager lives.
    • Example: jsonPayload.request.namespace.path="org_1/group_2377064/project_74977306/".
  • jsonPayload.request.path =~ "secrets/kv/data/explicit/.*" can be used to filter to just secret value read operations.
    • An explicit secret name can also be given with jsonPayload.request.path = "secrets/kv/data/explicit/<SECRET-NAME>".
    • This is best used in conjunction with the above.
  • jsonPayload.auth.display_name=~"pipeline_jwt" selects runner-initiated requests; jsonPayload.auth.display_name=~"gitlab_rails_jwt" selects Rails-initiated requests.

We suggest the following filters to focus on service logs (non-audit) in Grafana or GCP Logs Explorer:

resource.type="k8s_container"
resource.labels.namespace_name="secrets-manager-gke"
resource.labels.container_name="app"
-jsonPayload.request.remote_address:*

OpenBao writes all output (including audit events) to stderr on GKE, so GCP marks every log entry with ERROR severity regardless of the [INFO]/[WARN] level inside the message body. Treat the level inside the message as authoritative.

When debugging a Secrets Manager incident, it is useful to check the caller side to see what was sent to OpenBao

Create, update and delete operations Secrets and associated Permissions, use Kibana — data view pubsub-rails-inf-gprd-*:

json.controller : ("Projects::SecretsController" or "Groups::SecretsController") or json.meta.caller_id : graphql\:*Secret* or json.path : "/api/v4/internal/secrets_manager/audit_logs"

Three OR clauses cover the user-facing surfaces: HTML UI controllers, GraphQL mutations and resolvers, and the OpenBao→Rails audit callback Grape endpoint. The filter excludes other code declaring feature_category :secrets_management (CI Secure Files, CI job-token logging) which are unrelated to OpenBao Secrets Manager.

Provisioning, deprovisioning, and rotation reminder workers — all under the SecretsManagement::* namespace. Use Kibana — data view pubsub-sidekiq-inf-gprd-*:

json.class : "SecretsManagement::*"

gitlab.com Shared Runners. Use Kibana — data view pubsub-runner-inf-gprd:

json.msg : ("resolving secrets" or "reading from Vault" or "creating vault client" or "inline auth JWT")

Narrow with json.job : <job_id> or json.runner : <runner_id> once the affected job or runner is identified. json.correlation_id is often empty on runner-side Vault errors (the Vault SDK emits them outside any request context) — cross-trace to Rails/Sidekiq via json.job + timestamp instead.

GitLab Secrets Manager is a built-in secrets management solution for CI pipelines. Secrets are created and managed using GitLab UI, and consumed by CI jobs.

GitLab Secrets Manager relies on the secrets-manager-gke Runway service. The service is configured and deployed using the secrets-manager-runway project.

secrets-manager-gke runs OpenBao, which is a fork of HashiCorp Vault. The source code of OpenBao lives in openbao-internal, a build project that is intended to modify the upstream OpenBao releases.

The Rails backend and runners connect to the secrets-manager-gke service (running OpenBao) through the CloudFlare WAF and the Runway-managed GKE Gateway. Both Rails and runners use the same external URL (https://secrets.gitlab.com); there is no separate internal Runway URL on GKE.

OpenBao stores data on the Cloud SQL instance provided by Runway, and gets the unseal key from Google KMS via GCP Workload Identity (no Vault secret is needed for KMS auth on GKE).

OpenBao is configured with two audit devices that fan out every audit event in parallel:

  • file device writing JSON to the app container’s stdout (surfaced in Cloud Logging — see the Audit Logging section)
  • http device POSTing the same events to the Rails backend at https://gitlab.com/api/v4/internal/secrets_manager/audit_logs

The GitLab Secrets Manager design docs provides request flow diagrams.

flowchart TB
    CloudFlare(CloudFlare: secrets.gitlab.com)
    KMS[GCP KMS]
    PostgreSQL[GCP CloudSQL from Runway]
    Gateway[Runway GKE Gateway]

    Rails-- Manage OpenBao -->CloudFlare
    Runner-- Fetch Pipeline Secrets -->CloudFlare
    CloudFlare-->Gateway
    Gateway-->OpenBao
    OpenBao-- Decrypt Unseal Key -->KMS
    OpenBao-- Storage -->PostgreSQL

The service runs multiple OpenBao pods:

  • a single active pod
  • one or more standby pods

Pods connect to the PostgreSQL backend to store data and to acquire a lock.

On GKE, pods coordinate directly via cluster port 8201 (pod-to-pod, no LB involvement).

flowchart TD
    Ingress


        Service_OB([HTTP API])

    subgraph OpenBao
        OB_1[Primary]
        OB_2[Standby A]
        OB_3[Standby B]

        Service_Primary([Primary gRPC])
        end

    Ingress --> Service_OB
    Service_OB --> OB_1
    Service_OB --> OB_2
    Service_OB --> OB_3

    OB_2 -. forward .-> Service_Primary
    OB_3 -. forward .-> Service_Primary

    Service_Primary --> OB_1

    OB_1 -->Service_DB
    OB_1 -. lock maintenance .->Service_DB
    OB_2 -. lock monitor .->Service_DB
    OB_3 -. lock monitor .->Service_DB

    Service_DB([PostgreSQL]) -->    DB[(PostgreSQL)]

    OB_1 -- auto-unseal --> KMS
    OB_2 -- auto-unseal --> KMS
    OB_3 -- auto-unseal --> KMS

Benchmarking and sizing recommendations are covered by gitlab#589411.

The service is deployed on Runway GKE. Replicas are fixed at min_instances: 2 / max_instances: 2 — no autoscaling. Two pods provide HA: one active and one standby, coordinating leadership via the PostgreSQL lock.

Scalability is configured in default-values.yaml.

GitLab Secrets Manager is limited to the Premium and Ultimate tiers. The feature needs to be enabled in a group or project.

The service is currently deployed in a single region: us-east1 (both staging and production). Per-environment Runway configuration lives in gke-service-staging.yaml and gke-service-production.yaml.

Runway provisions and manages the Cloud SQL instance backing OpenBao. On Runway GKE, backups are always on for the Cloud SQL instance.

Runway performs backup and backup restore validation as configured for the secrets-manager-gke service. See the Runway restore validation documentation for details.

Backup procedure:

  1. Back up the Cloud SQL PostgreSQL database (runway-db-secrets-manager-gke).
  2. Back up the unseal key material stored on Google Cloud KMS. See runbooks for our internal Vault service, which similarly relies on Google Cloud KMS.

For restore, we suggest the following steps:

  1. Scale OpenBao down to zero pods.
  2. Perform the PostgreSQL restore.
  3. Scale OpenBao back up.

The Cloud SQL PostgreSQL database only contains encrypted data, and the unseal key is stored on Google KMS.

On Runway GKE, KMS authentication uses GCP Workload Identity tied to the pod’s Kubernetes service account — there is no long-lived credential or Vault secret for KMS access.

The service comes with built-in Runway observability:

The service comes with built-in Runway metrics. Additionally, the OpenBao container exposes its own metrics.

OpenBao metrics for this service use the secrets_manager_gke prefix.

Note: SLIs and alerts for secrets-manager-gke are currently driven by Runway load-balancer metrics only (see metrics-catalog/services/secrets-manager-gke.jsonnet). The secrets_manager_gke_* metrics are emitted by the OpenBao container and can be queried directly in Mimir, but they are not bound to any SLI for this service.

See OpenBao telemetry docs for the full list. The table below lists the metrics most relevant for operating the service.

MetricDescription
secrets_manager_gke_audit_log_request_failureNumber of audit log request failures
secrets_manager_gke_audit_device_log_response_failureNumber of audit log response failures
secrets_manager_gke_barrier_deleteTime taken to delete an entry from the barrier
secrets_manager_gke_barrier_getTime taken to get an entry from the barrier
secrets_manager_gke_barrier_listTime taken to list entries in the barrier
secrets_manager_gke_barrier_putTime taken to put an entry in the barrier
secrets_manager_gke_cache_deleteNumber of delete operations on the cache
secrets_manager_gke_cache_hitNumber of cache hits
secrets_manager_gke_cache_missNumber of cache misses
secrets_manager_gke_cache_writeNumber of cache writes
secrets_manager_gke_core_activeWhether the node is active (1) or standby (0)
secrets_manager_gke_core_unsealedWhether the node is unsealed (1) or sealed (0)
secrets_manager_gke_core_leadership_lostNumber of times leadership was lost
secrets_manager_gke_core_leadership_setup_failedNumber of times leadership setup failed
secrets_manager_gke_core_in_flight_requestsNumber of concurrent requests currently being processed
secrets_manager_gke_rollback_inflightNumber of rollback operations currently in flight
secrets_manager_gke_postgres_deleteTime taken to delete an entry from the PostgreSQL storage backend
secrets_manager_gke_postgres_getTime taken to get an entry from the PostgreSQL storage backend
secrets_manager_gke_postgres_listTime taken to list entries in the PostgreSQL storage backend
secrets_manager_gke_postgres_putTime taken to put an entry in the PostgreSQL storage backend
secrets_manager_gke_runtime_alloc_bytesNumber of bytes allocated by the OpenBao process

Notes:

  • Barrier and PostgreSQL metrics are summary metrics, exposing _count, _sum, and quantile series (0.5, 0.9, 0.99).
  • PostgreSQL metrics are named postgres (not postgresql) in the telemetry output, despite the documentation listing them as postgresql.
  • OpenBao is configured to exclude high-cardinality metrics.

Excluded metrics:

  • usage_gauge_period is set to 0 to exclude the following metrics:
    • token.count
    • token.count.by_policy
    • token.count.by_auth
    • token.count.by_ttl
    • expire.leases.by_expiration
    • secret.kv.count
    • identity.entity.count
    • identity.entity.alias.count
  • prefix_filter is set to exclude the following metrics:
    • audit.* — excluded except for audit.log_request_failure, audit.log_request, audit.log_response_failure, and audit.log_response
    • rollback.attempt.* — per-mount rollback counters
    • route.* — per-route request timers