Growth – Trials Health Runbook
This runbook covers four Prometheus alerts that monitor the GitLab trial provisioning pipeline end-to-end: SaaS trial creation, SM trial activation, and the Workato lead handoff to Marketo/CRM.
Note: This is a suite runbook covering four related alerts. All alerts are defined in
mimir-rules/gitlab-gprd/growth/growth-trials-health.ymland route to#g_growth_trials_alerts.
Overview
Section titled “Overview”These alerts fire when GitLab trial provisioning degrades in production (gprd).
The pipeline spans two services:
- CustomersDot (
gitlab-subscriptions-prod) handles trial creation requests from GitLab.com, SM activation viacloudActivationActivate, and the Workato lead handoff viaWorkato::CreateLeadJob. - GitLab.com Rails handles SaaS trial registration and Aha D14 activation tracking.
Failures here mean potential customers cannot start trials, which directly affects Growth team OKRs and revenue pipeline.
Services
Section titled “Services”- CustomersDot service overview
- Growth – Trials Health dashboard
- Team: Growth / Monetization —
#g_growthon Slack - Alerts channel:
#g_growth_trials_alerts - CustomersDot oncall/incidents:
#f_customersdot
Metrics
Section titled “Metrics”| Metric prefix | Source | Baseline (gprd) |
|---|---|---|
growth_trial_creation_* | GCP log-based metric, CustomersDot | ~10K+/week (~1.67/min) |
growth_sm_trial_activation_* | GCP log-based metric, cloudActivationActivate | ~100–123/week (~0.015/min) |
growth_workato_lead_* | GCP log-based metric, Sidekiq Workato jobs | Proportional to trial creation |
growth_marketo_inbound_* | GCP log-based metric, /marketo_trial endpoint | Low volume |
growth_trial_activation_aha14_* | App-level LabKit counter, GitLab.com Rails | Subset of SaaS trials |
growth_sm_trial_activation_duration_seconds | GCP log-based distribution metric, jsonPayload.duration_s | Trial-only request latency via lograge |
growth_trial_provision_latency_seconds_bucket | App-level LabKit histogram, GitLab.com Rails (#968) | Preferred long-term; covers SaaS trial latency |
Thresholds are anchored to the baselines above. Failure rates are expected to be near zero in normal operation; alert thresholds represent a meaningful signal above noise rather than a percentage of traffic.
If metrics stop appearing in Grafana, check the GCP → Cloud Monitoring → OTEL → Mimir pipeline before assuming the underlying service is healthy. See the GrowthTrialCreationNearZero investigation step 4 below.
Critical filter: source_class_name="Trial"
Section titled “Critical filter: source_class_name="Trial"”The growth_sm_trial_activation_* metrics and the
growth_sm_trial_activation_duration_seconds latency distribution must use
jsonPayload.source_class_name="Trial" as a filter in their Cloud Logging metric
definitions. Without it, the counters include subscription activations
(cloudActivationActivate handles both), which inflates baselines and renders
alert thresholds meaningless.
This field is set via SafeRequestStore by CloudActivations::ActivateService#execute
and is present in both the lograge request log and service-level log lines.
It was added in
customers-gitlab-com!15060,
which also enables separate latency profiling of the trial vs. subscription
activation paths.
If this field is ever renamed or removed from CustomersDot, the GCP log-based metric filters will silently start counting all activations. Update the Cloud Logging metric filter expressions in
gitlab-subscriptions-prodand re-validate baselines before re-enabling alerts.
Alert Behavior
Section titled “Alert Behavior”| Alert | Severity | Condition | For | Expected frequency |
|---|---|---|---|---|
| GrowthTrialCreationFailureSpike | s3 | failure rate > 0.5/5m | 10m | Rare; spikes during bad deploys |
| GrowthSMTrialActivationFailure | s3 | failure rate > 0.1/5m | 5m | Rare; near-zero baseline |
| GrowthTrialCreationNearZero | s2 | success rate < 0.01/s | 15m | Should almost never fire |
| GrowthWorkatoLeadErrors | s4 | error rate > 0.05/5m | 10m | Occasional; Workato API instability |
Silencing: alerts can be silenced in Alertmanager during planned CustomersDot maintenance windows or when a known incident is already being worked.
Severities
Section titled “Severities”-
s2 (GrowthTrialCreationNearZero): Trial creation has effectively stopped. Create an incident immediately. Page
@growth-oncallif outside business hours. All customers attempting to start a GitLab.com trial are affected. -
s3 (GrowthTrialCreationFailureSpike, GrowthSMTrialActivationFailure): Elevated failure rate but not a total outage. Triage within 30 minutes during business hours. Create an incident if the rate is still climbing or if more than ~50 failures have accumulated. SM activation failures affect paying SM customers trying to activate trial licenses.
-
s4 (GrowthWorkatoLeadErrors): Trial creation is unaffected; only CRM/Marketo data is impacted. Investigate within the business day. No incident required unless data loss is significant and unrecoverable.
Recent changes
Section titled “Recent changes”When an alert fires, check for recent changes that may have caused it:
- CustomersDot recent deployments — look for merges in the last 2 hours
- gitlab-org/customers-gitlab-com recent MRs
- GitLab.com recent deployments — for GrowthTrialCreationNearZero when CustomersDot is healthy
Dependencies
Section titled “Dependencies”The trial provisioning pipeline depends on all of the following:
| Dependency | Type | Failure symptom |
|---|---|---|
CustomersDot Rails app (gitlab-subscriptions-prod) | Internal | All alerts |
| GCP log-based metrics pipeline (Cloud Monitoring → OTEL → Mimir) | Internal | Metrics absent; NearZero false-positive |
cloud_activation_key service | Internal | SM activation failures |
| Workato API | External | GrowthWorkatoLeadErrors only |
| Marketo API | External | Downstream of Workato; silent until Workato fails |
GitLab.com Rails (gprd) | Internal | SaaS trial creation failures |
Alerts
Section titled “Alerts”GrowthTrialCreationFailureSpike
Section titled “GrowthTrialCreationFailureSpike”Severity: s3
Condition: growth_trial_creation_failure_total rate > 0.5/5m for 10 minutes.
What it means: CustomersDot is failing to provision SaaS trials at an elevated rate. The baseline failure rate is near zero; 0.5/5m represents a meaningful spike.
Investigation steps:
-
Open the dashboard panel: Growth – Trials Health → Trial creation failure rate
-
Check GCP Logs Explorer for CustomersDot errors:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.type="customersdot"jsonPayload.severity="ERROR"Project:
gitlab-subscriptions-prod -
Check Kibana for TrialsController errors: Index:
pubsub-rails-inf-gprd*Filter:json.controller: "TrialsController" AND json.status >= 500 AND json.environment: "production" -
Check CustomersDot incident history and recent deploys in
#f_customersdot. -
If the spike correlates with a deploy, consider reverting via the standard CustomersDot rollback process.
Resolution: Alert auto-resolves when the rate drops below 0.5/5m.
Possible resolutions: No past incidents documented yet. Add links here after the first real firing.
GrowthSMTrialActivationFailure
Section titled “GrowthSMTrialActivationFailure”Severity: s3
Condition: growth_sm_trial_activation_failure_total rate > 0.1/5m for 5 minutes.
What it means: Self-managed trial activations via cloudActivationActivate are
failing. Baseline is near zero (~100–123 SM activations/week in gprd). Any sustained
failures are notable and affect SM customers trying to activate trials.
Investigation steps:
-
Open the dashboard panel: Growth – Trials Health → SM trial activation failure rate
-
Check GCP Logs Explorer for SM activation failures:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.message="Trial creation failed"For a full request trace, find the
correlation_idfrom a failure entry, then:resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.correlation_id="<correlation_id>"Source class:
Trials::SelfManaged::BaseService(failure) andCloudActivations::ActivateService(activation errors like expired trial, feature flag disabled, incompatible GitLab version). -
Verify CustomersDot is healthy (check
#f_customersdotfor recent incidents). -
Check if the
cloud_activation_keyservice or any upstream dependency is degraded.
Resolution: Alert auto-resolves when the rate drops below 0.1/5m.
Possible resolutions: No past incidents documented yet. Add links here after the first real firing.
GrowthTrialCreationNearZero
Section titled “GrowthTrialCreationNearZero”Severity: s2
Condition: growth_trial_creation_success_total rate < 0.01/s for 15 minutes
while the metric series is present.
What it means: The trial creation pipeline appears to have stopped processing successful trials entirely. Expected baseline in gprd is ~1.67/min (~10K+/week). This suggests a complete outage of trial provisioning.
Investigation steps:
-
Open the dashboard panel: Growth – Trials Health → Trial creation success rate
-
Verify CustomersDot is up and responding — check the CustomersDot service health dashboard and
#f_customersdot. -
Check GCP Logs Explorer for any CustomersDot application errors:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.type="customersdot"jsonPayload.severity="ERROR"Project:
gitlab-subscriptions-prod -
Confirm the GCP log-based metric is still being exported — check Cloud Monitoring metric explorer for
custom.googleapis.com/growth_trial_creation_*. If absent, the metric pipeline (GCP → OTEL → Mimir) may be broken rather than the trial service itself. Contact#g_observability. -
If CustomersDot is healthy and metrics are flowing, check GitLab.com Rails for errors at the trial registration endpoint (
/trials).
Resolution: Alert auto-resolves when the rate rises above 0.01/s.
Possible resolutions: No past incidents documented yet. Add links here after the first real firing.
GrowthWorkatoLeadErrors
Section titled “GrowthWorkatoLeadErrors”Severity: s4
Condition: growth_workato_lead_error_total rate > 0.05/5m for 10 minutes.
What it means: The Workato::CreateLeadJob Sidekiq worker is failing to hand
off trial leads to Workato / Marketo. This does not block trial creation for
the end user but affects CRM / marketing data quality.
Investigation steps:
-
Open the dashboard panel: Growth – Trials Health → Workato lead handoff error rate
-
Check Kibana for Workato job errors: Index:
pubsub-sidekiq-inf-gprd*Filter:json.class: "Workato::CreateLeadJob" AND json.job_status: "fail" -
Check if Workato’s API endpoint is responding — ask in
#g_growthfor a Workato status check. -
Sidekiq dead queue: check whether
Workato::CreateLeadJobjobs are piling up in the dead queue in#f_customersdotor the CustomersDot Sidekiq dashboard.
Resolution: Alert auto-resolves when the rate drops below 0.05/5m.
Possible resolutions: No past incidents documented yet. Add links here after the first real firing.
Escalation
Section titled “Escalation”| Scenario | Action |
|---|---|
| s2 alert (NearZero), business hours | Triage immediately; post in #g_growth_trials_alerts and #g_growth |
| s2 alert, outside business hours | Page @growth-oncall; if unresponsive after 15 min, escalate to engineering manager |
| s3 alert, business hours | Triage within 30 min; post update in #g_growth_trials_alerts |
| s3 alert, outside business hours | Triage next business morning unless rate is still climbing |
| s4 alert | Investigate within the business day; no page required |
| CustomersDot incident | Coordinate in #f_customersdot |
| Metrics pipeline broken | Contact #g_observability |
Saved investigation queries
Section titled “Saved investigation queries”GCP Logs Explorer
Section titled “GCP Logs Explorer”All CustomersDot errors (last 30 min):
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.type="customersdot"jsonPayload.severity="ERROR"Project: gitlab-subscriptions-prod
SM activation — successful activations only:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.message="Instance successfully cloud activated"jsonPayload.source_class_name="Trial"Source: CloudActivations::ActivateService#activate_instance (Gitlab::Logger)
source_class_name="Trial" scopes to trial activations (vs subscription activations).
SM activation — trial creation failures:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.message="Trial creation failed"Source: Trials::SelfManaged::BaseService#execute (Gitlab::Logger)
SM activation — all trial activation requests (request-level):
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.params.value=~"cloudActivationActivate"jsonPayload.source_class_name="Trial"Source: lograge request log. source_class_name is set via SafeRequestStore
by CloudActivations::ActivateService#execute — null means it’s a subscription
activation, not a trial. Use jsonPayload.correlation_id to find companion
service log lines for a specific request.
SM activation — trial request latency (slow activations):
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.params.value=~"cloudActivationActivate"jsonPayload.source_class_name="Trial"jsonPayload.duration_s > 5The jsonPayload.duration_s field (lograge) is the source for the
growth_sm_trial_activation_duration_seconds distribution metric. Use this query
to spot-check slow individual requests during a latency regression. Prior to
customers-gitlab-com!15060,
source_class_name was absent and trial/subscription latencies could not be
separated from this log.
Trace a specific activation by correlation_id:
resource.type="gce_instance"logName="projects/gitlab-subscriptions-prod/logs/rails.production"jsonPayload.correlation_id="<paste correlation_id here>"Note on “Trial cloud activation creation successful”: This message comes from
Trials::SelfManaged::CreateUltimateTrialService#after_trial_creation_actions and
fires only when the CloudActivation database record is persisted — it’s a sub-step,
not the instance activation itself. If absent from logs, check whether
trial.create_cloud_activation is returning a non-persisted record.
Kibana
Section titled “Kibana”TrialsController errors (Rails):
json.controller: "TrialsController" AND json.status >= 500 AND json.environment: "production"Index: pubsub-rails-inf-gprd*
Workato::CreateLeadJob failures:
json.class: "Workato::CreateLeadJob" AND json.job_status: "fail"Index: pubsub-sidekiq-inf-gprd*
Definitions
Section titled “Definitions”Related Links
Section titled “Related Links”- Growth – Trials Health dashboard
- Growth Error Budget
- CustomersDot Overview
- Growth team handbook
- gitlab-org/growth/team-tasks#958 — parent observability issue