Skip to content

Growth – Trials Health Runbook

This runbook covers four Prometheus alerts that monitor the GitLab trial provisioning pipeline end-to-end: SaaS trial creation, SM trial activation, and the Workato lead handoff to Marketo/CRM.

Note: This is a suite runbook covering four related alerts. All alerts are defined in mimir-rules/gitlab-gprd/growth/growth-trials-health.yml and route to #g_growth_trials_alerts.


These alerts fire when GitLab trial provisioning degrades in production (gprd). The pipeline spans two services:

  • CustomersDot (gitlab-subscriptions-prod) handles trial creation requests from GitLab.com, SM activation via cloudActivationActivate, and the Workato lead handoff via Workato::CreateLeadJob.
  • GitLab.com Rails handles SaaS trial registration and Aha D14 activation tracking.

Failures here mean potential customers cannot start trials, which directly affects Growth team OKRs and revenue pipeline.

Metric prefixSourceBaseline (gprd)
growth_trial_creation_*GCP log-based metric, CustomersDot~10K+/week (~1.67/min)
growth_sm_trial_activation_*GCP log-based metric, cloudActivationActivate~100–123/week (~0.015/min)
growth_workato_lead_*GCP log-based metric, Sidekiq Workato jobsProportional to trial creation
growth_marketo_inbound_*GCP log-based metric, /marketo_trial endpointLow volume
growth_trial_activation_aha14_*App-level LabKit counter, GitLab.com RailsSubset of SaaS trials
growth_sm_trial_activation_duration_secondsGCP log-based distribution metric, jsonPayload.duration_sTrial-only request latency via lograge
growth_trial_provision_latency_seconds_bucketApp-level LabKit histogram, GitLab.com Rails (#968)Preferred long-term; covers SaaS trial latency

Thresholds are anchored to the baselines above. Failure rates are expected to be near zero in normal operation; alert thresholds represent a meaningful signal above noise rather than a percentage of traffic.

If metrics stop appearing in Grafana, check the GCP → Cloud Monitoring → OTEL → Mimir pipeline before assuming the underlying service is healthy. See the GrowthTrialCreationNearZero investigation step 4 below.

Critical filter: source_class_name="Trial"

Section titled “Critical filter: source_class_name="Trial"”

The growth_sm_trial_activation_* metrics and the growth_sm_trial_activation_duration_seconds latency distribution must use jsonPayload.source_class_name="Trial" as a filter in their Cloud Logging metric definitions. Without it, the counters include subscription activations (cloudActivationActivate handles both), which inflates baselines and renders alert thresholds meaningless.

This field is set via SafeRequestStore by CloudActivations::ActivateService#execute and is present in both the lograge request log and service-level log lines. It was added in customers-gitlab-com!15060, which also enables separate latency profiling of the trial vs. subscription activation paths.

If this field is ever renamed or removed from CustomersDot, the GCP log-based metric filters will silently start counting all activations. Update the Cloud Logging metric filter expressions in gitlab-subscriptions-prod and re-validate baselines before re-enabling alerts.

AlertSeverityConditionForExpected frequency
GrowthTrialCreationFailureSpikes3failure rate > 0.5/5m10mRare; spikes during bad deploys
GrowthSMTrialActivationFailures3failure rate > 0.1/5m5mRare; near-zero baseline
GrowthTrialCreationNearZeros2success rate < 0.01/s15mShould almost never fire
GrowthWorkatoLeadErrorss4error rate > 0.05/5m10mOccasional; Workato API instability

Silencing: alerts can be silenced in Alertmanager during planned CustomersDot maintenance windows or when a known incident is already being worked.

  • s2 (GrowthTrialCreationNearZero): Trial creation has effectively stopped. Create an incident immediately. Page @growth-oncall if outside business hours. All customers attempting to start a GitLab.com trial are affected.

  • s3 (GrowthTrialCreationFailureSpike, GrowthSMTrialActivationFailure): Elevated failure rate but not a total outage. Triage within 30 minutes during business hours. Create an incident if the rate is still climbing or if more than ~50 failures have accumulated. SM activation failures affect paying SM customers trying to activate trial licenses.

  • s4 (GrowthWorkatoLeadErrors): Trial creation is unaffected; only CRM/Marketo data is impacted. Investigate within the business day. No incident required unless data loss is significant and unrecoverable.

When an alert fires, check for recent changes that may have caused it:

The trial provisioning pipeline depends on all of the following:

DependencyTypeFailure symptom
CustomersDot Rails app (gitlab-subscriptions-prod)InternalAll alerts
GCP log-based metrics pipeline (Cloud Monitoring → OTEL → Mimir)InternalMetrics absent; NearZero false-positive
cloud_activation_key serviceInternalSM activation failures
Workato APIExternalGrowthWorkatoLeadErrors only
Marketo APIExternalDownstream of Workato; silent until Workato fails
GitLab.com Rails (gprd)InternalSaaS trial creation failures

Severity: s3 Condition: growth_trial_creation_failure_total rate > 0.5/5m for 10 minutes.

What it means: CustomersDot is failing to provision SaaS trials at an elevated rate. The baseline failure rate is near zero; 0.5/5m represents a meaningful spike.

Investigation steps:

  1. Open the dashboard panel: Growth – Trials Health → Trial creation failure rate

  2. Check GCP Logs Explorer for CustomersDot errors:

    resource.type="gce_instance"
    logName="projects/gitlab-subscriptions-prod/logs/rails.production"
    jsonPayload.type="customersdot"
    jsonPayload.severity="ERROR"

    Project: gitlab-subscriptions-prod

  3. Check Kibana for TrialsController errors: Index: pubsub-rails-inf-gprd* Filter: json.controller: "TrialsController" AND json.status >= 500 AND json.environment: "production"

  4. Check CustomersDot incident history and recent deploys in #f_customersdot.

  5. If the spike correlates with a deploy, consider reverting via the standard CustomersDot rollback process.

Resolution: Alert auto-resolves when the rate drops below 0.5/5m.

Possible resolutions: No past incidents documented yet. Add links here after the first real firing.


Severity: s3 Condition: growth_sm_trial_activation_failure_total rate > 0.1/5m for 5 minutes.

What it means: Self-managed trial activations via cloudActivationActivate are failing. Baseline is near zero (~100–123 SM activations/week in gprd). Any sustained failures are notable and affect SM customers trying to activate trials.

Investigation steps:

  1. Open the dashboard panel: Growth – Trials Health → SM trial activation failure rate

  2. Check GCP Logs Explorer for SM activation failures:

    resource.type="gce_instance"
    logName="projects/gitlab-subscriptions-prod/logs/rails.production"
    jsonPayload.message="Trial creation failed"

    For a full request trace, find the correlation_id from a failure entry, then:

    resource.type="gce_instance"
    logName="projects/gitlab-subscriptions-prod/logs/rails.production"
    jsonPayload.correlation_id="<correlation_id>"

    Source class: Trials::SelfManaged::BaseService (failure) and CloudActivations::ActivateService (activation errors like expired trial, feature flag disabled, incompatible GitLab version).

  3. Verify CustomersDot is healthy (check #f_customersdot for recent incidents).

  4. Check if the cloud_activation_key service or any upstream dependency is degraded.

Resolution: Alert auto-resolves when the rate drops below 0.1/5m.

Possible resolutions: No past incidents documented yet. Add links here after the first real firing.


Severity: s2 Condition: growth_trial_creation_success_total rate < 0.01/s for 15 minutes while the metric series is present.

What it means: The trial creation pipeline appears to have stopped processing successful trials entirely. Expected baseline in gprd is ~1.67/min (~10K+/week). This suggests a complete outage of trial provisioning.

Investigation steps:

  1. Open the dashboard panel: Growth – Trials Health → Trial creation success rate

  2. Verify CustomersDot is up and responding — check the CustomersDot service health dashboard and #f_customersdot.

  3. Check GCP Logs Explorer for any CustomersDot application errors:

    resource.type="gce_instance"
    logName="projects/gitlab-subscriptions-prod/logs/rails.production"
    jsonPayload.type="customersdot"
    jsonPayload.severity="ERROR"

    Project: gitlab-subscriptions-prod

  4. Confirm the GCP log-based metric is still being exported — check Cloud Monitoring metric explorer for custom.googleapis.com/growth_trial_creation_*. If absent, the metric pipeline (GCP → OTEL → Mimir) may be broken rather than the trial service itself. Contact #g_observability.

  5. If CustomersDot is healthy and metrics are flowing, check GitLab.com Rails for errors at the trial registration endpoint (/trials).

Resolution: Alert auto-resolves when the rate rises above 0.01/s.

Possible resolutions: No past incidents documented yet. Add links here after the first real firing.


Severity: s4 Condition: growth_workato_lead_error_total rate > 0.05/5m for 10 minutes.

What it means: The Workato::CreateLeadJob Sidekiq worker is failing to hand off trial leads to Workato / Marketo. This does not block trial creation for the end user but affects CRM / marketing data quality.

Investigation steps:

  1. Open the dashboard panel: Growth – Trials Health → Workato lead handoff error rate

  2. Check Kibana for Workato job errors: Index: pubsub-sidekiq-inf-gprd* Filter: json.class: "Workato::CreateLeadJob" AND json.job_status: "fail"

  3. Check if Workato’s API endpoint is responding — ask in #g_growth for a Workato status check.

  4. Sidekiq dead queue: check whether Workato::CreateLeadJob jobs are piling up in the dead queue in #f_customersdot or the CustomersDot Sidekiq dashboard.

Resolution: Alert auto-resolves when the rate drops below 0.05/5m.

Possible resolutions: No past incidents documented yet. Add links here after the first real firing.


ScenarioAction
s2 alert (NearZero), business hoursTriage immediately; post in #g_growth_trials_alerts and #g_growth
s2 alert, outside business hoursPage @growth-oncall; if unresponsive after 15 min, escalate to engineering manager
s3 alert, business hoursTriage within 30 min; post update in #g_growth_trials_alerts
s3 alert, outside business hoursTriage next business morning unless rate is still climbing
s4 alertInvestigate within the business day; no page required
CustomersDot incidentCoordinate in #f_customersdot
Metrics pipeline brokenContact #g_observability

All CustomersDot errors (last 30 min):

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.type="customersdot"
jsonPayload.severity="ERROR"

Project: gitlab-subscriptions-prod

SM activation — successful activations only:

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.message="Instance successfully cloud activated"
jsonPayload.source_class_name="Trial"

Source: CloudActivations::ActivateService#activate_instance (Gitlab::Logger) source_class_name="Trial" scopes to trial activations (vs subscription activations).

SM activation — trial creation failures:

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.message="Trial creation failed"

Source: Trials::SelfManaged::BaseService#execute (Gitlab::Logger)

SM activation — all trial activation requests (request-level):

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.params.value=~"cloudActivationActivate"
jsonPayload.source_class_name="Trial"

Source: lograge request log. source_class_name is set via SafeRequestStore by CloudActivations::ActivateService#execute — null means it’s a subscription activation, not a trial. Use jsonPayload.correlation_id to find companion service log lines for a specific request.

SM activation — trial request latency (slow activations):

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.params.value=~"cloudActivationActivate"
jsonPayload.source_class_name="Trial"
jsonPayload.duration_s > 5

The jsonPayload.duration_s field (lograge) is the source for the growth_sm_trial_activation_duration_seconds distribution metric. Use this query to spot-check slow individual requests during a latency regression. Prior to customers-gitlab-com!15060, source_class_name was absent and trial/subscription latencies could not be separated from this log.

Trace a specific activation by correlation_id:

resource.type="gce_instance"
logName="projects/gitlab-subscriptions-prod/logs/rails.production"
jsonPayload.correlation_id="<paste correlation_id here>"

Note on “Trial cloud activation creation successful”: This message comes from Trials::SelfManaged::CreateUltimateTrialService#after_trial_creation_actions and fires only when the CloudActivation database record is persisted — it’s a sub-step, not the instance activation itself. If absent from logs, check whether trial.create_cloud_activation is returning a non-persisted record.

TrialsController errors (Rails):

json.controller: "TrialsController" AND json.status >= 500 AND json.environment: "production"

Index: pubsub-rails-inf-gprd*

Workato::CreateLeadJob failures:

json.class: "Workato::CreateLeadJob" AND json.job_status: "fail"

Index: pubsub-sidekiq-inf-gprd*