Skip to content

Product Analytics ClickHouse Failure Remediation, Backup & Restore Process

Follow this runbook when a Product Analytics ClickHouse Cloud database is broken beyond repair and requires remediation.

  1. Check the metrics of the failed instance. Can the issue be mitigated in the first instance by increasing the available memory?
    • The Clickhouse metrics dashboard shows available metrics.
    • Speak to a ClickHouse Cloud administrator to attempt this. (@mwoolf, @dennis should cover most working hours - alternatively try in #f_clickhouse)
  1. In the first instance, open a P1 support ticket in ClickHouse Cloud.
  2. Ping the team in #g_monitor_product_analytics and #analytics-section to make them aware of the failure.
  3. Consider pinging ClickHouse team members in clickhouse-gitlab-external431 to expedite the request.

Only ClickHouse Cloud administrators are permitted to do this.

  1. Create a new Admin API key
    • Set an expiration of 1 hour.
  2. Use cURL to list all clusters - curl "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" \ -u '{API KEY}:{API SECRET}'
  3. Find the cluster in question and list its backups - curl "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services/bf5e7003-585d-4767-84ed-13fe3b934c8d/backups" \ -u '{API_KEY}:{API SECRET}'
  4. Create a new service from the backup - make sure to note the password in the response, it will only be available once. This should take around 5-10 minutes but relies on GCP:
Terminal window
curl -X "POST" "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" \
-H 'Content-Type: application/json' \
-u '{API KEY}:{API SECRET}' \
-d $'{
"tier": "production",
"provider": "gcp",
"region": "us-central1",
"name": "restored-product-analytics-TEST", # This should be the same name as the existing service, prefixed with 'restored'.
"idleScaling": false,
"backupId": "REPLACE ME" # This is the backup ID from step 3.
}'
  1. Enable a private connection to the instance using the self-serve information.
  2. Update the secrets and connection strings in Vault to connect to the new instance. gitlab-com/gitlab-org/analytics-section/product-analytics/analytics-stack/prd-278964/analytics-stack
  3. Redeploy the latest version of the analytics-stack.
  4. Check the following on the main team dashboard:
    • Vector is still ingesting data.
    • ClickHouse is still writing new data.
    • No unexpected errors in the configurator logs.