Product Analytics ClickHouse Failure Remediation, Backup & Restore Process
Follow this runbook when a Product Analytics ClickHouse Cloud database is broken beyond repair and requires remediation.
Check metrics
Section titled “Check metrics”- Check the metrics of the failed instance. Can the issue be mitigated in the first instance by increasing the available memory?
- The Clickhouse metrics dashboard shows available metrics.
- Speak to a ClickHouse Cloud administrator to attempt this. (@mwoolf, @dennis should cover most working hours - alternatively try in #f_clickhouse)
Speak to ClickHouse Cloud support
Section titled “Speak to ClickHouse Cloud support”- In the first instance, open a P1 support ticket in ClickHouse Cloud.
- Ping the team in #g_monitor_product_analytics and #analytics-section to make them aware of the failure.
- Consider pinging ClickHouse team members in
clickhouse-gitlab-external431
to expedite the request.
Restoring from Backup
Section titled “Restoring from Backup”Only ClickHouse Cloud administrators are permitted to do this.
- Create a new Admin API key
- Set an expiration of 1 hour.
- Use
cURL
to list all clusters -curl "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" \ -u '{API KEY}:{API SECRET}'
- Find the cluster in question and list its backups -
curl "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services/bf5e7003-585d-4767-84ed-13fe3b934c8d/backups" \ -u '{API_KEY}:{API SECRET}'
- Create a new service from the backup - make sure to note the password in the response, it will only be available once. This should take around 5-10 minutes but relies on GCP:
curl -X "POST" "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" \ -H 'Content-Type: application/json' \ -u '{API KEY}:{API SECRET}' \ -d $'{ "tier": "production", "provider": "gcp", "region": "us-central1", "name": "restored-product-analytics-TEST", # This should be the same name as the existing service, prefixed with 'restored'. "idleScaling": false, "backupId": "REPLACE ME" # This is the backup ID from step 3.}'
- Enable a private connection to the instance using the self-serve information.
- Update the secrets and connection strings in Vault to connect to the new instance.
gitlab-com/gitlab-org/analytics-section/product-analytics/analytics-stack/prd-278964/analytics-stack
- Redeploy the latest version of the analytics-stack.
- Check the following on the main team dashboard:
- Vector is still ingesting data.
- ClickHouse is still writing new data.
- No unexpected errors in the configurator logs.