ClickHouse Cloud Failure Remediation, Backup & Restore Process
Follow this runbook when a ClickHouse Cloud database is broken beyond repair and requires remediation.
Check metrics
Section titled “Check metrics”Check the metrics of the failed instance. Can the issue be mitigated in the first instance by increasing the available memory?
https://clickhouse.cloud/service/ad02dd6a-1dde-4f8f-858d-37462fd06058
shows available metrics.- Speak to a ClickHouse Cloud administrator to attempt this (
#f_clickhouse
).
Speak to ClickHouse Cloud support
Section titled “Speak to ClickHouse Cloud support”- In the first instance, open a P1 support ticket in ClickHouse Cloud.
- Ping the team in
#f_clickhouse
to make them aware of the failure. - Consider pinging ClickHouse team members in
clickhouse-gitlab-external431
to expedite the request.
Restoring from Backup
Section titled “Restoring from Backup”Only ClickHouse Cloud administrators are permitted to do this.
-
Create a new Admin API key - https://clickhouse.cloud/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/keys
- Set an expiration of 1 hour.
-
Use
cURL
to list all clusters:Terminal window curl -u '{API KEY}:{API SECRET}' \"https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" -
Find the cluster in question and list its backups:
Terminal window curl -u '{API_KEY}:{API SECRET}' \"https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services/bf5e7003-585d-4767-84ed-13fe3b934c8d/backups" -
Create a new service from the backup - make sure to note the password in the response, it will only be available once. This should take around 5-10 minutes but relies on GCP:
Terminal window curl -X "POST" "https://api.clickhouse.cloud/v1/organizations/8a0d56e3-d8f0-4e70-80bf-a8bf6ee950bd/services" \-H 'Content-Type: application/json' \-u '{API KEY}:{API SECRET}' \-d $'{"tier": "production","provider": "gcp","region": "us-central1","name": "restored-gitlab-com-production-TEST", # This should be the same name as the existing service, prefixed with 'restored'."idleScaling": false,"backupId": "REPLACE ME" # This is the backup ID from step 3.}' -
Enable a private connection to the instance using the self-serve information:
https://clickhouse.com/docs/en/manage/security/gcp-private-service-connect#add-endpoint-id-to-services-allow-list
-
Update the secrets and connection strings in Vault to connect to the new instance. Then there is two places to update connection strings (one and two)
-
Redeploy the latest version of the stack.
-
Check the following on the main team dashboard:
- ClickHouse is still writing new data.