Skip to content

Debug failed chef provisioning

We provision GCP machines with terraform and Chef. Most machines are provisioned using one of several terraform modules. For example, one of these modules would declare a bootstrap module instance, which copies a bootstrap script to the new instance, which is configured to run on boot.

This bootstrap script is responsible for enrolling the machine with our Chef server, using an initial runlist and environment obtained from GCE instance metadata

Most of our chef roles depend on some base role that adds ssh users and authorized_keys using a cookbook. As a base role dependency, this cookbook runs early in the provisioning process. If a later cookbook fails during the initial chef bootstrap, we usually ssh in to debug the problem.

If you are iterating on early-run recipes, or the bootstrap script itself, it’s possible for the bootstrapping run to not get to the point at which you can ssh into the machine. That is the problem this runbook is here for.

The startup script logs, which include the chef-client logs as it is run in the foreground, are visible on the serial console output. In GCP this can be accessed using gcloud, e.g.:

gcloud --project=gitlab-production compute instances tail-serial-port-output \
sidekiq-besteffort-06-sv-gprd --zone=us-east1-b
  1. Using the GCP web console, add a public IP to the instance if it doesn’t already have one.
  2. Temorarily add an ssh key for your GCP user and shh in: gcloud compute --project "gitlab-staging-1" ssh --zone "us-east1-d" gke-gstg-gitlab-gke-node-pool-2019092-7eaabcbf-1zl6 --tunnel-through-iap. The --tunnel-through-iap is there in case the instance has no public IP.
  3. Remove the public IP (if created).