Chef troubleshooting
Symptoms
Section titled “Symptoms”-
HAProxy is missing workers:
lb7.cluster.gitlab.com HAProxy_gitlab_443/worker4.cluster.gitlab.com is UNKNOWN - Check output not found in local checks -
Nodes are missing chef roles:
jeroen@xps15:~/src/gitlab/chef-repo$ bundle exec knife node show worker1.cluster.gitlab.comNode Name: worker1.cluster.gitlab.comEnvironment: _defaultFQDN: worker1.cluster.gitlab.comIP: 10.1.0.XRun List:Roles:Recipes:Platform: ubuntu 16.04Tags: -
Knife ssh does not work:
bundle exec knife ssh "name:worker1.cluster.gitlab.com" "uptime"WARNING: Failed to connect to -- Errno::ECONNREFUSED: Connection refused - connect(2)
Resolution
Section titled “Resolution”-
Check if the workers have the chef role
gitlab-cluster-worker
. HAProxy config is generated with a chef search on this specific role.bundle exec knife node show worker1.cluster.gitlab.comIf not restore the worker via
knife node from file
:bundle exec knife node from file worker1.cluster.gitlab.com.jsonRun chef-client on the node. When the chef-client run is finished on the nodes force a chef-client run on the load balancers to regenerate the haproxy config with the workers:
bundle exec knife ssh -p2222 -a ipaddress role:gitlab-cluster-lb 'sudo chef-client'bundle exec knife ssh -p2222 -a ipaddress role:gitlab-cluster-lb-pages 'sudo chef-client' -
See resolution steps at point 1.
-
Check if the ipnumber is correct for the node:
bundle exec knife node show worker1.cluster.gitlab.comIf ipaddress contains a wrong public ip update /etc/ipaddress.txt on the node and run chef-client
If ipaddress contains a private (local) ip make sure /etc/ipaddress.txt is set and the node has at least the chef role base-X where X is the OS type like debian etc. check chef-repo/roles/base-* for all current base roles.
Alerts
Section titled “Alerts”Chef client failures have reached critical levels
Section titled “Chef client failures have reached critical levels”Alert name: ChefClientErrorCritical Alert text: At least 10% of type TYPE are failing chef-runs
What to do:
- Find one of the nodes that is affected
- The alert is summarized; click the link to the prometheus graph from the alert (to get to the alerting environment easily), and adjust the query to just be
chef_client_error > 0
. It should list a metric for each node that is currently broken, from which you can select one of the type that is alerting. There will often be some correlation/commonality that may stand out and allow you to select a suitable first candidate. - Alternatively you can use Thanos which will list the nodes in each environment
- The alert is summarized; click the link to the prometheus graph from the alert (to get to the alerting environment easily), and adjust the query to just be
- On that node, inspect the chef logs (
sudo grep chef-client /var/log/syslog|less
) to determine what’s broken.
- It could be anything, but td-agent and incompatible gem combinations is common. In that case you can use
td-agent-gem
to manually adjust installed versions until the list of gems, often google-related, are all compatible with each other (compare to a still functional node for versions if necessary). Or delete all the installed gems and start again (running chef-client may bootstrap things again in that case).
Overriding Alerts when Chef Cannot be Repaired
Section titled “Overriding Alerts when Chef Cannot be Repaired”There may be times where you cannot repair Chef, but you want to stop the paging.
This incident is an example of a situation where bad Chef runs were the last run before disabling chef-client.
We want to remove the alerts for clarity until the root cause can be fixed.
The Prometheus Node Exporter stores chef-client metrics in /opt/prometheus/node_exporter/metrics/chef-client.prom
, and we can over-write this files data, specifically the value in chef_client_error
.
Here is an example command to clear a single VMs chef-client error.
$ sudo sed -i "s/chef_client_error 1.0/chef_client_error 0.0/" /opt/prometheus/node_exporter/metrics/chef-client.prom"