ChefClientErrorCritical

Overview

A significant number (10% or more) of Chef nodes are failing to complete chef-client converges in an environment (most likely GPRD).

This can be caused by a recent change to Chef cookbooks, roles, or environments. It can also be caused by third party dependencies, such as apt, not responding, or other services being unreachable. Sometimes it can be caused by cookbook bugs that don’t accomodate for all edge cases, etc.

This is not a publically facing problem for GitLab.com users, but it can block deploys and prevent required changes to our Chef instructure.

When paged with this alert, investigate logs and look for chef-client errors. If it’s not clear what is wrong, compare the errors to recent changes in cookbooks and the chef-repo project to try and identify the culprit.

Services

Metrics

This alert is based off this expression: avg(chef_client_error{env="gprd"}) * 100 > 10 and its definition can be found here. This is not an auto-generated alert definition.

chef_client_error is a metric scraped from the node exporter on VMs and represents the last result of a chef-cient run as 0 for a success and 1 as a failure.

This alert only notifies if more than 10% of the chef-client runs for the past hour have failed in a single environment. Chef-client does occasioncally fail to run for various reasons. Once the number of failures climbs to 10% or more of the Chef VMs in an environment, the EoC is notified since this could break deploys or needed configuration changes to mitigate incidents, etc.

Normally, this value is less than 10 and ideally is close to 0 with only occasional spikes to 1.

Alert Behavior

If the cause of chef-client failures is going to take a long time to remedy, a silence may be appropriate. Be aware that silencing this could blind the EoC to another Chef converging problem while silenced.

This should be a rare alert, but it has happened four times in the past ninety days.

Kibana Trends for ChefClientErrorCritical

Severities

It’s likely this is a Severity 3 or lower incident. The inability to converge chef-clients should only affect internal supporters of GitLab.com and their tooling. It should be labeled as backstage most likely.

During times of many incidents occuring at the same time, this could be a higher severity if it is blocking our ability to solve other incidents.

Verification

Recent changes

If you do find an MR that appears strongly correlated and related to the incident in progress, reverting the version change in the Chef-Repo Project should be the quickest and cleanest way to roll back the changes. This would work well for cookbook version increases as well as role changes.

Troubleshooting

Consider breaking down the chef-client failures by fleet type.
Look for errors in the logs for chef-client.
Review recent changes to chef-repo.

Possible Resolutions

Dependencies

Chef cookbooks often rely on third party cookbooks, apt repositories, and the Chef server.

Chef is a shared service and cookbook and chef-repo changes can be made by a large number of individuals. If the problem is isolated to a single fleet, consider escelating to the team that manages that fleet or service to get more help on recent changes, etc.

The #production_engineering Slack channel might be a good first place to ask for help.

When considering modifying this alert, consider that chef-client failures may happen due to outside dependencies, so errors will happen. They key is to prevent blockages for config management work and deploys.