Skip to content

Troubleshooting

generated with DocToc

TODO investigate the difference between stats reported in the monitoring cluster and elastic cloud web UI

Elastic cloud clusters’ performance can be checked using a few interfaces. They are listed below in the order of most likely availability and usefulness, but if one method is broken try another:

The only logs available to us are available in the Elastic Cloud Web UI.

To view them, go to the deployments page and click on “Logs” on the left hand side.

you can query the API using:

  • Kibana
  • Elastic Cloud web UI (API Console on the left)
  • bash + curl
  • Curator
  • other tools (e.g. Postman)

Credentials for APIs are in 1password (for all clusters the admin username is: elastic)

Scripts that are available in esc-tools repo and are useful for troubleshooting

Ideas of things to check (based on previous incidents)

Section titled “Ideas of things to check (based on previous incidents)”

on ES cluster:

  • do the nodes have enough disk space?
  • are shards being moved at the moment?
  • are all shards allocated? or are there any shard allocation errors?
  • what’s the indexing latency? (if it’s high, there’s a problem with performance)
  • what’s the cpu/memory/io usage?

See below

cluster lags behind with logs after a node was added

Section titled “cluster lags behind with logs after a node was added”

See below

As the new node is starting with no assigned shards, it will get all new shards assigned after a rollover, resulting in hot-spotting. We used high cpu usage as an indicator, but it was based on a guess (there was no hard evidence showing the cpu was used by processes related to shards).

Solution: See retry shard allocation

_cluster/health endpoint returns anything other than green.

We have alerts for the production logging cluster: https://gitlab.com/gitlab-com/runbooks/blob/master/legacy-prometheus-rules/elastic-clusters.yml#L24-33 and monitoring cluster: https://gitlab.com/gitlab-com/runbooks/blob/master/legacy-prometheus-rules/elastic-clusters.yml#L44-53 being in a state other than green.

In the past we saw this happen in multiple scenarios. Here are a few examples:

  • an ILM policy has requirements that no node can meet (there is no prior indication of that)
  • max allocation retry limit was reached, allocation was stopped and never retried (for example when a cluster is unhealthy and recovers by itself)
  • storage watermarks on nodes were reached (the scheduler will refuse to allocate shards to those nodes)
  • read-only blocks on node/cluster

for different reasons:

finding unassigned shards: api_calls/single/get-unassigned-shards.sh

This happens if both primary and replica for a shard are not available. Most probable reason is failure to allocate shards because of missing storage capacity. In this case deleting the affected index is the easiest way to come back to a healthy state.

https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/7398

  • in Kibana, go to: Management -> Index Management -> if there are ILM errors there will be a notification box displayed above the search box
  • in Elastic Cloud web UI:
    • check Elastic logs for any errors
  • in the monitoring cluster:
    • check cluster health
    • check indices sizes and confirm they are within the policy

for more docs see Index Lifecycle Management

Failure to move an index from a hot node to a warm node

Section titled “Failure to move an index from a hot node to a warm node”

This can happen for example when warm nodes run out of disk space. The ILM step will fail and mark the index as read-only. Clusters health will turn to unhealthy with an error message:

"An Elasticsearch index is in a read-only state and only allows deletes"

Any subsequent ILM attempts will fail with the following error message:

blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

In order to fix:

  • Release space on warm nodes (do not resize the cluster as it will fail!). Disk space can be released by removing indices. When deciding which indices to remove, start with oldest ones.
  • Remove blocks from indices that failed (if the entire cluster has been marked as read-only, remove that block as well). API calls for removing blocks (from indices and from the cluster) are documented in this repo, in the scripts directory.
  • Retry ILM steps
  • Once the cluster is back in a healthy state, adjust ILM policy or resize the cluster

Usually, our response to the following alerts is to consider scaling the cluster out.

  • if shards are distributed unequally, one node might receive a disproportionate amount of traffic causing high CPU usage and as a result the indexing latency might go up
  • stop routing to the overloaded node and force an index rollover (incoming documents are only saved to new indeces, regardless of the timestamp in the document)
  • alternatively you can trigger shard reballancing -> this might actually not be such a good idea. If the node is already heavily loaded, making it move a shard, which uses even more resources, will only make things worse.
NODE=instance-0000000016
curl -u 'xxx:yyy' "https://<deployment-id>.us-central1.gcp.cloud.es.io:9243/_cat/shards?v&bytes=b&h=store,index,node,shard,p,state" | grep -v $NODE | sort -r
curl -u 'xxx:yyy' "https://<deployment-id>.us-central1.gcp.cloud.es.io:9243/_cat/shards?v&bytes=b&h=store,index,node,shard,p,state,iic&s=iic,store" | grep $NODE
{
"commands" : [{
"move" : {
"index" : "pubsub-consul-inf-gprd-2020.01.26-000007", "shard" : 3,
"from_node" : "instance-0000000072", "to_node" : "instance-0000000016"
}
},
{
"move" : {
"index" : "pubsub-rails-inf-gprd-2020.01.29-000007", "shard" : 0,
"from_node" : "instance-0000000016", "to_node" : "instance-0000000072"
}
}]
}
curl -XPOST -u 'xxx:yyy' -d @move.json https://<deployment-id>.us-central1.gcp.cloud.es.io:9243/_cluster/reroute?retry_failed=true

In ES7 we rely on dynamic mappings set by the cluster. These mappings are set when the first document arrives at the cluster. If the type of a field is incorrectly detected, the cluster will fail to parse subsequent documents and will refuse to index them. The fix is to set mappings statically in those cases.

Here’s an example of a static mapping set for json.args: https://gitlab.com/gitlab-com/runbooks/merge_requests/1782

Once the index templates are updated (the above MR is merged and CI job successfully uploaded templates) you’ll also need to force a rollover of indices and mark the old index as complete.

Can be caused by:

  • ILM errors (this can happen for a number of reasons, here’s an example bug that leads to ILM errors: https://github.com/elastic/elasticsearch/issues/44175)
  • ILM stopped (the cluster remains healthy, but no indices are rolled over and there is no indicator of that happening until nodes run out of disk space at which point shards can be multiple TBs in size)
  • reaching storage watermarks (which causes ILM failures and the index is never rolled over and it continues to grow causing disk usage to grow indefinitely despite the node reaching watermarks and there is nothing stopping it from reaching 100% disk usage at which point indices have to be dropped)

Our response to “ILM errors” alerts is to investigate the cause for the errors and retry ILM on indices that failed. In order to retry ILM on failed indices:

  • Kibana -> Management pane (on the left hand side) -> Index Management -> if there are any lifecycle errors click on the “Show errors”
  • Select all indices with ILM errors
  • Manage index -> Retry ILM steps

You can also do it in a loop using a bash script available at elastic/api_calls/batch/retry-ilm.sh.

In certain scenarios ILM can end up in a stopped state. In order to start it again:

  1. issue an http request to the Elastic API (e.g. using Kibana console): POST /_ilm/start

  2. Make sure no indices have read_only blocks:

    • Kibana -> Management pane (on the left hand side) -> Index Management -> if there are any lifecycle errors click on the “Show errors”
    • click on each index on the list (of indices with errors) and read the error message
    • if any of the error message are about indices having a read_only block, remove the block using an api call described in runbooks/elastic/api_calls/single/remove-read_only_allow_delete-block-from-an-index.sh
    • trigger ILM retry on the index using Kibana
  3. Confirm ILM is operational again

    • make sure indices are rolled over and moved from hot nodes to warm nodes, an indication of ILM not working is uncontrollable increase in the size of the indices and the suffix not changing
  4. In critical situations, you might need to remove indices forcefully

Example error message:

illegal_argument_exception
Fielddata is disabled on text fields by default. Set fielddata=true on [json.extra.since] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.

In some cases, log records have an inconsistent structure which results in mapping conflicts. This can result in search queries failing on a subset of shards. When that happens, searches using Kibana fail.

The short term solution is to refresh index mappings cached in Kibana. This can be done by going to: Kibana -> Management -> Index Patterns -> select an index pattern -> click “Refresh field list” in the top right corner of the page

Long term we would like to have more consistency in logging fields so that we can avoid mapping conflicts completely.

see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/9364 for more details.