ErrorSLOViolation

Overview

What does this alert mean?

An ErrorSLOViolation occurs when the error rate for a specific application or service exceeds the defined Service Level Objective (SLO) threshold, indicating potential issues impacting functionality or user experience.

Possible Causes

Several factors can contribute to these alerts:

Bugs in the application code causing unexpected behavior.
Resource exhaustion (CPU, memory) leading to errors during request processing.
External service errors propagating back to the application
Underlying infrastructure problems impacting application stability (e.g., database errors, network connectivity issues).

General Troubleshooting Steps

Common troubleshooting steps though may differ slightly for each service:

Investigate application error logs to identify the specific errors occurring.
Analyze the timestamps and frequency of errors to pinpoint potential root causes
Correlate error messages with recent code deployments or infrastructure changes.
Identify spikes in response times that might be correlated with errors. Compare current metrics with historical performance data to identify deviations.
Monitor resource utilization (CPU, memory) for signs of bottlenecks impacting application stability.
Investigate code changes introduced in the deployment to identify potential causes. Check for recent deployments, configuration changes, or infrastructure modifications.
Check the health of underlying infrastructure components (database, servers, network) for any reported issues.

Services

Refer to the service catalogue for the service owners and escalation Service Catalogue

Metrics

ErrorSLOViolation Metrics

The error ratio measures the proportion of requests that result in errors compared to the total number of requests. Alerts are triggered if the error ratio exceeds the defined threshold, signaling potential issues with the application.
Under normal conditions, the error ratio should remain low, indicating that the majority of requests are successfully processed. For example:
```
Frontend Web Service Error Rate: Consistently below 1%, with minor, short-lived spikes but no sustained periods above the threshold.
```
Analysis of historical error data helps identify normal performance patterns and determine acceptable thresholds. For instance, if the service usually has an error ratio of 0.5%, a threshold might be set at 1.5% to account for normal fluctuations but still detect significant reliability issues.

Example: In the graph below there was a spike in the error ratio for sidekiq service. Upon investigation this was been caused by Gitlab::ExclusiveLeaseHelpers::FailedToObtainLockError error.

alt text

The Kibana logs also provide insight as to when the issue probably started

alt text

A few examples of how the metrics is been calculated:

Git: GitServiceGitlabSshdErrorSLOViolation
Patroni: gitlab_component_errors

Severities

The severity of this alert is generally what is configured on the SLI, this defaults to ~“severity::2”.
There might be customer user impact depending on which service is affected

Recent changes

Previous Incidents

PackagecloudServiceLoadbalancerErrorSLOViolation
CiRunnersServicePollingErrorSLOViolation
GitServiceWorkhorseErrorSLOViolation
SidekiqServiceSidekiqExecutionErrorSLOViolationSingleShard
FrontendServiceWebsocketsServicesErrorSLOViolation
Error ratios may fluctuate considerably for services that receive very little traffic, and that we may need to adjust SLO targets to accommodate these.
Temporary issues with external services can cause transient errors and trigger false alerts
Network connectivity problems can lead to intermittent errors and false positives

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation.