Skip to content

ApdexSLOViolation

This alert indicates that the Apdex score for a specific service has fallen below a predefined threshold, signifying a potential performance degradation. An Apdex violation occurs when specific application or service transactions fail to complete within the defined Apdex time, indicating a decline in user experience.

Several factors can contribute to these alerts:

  • Unexpected spikes in traffic leading to resource exhaustion. (CPU, memory)
  • Database connection issues, slow queries or database performance issues.
  • Recent code deployments introducing bugs or performance issues.
  • External service dependencies experiencing slowdowns
  • Server or network problems affecting service performance.

Common troubleshooting steps though may differ slightly for each service:

  • When investigating an Apdex issue without a corresponding increase in error rates, a valuable initial step is to identify the specific Service Level Indicator (SLI) reporting elevated slow request metrics. By examining the logs for these slow requests, we can often gain insights into the nature of the performance degradation.
  • Review Apdex score details, response times, error rates, and any anomalies.
  • Review logs for errors, timeouts, or slow queries related to the affected services. Look for correlations between log entries and performance issues.
  • Check for recent deployments, configuration changes, or infrastructure modifications.
  • Identify patterns or spikes in latency and errors.

Refer to the service catalogue for the service owners and escalation Service Catalogue

ApdexSLOViolation Metrics

  • The main goal is to monitor these metrics and raise alerts when they violate predefined thresholds, ensuring that the service performance meets the expected standards. The Apdex score quantifies user satisfaction with an application’s response time. It classifies user experiences into three categories: satisfactory, tolerable, and frustrating.

  • An Apdex score is calculated on a scale from 0 to 1, where:

    • 1.0: All responses are satisfactory.
    • 0.5: Half the responses are satisfactory and half are not satisfactory.
    • 0.0: All responses are frustrating.
  • Under normal conditions, the Apdex score should remain consistently high, reflecting good user experience. For example:

    The Apdex score should consistently be above 0.9, with occasional minor drops but no prolonged periods below this threshold.
  • If a service typically achieves an Apdex score of 0.9, a threshold might be set at 0.8 to trigger an alert if user satisfaction drops significantly.

Example: In the graph below traffic absent alert fires when an SLI, the rails_replica_sql SLI of the patroni-ci service (main stage) has an apdex violating SLO

alt text

A few examples of how the metrics is been calculated:

  • The severity of this alert is generally what is configured on the SLI, this defaults to ~“severity::2”.
  • There might be customer user impact depending on which service is affected

If the issue cannot be resolved quickly, escalate to the appropriate engineering or operations team for further investigation.