Rails SQL Apdex alerts
When we see an SQL Apdex alert is important to quickly asses the impact and rule out common causes like abuse and identify problematic queries. This runbook covers some of the topics that were discussed in the EOC Firedrill.
What to do in the first 5 minutes
Section titled “What to do in the first 5 minutes”- Be aware of the primary database for log queries
- Check the dashboards and log links below to asses root cause
- See if there is a quick recovery, if not page the IMOC who will bring in the CMOC in case we need to make a status page update
Dashboards
Section titled “Dashboards”- Check the GitLab general overview for service degradation
- Check the Patroni overview for the current status of Patroni.
- Look for unusual usage patterns in tuple statistics
- Look for outliers in the marginalia sampler dashboard
Metric queries
Section titled “Metric queries”- Primary queries by endpoint_id if it exists
- Grab the first
endpoint_id
, search the logs by settingjson.meta.caller_id
to theendpoint_id
and try to find a common denominator, for example,json.meta.root_namespace
. - If you don’t find a common denominator, try adding the filter
json.job_status: fail
, example; This can remove noise in some cases and help find the offender.
- Grab the first
- Slow queries on the primary
- Statement timeouts on the primary
- Locks on the primary
- Check for unusual stats for a specific relname
To find the exact query by the query_id
from thanos on the matching Postgres node where the query was handled run
select queryid, substr(query ,1, 5000) from pg_stat_statements where queryid='xxxxx';
For any of the above queries, you can search for json.fingerprint on the left list of fields, click on it to see if a particular fingerprint is dominating slow queries or timeouts. From this, you can get the full query (or the endpoint ID) which will help to narrow down the performance degradation
For more detailed information about slow queries, see the runbook for collecting pg data
Often abuse can be the source of DB degradation, to see if there might be abuse happening reference the abuse runbook