Primary Database Node CPU Saturation Analysis
Why do we need this alert?
Section titled “Why do we need this alert?”The goal of the PrimaryDatabaseCPUSaturation alert is to let us know that there was spike in CPU pressure on the primary database node, so we can investigate while we still have logs available and before the situation gets worse. The alert triggers when CPU pressure is over three standard deviations above average. More information on the PromQL query that is used can be found here. For unfiltered visualization of this query click here.
Such spikes are usually short-lived and by the time we look into it the alert may have self-resolved, though we still need to investigate.
To confirm the spike:
- Go to the “Host Stats” dashboard.
- Filter by the current primary node for the database. The primary can change regularly. To see which server is the current primary view the PostgreSQL Replication Overview dashboard.
Create an issue
Section titled “Create an issue”Capture all findings in an issue:
- Click here
- Title: YYYY-MM-DD HH:MM: CPU pressure spike on the primary node for (main|ci) database
- Labels: TBD
- Assignees: TBD
- Description: Include any relevant information that is available. When referencing to sources like Mimir and Kibana, please include both screenshot and sharable link to the source (example).
How to identify the root cause?
Section titled “How to identify the root cause?”Spikes in CPU pressure are often caused by queries that consume too much CPU time to execute or plan. The query itself may be fast, but called many times per second.
Total execution time
Section titled “Total execution time”To find the top 10 statements by execution time use the following PromQL query:
topk(10, sum by (queryid) ( rate(pg_stat_kcache_exec_total_time{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0 ))
- Go to Mimir.
- For the
ci
database change thetype
label topatroni-ci
. - Adjust time range to 15 minutes before and after the alert.
- On the graph, click “Show all” so that all series are rendered.
- Look for any statement(s) that stand out. In the example bellow we can see that the statement with queryid
5155022932036490076
spiked at around 18:28:00.
Total plan time
Section titled “Total plan time”To find the top 10 statements by plan time use the following PromQL query:
topk(10, sum by (queryid) ( rate(pg_stat_statements_plan_seconds_total{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0 ))
- Go to Mimir.
- For the
ci
database change thetype
label topatroni-ci
. - Adjust time range to 15 minutes before and after the alert.
- On the graph, click “Show all” so that all series are rendered.
- Look for any statement(s) that stand out.
Total number of calls
Section titled “Total number of calls”To find the top 10 statements by number of calls use the following PromQL query:
topk(10, sum by (queryid) ( rate(pg_stat_statements_calls{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0 ))
- Go to Mimir.
- For the
ci
database change thetype
label topatroni-ci
. - Adjust time range to 15 minutes before and after the alert.
- On the graph, click “Show all” so that all series are rendered.
- Look for any statement(s) that stand out.
Other sources
Section titled “Other sources”The following dashboards may be also useful:
To use these dashboards filter by:
- environment:
gprd
. - type:
patroni
for themain
database,patroni-ci
for theci
database. - fqdn: the current primary node for the database. The primary can change regularly. To see which server is the current primary view the PostgreSQL Replication Overview dashboard.
- queryid: for single qury analysis use the respective value.‘
Other possible causes
Section titled “Other possible causes”- Check for any active incidents/deployments around the time of the alert
- TODO: Add other relevant PromQL queries, e.g. LWLocks, etc.
Map queryid to SQL
Section titled “Map queryid to SQL”To find the normalized SQL for given queryid, follow the mapping docs. Usually searching the logs is enough:
- Go to Kibana.
- Update the filter and set the queryid.
- If needed increase the time range.
References
Section titled “References”- This documents is based on Database Peak Analysis Report