Skip to content

Primary Database Node CPU Saturation Analysis

The goal of the PrimaryDatabaseCPUSaturation alert is to let us know that there was spike in CPU pressure on the primary database node, so we can investigate while we still have logs available and before the situation gets worse. The alert triggers when CPU pressure is over three standard deviations above average. More information on the PromQL query that is used can be found here. For unfiltered visualization of this query click here.

Such spikes are usually short-lived and by the time we look into it the alert may have self-resolved, though we still need to investigate.

To confirm the spike:

  • Go to the “Host Stats” dashboard.
  • Filter by the current primary node for the database. The primary can change regularly. To see which server is the current primary view the PostgreSQL Replication Overview dashboard.

Capture all findings in an issue:

  • Click here
  • Title: YYYY-MM-DD HH:MM: CPU pressure spike on the primary node for (main|ci) database
  • Labels: TBD
  • Assignees: TBD
  • Description: Include any relevant information that is available. When referencing to sources like Mimir and Kibana, please include both screenshot and sharable link to the source (example).

Spikes in CPU pressure are often caused by queries that consume too much CPU time to execute or plan. The query itself may be fast, but called many times per second.

To find the top 10 statements by execution time use the following PromQL query:

topk(10,
sum by (queryid) (
rate(pg_stat_kcache_exec_total_time{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0
)
)
  • Go to Mimir.
  • For the ci database change the type label to patroni-ci.
  • Adjust time range to 15 minutes before and after the alert.
  • On the graph, click “Show all” so that all series are rendered.
  • Look for any statement(s) that stand out. In the example bellow we can see that the statement with queryid 5155022932036490076 spiked at around 18:28:00. execution time spike

To find the top 10 statements by plan time use the following PromQL query:

topk(10,
sum by (queryid) (
rate(pg_stat_statements_plan_seconds_total{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0
)
)
  • Go to Mimir.
  • For the ci database change the type label to patroni-ci.
  • Adjust time range to 15 minutes before and after the alert.
  • On the graph, click “Show all” so that all series are rendered.
  • Look for any statement(s) that stand out.

To find the top 10 statements by number of calls use the following PromQL query:

topk(10,
sum by (queryid) (
rate(pg_stat_statements_calls{env="gprd", type="patroni"}[1m]) and on (instance) pg_replication_is_replica == 0
)
)
  • Go to Mimir.
  • For the ci database change the type label to patroni-ci.
  • Adjust time range to 15 minutes before and after the alert.
  • On the graph, click “Show all” so that all series are rendered.
  • Look for any statement(s) that stand out.

The following dashboards may be also useful:

To use these dashboards filter by:

  • environment: gprd.
  • type: patroni for the main database, patroni-ci for the ci database.
  • fqdn: the current primary node for the database. The primary can change regularly. To see which server is the current primary view the PostgreSQL Replication Overview dashboard.
  • queryid: for single qury analysis use the respective value.‘
  • Check for any active incidents/deployments around the time of the alert
  • TODO: Add other relevant PromQL queries, e.g. LWLocks, etc.

To find the normalized SQL for given queryid, follow the mapping docs. Usually searching the logs is enough:

  • Go to Kibana.
  • Update the filter and set the queryid.
  • If needed increase the time range.