From 15b7ff2ea0462bb7e25dae41c641c1d8680fb695 Mon Sep 17 00:00:00 2001 From: Trevor Nierman Date: Mon, 5 May 2025 13:48:58 -0400 Subject: [PATCH] Enhancements to the prometheus high CPU FAQ page --- content/Products/OpenshiftMonitoring/faq.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/content/Products/OpenshiftMonitoring/faq.md b/content/Products/OpenshiftMonitoring/faq.md index f4bae09..bc6a152 100644 --- a/content/Products/OpenshiftMonitoring/faq.md +++ b/content/Products/OpenshiftMonitoring/faq.md @@ -91,6 +91,20 @@ Often, when "high" CPU usage or spikes are identified it can be a symptom of exp A good place to start the investigation is the `/rules` endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times. +A sorted list of rule evaluation times can be gathered with the following: + +```bash +oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/rules' | jq -r '.data.groups[] | .rules[] | [.evaluationTime, .health, .name] | @tsv' | sort +``` + +An overview of the timeseries database can be retrieved with: + +```bash +oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/status/tsdb' | jq +``` + +Within Prometheus, the `prometheus_rule_evaluation_duration_seconds` metric can be used to view evalutation time by quantile for each instance. Additionally, the `prometheus_rule_group_last_duration_seconds` can be used to determine the longest evaluating rulegroups. + ## How do I retrieve CPU profiles? In cases where excessive CPU usage is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span.