Description
Describe the bug a clear and concise description of what the bug is.
We've observed an unexpected behavior in the rate() function when a counter resets due to a container restart. The rate() function is not handling the counter reset correctly, leading to misleading data in our Grafana dashboards.
What's your helm version?
3.11.1
What's your kubectl version?
1.31.0
Which chart?
kube-prometheus-stack
What's the chart version?
69.2.3
What happened?
We noticed a sudden drop in the graph for our 'ara' service requests.
Upon investigation, we found that one pod had a container restart, causing its counter to reset.
The rate() function did not handle this reset correctly, resulting in a significant dip in the graph from about 80k requests/s to 12k requests/s.
This behavior persisted even when focusing on a single series, eliminating the possibility of it being caused by the sum aggregation.
What you expected to happen?
We expected the rate() function to handle counter resets gracefully, as per the Prometheus documentation. The function should detect the reset and calculate the rate correctly, maintaining a consistent view of the request rate despite the container restart.
How to reproduce it?
Use a Prometheus query with the rate() function on a counter metric, such as: rate(http_server_duration_milliseconds_count{service_name="ara"}[1m])
Trigger a container restart for one of the pods of the service being monitored.
Observe the resulting graph in Grafana over the period of the restart.
Enter the changed values of values.yaml?
scrapeInterval: 30s
scrapeTimeout: 10s
Enter the command that you execute and failing/misfunctioning.
N/A
Anything else we need to know?
We've tested this with both Prometheus and Thanos data sources, yielding the same results.
The issue persists even when isolating a single series, ruling out problems with aggregation.
We've confirmed that the underlying counter did reset, as seen in the raw metric data.