-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Introduce "integrated queue length" metrics and use it for IO classes #3210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The class wraps an unsigned integral type (for arithmetics) and allows the caller to "checkpoint" it from time to time (can be done as frequent as needed though). Every checkpoint then contributes the value*duration (duration from previous checkpoint) to the final result. When exported, the integrator provides a monotonically increasing value, i.e. -- a COUNTER. Monitoring system the can rate() the counter thus providing an averaged over the scraping period value, rather than a randomly taken snapshot as gauge would do. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each priority class currently calculates the number of queued requests and the number of executing requests, and exports both as gauges. This patch switches both counters to be integrated-length values and exports both as counters. Checkpoint happens -- for queued requests before dispatching, for executing requests after dispatching. Thus the counter provides the info about the lengths of queues scheduler and disk have to deal with, respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
|
Very nice, this is the right way to do those metrics IMO, i.e., the old "sample at random moment" ones could just be deleted. If you do want a counterpart to these, which can capture the sort of "max length" micro-behavior within a sample period which is something lost in the averaging, what I have found useful is a complementary metric which is "rolling max" of the value (sometimes rolling min is useful, and rolling average is essentially this integrated one). The metric exposes the max over some configurable lookback period. Ideally you align the lookback period to the scrape frequency but it still works alright if they are mismatched. |
| // Updates the integral value by adding yet another | ||
| // value * duration component. The "now" time-point can be | ||
| // provided (e.g. for tests or optimizations) | ||
| void checkpoint(Clock::time_point now = Clock::now()) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like the need to manual checkpointing is unecessary? We have this same kind of integrated metric in Redpanda (also for queue lengths), but it always keep the integrated value up to date, no need to call checkpoint. Critically, it is useful that the integral method updates the integral (effectively, that it checkspoints). Otherwise it is very easy to get bogus data: e.g., the queue has 5 elements in it, but isn't checkpointed for a metric interval, then the rate drops to 0 for that period, then in the next period it jumps to 10, not reflecting reality.
integral can be non-const and do a true update, or it stay cost and just calculate the additional term to fill out the integral since the last checkpoint (probably better?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added explicit checkpoints for several reasons
- de-queuing happens in batches (when dispatching), so kicking the
result += value * (now - ts)once every -- is overkill - time since previous udpate resolution is either too expensive for frequent updates (highres clock) or will be zero most of the time (lowres clock)
- (the main reason) I want to have "queue length the dispatch loop deals with" metrics, explicitly. Checkpointing is right before doing the dispatch makes it happen
But by and large yes, it can be automatically checkpointed on every modification
|
Looks reasonable, but maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description. |
When we added this 60% of the time spent was discussing the name IIRC. AI suggests "Integral Metrics or Cumulative Time-Weighted Metrics" I think Integral is pretty bad as it has about 2 other, arguably more common, meanings. |
|
Cumulative is a good word here. |
|
Gemini suggests "request_duration_seconds_total". |
|
No, that's the sum of durations, not the sum of length*duration. |
|
|
|
So now:
I think I like total waiting time best. |
I've been looking for "max queue length between scrapes' too, but don't like re-setting the maximum on timer. I admit it works, just doesn't look nice. |
The closest analogy "man-hours" analogue in pro(duct|ject|cess) management. In physics there are "something * time" units: My favorite is from kinematics :) absement = absence + displacement
Confuses with "total latency"
reqond (REQest*secOND)
Current name, yes |
There are two hard problems in computer science – cache invalidation, naming and off-by-one errors. |
But the new helper class literally calculates the approximation of the integral of the XY plot |
I have a very draft version that collects a histogram of durations where buckets are selected based on the queue lengths. Like "queue was sized [0,2) for 1.2 seconds, sized [2,4) for 3.2 seconds, sized [4,8) for 0.1 seconds and sized [8,infty) for 2.6 seconds". Out of it I can derive the p99-ish metrics saying that "the queue was sized above N 99% of the time so far". But it has more problems than benefits for now :( Maybe something can come out of it later |
Right, that's one way to keep it "counter based" without the ugliness of the rolling window (note that the rolling window doesn't reset on scrape: it is a rolling window, separately configured, scape is a "read only" option, but perhaps you still consider it ugly), but it comes at a high cost of many metrics and imprecise identification of the max (of course, you get something for this cost: a histogram). |
|
@travisdowns , what's your preference on the new metrics name? Provided the "integral" still not good |
Sorry, I missed your earlier comment. Yes, it is the integral, but in english words always need 3+ meanings :). So I think the right name would be |
|
And sorry, I had provided that offhand comment without noticing that |



Sometimes a performance metrics is inherently volatile and fluctuates rapidly. The great example of such a metrics is length of task queues or IO queues. Nowadays those values are exported as gauges. When monitoring system periodically "scrapes" (polls) those values, the collected data point is merely a snapshot of the queue state at that precise moment that doesn’t really show much information about what really happens in that area. Scraping happens once every several minutes, while queue length changes almost every millisecond. The extracted information is less than one thousand's part of a percent. Negligibly little information it is.
In the old days the very same problem existed for IO request latencies. Back then the latency of requests in a class was tracked by saving the latency of the last completed request and exporting it as gauge. Similarly -- almost all the information about request latencies was lost. The 803e790 fixed that and introduced a monotonically growing total_queue_time value exported as counter and monitored using promql
rate()function.This PR extends this approach onto IO queue length by introducing the integrated_length wrapper.
It makes little sense just to add-up the queue-lengths whenever the queue gets modified. Instead, the integrator maintains the sum of
length * durationproducts taken at specific checkpoints. In it thelengthis the length of the queue when being checkpoint-ed, and thedurationis the time passed since last checkpoint. If drawing the XY plot where X is time and Y is queue length, the accumulated value is then the square of the area below the plot. If, as promql rate does, getting two integrated values, subtracting the latest one from previous and dividing it by the duration (difference of X points) -- the result would apparently be the average Y value between two scrape points. Which is much more information compared to a randomly scraped point.It's not perfect, of course. Short spikes are averaged-out and are thus swallowed by this metrics. The shorter the spike is the less noticeable it comes. Nonetheless, this gives much more information about how queue length changes over time, as compared to randomly taken gauge snapshots.