Introduce "integrated queue length" metrics and use it for IO classes #3210

xemul · 2026-01-20T12:29:19Z

Sometimes a performance metrics is inherently volatile and fluctuates rapidly. The great example of such a metrics is length of task queues or IO queues. Nowadays those values are exported as gauges. When monitoring system periodically "scrapes" (polls) those values, the collected data point is merely a snapshot of the queue state at that precise moment that doesn’t really show much information about what really happens in that area. Scraping happens once every several minutes, while queue length changes almost every millisecond. The extracted information is less than one thousand's part of a percent. Negligibly little information it is.

In the old days the very same problem existed for IO request latencies. Back then the latency of requests in a class was tracked by saving the latency of the last completed request and exporting it as gauge. Similarly -- almost all the information about request latencies was lost. The 803e790 fixed that and introduced a monotonically growing total_queue_time value exported as counter and monitored using promql rate() function.

This PR extends this approach onto IO queue length by introducing the integrated_length wrapper.

It makes little sense just to add-up the queue-lengths whenever the queue gets modified. Instead, the integrator maintains the sum of length * duration products taken at specific checkpoints. In it the length is the length of the queue when being checkpoint-ed, and the duration is the time passed since last checkpoint. If drawing the XY plot where X is time and Y is queue length, the accumulated value is then the square of the area below the plot. If, as promql rate does, getting two integrated values, subtracting the latest one from previous and dividing it by the duration (difference of X points) -- the result would apparently be the average Y value between two scrape points. Which is much more information compared to a randomly scraped point.

It's not perfect, of course. Short spikes are averaged-out and are thus swallowed by this metrics. The shorter the spike is the less noticeable it comes. Nonetheless, this gives much more information about how queue length changes over time, as compared to randomly taken gauge snapshots.

The class wraps an unsigned integral type (for arithmetics) and allows the caller to "checkpoint" it from time to time (can be done as frequent as needed though). Every checkpoint then contributes the value*duration (duration from previous checkpoint) to the final result. When exported, the integrator provides a monotonically increasing value, i.e. -- a COUNTER. Monitoring system the can rate() the counter thus providing an averaged over the scraping period value, rather than a randomly taken snapshot as gauge would do. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Each priority class currently calculates the number of queued requests and the number of executing requests, and exports both as gauges. This patch switches both counters to be integrated-length values and exports both as counters. Checkpoint happens -- for queued requests before dispatching, for executing requests after dispatching. Thus the counter provides the info about the lengths of queues scheduler and disk have to deal with, respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

xemul · 2026-01-20T12:31:32Z

An example of how two metrics look like (old, gauge, left-sife and new, rate-d counter, right side) for compaction class during the backup test

xemul · 2026-01-20T12:32:50Z

One more example -- the commitlog queue lengths for a node lightly loaded by latte writer

xemul · 2026-01-20T12:33:48Z

And the proof of correctness -- a static integrator with the constant value of 1, collected, exported and rate-d similarly to the queue lengths from above examples (not included in this PR)

travisdowns · 2026-01-20T16:18:59Z

Very nice, this is the right way to do those metrics IMO, i.e., the old "sample at random moment" ones could just be deleted.

If you do want a counterpart to these, which can capture the sort of "max length" micro-behavior within a sample period which is something lost in the averaging, what I have found useful is a complementary metric which is "rolling max" of the value (sometimes rolling min is useful, and rolling average is essentially this integrated one). The metric exposes the max over some configurable lookback period. Ideally you align the lookback period to the scrape frequency but it still works alright if they are mismatched.

travisdowns · 2026-01-20T16:26:21Z

include/seastar/util/integrated-length.hh

+    // Updates the integral value by adding yet another
+    // value * duration component. The "now" time-point can be
+    // provided (e.g. for tests or optimizations)
+    void checkpoint(Clock::time_point now = Clock::now()) noexcept {


It feels like the need to manual checkpointing is unecessary? We have this same kind of integrated metric in Redpanda (also for queue lengths), but it always keep the integrated value up to date, no need to call checkpoint. Critically, it is useful that the integral method updates the integral (effectively, that it checkspoints). Otherwise it is very easy to get bogus data: e.g., the queue has 5 elements in it, but isn't checkpointed for a metric interval, then the rate drops to 0 for that period, then in the next period it jumps to 10, not reflecting reality.

integral can be non-const and do a true update, or it stay cost and just calculate the additional term to fill out the integral since the last checkpoint (probably better?).

I added explicit checkpoints for several reasons

de-queuing happens in batches (when dispatching), so kicking the result += value * (now - ts) once every -- is overkill

time since previous udpate resolution is either too expensive for frequent updates (highres clock) or will be zero most of the time (lowres clock)

(the main reason) I want to have "queue length the dispatch loop deals with" metrics, explicitly. Checkpointing is right before doing the dispatch makes it happen

But by and large yes, it can be automatically checkpointed on every modification

avikivity · 2026-01-20T19:02:37Z

Looks reasonable, but maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

travisdowns · 2026-01-20T19:29:04Z

maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

When we added this 60% of the time spent was discussing the name IIRC.

AI suggests "Integral Metrics or Cumulative Time-Weighted Metrics"

I think Integral is pretty bad as it has about 2 other, arguably more common, meanings.

avikivity · 2026-01-20T19:29:58Z

Cumulative is a good word here.

avikivity · 2026-01-20T19:32:38Z

Gemini suggests "request_duration_seconds_total".

avikivity · 2026-01-20T19:33:11Z

No, that's the sum of durations, not the sum of length*duration.

travisdowns · 2026-01-20T19:34:18Z

iostat has similar naming to that for its cumulative metric and yes it's confusing/wrong

avikivity · 2026-01-20T19:34:53Z

So now:

total waiting time
queue request-seconds
integrated queue length

I think I like total waiting time best.

xemul · 2026-01-21T06:52:44Z

If you do want a counterpart to these, which can capture the sort of "max length" micro-behavior within a sample period which is something lost in the averaging, what I have found useful is a complementary metric which is "rolling max" of the value (sometimes rolling min is useful, and rolling average is essentially this integrated one). The metric exposes the max over some configurable lookback period. Ideally you align the lookback period to the scrape frequency but it still works alright if they are mismatched.

I've been looking for "max queue length between scrapes' too, but don't like re-setting the maximum on timer. I admit it works, just doesn't look nice.

xemul · 2026-01-21T07:01:23Z

Looks reasonable, but maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

The closest analogy "man-hours" analogue in pro(duct|ject|cess) management.
AI suggested "average queue load", but the newly added class is not about queues, it's about integrating any fluctuating integer, so not perfect match

In physics there are "something * time" units:
https://en.wikipedia.org/wiki/Action_(physics)
https://en.wikipedia.org/wiki/Impulse_(physics)

My favorite is from kinematics :) absement = absence + displacement

total waiting time

Confuses with "total latency"

queue request-seconds

reqond (REQest*secOND)

integrated queue length

Current name, yes

xemul · 2026-01-21T07:02:48Z

When we added this 60% of the time spent was discussing the name IIRC.

There are two hard problems in computer science – cache invalidation, naming and off-by-one errors.
So yes, we're going to re-approve this theory here :D

xemul · 2026-01-21T07:06:00Z

I think Integral is pretty bad as it has about 2 other, arguably more common, meanings.

But the new helper class literally calculates the approximation of the integral of the XY plot

xemul · 2026-01-21T07:55:02Z

I've been looking for "max queue length between scrapes' too, but don't like re-setting the maximum on timer. I admit it works, just doesn't look nice.

I have a very draft version that collects a histogram of durations where buckets are selected based on the queue lengths. Like "queue was sized [0,2) for 1.2 seconds, sized [2,4) for 3.2 seconds, sized [4,8) for 0.1 seconds and sized [8,infty) for 2.6 seconds". Out of it I can derive the p99-ish metrics saying that "the queue was sized above N 99% of the time so far". But it has more problems than benefits for now :( Maybe something can come out of it later

travisdowns · 2026-01-21T20:54:55Z

I have a very draft version that collects a histogram of durations where buckets are selected based on the queue lengths. Like "queue was sized [0,2) for 1.2 seconds, sized [2,4) for 3.2 seconds, sized [4,8) for 0.1 seconds and sized [8,infty) for 2.6 seconds". Out of it I can derive the p99-ish metrics saying that "the queue was sized above N 99% of the time so far". But it has more problems than benefits for now :( Maybe something can come out of it later

Right, that's one way to keep it "counter based" without the ugliness of the rolling window (note that the rolling window doesn't reset on scrape: it is a rolling window, separately configured, scape is a "read only" option, but perhaps you still consider it ugly), but it comes at a high cost of many metrics and imprecise identification of the max (of course, you get something for this cost: a histogram).

xemul · 2026-01-22T11:24:59Z

@travisdowns , what's your preference on the new metrics name? Provided the "integral" still not good

travisdowns · 2026-01-22T16:46:47Z

@travisdowns , what's your preference on the new metrics name? Provided the "integral" still not good

Sorry, I missed your earlier comment. Yes, it is the integral, but in english words always need 3+ meanings :).

So I think the right name would be integrated_... not integral_... as you need an adjective there. So "integral" as an adjective (meaning "whole") seems wrong. It should be integrated which as an adjective means "resulting from mathematical integration" or something.

travisdowns · 2026-01-22T16:47:36Z

And sorry, I had provided that offhand comment without noticing that integral was actually the chosen approach in the PR, otherwise I would have been more constructive.

xemul added 2 commits January 20, 2026 15:08

travisdowns reviewed Jan 20, 2026

View reviewed changes

Introduce "integrated queue length" metrics and use it for IO classes #3210

Are you sure you want to change the base?

Introduce "integrated queue length" metrics and use it for IO classes #3210

Conversation

xemul commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xemul commented Jan 20, 2026

Uh oh!

xemul commented Jan 20, 2026

Uh oh!

xemul commented Jan 20, 2026

Uh oh!

travisdowns commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisdowns Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

xemul Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

travisdowns commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

travisdowns commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

xemul commented Jan 21, 2026

Uh oh!

xemul commented Jan 21, 2026

Uh oh!

xemul commented Jan 21, 2026

Uh oh!

xemul commented Jan 21, 2026

Uh oh!

xemul commented Jan 21, 2026

Uh oh!

travisdowns commented Jan 21, 2026

Uh oh!

xemul commented Jan 22, 2026

Uh oh!

travisdowns commented Jan 22, 2026

Uh oh!

travisdowns commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xemul commented Jan 20, 2026 •

edited

Loading

travisdowns commented Jan 20, 2026 •

edited

Loading

travisdowns commented Jan 20, 2026 •

edited

Loading

travisdowns commented Jan 20, 2026 •

edited

Loading