Skip to content

Conversation

@xemul
Copy link
Contributor

@xemul xemul commented Jan 20, 2026

Sometimes a performance metrics is inherently volatile and fluctuates rapidly. The great example of such a metrics is length of task queues or IO queues. Nowadays those values are exported as gauges. When monitoring system periodically "scrapes" (polls) those values, the collected data point is merely a snapshot of the queue state at that precise moment that doesn’t really show much information about what really happens in that area. Scraping happens once every several minutes, while queue length changes almost every millisecond. The extracted information is less than one thousand's part of a percent. Negligibly little information it is.

In the old days the very same problem existed for IO request latencies. Back then the latency of requests in a class was tracked by saving the latency of the last completed request and exporting it as gauge. Similarly -- almost all the information about request latencies was lost. The 803e790 fixed that and introduced a monotonically growing total_queue_time value exported as counter and monitored using promql rate() function.

This PR extends this approach onto IO queue length by introducing the integrated_length wrapper.

It makes little sense just to add-up the queue-lengths whenever the queue gets modified. Instead, the integrator maintains the sum of length * duration products taken at specific checkpoints. In it the length is the length of the queue when being checkpoint-ed, and the duration is the time passed since last checkpoint. If drawing the XY plot where X is time and Y is queue length, the accumulated value is then the square of the area below the plot. If, as promql rate does, getting two integrated values, subtracting the latest one from previous and dividing it by the duration (difference of X points) -- the result would apparently be the average Y value between two scrape points. Which is much more information compared to a randomly scraped point.

It's not perfect, of course. Short spikes are averaged-out and are thus swallowed by this metrics. The shorter the spike is the less noticeable it comes. Nonetheless, this gives much more information about how queue length changes over time, as compared to randomly taken gauge snapshots.

xemul added 2 commits January 20, 2026 15:08
The class wraps an unsigned integral type (for arithmetics) and allows
the caller to "checkpoint" it from time to time (can be done as frequent
as needed though). Every checkpoint then contributes the value*duration
(duration from previous checkpoint) to the final result.

When exported, the integrator provides a monotonically increasing value,
i.e. -- a COUNTER. Monitoring system the can rate() the counter thus
providing an averaged over the scraping period value, rather than a
randomly taken snapshot as gauge would do.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each priority class currently calculates the number of queued requests
and the number of executing requests, and exports both as gauges. This
patch switches both counters to be integrated-length values and exports
both as counters.

Checkpoint happens -- for queued requests before dispatching, for
executing requests after dispatching. Thus the counter provides the info
about the lengths of queues scheduler and disk have to deal with,
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul
Copy link
Contributor Author

xemul commented Jan 20, 2026

image

An example of how two metrics look like (old, gauge, left-sife and new, rate-d counter, right side) for compaction class during the backup test

@xemul
Copy link
Contributor Author

xemul commented Jan 20, 2026

image

One more example -- the commitlog queue lengths for a node lightly loaded by latte writer

@xemul
Copy link
Contributor Author

xemul commented Jan 20, 2026

image

And the proof of correctness -- a static integrator with the constant value of 1, collected, exported and rate-d similarly to the queue lengths from above examples (not included in this PR)

@travisdowns
Copy link
Contributor

travisdowns commented Jan 20, 2026

Very nice, this is the right way to do those metrics IMO, i.e., the old "sample at random moment" ones could just be deleted.

If you do want a counterpart to these, which can capture the sort of "max length" micro-behavior within a sample period which is something lost in the averaging, what I have found useful is a complementary metric which is "rolling max" of the value (sometimes rolling min is useful, and rolling average is essentially this integrated one). The metric exposes the max over some configurable lookback period. Ideally you align the lookback period to the scrape frequency but it still works alright if they are mismatched.

// Updates the integral value by adding yet another
// value * duration component. The "now" time-point can be
// provided (e.g. for tests or optimizations)
void checkpoint(Clock::time_point now = Clock::now()) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the need to manual checkpointing is unecessary? We have this same kind of integrated metric in Redpanda (also for queue lengths), but it always keep the integrated value up to date, no need to call checkpoint. Critically, it is useful that the integral method updates the integral (effectively, that it checkspoints). Otherwise it is very easy to get bogus data: e.g., the queue has 5 elements in it, but isn't checkpointed for a metric interval, then the rate drops to 0 for that period, then in the next period it jumps to 10, not reflecting reality.

integral can be non-const and do a true update, or it stay cost and just calculate the additional term to fill out the integral since the last checkpoint (probably better?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added explicit checkpoints for several reasons

  1. de-queuing happens in batches (when dispatching), so kicking the result += value * (now - ts) once every -- is overkill
  2. time since previous udpate resolution is either too expensive for frequent updates (highres clock) or will be zero most of the time (lowres clock)
  3. (the main reason) I want to have "queue length the dispatch loop deals with" metrics, explicitly. Checkpointing is right before doing the dispatch makes it happen

But by and large yes, it can be automatically checkpointed on every modification

@avikivity
Copy link
Member

Looks reasonable, but maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

@travisdowns
Copy link
Contributor

travisdowns commented Jan 20, 2026

maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

When we added this 60% of the time spent was discussing the name IIRC.

AI suggests "Integral Metrics or Cumulative Time-Weighted Metrics"

I think Integral is pretty bad as it has about 2 other, arguably more common, meanings.

@avikivity
Copy link
Member

Cumulative is a good word here.

@avikivity
Copy link
Member

Gemini suggests "request_duration_seconds_total".

@avikivity
Copy link
Member

No, that's the sum of durations, not the sum of length*duration.

@travisdowns
Copy link
Contributor

travisdowns commented Jan 20, 2026

iostat has similar naming to that for its cumulative metric and yes it's confusing/wrong

@avikivity
Copy link
Member

So now:

  • total waiting time
  • queue request-seconds
  • integrated queue length

I think I like total waiting time best.

@xemul
Copy link
Contributor Author

xemul commented Jan 21, 2026

If you do want a counterpart to these, which can capture the sort of "max length" micro-behavior within a sample period which is something lost in the averaging, what I have found useful is a complementary metric which is "rolling max" of the value (sometimes rolling min is useful, and rolling average is essentially this integrated one). The metric exposes the max over some configurable lookback period. Ideally you align the lookback period to the scrape frequency but it still works alright if they are mismatched.

I've been looking for "max queue length between scrapes' too, but don't like re-setting the maximum on timer. I admit it works, just doesn't look nice.

@xemul
Copy link
Contributor Author

xemul commented Jan 21, 2026

Looks reasonable, but maybe there is an accepted name for such a metric? We should adopt it if so. AI tools can help find the name given a description.

The closest analogy "man-hours" analogue in pro(duct|ject|cess) management.
AI suggested "average queue load", but the newly added class is not about queues, it's about integrating any fluctuating integer, so not perfect match

In physics there are "something * time" units:
https://en.wikipedia.org/wiki/Action_(physics)
https://en.wikipedia.org/wiki/Impulse_(physics)

My favorite is from kinematics :) absement = absence + displacement

total waiting time

Confuses with "total latency"

queue request-seconds

reqond (REQest*secOND)

integrated queue length

Current name, yes

@xemul
Copy link
Contributor Author

xemul commented Jan 21, 2026

When we added this 60% of the time spent was discussing the name IIRC.

There are two hard problems in computer science – cache invalidation, naming and off-by-one errors.
So yes, we're going to re-approve this theory here :D

@xemul
Copy link
Contributor Author

xemul commented Jan 21, 2026

I think Integral is pretty bad as it has about 2 other, arguably more common, meanings.

But the new helper class literally calculates the approximation of the integral of the XY plot

@xemul
Copy link
Contributor Author

xemul commented Jan 21, 2026

I've been looking for "max queue length between scrapes' too, but don't like re-setting the maximum on timer. I admit it works, just doesn't look nice.

I have a very draft version that collects a histogram of durations where buckets are selected based on the queue lengths. Like "queue was sized [0,2) for 1.2 seconds, sized [2,4) for 3.2 seconds, sized [4,8) for 0.1 seconds and sized [8,infty) for 2.6 seconds". Out of it I can derive the p99-ish metrics saying that "the queue was sized above N 99% of the time so far". But it has more problems than benefits for now :( Maybe something can come out of it later

@travisdowns
Copy link
Contributor

I have a very draft version that collects a histogram of durations where buckets are selected based on the queue lengths. Like "queue was sized [0,2) for 1.2 seconds, sized [2,4) for 3.2 seconds, sized [4,8) for 0.1 seconds and sized [8,infty) for 2.6 seconds". Out of it I can derive the p99-ish metrics saying that "the queue was sized above N 99% of the time so far". But it has more problems than benefits for now :( Maybe something can come out of it later

Right, that's one way to keep it "counter based" without the ugliness of the rolling window (note that the rolling window doesn't reset on scrape: it is a rolling window, separately configured, scape is a "read only" option, but perhaps you still consider it ugly), but it comes at a high cost of many metrics and imprecise identification of the max (of course, you get something for this cost: a histogram).

@xemul
Copy link
Contributor Author

xemul commented Jan 22, 2026

@travisdowns , what's your preference on the new metrics name? Provided the "integral" still not good

@travisdowns
Copy link
Contributor

@travisdowns , what's your preference on the new metrics name? Provided the "integral" still not good

Sorry, I missed your earlier comment. Yes, it is the integral, but in english words always need 3+ meanings :).

So I think the right name would be integrated_... not integral_... as you need an adjective there. So "integral" as an adjective (meaning "whole") seems wrong. It should be integrated which as an adjective means "resulting from mathematical integration" or something.

@travisdowns
Copy link
Contributor

And sorry, I had provided that offhand comment without noticing that integral was actually the chosen approach in the PR, otherwise I would have been more constructive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants