Skip to content

Add CPU and scheduled time timeline to operator metrics#29016

Open
lukasz-stec wants to merge 1 commit intotrinodb:masterfrom
starburstdata:ls/2604/01-resource-usage-timeline
Open

Add CPU and scheduled time timeline to operator metrics#29016
lukasz-stec wants to merge 1 commit intotrinodb:masterfrom
starburstdata:ls/2604/01-resource-usage-timeline

Conversation

@lukasz-stec
Copy link
Copy Markdown
Member

Description

Introduces ResourceUsageTimeSeriesRecorder, a fixed-size bucketed sampler that doubles its bucket width as operator execution grows. CPU and wall time are recorded across addInput, getOutput, and finish phases, merged into a single snapshot, and surfaced as "CPU and scheduled time usage over time" in operator metrics.

JMH benchmark results show small overhead.

Benchmark                                                  (operatorCount)  (randomBucketWidth)  (randomStartTime)  (recordDelayMillis)  (resourceTimeSeries)  (snapshotCount)  Mode  Cnt        Score      Error  Units
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A                2  avgt   20      222.676 ±    1.748  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A               10  avgt   20      402.062 ±    6.396  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A              100  avgt   20     3616.961 ±  210.224  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A                2  avgt   20      277.844 ±    1.140  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A               10  avgt   20     1024.055 ±    5.733  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A              100  avgt   20    11280.448 ±  824.108  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A                2  avgt   20      258.718 ±    2.311  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A               10  avgt   20      748.176 ±    7.178  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A              100  avgt   20     7295.883 ±  295.682  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A                2  avgt   20      308.245 ±    3.080  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A               10  avgt   20      926.292 ±    3.346  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A              100  avgt   20     9384.697 ±  162.079  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd                2                  N/A                N/A                  N/A                 false              N/A  avgt   20     4879.238 ±    5.494  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd                2                  N/A                N/A                  N/A                  true              N/A  avgt   20     5559.705 ±   40.382  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd               10                  N/A                N/A                  N/A                 false              N/A  avgt   20    92375.376 ±  233.255  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd               10                  N/A                N/A                  N/A                  true              N/A  avgt   20    94436.608 ±  422.369  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd              100                  N/A                N/A                  N/A                 false              N/A  avgt   20  1416322.264 ± 9655.329  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd              100                  N/A                N/A                  N/A                  true              N/A  avgt   20  1421749.822 ± 7786.944  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                  100                   N/A              N/A  avgt   20       34.609 ±    0.020  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                  500                   N/A              N/A  avgt   20       34.709 ±    0.048  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                 1000                   N/A              N/A  avgt   20       34.755 ±    0.090  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                 2000                   N/A              N/A  avgt   20       34.771 ±    0.077  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                32000                   N/A              N/A  avgt   20       34.678 ±    0.030  ns/op

The operator metric serializes to json like:

        "CPU and scheduled time usage over time" : {
          "startTimeEpochSeconds" : 1775136871,
          "bucketWidthSeconds" : 1,
          "cpuNanosBuckets" : [ 52000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 151000, 489000, 179000, 17051000 ],
          "wallNanosBuckets" : [ 57208, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 144955, 493628, 177336, 123101335 ]
        }

It enables visualizations like this:
image

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Add "CPU and scheduled time usage over time" metric to the operator stats.

@cla-bot cla-bot bot added the cla-signed label Apr 7, 2026
@lukasz-stec lukasz-stec force-pushed the ls/2604/01-resource-usage-timeline branch 3 times, most recently from 67c610a to 2951467 Compare April 8, 2026 09:00
@lukasz-stec lukasz-stec requested review from losipiuk and wendigo April 8, 2026 11:04
@lukasz-stec lukasz-stec marked this pull request as ready for review April 8, 2026 11:04
Introduces ResourceUsageTimeSeriesRecorder, a fixed-size bucketed sampler
that doubles its bucket width as operator execution grows. CPU and wall time
are recorded across addInput, getOutput, and finish phases, merged into a
single snapshot, and surfaced as "CPU and scheduled time usage over time"
in operator metrics.

JMH benchmark results show small overhead.
Benchmark                                                  (operatorCount)  (randomBucketWidth)  (randomStartTime)  (recordDelayMillis)  (resourceTimeSeries)  (snapshotCount)  Mode  Cnt        Score      Error  Units
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A                2  avgt   20      222.676 ±    1.748  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A               10  avgt   20      402.062 ±    6.396  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false              false                  N/A                   N/A              100  avgt   20     3616.961 ±  210.224  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A                2  avgt   20      277.844 ±    1.140  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A               10  avgt   20     1024.055 ±    5.733  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                false               true                  N/A                   N/A              100  avgt   20    11280.448 ±  824.108  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A                2  avgt   20      258.718 ±    2.311  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A               10  avgt   20      748.176 ±    7.178  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true              false                  N/A                   N/A              100  avgt   20     7295.883 ±  295.682  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A                2  avgt   20      308.245 ±    3.080  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A               10  avgt   20      926.292 ±    3.346  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.merge                         N/A                 true               true                  N/A                   N/A              100  avgt   20     9384.697 ±  162.079  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd                2                  N/A                N/A                  N/A                 false              N/A  avgt   20     4879.238 ±    5.494  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd                2                  N/A                N/A                  N/A                  true              N/A  avgt   20     5559.705 ±   40.382  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd               10                  N/A                N/A                  N/A                 false              N/A  avgt   20    92375.376 ±  233.255  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd               10                  N/A                N/A                  N/A                  true              N/A  avgt   20    94436.608 ±  422.369  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd              100                  N/A                N/A                  N/A                 false              N/A  avgt   20  1416322.264 ± 9655.329  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.operatorStatsAdd              100                  N/A                N/A                  N/A                  true              N/A  avgt   20  1421749.822 ± 7786.944  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                  100                   N/A              N/A  avgt   20       34.609 ±    0.020  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                  500                   N/A              N/A  avgt   20       34.709 ±    0.048  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                 1000                   N/A              N/A  avgt   20       34.755 ±    0.090  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                 2000                   N/A              N/A  avgt   20       34.771 ±    0.077  ns/op
BenchmarkResourceUsageTimeSeriesRecorder.record                        N/A                  N/A                N/A                32000                   N/A              N/A  avgt   20       34.678 ±    0.030  ns/op
@lukasz-stec lukasz-stec force-pushed the ls/2604/01-resource-usage-timeline branch from 2951467 to 3690351 Compare April 8, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

1 participant