add gha resource utilization summary dashboard by isegall-da · Pull Request #4047 · hyperledger-labs/splice

isegall-da · 2026-02-19T14:31:23Z

I started collecting this table manually, then realized there must be a better way:

Seems to me like the conclusion is there's not much we need to change right now. So: fixes https://github.com/DACH-NY/canton-network-internal/issues/3433

@martinflorian-da @nicu-da do you agree?

Pull Request Checklist

Cluster Testing

If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

Include any change that might be observable by our partners or affect their deployment in the release notes.
Specify fixed issues with Fixes #n, and mention issues worked on using #n
Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

nicu-da · 2026-02-19T14:48:58Z

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table.
And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus.
also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

isegall-da · 2026-02-19T15:41:10Z

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table. And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus. also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

Hmmm...
The table is definitely consistent with the other dashboard that we have for this:
https://grafana.splice.network.canton.global/d/ae9lqwimiigw0d/resource-utilization-detailed?orgId=1&from=now-24h&to=now&timezone=UTC&var-test_suite=disaster-recovery

(where, e.g. DR definitely does use 10 cpus)

isegall-da · 2026-02-19T15:41:46Z

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table. And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus. also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

Hmmm... The table is definitely consistent with the other dashboard that we have for this: https://grafana.splice.network.canton.global/d/ae9lqwimiigw0d/resource-utilization-detailed?orgId=1&from=now-24h&to=now&timezone=UTC&var-test_suite=disaster-recovery

(where, e.g. DR definitely does use 10 cpus)

But it of course could be that my whole infra for reporting the metrics there is completely off.......

isegall-da · 2026-02-19T15:47:50Z

this view definitely does go above 5:
https://console.cloud.google.com/monitoring/metrics-explorer;startTime=2026-02-19T00:09:12.000Z;endTime=2026-02-19T13:09:26.000Z?referrer=search&project=da-cn-splice&pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22pointConnectionMethod%22:%22GAP_DETECTION%22,%22targetAxis%22:%22Y1%22,%22timeSeriesFilter%22:%7B%22aggregations%22:%5B%7B%22alignmentPeriod%22:%2260s%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22groupByFields%22:%5B%22resource.label.%5C%22pod_name%5C%22%22%5D,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%5D,%22apiSource%22:%22DEFAULT_CLOUD%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22filter%22:%22metric.type%3D%5C%22kubernetes.io%2Fcontainer%2Fcpu%2Fcore_usage_time%5C%22%20resource.type%3D%5C%22k8s_container%5C%22%22,%22groupByFields%22:%5B%22resource.label.%5C%22pod_name%5C%22%22%5D,%22minAlignmentPeriod%22:%2260s%22,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D

nicu-da · 2026-02-19T15:50:32Z

Hmm, something's off for sure, we use the grafana resource utilization dashboards quite a bit so I wonder if something's off there.
The other thing that made me question things is the 103% memory use in the table, which in theory shouldn't be possible

isegall-da · 2026-02-19T22:31:22Z

Hmm, something's off for sure, we use the grafana resource utilization dashboards quite a bit so I wonder if something's off there. The other thing that made me question things is the 103% memory use in the table, which in theory shouldn't be possible

So the different is that the k8s dashboard uses sum_rate5m while ours uses sum_irate. IIUC, the former averages over time, thus smoothes out burstiness. That's why for the same container we see this in our dashboard:

but this in the k8s one:

Now the question is which one do we want to make decisions based on.... I think that for CI times, as long as we have bursts that actually uses that CPU, it's worth allocating it. WDYT?

nicu-da · 2026-02-20T08:12:56Z

I think it makes sense then, also something to keep in mind when checking resource usage in the future.
I do wonder if it might make sense to lower requests to account for the average usage though, and imo just not have limits.
What might be worth looking at accounting for bursts is overall node usage as well. Why i created the original issue was because looking at how many resources we were using on one node it seemed like we were wasting lots of resources.
We have fairly large requests, and we usually can fit 1 or max 2 runners on a single node, and it seems that leaves a lot of resources unused during most times.

isegall-da · 2026-02-20T21:06:01Z

just not have limits

IDK, don't we have enough noise as it is without CI runs starting to interfere with each other's resources?

isegall-da added 3 commits February 19, 2026 14:28

[static] add gha resource utilization summary dashboard

ad0aaf0

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

[static] add gha resource utilization summary dashboard

3fd10ac

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

[static] add gha resource utilization summary dashboard

de4ce7e

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

isegall-da requested review from martinflorian-da and nicu-da February 19, 2026 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

add gha resource utilization summary dashboard #4047

add gha resource utilization summary dashboard #4047
isegall-da wants to merge 3 commits intomainfrom
isegall/gha-summary

isegall-da commented Feb 19, 2026 •

edited

Loading

Uh oh!

nicu-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

nicu-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

nicu-da commented Feb 20, 2026

Uh oh!

isegall-da commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

isegall-da commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Cluster Testing

PR Guidelines

Merge Guidelines

Uh oh!

nicu-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

nicu-da commented Feb 19, 2026

Uh oh!

isegall-da commented Feb 19, 2026

Uh oh!

nicu-da commented Feb 20, 2026

Uh oh!

isegall-da commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isegall-da commented Feb 19, 2026 •

edited

Loading