Skip to content

Comments

add gha resource utilization summary dashboard #4047

Open
isegall-da wants to merge 3 commits intomainfrom
isegall/gha-summary
Open

add gha resource utilization summary dashboard #4047
isegall-da wants to merge 3 commits intomainfrom
isegall/gha-summary

Conversation

@isegall-da
Copy link
Contributor

@isegall-da isegall-da commented Feb 19, 2026

I started collecting this table manually, then realized there must be a better way:

Screenshot from 2026-02-19 09-35-48

Seems to me like the conclusion is there's not much we need to change right now. So: fixes https://github.com/DACH-NY/canton-network-internal/issues/3433

@martinflorian-da @nicu-da do you agree?

Pull Request Checklist

Cluster Testing

  • If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

  • Include any change that might be observable by our partners or affect their deployment in the release notes.
  • Specify fixed issues with Fixes #n, and mention issues worked on using #n
  • Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

  • Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>
Signed-off-by: Itai Segall <itai.segall@digitalasset.com>
Signed-off-by: Itai Segall <itai.segall@digitalasset.com>
@nicu-da
Copy link
Contributor

nicu-da commented Feb 19, 2026

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table.
And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus.
also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

@isegall-da
Copy link
Contributor Author

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table. And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus. also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

Hmmm...
The table is definitely consistent with the other dashboard that we have for this:
https://grafana.splice.network.canton.global/d/ae9lqwimiigw0d/resource-utilization-detailed?orgId=1&from=now-24h&to=now&timezone=UTC&var-test_suite=disaster-recovery

(where, e.g. DR definitely does use 10 cpus)

@isegall-da
Copy link
Contributor Author

I am questioning the data we have there a bit, when I check previously grafana it was not close to what we have in that table. And checking grafana now https://grafana.splice.network.canton.global/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=default&var-cluster=&var-namespace=gha-runners&refresh=10s I don't see any pod that uses more than 5 cpus, yet that table reports DR as using 10 cpus. also for example the x large runners in grafana never used more than 32gb of memory, yet we request 52 for them and the table shows them as using 40

Hmmm... The table is definitely consistent with the other dashboard that we have for this: https://grafana.splice.network.canton.global/d/ae9lqwimiigw0d/resource-utilization-detailed?orgId=1&from=now-24h&to=now&timezone=UTC&var-test_suite=disaster-recovery

(where, e.g. DR definitely does use 10 cpus)

But it of course could be that my whole infra for reporting the metrics there is completely off.......

@nicu-da
Copy link
Contributor

nicu-da commented Feb 19, 2026

Hmm, something's off for sure, we use the grafana resource utilization dashboards quite a bit so I wonder if something's off there.
The other thing that made me question things is the 103% memory use in the table, which in theory shouldn't be possible

@isegall-da
Copy link
Contributor Author

Hmm, something's off for sure, we use the grafana resource utilization dashboards quite a bit so I wonder if something's off there. The other thing that made me question things is the 103% memory use in the table, which in theory shouldn't be possible

So the different is that the k8s dashboard uses sum_rate5m while ours uses sum_irate. IIUC, the former averages over time, thus smoothes out burstiness. That's why for the same container we see this in our dashboard:
Screenshot from 2026-02-19 17-29-08

but this in the k8s one:
Screenshot from 2026-02-19 17-28-58

Now the question is which one do we want to make decisions based on.... I think that for CI times, as long as we have bursts that actually uses that CPU, it's worth allocating it. WDYT?

@nicu-da
Copy link
Contributor

nicu-da commented Feb 20, 2026

I think it makes sense then, also something to keep in mind when checking resource usage in the future.
I do wonder if it might make sense to lower requests to account for the average usage though, and imo just not have limits.
What might be worth looking at accounting for bursts is overall node usage as well. Why i created the original issue was because looking at how many resources we were using on one node it seemed like we were wasting lots of resources.
We have fairly large requests, and we usually can fit 1 or max 2 runners on a single node, and it seems that leaves a lot of resources unused during most times.

@isegall-da
Copy link
Contributor Author

just not have limits

IDK, don't we have enough noise as it is without CI runs starting to interfere with each other's resources?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants