Skip to content

Conversation

@MartinquaXD
Copy link
Contributor

Description

Plotting the entire (or at least vast majority) of time lost just running the auction (i.e. everything besides actually computing solutions) is extremely important for guiding our optimization efforts.
We already have some metrics for that but since those are histograms we have a few issues:

  1. the granularity of histograms depends on the buckets we define. The necessary granularity can vary a lot depending on the task so reusing the same metric for multiple sources of overhead either means we have to introduce a TON of buckets or multiple histograms (one for each source of overhead).
  2. AFAIK histograms can't be merged into 1 nice plot that visualizes all the overhead at once. Instead you basically have to look at each histogram individually and mentally piece everything together.

Changes

This PR addresses both issues by measuring the overhead using 2 counters. One for measuring the total time spent in each phase and one for counting how many measurements we did.
Using gauges for this would have been a bit easier but gauges have the issue that they only plot the exact value stored at the time when prometheus scrapes the metrics. Since the runtime of the individual sources of overhead can vary quite a bit from run to run there is a chance that gauges misrepresent the metrics.
With the 2 counter approach we can at least always compute averages for all sources of overhead which should hopefully give us better data.

As we continue to reduce this overhead it might make sense to break down some of these phases a bit more but I think this is a good starting point. Note that a lot of plotted phases look insignificant in my screenshot but only because the data comes from the playground which basically does nothing. From my previous efforts to optimize performance I know that many of these phases take a surprising amount of time.

How to test

I used #3752 to build the new dashboard I want to build in the playground to verify that things work as I intend.

As you can see that dashboard makes it a lot easier to get a sense of ALL the auction overhead at once and how much each phase contributes to the total overhead.
Screenshot 2025-10-09 at 06 32 57

@MartinquaXD MartinquaXD requested a review from a team as a code owner October 9, 2025 06:36
Copy link
Contributor

@squadgazzz squadgazzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And the heatmap metrics will be dropped eventually, right?

@MartinquaXD
Copy link
Contributor Author

And the heatmap metrics will be dropped eventually, right?

Yes. I think the new approach should be strictly more useful for our purposes. But I wanted to see how it looks with actual data before deleting our current metrics.

@MartinquaXD MartinquaXD merged commit e7877c4 into main Oct 9, 2025
17 checks passed
@MartinquaXD MartinquaXD deleted the new-overhead-metrics branch October 9, 2025 15:58
@github-actions github-actions bot locked and limited conversation to collaborators Oct 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants