You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
More precise metrics for measuring auction overhead (#3754)
# Description
Plotting the entire (or at least vast majority) of time lost just
running the auction (i.e. everything besides actually computing
solutions) is extremely important for guiding our optimization efforts.
We already have some metrics for that but since those are histograms we
have a few issues:
1. the granularity of histograms depends on the buckets we define. The
necessary granularity can vary a lot depending on the task so reusing
the same metric for multiple sources of overhead either means we have to
introduce a TON of buckets or multiple histograms (one for each source
of overhead).
2. AFAIK histograms can't be merged into 1 nice plot that visualizes all
the overhead at once. Instead you basically have to look at each
histogram individually and mentally piece everything together.
# Changes
This PR addresses both issues by measuring the overhead using 2
counters. One for measuring the total time spent in each phase and one
for counting how many measurements we did.
Using gauges for this would have been a bit easier but gauges have the
issue that they only plot the exact value stored at the time when
prometheus scrapes the metrics. Since the runtime of the individual
sources of overhead can vary quite a bit from run to run there is a
chance that gauges misrepresent the metrics.
With the 2 counter approach we can at least always compute averages for
all sources of overhead which should hopefully give us better data.
As we continue to reduce this overhead it might make sense to break down
some of these phases a bit more but I think this is a good starting
point. Note that a lot of plotted phases look insignificant in my
screenshot but only because the data comes from the playground which
basically does nothing. From my previous efforts to optimize performance
I know that many of these phases take a surprising amount of time.
## How to test
I used #3752 to build the
new dashboard I want to build in the playground to verify that things
work as I intend.
As you can see that dashboard makes it a lot easier to get a sense of
ALL the auction overhead at once and how much each phase contributes to
the total overhead.
<img width="1247" height="639" alt="Screenshot 2025-10-09 at 06 32 57"
src="https://github.com/user-attachments/assets/74196838-74fc-4188-a5b9-fd8775eb5d1d"
/>
0 commit comments