More precise metrics for measuring auction overhead #3754
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Plotting the entire (or at least vast majority) of time lost just running the auction (i.e. everything besides actually computing solutions) is extremely important for guiding our optimization efforts.
We already have some metrics for that but since those are histograms we have a few issues:
Changes
This PR addresses both issues by measuring the overhead using 2 counters. One for measuring the total time spent in each phase and one for counting how many measurements we did.
Using gauges for this would have been a bit easier but gauges have the issue that they only plot the exact value stored at the time when prometheus scrapes the metrics. Since the runtime of the individual sources of overhead can vary quite a bit from run to run there is a chance that gauges misrepresent the metrics.
With the 2 counter approach we can at least always compute averages for all sources of overhead which should hopefully give us better data.
As we continue to reduce this overhead it might make sense to break down some of these phases a bit more but I think this is a good starting point. Note that a lot of plotted phases look insignificant in my screenshot but only because the data comes from the playground which basically does nothing. From my previous efforts to optimize performance I know that many of these phases take a surprising amount of time.
How to test
I used #3752 to build the new dashboard I want to build in the playground to verify that things work as I intend.
As you can see that dashboard makes it a lot easier to get a sense of ALL the auction overhead at once and how much each phase contributes to the total overhead.
