Skip to content

Commit e7877c4

Browse files
authored
More precise metrics for measuring auction overhead (#3754)
# Description Plotting the entire (or at least vast majority) of time lost just running the auction (i.e. everything besides actually computing solutions) is extremely important for guiding our optimization efforts. We already have some metrics for that but since those are histograms we have a few issues: 1. the granularity of histograms depends on the buckets we define. The necessary granularity can vary a lot depending on the task so reusing the same metric for multiple sources of overhead either means we have to introduce a TON of buckets or multiple histograms (one for each source of overhead). 2. AFAIK histograms can't be merged into 1 nice plot that visualizes all the overhead at once. Instead you basically have to look at each histogram individually and mentally piece everything together. # Changes This PR addresses both issues by measuring the overhead using 2 counters. One for measuring the total time spent in each phase and one for counting how many measurements we did. Using gauges for this would have been a bit easier but gauges have the issue that they only plot the exact value stored at the time when prometheus scrapes the metrics. Since the runtime of the individual sources of overhead can vary quite a bit from run to run there is a chance that gauges misrepresent the metrics. With the 2 counter approach we can at least always compute averages for all sources of overhead which should hopefully give us better data. As we continue to reduce this overhead it might make sense to break down some of these phases a bit more but I think this is a good starting point. Note that a lot of plotted phases look insignificant in my screenshot but only because the data comes from the playground which basically does nothing. From my previous efforts to optimize performance I know that many of these phases take a surprising amount of time. ## How to test I used #3752 to build the new dashboard I want to build in the playground to verify that things work as I intend. As you can see that dashboard makes it a lot easier to get a sense of ALL the auction overhead at once and how much each phase contributes to the total overhead. <img width="1247" height="639" alt="Screenshot 2025-10-09 at 06 32 57" src="https://github.com/user-attachments/assets/74196838-74fc-4188-a5b9-fd8775eb5d1d" />
1 parent c1fc970 commit e7877c4

File tree

10 files changed

+95
-14
lines changed

10 files changed

+95
-14
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

crates/autopilot/src/infra/persistence/mod.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@ impl Persistence {
7272
&self,
7373
auction: &domain::RawAuctionData,
7474
) -> Result<domain::auction::Id, DatabaseError> {
75+
let _timer = observe::metrics::metrics()
76+
.on_auction_overhead_start("autopilot", "replace_auction_in_db");
7577
let auction = dto::auction::from_domain(auction.clone());
7678
self.postgres
7779
.replace_current_auction(&auction)

crates/autopilot/src/infra/solvers/dto/solve.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ impl Request {
3636
trusted_tokens: &HashSet<H160>,
3737
time_limit: Duration,
3838
) -> Self {
39+
let _timer =
40+
observe::metrics::metrics().on_auction_overhead_start("autopilot", "serialize_request");
3941
let helper = RequestHelper {
4042
id: auction.id,
4143
orders: auction

crates/autopilot/src/maintenance.rs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ use {
2020
core::{AtomicU64, GenericGauge},
2121
},
2222
shared::{event_handling::AlloyEventRetriever, maintenance::Maintaining},
23-
std::{future::Future, sync::Arc},
23+
std::{future::Future, sync::Arc, time::Instant},
2424
tokio::sync::Mutex,
2525
};
2626

@@ -64,7 +64,7 @@ impl Maintenance {
6464
return;
6565
}
6666

67-
let start = std::time::Instant::now();
67+
let start = Instant::now();
6868
if let Err(err) = self.update_inner().await {
6969
tracing::warn!(?err, block = new_block.number, "failed to run maintenance");
7070
metrics().updates.with_label_values(&["error"]).inc();
@@ -82,7 +82,8 @@ impl Maintenance {
8282
}
8383

8484
async fn update_inner(&self) -> Result<()> {
85-
// All these can run independently of each other.
85+
let _timer =
86+
observe::metrics::metrics().on_auction_overhead_start("autopilot", "maintenance_total");
8687
tokio::try_join!(
8788
Self::timed_future(
8889
"settlement_indexer",
@@ -118,6 +119,7 @@ impl Maintenance {
118119
.maintenance_stage_time
119120
.with_label_values(&[label])
120121
.start_timer();
122+
let _timer2 = observe::metrics::metrics().on_auction_overhead_start("autopilot", label);
121123
fut.await
122124
}
123125

crates/autopilot/src/solvable_orders.rs

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,10 @@ use {
3333
collections::{BTreeMap, HashMap, HashSet, btree_map::Entry},
3434
future::Future,
3535
sync::Arc,
36-
time::Duration,
36+
time::{Duration, Instant},
3737
},
3838
strum::VariantNames,
39-
tokio::{sync::Mutex, time::Instant},
39+
tokio::sync::Mutex,
4040
};
4141

4242
#[derive(prometheus_metric_storage::MetricStorage)]
@@ -166,6 +166,9 @@ impl SolvableOrdersCache {
166166
pub async fn update(&self, block: u64, store_events: bool) -> Result<()> {
167167
let start = Instant::now();
168168

169+
let _timer = observe::metrics::metrics()
170+
.on_auction_overhead_start("autopilot", "update_solvabe_orders");
171+
169172
let db_solvable_orders = self.get_solvable_orders().await?;
170173
tracing::trace!("fetched solvable orders from db");
171174

crates/driver/src/domain/competition/mod.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,8 @@ impl Competition {
115115
/// Solve an auction as part of this competition.
116116
pub async fn solve(&self, auction: Arc<String>) -> Result<Option<Solved>, Error> {
117117
let start = Instant::now();
118+
let timer = ::observe::metrics::metrics()
119+
.on_auction_overhead_start("driver", "pre_processing_total");
118120

119121
let tasks = self
120122
.fetcher
@@ -179,6 +181,7 @@ impl Competition {
179181
.auction_preprocessing
180182
.with_label_values(&["total"])
181183
.observe(elapsed.as_secs_f64());
184+
drop(timer);
182185
tracing::debug!(?elapsed, "auction task execution time");
183186

184187
let auction = &auction;
@@ -520,6 +523,7 @@ impl Competition {
520523
{
521524
task::spawn_blocking(move || {
522525
let _timer = metrics::get().processing_stage_timer(stage);
526+
let _timer2 = ::observe::metrics::metrics().on_auction_overhead_start("driver", stage);
523527
f()
524528
})
525529
.await

crates/driver/src/domain/competition/pre_processing.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,8 @@ impl Utilities {
206206
async fn parse_request(&self, solve_request: Arc<String>) -> Result<Arc<Auction>> {
207207
let auction_dto: SolveRequest = {
208208
let _timer = metrics::get().processing_stage_timer("parse_dto");
209+
let _timer2 =
210+
observe::metrics::metrics().on_auction_overhead_start("driver", "parse_dto");
209211
// deserialization takes tens of milliseconds so run it on a blocking task
210212
tokio::task::spawn_blocking(move || {
211213
serde_json::from_str(&solve_request).context("could not parse solve request")
@@ -219,6 +221,8 @@ impl Utilities {
219221

220222
let auction_domain = {
221223
let _timer = metrics::get().processing_stage_timer("convert_to_domain");
224+
let _timer2 = observe::metrics::metrics()
225+
.on_auction_overhead_start("driver", "convert_to_domain");
222226
let app_data = self
223227
.app_data_retriever
224228
.as_ref()
@@ -237,6 +241,8 @@ impl Utilities {
237241
/// Fetches the tradable balance for every order owner.
238242
async fn fetch_balances(self: Arc<Self>, auction: Arc<Auction>) -> Arc<Balances> {
239243
let _timer = metrics::get().processing_stage_timer("fetch_balances");
244+
let _timer2 =
245+
observe::metrics::metrics().on_auction_overhead_start("driver", "fetch_balances");
240246

241247
// Collect trader/token/source/interaction tuples for fetching available
242248
// balances. Note that we are pessimistic here, if a trader is selling
@@ -330,6 +336,8 @@ impl Utilities {
330336
};
331337

332338
let _timer = metrics::get().processing_stage_timer("fetch_app_data");
339+
let _timer2 =
340+
observe::metrics::metrics().on_auction_overhead_start("driver", "fetch_app_data");
333341

334342
let app_data = join_all(
335343
auction
@@ -367,6 +375,8 @@ impl Utilities {
367375

368376
async fn cow_amm_orders(self: Arc<Self>, auction: Arc<Auction>) -> Arc<Vec<Order>> {
369377
let _timer = metrics::get().processing_stage_timer("cow_amm_orders");
378+
let _timer2 =
379+
observe::metrics::metrics().on_auction_overhead_start("driver", "cow_amm_orders");
370380
let cow_amms = self.eth.contracts().cow_amm_registry().amms().await;
371381
let domain_separator = self.eth.contracts().settlement_domain_separator();
372382
let domain_separator = model::DomainSeparator(domain_separator.0);
@@ -487,6 +497,8 @@ impl Utilities {
487497
auction: Arc<Auction>,
488498
) -> Arc<Vec<liquidity::Liquidity>> {
489499
let _timer = metrics::get().processing_stage_timer("fetch_liquidity");
500+
let _timer2 =
501+
observe::metrics::metrics().on_auction_overhead_start("driver", "fetch_liquidity");
490502
let pairs = auction.liquidity_pairs();
491503
Arc::new(
492504
self.liquidity_fetcher

crates/driver/src/infra/solver/mod.rs

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ use {
2525
num::BigRational,
2626
observe::tracing::tracing_headers,
2727
reqwest::header::HeaderName,
28-
std::{collections::HashMap, time::Duration},
28+
std::{
29+
collections::HashMap,
30+
time::{Duration, Instant},
31+
},
2932
tap::TapFallible,
3033
thiserror::Error,
3134
tracing::{Instrument, instrument},
@@ -233,8 +236,9 @@ impl Solver {
233236
auction: &Auction,
234237
liquidity: &[liquidity::Liquidity],
235238
) -> Result<Vec<Solution>, Error> {
239+
let start = Instant::now();
240+
236241
let flashloan_hints = self.assemble_flashloan_hints(auction);
237-
// Fetch the solutions from the solver.
238242
let weth = self.eth.contracts().weth_address();
239243
let auction_dto = dto::auction::new(
240244
auction,
@@ -246,12 +250,6 @@ impl Solver {
246250
auction.deadline(self.timeouts()).solvers(),
247251
);
248252

249-
if let Some(id) = auction.id() {
250-
// Only auctions with IDs are real auctions (/quote requests don't have an ID,
251-
// and it makes no sense to store them)
252-
self.persistence.archive_auction(id, &auction_dto);
253-
}
254-
255253
let body = {
256254
// pre-allocate a big enough buffer to avoid re-allocating memory
257255
// as the request gets serialized
@@ -261,6 +259,17 @@ impl Solver {
261259
String::from_utf8(buffer).expect("serde_json only writes valid utf8")
262260
};
263261

262+
if let Some(id) = auction.id() {
263+
// Only auctions with IDs are real auctions (/quote requests don't have an ID).
264+
// Only for those it makes sense to archive them and measure the execution time.
265+
self.persistence.archive_auction(id, &auction_dto);
266+
::observe::metrics::metrics().measure_auction_overhead(
267+
start,
268+
"driver",
269+
"serialize_request",
270+
);
271+
}
272+
264273
let url = shared::url::join(&self.config.endpoint, "solve");
265274
super::observe::solver_request(&url, &body);
266275
let timeout = match auction.deadline(self.timeouts()).solvers().remaining() {

crates/observe/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ opentelemetry_sdk = { workspace = true }
1818
pin-project-lite = { workspace = true }
1919
prometheus = { workspace = true }
2020
prometheus-metric-storage = { workspace = true }
21+
scopeguard = { workspace = true }
2122
serde = { workspace = true }
2223
serde_json = { workspace = true }
2324
time = { workspace = true, features = ["macros"] }

crates/observe/src/metrics.rs

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
use {
2-
prometheus::Encoder,
2+
prometheus::{
3+
Encoder,
4+
core::{AtomicF64, AtomicU64, GenericCounterVec},
5+
},
36
std::{
47
collections::HashMap,
58
convert::Infallible,
@@ -9,6 +12,7 @@ use {
912
OnceLock,
1013
atomic::{AtomicBool, Ordering},
1114
},
15+
time::Instant,
1216
},
1317
tokio::task::{self, JoinHandle},
1418
warp::{Filter, Rejection, Reply},
@@ -136,3 +140,44 @@ fn handle_readiness(
136140
}
137141
})
138142
}
143+
144+
/// Metrics shared by potentially all processes.
145+
#[derive(prometheus_metric_storage::MetricStorage)]
146+
pub struct Metrics {
147+
/// All the time losses we incur while arbitrating the auctions
148+
#[metric(labels("component", "phase"))]
149+
pub auction_overhead_time: GenericCounterVec<AtomicF64>,
150+
151+
/// How many measurements we did for each source of overhead.
152+
#[metric(labels("component", "phase"))]
153+
pub auction_overhead_count: GenericCounterVec<AtomicU64>,
154+
}
155+
156+
impl Metrics {
157+
/// Returns a struct that measures the overhead when it gets dropped.
158+
#[must_use]
159+
pub fn on_auction_overhead_start<'a, 'b, 'c>(
160+
&'a self,
161+
component: &'b str,
162+
phase: &'c str,
163+
) -> impl Drop + use<'a, 'b, 'c> {
164+
let start = std::time::Instant::now();
165+
scopeguard::guard(start, move |start| {
166+
self.measure_auction_overhead(start, component, phase);
167+
})
168+
}
169+
170+
pub fn measure_auction_overhead(&self, start: Instant, component: &str, phase: &str) {
171+
self.auction_overhead_time
172+
.with_label_values(&[component, phase])
173+
.inc_by(start.elapsed().as_secs_f64());
174+
175+
self.auction_overhead_count
176+
.with_label_values(&[component, phase])
177+
.inc()
178+
}
179+
}
180+
181+
pub fn metrics() -> &'static Metrics {
182+
Metrics::instance(get_storage_registry()).unwrap()
183+
}

0 commit comments

Comments
 (0)