Skip to content

Commit 046f50e

Browse files
authored
ci(cpp): widen memory_per_step_growth_kb threshold to 75% (#874)
## Summary Bump the regression threshold for `memory_per_step_growth_kb` from the generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't false-fail the C++ Benchmarks gate. This is the only metric at a single-digit-KB scale, where ordinary feature work produces large *percent* swings on small *absolute* numbers. ## Root cause `memory_per_step_growth_kb` measures KB of RSS growth per agent loop step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged 2026-04-24) added the `Image` type, content-block JSON parsing, and new per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB absolute, a perfectly reasonable cost for a major new feature, but it reads as +58% on the percent scale, well past the 15% generic threshold. The cached baseline never refreshed because the workflow only saves a new baseline on a successful main push (`if: github.ref == 'refs/heads/main' && github.event_name == 'push'`), and every main push since #858 has either failed (this same gate) or been cancelled. So **every PR that touches C++ keeps inheriting the stale pre-VLM baseline** and failing on this one metric. ## Why a threshold change, not a baseline bump A baseline bump alone is a one-shot fix — the next feature that adds another KB-level allocation per message will hit the same wall. A wider threshold for *just this metric* is the structural fix: it acknowledges that on a metric whose absolute values are tiny, percent swings are noisy. Other metrics (binary size, latency, peak memory in MB) keep their tighter thresholds. ## Verification Most recent failing run shows the new value within the new band: ``` loop_latency_median_us 1026.0 918.0 -10.5% 15% IMPROVED memory_baseline_kb 9272.0 9332.0 +0.6% 15% OK memory_peak_kb 9416.0 9560.0 +1.5% 15% OK memory_per_step_growth_kb 7.2 11.4 +58.3% 15% FAIL <-- this one ``` With the 75% threshold the FAIL becomes OK; everything else is unchanged. ## Test plan - [ ] CI green on this PR (the same Windows benchmark job that's been red on every main push since 2026-04-24) - [ ] After merge, baseline auto-refreshes on the next main push, future PRs no longer inherit the stale 7.2 KB baseline
1 parent 7e54e72 commit 046f50e

1 file changed

Lines changed: 7 additions & 0 deletions

File tree

cpp/benchmarks/bench_utils.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,13 @@ inline double thresholdForMetric(const std::string& name) {
180180
if (name.find("binary_size") != std::string::npos) {
181181
return 10.0;
182182
}
183+
// Per-step growth is a single-digit-KB metric, so a single feature addition
184+
// (e.g. VLM image support adding ~4 KB/step for new content-block storage)
185+
// reads as a 50%+ swing. Use a wider band so legitimate feature additions
186+
// don't fail the gate while still catching true 2x-style regressions.
187+
if (name == "memory_per_step_growth_kb") {
188+
return 75.0;
189+
}
183190
// All other metrics: 15% threshold
184191
return 15.0;
185192
}

0 commit comments

Comments
 (0)