ci(cpp): widen memory_per_step_growth_kb threshold to 75% (#874)

kovtcharov · web-flow · commit 046f50e011a1 · 2026-04-28T13:20:57.000Z
## Summary Bump the regression threshold for `memory_per_step_growth_kb` from the generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't false-fail the C++ Benchmarks gate. This is the only metric at a single-digit-KB scale, where ordinary feature work produces large *percent* swings on small *absolute* numbers. ## Root cause `memory_per_step_growth_kb` measures KB of RSS growth per agent loop step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged 2026-04-24) added the `Image` type, content-block JSON parsing, and new per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB absolute, a perfectly reasonable cost for a major new feature, but it reads as +58% on the percent scale, well past the 15% generic threshold. The cached baseline never refreshed because the workflow only saves a new baseline on a successful main push (`if: github.ref == 'refs/heads/main' && github.event_name == 'push'`), and every main push since #858 has either failed (this same gate) or been cancelled. So **every PR that touches C++ keeps inheriting the stale pre-VLM baseline** and failing on this one metric. ## Why a threshold change, not a baseline bump A baseline bump alone is a one-shot fix — the next feature that adds another KB-level allocation per message will hit the same wall. A wider threshold for *just this metric* is the structural fix: it acknowledges that on a metric whose absolute values are tiny, percent swings are noisy. Other metrics (binary size, latency, peak memory in MB) keep their tighter thresholds. ## Verification Most recent failing run shows the new value within the new band: ``` loop_latency_median_us 1026.0 918.0 -10.5% 15% IMPROVED memory_baseline_kb 9272.0 9332.0 +0.6% 15% OK memory_peak_kb 9416.0 9560.0 +1.5% 15% OK memory_per_step_growth_kb 7.2 11.4 +58.3% 15% FAIL <-- this one ``` With the 75% threshold the FAIL becomes OK; everything else is unchanged. ## Test plan - [ ] CI green on this PR (the same Windows benchmark job that's been red on every main push since 2026-04-24) - [ ] After merge, baseline auto-refreshes on the next main push, future PRs no longer inherit the stale 7.2 KB baseline
diff --git a/cpp/benchmarks/bench_utils.h b/cpp/benchmarks/bench_utils.h
@@ -180,6 +180,13 @@ inline double thresholdForMetric(const std::string& name) {
     if (name.find("binary_size") != std::string::npos) {
         return 10.0;
     }
+    // Per-step growth is a single-digit-KB metric, so a single feature addition
+    // (e.g. VLM image support adding ~4 KB/step for new content-block storage)
+    // reads as a 50%+ swing. Use a wider band so legitimate feature additions
+    // don't fail the gate while still catching true 2x-style regressions.
+    if (name == "memory_per_step_growth_kb") {
+        return 75.0;
+    }
     // All other metrics: 15% threshold
     return 15.0;
 }

Original file line number	Diff line number	Diff line change
`@@ -180,6 +180,13 @@ inline double thresholdForMetric(const std::string& name) {`
`180`	`180`	`if (name.find("binary_size") != std::string::npos) {`
`181`	`181`	`return 10.0;`
`182`	`182`	`}`
	`183`	`+ // Per-step growth is a single-digit-KB metric, so a single feature addition`
	`184`	`+ // (e.g. VLM image support adding ~4 KB/step for new content-block storage)`
	`185`	`+ // reads as a 50%+ swing. Use a wider band so legitimate feature additions`
	`186`	`+ // don't fail the gate while still catching true 2x-style regressions.`
	`187`	`+ if (name == "memory_per_step_growth_kb") {`
	`188`	`+ return 75.0;`
	`189`	`+ }`
`183`	`190`	`// All other metrics: 15% threshold`
`184`	`191`	`return 15.0;`
`185`	`192`	`}`