Commit 046f50e
authored
ci(cpp): widen memory_per_step_growth_kb threshold to 75% (#874)
## Summary
Bump the regression threshold for `memory_per_step_growth_kb` from the
generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't
false-fail the C++ Benchmarks gate. This is the only metric at a
single-digit-KB scale, where ordinary feature work produces large
*percent* swings on small *absolute* numbers.
## Root cause
`memory_per_step_growth_kb` measures KB of RSS growth per agent loop
step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged
2026-04-24) added the `Image` type, content-block JSON parsing, and new
per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB
absolute, a perfectly reasonable cost for a major new feature, but it
reads as +58% on the percent scale, well past the 15% generic threshold.
The cached baseline never refreshed because the workflow only saves a
new baseline on a successful main push (`if: github.ref ==
'refs/heads/main' && github.event_name == 'push'`), and every main push
since #858 has either failed (this same gate) or been cancelled. So
**every PR that touches C++ keeps inheriting the stale pre-VLM
baseline** and failing on this one metric.
## Why a threshold change, not a baseline bump
A baseline bump alone is a one-shot fix — the next feature that adds
another KB-level allocation per message will hit the same wall. A wider
threshold for *just this metric* is the structural fix: it acknowledges
that on a metric whose absolute values are tiny, percent swings are
noisy. Other metrics (binary size, latency, peak memory in MB) keep
their tighter thresholds.
## Verification
Most recent failing run shows the new value within the new band:
```
loop_latency_median_us 1026.0 918.0 -10.5% 15% IMPROVED
memory_baseline_kb 9272.0 9332.0 +0.6% 15% OK
memory_peak_kb 9416.0 9560.0 +1.5% 15% OK
memory_per_step_growth_kb 7.2 11.4 +58.3% 15% FAIL <-- this one
```
With the 75% threshold the FAIL becomes OK; everything else is
unchanged.
## Test plan
- [ ] CI green on this PR (the same Windows benchmark job that's been
red on every main push since 2026-04-24)
- [ ] After merge, baseline auto-refreshes on the next main push, future
PRs no longer inherit the stale 7.2 KB baseline1 parent 7e54e72 commit 046f50e
1 file changed
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
180 | 180 | | |
181 | 181 | | |
182 | 182 | | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
183 | 190 | | |
184 | 191 | | |
185 | 192 | | |
| |||
0 commit comments