Skip to content

ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874

Merged
kovtcharov merged 1 commit intomainfrom
kalin/cpp-bench-threshold-per-step-growth
Apr 28, 2026
Merged

ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874
kovtcharov merged 1 commit intomainfrom
kalin/cpp-bench-threshold-per-step-growth

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Bump the regression threshold for memory_per_step_growth_kb from the generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't false-fail the C++ Benchmarks gate. This is the only metric at a single-digit-KB scale, where ordinary feature work produces large percent swings on small absolute numbers.

Root cause

memory_per_step_growth_kb measures KB of RSS growth per agent loop step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged 2026-04-24) added the Image type, content-block JSON parsing, and new per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB absolute, a perfectly reasonable cost for a major new feature, but it reads as +58% on the percent scale, well past the 15% generic threshold.

The cached baseline never refreshed because the workflow only saves a new baseline on a successful main push (if: github.ref == 'refs/heads/main' && github.event_name == 'push'), and every main push since #858 has either failed (this same gate) or been cancelled. So every PR that touches C++ keeps inheriting the stale pre-VLM baseline and failing on this one metric.

Why a threshold change, not a baseline bump

A baseline bump alone is a one-shot fix — the next feature that adds another KB-level allocation per message will hit the same wall. A wider threshold for just this metric is the structural fix: it acknowledges that on a metric whose absolute values are tiny, percent swings are noisy. Other metrics (binary size, latency, peak memory in MB) keep their tighter thresholds.

Verification

Most recent failing run shows the new value within the new band:

loop_latency_median_us           1026.0   918.0  -10.5%   15%   IMPROVED
memory_baseline_kb               9272.0  9332.0   +0.6%   15%   OK
memory_peak_kb                   9416.0  9560.0   +1.5%   15%   OK
memory_per_step_growth_kb           7.2    11.4  +58.3%   15%   FAIL  <-- this one

With the 75% threshold the FAIL becomes OK; everything else is unchanged.

Test plan

  • CI green on this PR (the same Windows benchmark job that's been red on every main push since 2026-04-24)
  • After merge, baseline auto-refreshes on the next main push, future PRs no longer inherit the stale 7.2 KB baseline

Per-step growth is a single-digit-KB metric (~7 KB pre-VLM), so #858's
legitimate +4 KB/step from the new Image type, content-block JSON parsing,
and per-message storage shows up as +58% — well past the 15% generic
threshold but only a few KB in absolute terms.

Bump just this one metric to a 75% band so feature additions don't false-
fail the gate while still catching real 2x-scale regressions. Bumping
the cached baseline alone won't help long-term: any future feature that
adds another KB-level allocation per message will hit the same wall.

Currently failing every PR that runs C++ Benchmarks on Windows because
the cached main baseline (from 2026-04-20) predates #858 and main pushes
since then have either failed or been cancelled, so the baseline never
refreshed.
@kovtcharov kovtcharov added bug Something isn't working tests Test changes cpp labels Apr 26, 2026
@kovtcharov kovtcharov enabled auto-merge April 26, 2026 03:51
@kovtcharov kovtcharov self-assigned this Apr 26, 2026
@kovtcharov kovtcharov added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 046f50e Apr 28, 2026
27 of 29 checks passed
@kovtcharov kovtcharov deleted the kalin/cpp-bench-threshold-per-step-growth branch April 28, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpp tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants