ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874
Merged
kovtcharov merged 1 commit intomainfrom Apr 28, 2026
Merged
ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874kovtcharov merged 1 commit intomainfrom
kovtcharov merged 1 commit intomainfrom
Conversation
Per-step growth is a single-digit-KB metric (~7 KB pre-VLM), so #858's legitimate +4 KB/step from the new Image type, content-block JSON parsing, and per-message storage shows up as +58% — well past the 15% generic threshold but only a few KB in absolute terms. Bump just this one metric to a 75% band so feature additions don't false- fail the gate while still catching real 2x-scale regressions. Bumping the cached baseline alone won't help long-term: any future feature that adds another KB-level allocation per message will hit the same wall. Currently failing every PR that runs C++ Benchmarks on Windows because the cached main baseline (from 2026-04-20) predates #858 and main pushes since then have either failed or been cancelled, so the baseline never refreshed.
itomek
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bump the regression threshold for
memory_per_step_growth_kbfrom the generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't false-fail the C++ Benchmarks gate. This is the only metric at a single-digit-KB scale, where ordinary feature work produces large percent swings on small absolute numbers.Root cause
memory_per_step_growth_kbmeasures KB of RSS growth per agent loop step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged 2026-04-24) added theImagetype, content-block JSON parsing, and new per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB absolute, a perfectly reasonable cost for a major new feature, but it reads as +58% on the percent scale, well past the 15% generic threshold.The cached baseline never refreshed because the workflow only saves a new baseline on a successful main push (
if: github.ref == 'refs/heads/main' && github.event_name == 'push'), and every main push since #858 has either failed (this same gate) or been cancelled. So every PR that touches C++ keeps inheriting the stale pre-VLM baseline and failing on this one metric.Why a threshold change, not a baseline bump
A baseline bump alone is a one-shot fix — the next feature that adds another KB-level allocation per message will hit the same wall. A wider threshold for just this metric is the structural fix: it acknowledges that on a metric whose absolute values are tiny, percent swings are noisy. Other metrics (binary size, latency, peak memory in MB) keep their tighter thresholds.
Verification
Most recent failing run shows the new value within the new band:
With the 75% threshold the FAIL becomes OK; everything else is unchanged.
Test plan