ci(cpp): widen memory_per_step_growth_kb threshold to 75% by kovtcharov · Pull Request #874 · amd/gaia

kovtcharov · 2026-04-26T03:50:08Z

Summary

Bump the regression threshold for memory_per_step_growth_kb from the generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't false-fail the C++ Benchmarks gate. This is the only metric at a single-digit-KB scale, where ordinary feature work produces large percent swings on small absolute numbers.

Root cause

memory_per_step_growth_kb measures KB of RSS growth per agent loop step. Pre-VLM main was ~7.2 KB/step. PR #858 (VLM image support, merged 2026-04-24) added the Image type, content-block JSON parsing, and new per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB absolute, a perfectly reasonable cost for a major new feature, but it reads as +58% on the percent scale, well past the 15% generic threshold.

The cached baseline never refreshed because the workflow only saves a new baseline on a successful main push (if: github.ref == 'refs/heads/main' && github.event_name == 'push'), and every main push since #858 has either failed (this same gate) or been cancelled. So every PR that touches C++ keeps inheriting the stale pre-VLM baseline and failing on this one metric.

Why a threshold change, not a baseline bump

A baseline bump alone is a one-shot fix — the next feature that adds another KB-level allocation per message will hit the same wall. A wider threshold for just this metric is the structural fix: it acknowledges that on a metric whose absolute values are tiny, percent swings are noisy. Other metrics (binary size, latency, peak memory in MB) keep their tighter thresholds.

Verification

Most recent failing run shows the new value within the new band:

loop_latency_median_us           1026.0   918.0  -10.5%   15%   IMPROVED
memory_baseline_kb               9272.0  9332.0   +0.6%   15%   OK
memory_peak_kb                   9416.0  9560.0   +1.5%   15%   OK
memory_per_step_growth_kb           7.2    11.4  +58.3%   15%   FAIL  <-- this one

With the 75% threshold the FAIL becomes OK; everything else is unchanged.

Test plan

CI green on this PR (the same Windows benchmark job that's been red on every main push since 2026-04-24)
After merge, baseline auto-refreshes on the next main push, future PRs no longer inherit the stale 7.2 KB baseline

Per-step growth is a single-digit-KB metric (~7 KB pre-VLM), so #858's legitimate +4 KB/step from the new Image type, content-block JSON parsing, and per-message storage shows up as +58% — well past the 15% generic threshold but only a few KB in absolute terms. Bump just this one metric to a 75% band so feature additions don't false- fail the gate while still catching real 2x-scale regressions. Bumping the cached baseline alone won't help long-term: any future feature that adds another KB-level allocation per message will hit the same wall. Currently failing every PR that runs C++ Benchmarks on Windows because the cached main baseline (from 2026-04-20) predates #858 and main pushes since then have either failed or been cancelled, so the baseline never refreshed.

kovtcharov requested a review from kovtcharov-amd as a code owner April 26, 2026 03:50

kovtcharov added this to the v0.17.5 — Website launch and shell-safety [OSS] milestone Apr 26, 2026

kovtcharov added bug Something isn't working tests Test changes cpp labels Apr 26, 2026

kovtcharov enabled auto-merge April 26, 2026 03:51

kovtcharov self-assigned this Apr 26, 2026

itomek approved these changes Apr 28, 2026

View reviewed changes

kovtcharov added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit 046f50e Apr 28, 2026
27 of 29 checks passed

kovtcharov deleted the kalin/cpp-bench-threshold-per-step-growth branch April 28, 2026 13:21

itomek mentioned this pull request Apr 29, 2026

trying to address the issue in::Tool-registry scaling: resolve scratc… #811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874

ci(cpp): widen memory_per_step_growth_kb threshold to 75%#874
kovtcharov merged 1 commit intomainfrom
kalin/cpp-bench-threshold-per-step-growth

kovtcharov commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kovtcharov commented Apr 26, 2026

Summary

Root cause

Why a threshold change, not a baseline bump

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants