Skip to content

Commit 81c4393

Browse files
ci(perf-gate): loosen per-kernel thresholds to absorb between-run drift
The first real-CI run of the perf gate (CI run 25614982597 on this PR's own no-op change) tripped a hard fail on `Magnitude` at 0.840x and soft-warned on five other kernels in [0.913, 0.965], despite this PR not changing any kernel code. The PR-side and main-side binaries should be byte-identical (same source, same auto-detected SIMD features, same `-C target-cpu=x86-64-v3`), so the only explanation is between-run noise on the same VM. Empirical observation: between-run drift on otherwise-identical binaries on the same runner VM hits ~10-15% per kernel, which is considerably higher than the within-run CV% the bench reports (typically <1%). Cache state across the two consecutive `./openvx-mark` process launches, thermal headroom, and VM-host neighbour load are the usual culprits. The within-run CV% filter (`--max-cv 5.0`) doesn't catch this because it only inspects samples within a single bench process. Recalibration: --kernel-floor 0.85 -> 0.75 Per-kernel hard fail now requires >25% regression. Generous enough to absorb the worst between-run drift we've observed (the 16% Magnitude blip on the failed CI run sits comfortably above the new floor). --warn-floor 0.97 -> 0.90 Soft-warn band moves from "any kernel slower than 3%" to "individual kernels in [-25%, -10%)". Below 10% is treated as noise and not flagged. --geomean-floor 0.97 (unchanged) Aggregate move > 3% across 50+ verified kernels stays the primary gate signal. That magnitude of aggregate drift is essentially impossible to fake with single-kernel noise: it requires a real software-side regression that touches the hot path. Keeping this strict. Self-tests on the four reference input pairs (PR12 vs pre-PR12 main, reversed, identity, same-side) still behave correctly with the new thresholds: PASS with verdict 1.375x on the real perf wins, FAIL with verdict 0.727x and 7 hard-failed kernels on the simulated regression, PASS with 1.000x on the identity pair. Applying the new thresholds to the offending CI run's data turns its 1 hard-fail / 5 soft-warn output into the PASS verdict it should have had on a no-op PR. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 068ea31 commit 81c4393

2 files changed

Lines changed: 45 additions & 11 deletions

File tree

.github/scripts/perf_gate.py

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,23 @@
1313
1414
Defaults:
1515
--geomean-floor 0.97 (no more than 3% aggregate slowdown)
16-
--kernel-floor 0.85 (no kernel may regress more than 15%)
16+
--kernel-floor 0.75 (no kernel may regress more than 25%)
17+
--warn-floor 0.90 (soft-warn band for individual kernels in
18+
[0.75, 0.90); below 10% we treat as noise)
1719
--max-cv 5.0 (skip kernels above this run-to-run noise)
1820
21+
The per-kernel floor is intentionally generous (0.75x = 25%
22+
allowed regression) because between-run drift on otherwise-identical
23+
binaries on the SAME runner VM measures ~10-15% per kernel in real
24+
CI — well above the within-run CV% the bench itself reports. Cache
25+
state, thermal headroom, and VM-host neighbour load are the usual
26+
suspects. A tighter per-kernel floor produced false positives on
27+
no-op PRs.
28+
29+
Aggregate moves > 3% across 50+ verified kernels are essentially
30+
impossible to fake with noise, which is why the geomean floor is
31+
the real gate signal — it stays at 0.97x.
32+
1933
Each filter is applied independently; a kernel that doesn't pass the
2034
filters (unverified, noisy, missing on either side) is reported in a
2135
"skipped" section but does not contribute to the gate decision.
@@ -284,10 +298,12 @@ def main(argv: list[str]) -> int:
284298
p.add_argument("pr_json", help="benchmark_results.json from PR's rustVX run")
285299
p.add_argument("--geomean-floor", type=float, default=0.97,
286300
help="Aggregate geomean floor (default 0.97 = up to 3%% regression)")
287-
p.add_argument("--kernel-floor", type=float, default=0.85,
288-
help="Per-kernel floor (default 0.85 = up to 15%% regression)")
289-
p.add_argument("--warn-floor", type=float, default=0.97,
290-
help="Soft warn floor (default 0.97 = warn between -3%% and -15%%)")
301+
p.add_argument("--kernel-floor", type=float, default=0.75,
302+
help="Per-kernel floor (default 0.75 = up to 25%% regression; "
303+
"generous to absorb ~10-15%% between-run noise on real CI)")
304+
p.add_argument("--warn-floor", type=float, default=0.90,
305+
help="Soft warn floor (default 0.90 = warn for individual "
306+
"kernels in [-25%%, -10%%); below 10%% is treated as noise)")
291307
p.add_argument("--max-cv", type=float, default=5.0,
292308
help="Skip kernels whose CV%% exceeds this threshold (default 5.0)")
293309
p.add_argument("--summary-out", default=None,

.github/workflows/conformance.yml

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -846,10 +846,28 @@ jobs:
846846
# the regression thresholds, posts a markdown verdict to the PR's job
847847
# summary, and exits non-zero (i.e. fails the workflow) on regression.
848848
#
849-
# Defaults:
850-
# * geomean PR/main >= 0.97 (no more than 3% aggregate slowdown)
851-
# * per-kernel PR/main >= 0.85 (no kernel may regress more than 15%)
852-
# * skip kernels with stability_warning / verified=false / cv_percent > 5%
849+
# Threshold rationale (see `.github/scripts/perf_gate.py` for full
850+
# docstring and per-flag semantics):
851+
#
852+
# * --geomean-floor 0.97 -> aggregate move > 3% slower fails. This
853+
# is the real signal for actual perf bugs
854+
# that affect multiple kernels.
855+
# * --kernel-floor 0.75 -> a SINGLE-kernel hard fail requires
856+
# > 25% regression. This is intentionally
857+
# generous: we measured ~10-15% between-
858+
# run drift on otherwise-identical
859+
# binaries on the same VM (cache state,
860+
# thermal, VM-host neighbour load), well
861+
# above the within-run CV% the bench
862+
# itself reports. A tighter per-kernel
863+
# floor produced false positives on
864+
# no-op PRs (CI run 25614982597).
865+
# * --warn-floor 0.90 -> soft-warn band [0.75, 0.90). Below 10%
866+
# we treat as noise.
867+
# * --max-cv 5.0 -> auto-skip kernels above this within-
868+
# run CV%; combined with the looser
869+
# per-kernel floor this gives us a clean
870+
# signal-to-noise ratio.
853871
#
854872
# Trigger:
855873
# * pull_request only — push events to main do not gate against
@@ -892,7 +910,7 @@ jobs:
892910
python3 ${{ github.workspace }}/.github/scripts/perf_gate.py \
893911
"$MAIN" "$PR" \
894912
--geomean-floor 0.97 \
895-
--kernel-floor 0.85 \
896-
--warn-floor 0.97 \
913+
--kernel-floor 0.75 \
914+
--warn-floor 0.90 \
897915
--max-cv 5.0 \
898916
--summary-out "$GITHUB_STEP_SUMMARY"

0 commit comments

Comments
 (0)