Commit 9bf4e14
ci(perf-gate): block PRs that regress kernel performance against main (#14)
* ci(perf-gate): block PRs that regress kernel performance against main
Adds a per-PR performance regression gate that fails the workflow if
the PR's rustVX kernels are slower than main's by more than the
configured thresholds. The two builds are benchmarked back-to-back
on the *same* runner VM via openvx-mark, so hardware variance between
the two numbers is zero — every reported difference is attributable
to this PR.
This addresses the practical problem we hit while doing perf work: a
PR can claim a 50x speedup against the Khronos sample on its own
benchmark run while silently regressing some other kernel against
main, because (a) Khronos is on a different lifecycle for some
kernels (LaplacianPyramid hoist) and (b) consecutive runs on
GitHub-hosted runners can land on different physical CPUs (we
observed EPYC 7763 vs EPYC 9V74 swings of ~1.29x with no source
changes). Direct PR-vs-main on the same VM removes both confounders.
Workflow shape
--------------
* The existing `benchmark` job is extended with three new steps that
fire only on `pull_request` events:
1. Check out the merge target ref (`github.base_ref`, i.e. main
in practice) into a sub-directory.
2. Build it with `cargo build --release -p openvx-ffi`, reusing
the same auto-detected SIMD feature set the PR build used
(sse2/avx2 on x86_64-AVX2 hosts, sse2 only on older x86_64,
neon on aarch64) and the same `-C target-cpu=x86-64-v3` /
`v2` portable microarch level.
3. Build openvx-mark against main's libopenvx_ffi.so, run with
the same `--resolution FHD --iterations 20 --warmup 5`
workload as the PR run, and upload the resulting
benchmark_results.json as the new
`benchmark-results-rustvx-main` artifact.
* A new `perf-gate` job (also `pull_request`-only) downloads the two
rustVX JSONs (`benchmark-results-rustvx` from PR's run and
`benchmark-results-rustvx-main` from main's same-VM run) and runs
the new `.github/scripts/perf_gate.py` script. Exit non-zero on
regression -> workflow fails -> PR cannot merge until the
regression is addressed or the thresholds re-tuned.
Thresholds (defaults, easy to tune in the workflow file)
--------------------------------------------------------
* `--geomean-floor 0.97`: PR/main geomean across all gated kernels
must be at least 0.97x (no more than 3% aggregate slowdown).
* `--kernel-floor 0.85`: every individually gated kernel must have
PR/main >= 0.85x (no kernel may regress more than 15%).
* `--warn-floor 0.97`: kernels in [0.85, 0.97) get a soft warning
in the verdict markdown but do not fail the gate.
* `--max-cv 5.0`: kernels with `wall_clock.cv_percent > 5%` on
either side are auto-skipped (treated as too noisy to gate on);
they're reported in a "skipped" section so reviewers can see
what was filtered.
Additional automatic skips:
- any kernel with `verified=false` on either side (CTS handles
correctness; perf gate does not double-gate on it),
- any kernel with `stability_warning=true` (openvx-mark's own
stability flag),
- any kernel that's missing on one of the two sides (new in PR
or removed in PR -> reported, not gated).
Verdict reporting
-----------------
`perf_gate.py` writes a markdown block to stdout (and to
`$GITHUB_STEP_SUMMARY` when invoked from the workflow). The block
contains:
* PASS / FAIL verdict with the geomean ratio and counts.
* Threshold table (so reviewers can see what the bar was).
* Hard-regression table (kernels that failed the per-kernel floor),
sorted from worst to best, with main MP/s, PR MP/s, ratio,
per-side sustained_ms, and the reason the kernel tripped.
* Soft-regression table (warn rows).
* Healthy-kernels summary (5 closest to floor + 5 biggest wins,
to surface "almost-regressions" reviewers might want to know about
even on a passing PR).
* Skipped table with reasons.
Self-tests
----------
The script was dry-run against three input pairs from real CI runs
in this repo:
1. PR12 (perf-improvements) vs pre-PR12 main: PASS, geomean 1.375x,
0 hard failures, 0 warnings. The big wins (Box3x3 14.6x,
Gaussian3x3 12.3x, Subtract 3.76x) surface in the
"biggest wins" section. `Add` was correctly auto-skipped because
its CV% was 13.89% on PR (well above the 5% floor).
2. Reversed pair (simulates PR much slower than main): FAIL, 35
hard failures, exit 1. Top offenders sorted from worst to best
(Box3x3 0.068x, Gaussian3x3 0.081x, Subtract 0.266x, ...).
3. Identity (same JSON twice): PASS, ratio 1.000x exactly.
All three return the right exit code, which is what gates the
workflow.
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci: extend paths-ignore to cover docs-only / metadata-only changes
Adds the obviously non-functional file patterns to both the push and
pull_request paths-ignore lists so a CI run isn't spent on changes
that can't possibly affect compile, test, or bench output.
Added (on top of the existing **/*.md, docs/**, LICENSE):
* .gitignore, .gitattributes, .editorconfig — repo metadata
* **/*.svg, **/*.png, **/*.jpg, **/*.jpeg, **/*.gif, **/*.webp —
image-only changes (e.g. logo updates in docs/openvx-logo.svg)
Deliberately NOT added:
* **/*.txt — could be a Cargo build input (e.g. requirements.txt)
* .github/ISSUE_TEMPLATE/** — usually .md and already covered;
leave non-md files in there unblocked so we get CI if anyone
drops a script in
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci(perf-gate): loosen per-kernel thresholds to absorb between-run drift
The first real-CI run of the perf gate (CI run 25614982597 on this
PR's own no-op change) tripped a hard fail on `Magnitude` at 0.840x
and soft-warned on five other kernels in [0.913, 0.965], despite
this PR not changing any kernel code. The PR-side and main-side
binaries should be byte-identical (same source, same auto-detected
SIMD features, same `-C target-cpu=x86-64-v3`), so the only
explanation is between-run noise on the same VM.
Empirical observation: between-run drift on otherwise-identical
binaries on the same runner VM hits ~10-15% per kernel, which is
considerably higher than the within-run CV% the bench reports
(typically <1%). Cache state across the two consecutive
`./openvx-mark` process launches, thermal headroom, and VM-host
neighbour load are the usual culprits. The within-run CV% filter
(`--max-cv 5.0`) doesn't catch this because it only inspects
samples within a single bench process.
Recalibration:
--kernel-floor 0.85 -> 0.75
Per-kernel hard fail now requires >25% regression. Generous
enough to absorb the worst between-run drift we've observed
(the 16% Magnitude blip on the failed CI run sits comfortably
above the new floor).
--warn-floor 0.97 -> 0.90
Soft-warn band moves from "any kernel slower than 3%" to
"individual kernels in [-25%, -10%)". Below 10% is treated as
noise and not flagged.
--geomean-floor 0.97 (unchanged)
Aggregate move > 3% across 50+ verified kernels stays the
primary gate signal. That magnitude of aggregate drift is
essentially impossible to fake with single-kernel noise: it
requires a real software-side regression that touches the hot
path. Keeping this strict.
Self-tests on the four reference input pairs (PR12 vs pre-PR12
main, reversed, identity, same-side) still behave correctly with
the new thresholds: PASS with verdict 1.375x on the real perf wins,
FAIL with verdict 0.727x and 7 hard-failed kernels on the simulated
regression, PASS with 1.000x on the identity pair. Applying the new
thresholds to the offending CI run's data turns its 1 hard-fail /
5 soft-warn output into the PASS verdict it should have had on a
no-op PR.
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci(perf-gate): refactor to self-contained 3-phase architecture
Per review feedback: the previous design had the perf gate consume
artifacts from across two different jobs running on different
runner VMs. Specifically the PR's libopenvx_ffi.so was built in the
existing `build` job (one VM) while the main rustVX was built
inside the `benchmark` job (a different VM). When those two VMs
landed on different CPU pools — which can happen on GitHub-hosted
runners — the auto-detected feature flags (sse2 / avx2) and
RUSTFLAGS (`-C target-cpu=x86-64-v3` vs `v2`) diverged, producing
non-comparable binaries. The downstream gate then saw a real
regression on `Magnitude` (PR/main = 0.746x) on a no-op PR — the
PR binary was running scalar code while the main binary was
running AVX2 code, even though both were checked out from the same
source modulo a workflow-only diff.
Refactor into a clean 3-phase architecture:
Phase 1 (parallel builds):
build — PR rustVX -> build-artifacts
build-main — main rustVX (NEW) -> build-artifacts-main
build-khronos-sample — Khronos sample -> khronos-sample-artifacts
Phase 2 (parallel CTS jobs, depend on `build`):
baseline, graph, data-objects, image-ops, vision-*, enhanced-vision
Phase 3 (parallel bench jobs):
benchmark — PR rustVX vs Khronos sample (informational; existing
job, restored to its pre-#14 shape).
perf-gate — PR rustVX vs main rustVX (gating, NEW).
Depends on `build` + `build-main`. Downloads both
rustVX archives onto a single runner VM, builds
openvx-mark twice, runs both benches back-to-back,
applies `.github/scripts/perf_gate.py` thresholds
and exits non-zero on regression. Same-VM
comparison is the whole point — both libraries are
measured on identical hardware.
Notes:
* `build-main` is structurally identical to `build` (same
CPU-detection + cargo-build steps), just with `ref: github.base_ref`
on the checkout. The `build-artifacts-main` archive is intentionally
smaller than `build-artifacts` (no CTS payload), since only
perf-gate consumes it.
* The `benchmark` job is restored to its pre-#14 shape — no longer
builds main rustVX or uploads `benchmark-results-rustvx-main`. All
PR-vs-main work happens in `perf-gate`, completely independently
of the existing `benchmark` job.
* Added a "throwaway warmup" step in perf-gate before the two
measured runs. The previous CI failure
(run 25615947512: `Magnitude` showed PR/main = 0.746x on a no-op
PR) had the consistent direction of "first bench is slower than
second bench" — likely cold OS cache / page-fault overhead biasing
the first measured run. Doing two short throwaway runs first
warms the VM state so the two real measurement runs both start
from a comparable hot-cache baseline.
* The perf-gate's `needs:` only includes the two build jobs (not
the CTS jobs that the benchmark job blocks on). Perf is
independent of correctness — CTS jobs gate that separately — so
the gate can start as soon as both Phase-1 builds complete,
giving faster signal.
* Skipped on push events to main (no merge target to diff against).
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci(perf-gate): emit comprehensive per-kernel table for trend tracking
Replaces the "Healthy kernels (5 closest to floor + 5 biggest wins)"
highlight section with a full "All kernels" table that lists every
gated kernel exactly once, sorted from worst PR/main ratio to best.
Hard-fail and soft-warn highlight sections at the top still render
unchanged (useful for at-a-glance scanning), so the table is
duplicative-but-comprehensive: every regression also appears in
the all-kernels table for completeness.
The full table is what you need to track perf trajectory across
PRs over time — being able to grep one CI run's `All kernels`
section against the previous run's tells you whether a kernel is
drifting close to the floor before it actually trips the gate.
Verified locally:
* PR12 vs pre-PR12 main (real perf wins): all 51 verified kernels
render, sorted from 1.135x (smallest win) to 14.602x (Box3x3
biggest win). 0 fail, 0 warn, 51 ok. Verdict PASS.
* Reversed (simulated regression): 51 kernels render, sorted
worst-first (Box3x3 0.068x at the top). 7 fail, 44 warn.
Verdict FAIL. Hard / Soft / All-kernels sections all consistent
with each other (7 + 44 + 0 = 51 verified).
* Identity / same-side: all kernels render at ratio 1.000x.
Verdict PASS.
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci(perf-gate): include skipped kernels in the comprehensive table
The previous version listed 51 gated kernels in the "All kernels"
table and put the 1 skipped kernel ('Add', CV% too noisy) in a
separate "Skipped" section with only its name and reason — the
actual numbers were not visible. For trend tracking the user
needs all 52 kernels' numbers in the same place each run.
This change:
* Extends `SkipRecord` to optionally carry the underlying main +
PR Row data so the skipped kernel can be displayed with its
actual MP/s and ms numbers.
* Folds skipped kernels into the comprehensive "All kernels"
table with a new `[skip]` status emoji and the skip reason in
the Notes column. Skipped rows still sort by ratio alongside
gated rows; rows missing on one side (no ratio computable)
sort to the very bottom and are rendered with em-dashes.
* Drops the redundant standalone "Skipped" section — its
information is now in-line with the main table.
* Updates the section heading to report all four counts:
`52 total — 0 fail, 0 warn, 51 ok, 1 skipped`.
Verified locally on the PR12-vs-pre-PR12-main pair: the table now
contains 52 rows including a `[skip] Add` row that shows the
actual 8.484x speedup but is annotated `skipped: CV% over 5.0%
(main=0.29% pr=13.89%)`. Regression case also shows all 52 rows
(7 fail + 44 warn + 0 ok + 1 skipped).
Co-authored-by: Cursor <cursoragent@cursor.com>
* ci(perf-gate): explicit AVX2 builds, 10% per-kernel floor, hardware details
Three review-driven updates:
1. Both `build` and `build-main` jobs now produce binaries with
EXPLICIT AVX2 features and `-C target-cpu=x86-64-v3`, instead
of auto-detecting from /proc/cpuinfo. This guarantees the two
rustVX libraries the perf-gate compares are built with
identical compile-time configuration regardless of which Azure
VM each Phase-1 build lands on, eliminating the silent
binary-mismatch class of false positives we saw earlier.
The x86-64-v3 microarch level (SSE4.2 + AVX + AVX2 + BMI1+2 +
FMA + F16C) is supported by every Linux runner in GitHub's
Azure pool today (AMD EPYC Milan / Genoa, Intel Cascade Lake /
Ice Lake). It's the right baseline for AMD Zen-tuned
performance work — gives rustc licence to auto-vectorise the
rest of the workspace using AVX2 / FMA / BMI2 on top of the
hand-tuned `#[target_feature(enable = "avx2")]` intrinsics
gated by the `avx2` Cargo feature on `openvx-core` and
`openvx-vision`. Falls back cleanly to NEON on aarch64 and
scalar elsewhere.
2. Per-kernel floor tightened from 0.75x (25% regression) to
0.90x (10% regression). With explicit-AVX2 builds the same-VM
noise floor sits well below 10%, so anything tripping the
gate is now a real regression worth investigating rather than
plausibly absorbable jitter. Warn floor moves correspondingly
from 0.90 to 0.95 — kernels in the 5-10% slower range get an
advisory annotation but still pass the gate.
3. Verdict markdown now includes a "### Hardware" section above
the verdict line, listing CPU model / cores / RAM / hostname
/ OS version / timestamp for both the main-side and PR-side
bench runs (parsed from each `benchmark_results.json`'s
`system` field). When the two runs report different hardware
(a real risk if our same-VM assumption ever fails), an
explicit warning callout appears below the table flagging
that the comparison may be biased.
Self-tests: PR12-vs-pre-PR12 main still produces PASS with
geomean 1.375x and 0 hard failures (the real perf wins all
exceed the new 0.90 floor by miles). Reversed pair produces a
much louder FAIL than before (51 hard regressions instead of 7,
because every kernel that's >10% slower now trips the floor).
Identity comparison stays at exactly 1.000x and 0 failures.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent 709b7da commit 9bf4e14
2 files changed
Lines changed: 852 additions & 87 deletions
0 commit comments