ci: gate benchmark regression on deterministic metrics, not ns/op (#5316)

devantler · claude · web-flow · commit 23119324c58e · 2026-06-17T22:50:38.000Z
* ci: gate benchmark regression on deterministic metrics, not ns/op The github-action-benchmark gate compared wall-clock ns/op, which varies 2-5x between runs on shared GitHub-hosted runners, producing false-positive "Performance Alert" comments on PRs that never touch the benchmarked code (e.g. #5272, a decode-only change, alerted the Marshal/Encode benchmarks it cannot reach). allocs/op and B/op stay flat across those runs - they are the real regression signal. Compare only the deterministic allocation metrics: - Derive bench-compare.txt from bench-filtered.txt with ns/op pinned to 1. The go parser emits one comparable series per metric; pinning ns/op makes both ns/op-bearing series (the bare and the "- ns/op" series) compare as ~0x of the baseline so jitter can't trip the alert, while the B/op and allocs/op series keep their real values and byte-identical names and still compare 1:1. The store job keeps using the unmodified file, so the ns/op history chart is unchanged. - PRs run -count=1 (deterministic metrics need no averaging) for a ~3x faster signal; push to main keeps -count=5 for the ns/op chart. - PRs fail on a >=1.5x allocs/B regression (fail-on-alert is now enabled for pull requests, deterministic so safe from jitter). Pushes to main never fail and just refresh the baseline, so an intentional increase self-heals without wedging main or the auto-release. Also corrects docs/AGENTS.md, which claimed main fails on a >=2x ns/op regression - fail-on-alert was unconditionally false, so it never did. Fixes #5276 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore: Apply megalinter fixes --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: devantler <26203420+devantler@users.noreply.github.com>
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -1307,12 +1307,13 @@ jobs:
         shell: bash
         env:
           PACKAGES: ${{ steps.discover.outputs.packages }}
-          # Reduce -count on PRs (-count=3) to halve benchmark wall time while
-          # keeping enough samples for the awk averaging step in
-          # "Prepare benchmark regression gate input" to produce a stable signal.
-          # On push to main (which feeds benchmark-store / the historical chart),
-          # keep -count=5 for higher-quality baselines.
-          BENCH_COUNT: ${{ github.event_name == 'push' && '5' || '3' }}
+          # The regression gate compares only the deterministic allocation
+          # metrics (B/op, allocs/op), which are independent of sample count, so
+          # PRs run -count=1 for the fastest possible signal (wall-clock ns/op is
+          # never gated — see "Compare benchmark results"). On push to main
+          # (which feeds benchmark-store / the historical ns/op chart) keep
+          # -count=5 for higher-quality wall-clock baselines.
+          BENCH_COUNT: ${{ github.event_name == 'push' && '5' || '1' }}
         run: |
           set -euo pipefail
           echo "$PACKAGES" \
@@ -1377,19 +1378,39 @@ jobs:
             }
           ' "$RUNNER_TEMP/bench.txt" > "$RUNNER_TEMP/bench-filtered.txt"
 
+          # Derive the regression-gate input by neutralizing the wall-clock
+          # metric. github-action-benchmark's "go" parser turns each
+          # "<name> <iters> <v1> <u1> <v2> <u2> ..." line into one comparable
+          # series per metric (plus a bare ns/op series). Pinning the ns/op value
+          # to a constant 1 makes both ns/op-bearing series compare as ~0x of the
+          # baseline so wall-clock jitter on shared runners can never trip the
+          # alert, while the deterministic B/op and allocs/op series keep their
+          # real values AND byte-identical names — so they still compare 1:1
+          # against the stored baseline. The unmodified bench-filtered.txt (real
+          # ns/op) still feeds the history chart via the benchmark-store job.
+          sed -E 's/[0-9]+ ns\/op/1 ns\/op/' \
+            "$RUNNER_TEMP/bench-filtered.txt" > "$RUNNER_TEMP/bench-compare.txt"
+
       - name: 📊 Compare benchmark results
         if: steps.discover.outputs.skip == 'false'
         uses: benchmark-action/github-action-benchmark@52576c92bccf6ac60c8223ec7eb2565637cae9ba # v1.22.1
         with:
           tool: go
-          output-file-path: ${{ runner.temp }}/bench-filtered.txt
+          # bench-compare.txt pins ns/op to 1, so only the deterministic
+          # B/op and allocs/op series carry real values here.
+          output-file-path: ${{ runner.temp }}/bench-compare.txt
           gh-pages-branch: benchmark-data
           benchmark-data-dir-path: dev/bench
           github-token: ${{ secrets.GITHUB_TOKEN }}
           auto-push: false
           alert-threshold: "150%"
-          fail-threshold: ${{ github.event_name != 'pull_request' && '200%' || '' }}
-          fail-on-alert: false
+          fail-threshold: "150%"
+          # Deterministic metrics are jitter-free, so PRs block on a real
+          # allocation/byte regression (≥1.5x). Pushes to main never fail
+          # (comment-on-alert/fail-on-alert are PR-only): the benchmark-store
+          # job updates the baseline so it self-heals after an intentional
+          # increase lands, avoiding a wedged main / blocked release.
+          fail-on-alert: ${{ github.event_name == 'pull_request' }}
           comment-on-alert: ${{ github.event_name == 'pull_request' }}
           summary-always: true
 
diff --git a/AGENTS.md b/AGENTS.md
@@ -348,7 +348,9 @@ For a deeper dive into KSail's design and internals, refer to:
 
 ## Benchmark Pipeline Consistency
 
-When changing `-count` in the `go test -bench` command in CI, always update the awk averaging/filtering step in "Prepare benchmark regression gate input" to produce exactly one result line per benchmark name. Failing to do so causes false-positive performance regression alerts for all benchmarks, even ones unrelated to the PR. The comparison tool (`github-action-benchmark`) expects the file it reads to contain exactly one result line per benchmark name, so repeated entries for the same benchmark must be consolidated before comparison. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
+The benchmark regression gate (the `📊 Compare benchmark results` step in `.github/workflows/ci.yaml`) compares **only the deterministic allocation metrics** (`B/op`, `allocs/op`); the noisy wall-clock `ns/op` series is neutralized in the gate input (`bench-compare.txt`, derived from `bench-filtered.txt` by pinning every `ns/op` value to `1`) because gating on `ns/op` produces false-positive alerts from 2–5× runner jitter. Pull requests run `-count=1` (deterministic metrics need no averaging) and the gate **fails the check** on a ≥1.5× allocation/byte regression; pushes to `main` run `-count=5` to feed the historical `ns/op` chart and only refresh the baseline (they never fail, so the baseline self-heals after an intentional increase lands).
+
+When changing `-count` (or otherwise editing the awk in "Prepare benchmark regression gate input"), keep it producing **exactly one result line per benchmark name**, and keep `bench-compare.txt` carrying the real `B/op`/`allocs/op` values under names byte-identical to the baseline — `github-action-benchmark` compares each line against the single stored baseline entry by name, so duplicate lines create spurious alerts and renamed series silently stop gating. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
 
 ## Active Technologies
 
diff --git a/docs/BENCHMARK-REGRESSION.md b/docs/BENCHMARK-REGRESSION.md
@@ -7,24 +7,30 @@ KSail includes automated benchmark regression testing to detect performance chan
 The benchmark jobs in the [CI workflow](../.github/workflows/ci.yaml) run on pushes to `main` and pull requests. They use [`benchmark-action/github-action-benchmark`](https://github.com/benchmark-action/github-action-benchmark) for regression detection and historical tracking.
 
 1. Discovers packages that contain benchmark functions (avoids compiling the entire module)
-2. Runs benchmarks on the current branch
-3. Compares results against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
-4. **Fails the CI check** if a benchmark regresses beyond the configured threshold
+2. Runs benchmarks on the current branch (`-count=1` on pull requests, `-count=5` on pushes to `main`)
+3. Compares the **deterministic allocation metrics** (`B/op`, `allocs/op`) against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
+4. **Fails the pull-request check** if an allocation metric regresses beyond the threshold
 
-On pushes to `main`, benchmark results are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
+Wall-clock `ns/op` is still recorded for the historical chart, but it is **never** used to gate — see [Regression Detection](#regression-detection) for why.
+
+On pushes to `main`, the full results (including `ns/op`) are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
 
 ## Regression Detection
 
-The workflow uses threshold-based regression detection:
+Shared GitHub-hosted runners have wall-clock variance of 2–5× between runs, so gating on `ns/op` produces false-positive "Performance Alert" comments on changes that never touch the benchmarked code. The gate therefore compares **only the deterministic metrics** — `allocs/op` and `B/op` — which depend on the code path, not on CPU speed or thermal throttling. A real algorithmic regression shows up in allocations; runner jitter does not.
+
+Mechanically, the `🔍 Prepare benchmark regression gate input` step writes two files:
 
-| Setting           | Value | Meaning                                                                                              |
-|-------------------|-------|------------------------------------------------------------------------------------------------------|
-| `alert-threshold` | 150%  | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline                 |
-| `fail-threshold`  | 200%  | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline |
+- `bench-filtered.txt` — the real measurements (including `ns/op`), used to update the baseline and the history chart.
+- `bench-compare.txt` — the same data with `ns/op` pinned to `1`, used as the gate input. `github-action-benchmark`'s `go` parser turns each line into one comparable series per metric; pinning `ns/op` makes its series compare as ~0× of the baseline (so it can never alert), while the `B/op` and `allocs/op` series keep their real values **and byte-identical names** and still compare 1:1.
 
-On pull requests, the benchmark gate is **informational only**: results are posted as a PR comment when the alert threshold is exceeded, but CI never blocks on it. This is intentional — shared GitHub Actions runners have hardware variance of 2–5× between runs, making per-PR blocking gates unreliable. Real regressions are caught on pushes to `main`.
+| Setting           | Value | Meaning                                                             |
+|-------------------|-------|---------------------------------------------------------------------|
+| `alert-threshold` | 150%  | A deterministic metric is flagged when it is ≥1.5× the baseline     |
+| `fail-threshold`  | 150%  | Threshold above which the gate fails the check (pull requests only) |
 
-On push or merge-queue events, `fail-threshold` is active: a ≥2× regression fails CI, protecting `main` from persistent algorithmic regressions.
+- **On pull requests**, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Because the gated metrics are jitter-free, this gate is reliable — it won't fire on runner noise.
+- **On pushes to `main`**, the gate never fails: the `📤 Store Benchmark Data` job simply updates the baseline (and the `ns/op` chart). An intentional allocation increase that lands on `main` therefore re-baselines itself, so it can neither wedge `main` nor block the auto-release.
 
 ## Historical Results
 
@@ -61,6 +67,6 @@ Follow the conventions established in the existing benchmark files:
 
 **No baseline data yet:** The first push to `main` after enabling the workflow auto-pushes the initial baseline to the `benchmark-data` branch. PRs opened before that will skip the comparison.
 
-**Benchmark times are inconsistent:** CI runs each benchmark 3 times on pull requests (`-count=3`) and 5 times on pushes to `main` (`-count=5`). The samples are averaged into a single representative value before comparison, giving a stable 1:1 comparison against the stored baseline. On pull requests, the benchmark gate is informational only — shared CI runners can vary 2–5× in hardware speed between runs, so per-PR blocking would produce too many false positives. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) are excluded from the regression gate entirely since their timing is dominated by disk-cache state rather than algorithmic complexity.
+**Benchmark times are inconsistent:** This is expected on shared CI runners (2–5× variance between runs) and is exactly why the gate ignores `ns/op` and compares only the deterministic `allocs/op`/`B/op` metrics (see [Regression Detection](#regression-detection)). CI runs each benchmark once on pull requests (`-count=1`, since allocation metrics need no averaging) and 5 times on pushes to `main` (`-count=5`, to smooth the `ns/op` history chart), averaging the samples into one representative value per benchmark. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) and sub-100 `ns/op` benchmarks are excluded from the gate entirely, since their timing is dominated by disk-cache state or clock jitter rather than algorithmic complexity.
 
 **Benchmark jobs skipped:** The workflow runs on all PRs, but benchmark jobs are skipped unless a file in one of the 16 packages that contain `Benchmark*` functions, `go.mod`, `go.sum`, or `.github/workflows/ci.yaml` changed. PRs that only touch unrelated Go code (e.g. a new CLI command, documentation, or a package not in the benchmark filter) will skip benchmarks entirely. In the merge queue, benchmark jobs are always skipped.
diff --git a/docs/src/content/docs/benchmarks.mdx b/docs/src/content/docs/benchmarks.mdx
@@ -13,12 +13,14 @@ KSail includes automated benchmark regression testing to detect performance chan
 
 ## Regression Detection
 
-| Setting           | Value | Meaning                                                                                              |
-|-------------------|-------|------------------------------------------------------------------------------------------------------|
-| `alert-threshold` | 150%  | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline (never blocks CI on PRs) |
-| `fail-threshold`  | 200%  | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline         |
+Shared GitHub-hosted runners vary 2–5× in wall-clock speed between runs, so the gate compares **only the deterministic allocation metrics** (`allocs/op`, `B/op`) — never wall-clock `ns/op`, which is tracked for the chart above but is too noisy to gate on. A real algorithmic regression shows up in allocations; runner jitter does not.
 
-On pull requests where benchmarks run and benchmark functions are discovered, a comment is posted only when the alert threshold is exceeded, highlighting the regressed benchmarks.
+| Setting           | Value | Meaning                                                            |
+|-------------------|-------|-------------------------------------------------------------------|
+| `alert-threshold` | 150%  | A deterministic metric is flagged when it is ≥1.5× the baseline    |
+| `fail-threshold`  | 150%  | Threshold above which the gate fails the check (pull requests only) |
+
+On pull requests, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Pushes to `main` never fail — they refresh the baseline and the `ns/op` chart above, so an intentional allocation increase re-baselines itself.
 
 Benchmark results are also recorded in every [CI workflow run summary](https://github.com/devantler-tech/ksail/actions/workflows/ci.yaml). On pushes to `main`, results are auto-pushed to the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data).