You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ci: gate benchmark regression on deterministic metrics, not ns/op (#5316)
* ci: gate benchmark regression on deterministic metrics, not ns/op
The github-action-benchmark gate compared wall-clock ns/op, which varies
2-5x between runs on shared GitHub-hosted runners, producing false-positive
"Performance Alert" comments on PRs that never touch the benchmarked code
(e.g. #5272, a decode-only change, alerted the Marshal/Encode benchmarks it
cannot reach). allocs/op and B/op stay flat across those runs - they are the
real regression signal.
Compare only the deterministic allocation metrics:
- Derive bench-compare.txt from bench-filtered.txt with ns/op pinned to 1.
The go parser emits one comparable series per metric; pinning ns/op makes
both ns/op-bearing series (the bare and the "- ns/op" series) compare as
~0x of the baseline so jitter can't trip the alert, while the B/op and
allocs/op series keep their real values and byte-identical names and still
compare 1:1. The store job keeps using the unmodified file, so the ns/op
history chart is unchanged.
- PRs run -count=1 (deterministic metrics need no averaging) for a ~3x
faster signal; push to main keeps -count=5 for the ns/op chart.
- PRs fail on a >=1.5x allocs/B regression (fail-on-alert is now enabled for
pull requests, deterministic so safe from jitter). Pushes to main never
fail and just refresh the baseline, so an intentional increase self-heals
without wedging main or the auto-release.
Also corrects docs/AGENTS.md, which claimed main fails on a >=2x ns/op
regression - fail-on-alert was unconditionally false, so it never did.
Fixes#5276
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore: Apply megalinter fixes
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: devantler <26203420+devantler@users.noreply.github.com>
Copy file name to clipboardExpand all lines: AGENTS.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -348,7 +348,9 @@ For a deeper dive into KSail's design and internals, refer to:
348
348
349
349
## Benchmark Pipeline Consistency
350
350
351
-
When changing `-count` in the `go test -bench` command in CI, always update the awk averaging/filtering step in "Prepare benchmark regression gate input" to produce exactly one result line per benchmark name. Failing to do so causes false-positive performance regression alerts for all benchmarks, even ones unrelated to the PR. The comparison tool (`github-action-benchmark`) expects the file it reads to contain exactly one result line per benchmark name, so repeated entries for the same benchmark must be consolidated before comparison. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
351
+
The benchmark regression gate (the `📊 Compare benchmark results` step in `.github/workflows/ci.yaml`) compares **only the deterministic allocation metrics** (`B/op`, `allocs/op`); the noisy wall-clock `ns/op` series is neutralized in the gate input (`bench-compare.txt`, derived from `bench-filtered.txt` by pinning every `ns/op` value to `1`) because gating on `ns/op` produces false-positive alerts from 2–5× runner jitter. Pull requests run `-count=1` (deterministic metrics need no averaging) and the gate **fails the check** on a ≥1.5× allocation/byte regression; pushes to `main` run `-count=5` to feed the historical `ns/op` chart and only refresh the baseline (they never fail, so the baseline self-heals after an intentional increase lands).
352
+
353
+
When changing `-count` (or otherwise editing the awk in "Prepare benchmark regression gate input"), keep it producing **exactly one result line per benchmark name**, and keep `bench-compare.txt` carrying the real `B/op`/`allocs/op` values under names byte-identical to the baseline — `github-action-benchmark` compares each line against the single stored baseline entry by name, so duplicate lines create spurious alerts and renamed series silently stop gating. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
Copy file name to clipboardExpand all lines: docs/BENCHMARK-REGRESSION.md
+18-12Lines changed: 18 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,24 +7,30 @@ KSail includes automated benchmark regression testing to detect performance chan
7
7
The benchmark jobs in the [CI workflow](../.github/workflows/ci.yaml) run on pushes to `main` and pull requests. They use [`benchmark-action/github-action-benchmark`](https://github.com/benchmark-action/github-action-benchmark) for regression detection and historical tracking.
8
8
9
9
1. Discovers packages that contain benchmark functions (avoids compiling the entire module)
10
-
2. Runs benchmarks on the current branch
11
-
3. Compares results against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
12
-
4.**Fails the CI check** if a benchmark regresses beyond the configured threshold
10
+
2. Runs benchmarks on the current branch (`-count=1` on pull requests, `-count=5` on pushes to `main`)
11
+
3. Compares the **deterministic allocation metrics** (`B/op`, `allocs/op`) against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
12
+
4.**Fails the pull-request check** if an allocation metric regresses beyond the threshold
13
13
14
-
On pushes to `main`, benchmark results are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
14
+
Wall-clock `ns/op` is still recorded for the historical chart, but it is **never** used to gate — see [Regression Detection](#regression-detection) for why.
15
+
16
+
On pushes to `main`, the full results (including `ns/op`) are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
15
17
16
18
## Regression Detection
17
19
18
-
The workflow uses threshold-based regression detection:
20
+
Shared GitHub-hosted runners have wall-clock variance of 2–5× between runs, so gating on `ns/op` produces false-positive "Performance Alert" comments on changes that never touch the benchmarked code. The gate therefore compares **only the deterministic metrics** — `allocs/op` and `B/op` — which depend on the code path, not on CPU speed or thermal throttling. A real algorithmic regression shows up in allocations; runner jitter does not.
21
+
22
+
Mechanically, the `🔍 Prepare benchmark regression gate input` step writes two files:
|`alert-threshold`| 150% | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline |
23
-
|`fail-threshold`| 200% | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline |
24
+
-`bench-filtered.txt` — the real measurements (including `ns/op`), used to update the baseline and the history chart.
25
+
-`bench-compare.txt` — the same data with `ns/op` pinned to `1`, used as the gate input. `github-action-benchmark`'s `go` parser turns each line into one comparable series per metric; pinning `ns/op` makes its series compare as ~0× of the baseline (so it can never alert), while the `B/op` and `allocs/op` series keep their real values **and byte-identical names** and still compare 1:1.
24
26
25
-
On pull requests, the benchmark gate is **informational only**: results are posted as a PR comment when the alert threshold is exceeded, but CI never blocks on it. This is intentional — shared GitHub Actions runners have hardware variance of 2–5× between runs, making per-PR blocking gates unreliable. Real regressions are caught on pushes to `main`.
|`alert-threshold`| 150% | A deterministic metric is flagged when it is ≥1.5× the baseline |
30
+
|`fail-threshold`| 150% | Threshold above which the gate fails the check (pull requests only) |
26
31
27
-
On push or merge-queue events, `fail-threshold` is active: a ≥2× regression fails CI, protecting `main` from persistent algorithmic regressions.
32
+
-**On pull requests**, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Because the gated metrics are jitter-free, this gate is reliable — it won't fire on runner noise.
33
+
-**On pushes to `main`**, the gate never fails: the `📤 Store Benchmark Data` job simply updates the baseline (and the `ns/op` chart). An intentional allocation increase that lands on `main` therefore re-baselines itself, so it can neither wedge `main` nor block the auto-release.
28
34
29
35
## Historical Results
30
36
@@ -61,6 +67,6 @@ Follow the conventions established in the existing benchmark files:
61
67
62
68
**No baseline data yet:** The first push to `main` after enabling the workflow auto-pushes the initial baseline to the `benchmark-data` branch. PRs opened before that will skip the comparison.
63
69
64
-
**Benchmark times are inconsistent:**CI runs each benchmark 3 times on pull requests (`-count=3`) and 5 times on pushes to `main` (`-count=5`). The samples are averaged into a single representative value before comparison, giving a stable 1:1 comparison against the stored baseline. On pull requests, the benchmark gate is informational only — shared CI runners can vary 2–5× in hardware speed between runs, so per-PR blocking would produce too many false positives. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) are excluded from the regression gate entirely since their timing is dominated by disk-cache state rather than algorithmic complexity.
70
+
**Benchmark times are inconsistent:**This is expected on shared CI runners (2–5× variance between runs) and is exactly why the gate ignores `ns/op` and compares only the deterministic `allocs/op`/`B/op` metrics (see [Regression Detection](#regression-detection)). CI runs each benchmark once on pull requests (`-count=1`, since allocation metrics need no averaging) and 5 times on pushes to `main` (`-count=5`, to smooth the `ns/op` history chart), averaging the samples into one representative value per benchmark. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) and sub-100 `ns/op` benchmarks are excluded from the gate entirely, since their timing is dominated by disk-cache state or clock jitter rather than algorithmic complexity.
65
71
66
72
**Benchmark jobs skipped:** The workflow runs on all PRs, but benchmark jobs are skipped unless a file in one of the 16 packages that contain `Benchmark*` functions, `go.mod`, `go.sum`, or `.github/workflows/ci.yaml` changed. PRs that only touch unrelated Go code (e.g. a new CLI command, documentation, or a package not in the benchmark filter) will skip benchmarks entirely. In the merge queue, benchmark jobs are always skipped.
|`alert-threshold`| 150% | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline (never blocks CI on PRs) |
19
-
|`fail-threshold`| 200% | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline |
16
+
Shared GitHub-hosted runners vary 2–5× in wall-clock speed between runs, so the gate compares **only the deterministic allocation metrics** (`allocs/op`, `B/op`) — never wall-clock `ns/op`, which is tracked for the chart above but is too noisy to gate on. A real algorithmic regression shows up in allocations; runner jitter does not.
20
17
21
-
On pull requests where benchmarks run and benchmark functions are discovered, a comment is posted only when the alert threshold is exceeded, highlighting the regressed benchmarks.
|`alert-threshold`| 150% | A deterministic metric is flagged when it is ≥1.5× the baseline |
21
+
|`fail-threshold`| 150% | Threshold above which the gate fails the check (pull requests only) |
22
+
23
+
On pull requests, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Pushes to `main` never fail — they refresh the baseline and the `ns/op` chart above, so an intentional allocation increase re-baselines itself.
22
24
23
25
Benchmark results are also recorded in every [CI workflow run summary](https://github.com/devantler-tech/ksail/actions/workflows/ci.yaml). On pushes to `main`, results are auto-pushed to the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data).
0 commit comments