Skip to content

Commit 2311932

Browse files
devantlerclaude
andauthored
ci: gate benchmark regression on deterministic metrics, not ns/op (#5316)
* ci: gate benchmark regression on deterministic metrics, not ns/op The github-action-benchmark gate compared wall-clock ns/op, which varies 2-5x between runs on shared GitHub-hosted runners, producing false-positive "Performance Alert" comments on PRs that never touch the benchmarked code (e.g. #5272, a decode-only change, alerted the Marshal/Encode benchmarks it cannot reach). allocs/op and B/op stay flat across those runs - they are the real regression signal. Compare only the deterministic allocation metrics: - Derive bench-compare.txt from bench-filtered.txt with ns/op pinned to 1. The go parser emits one comparable series per metric; pinning ns/op makes both ns/op-bearing series (the bare and the "- ns/op" series) compare as ~0x of the baseline so jitter can't trip the alert, while the B/op and allocs/op series keep their real values and byte-identical names and still compare 1:1. The store job keeps using the unmodified file, so the ns/op history chart is unchanged. - PRs run -count=1 (deterministic metrics need no averaging) for a ~3x faster signal; push to main keeps -count=5 for the ns/op chart. - PRs fail on a >=1.5x allocs/B regression (fail-on-alert is now enabled for pull requests, deterministic so safe from jitter). Pushes to main never fail and just refresh the baseline, so an intentional increase self-heals without wedging main or the auto-release. Also corrects docs/AGENTS.md, which claimed main fails on a >=2x ns/op regression - fail-on-alert was unconditionally false, so it never did. Fixes #5276 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore: Apply megalinter fixes --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: devantler <26203420+devantler@users.noreply.github.com>
1 parent b3d09cd commit 2311932

4 files changed

Lines changed: 58 additions & 27 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1307,12 +1307,13 @@ jobs:
13071307
shell: bash
13081308
env:
13091309
PACKAGES: ${{ steps.discover.outputs.packages }}
1310-
# Reduce -count on PRs (-count=3) to halve benchmark wall time while
1311-
# keeping enough samples for the awk averaging step in
1312-
# "Prepare benchmark regression gate input" to produce a stable signal.
1313-
# On push to main (which feeds benchmark-store / the historical chart),
1314-
# keep -count=5 for higher-quality baselines.
1315-
BENCH_COUNT: ${{ github.event_name == 'push' && '5' || '3' }}
1310+
# The regression gate compares only the deterministic allocation
1311+
# metrics (B/op, allocs/op), which are independent of sample count, so
1312+
# PRs run -count=1 for the fastest possible signal (wall-clock ns/op is
1313+
# never gated — see "Compare benchmark results"). On push to main
1314+
# (which feeds benchmark-store / the historical ns/op chart) keep
1315+
# -count=5 for higher-quality wall-clock baselines.
1316+
BENCH_COUNT: ${{ github.event_name == 'push' && '5' || '1' }}
13161317
run: |
13171318
set -euo pipefail
13181319
echo "$PACKAGES" \
@@ -1377,19 +1378,39 @@ jobs:
13771378
}
13781379
' "$RUNNER_TEMP/bench.txt" > "$RUNNER_TEMP/bench-filtered.txt"
13791380
1381+
# Derive the regression-gate input by neutralizing the wall-clock
1382+
# metric. github-action-benchmark's "go" parser turns each
1383+
# "<name> <iters> <v1> <u1> <v2> <u2> ..." line into one comparable
1384+
# series per metric (plus a bare ns/op series). Pinning the ns/op value
1385+
# to a constant 1 makes both ns/op-bearing series compare as ~0x of the
1386+
# baseline so wall-clock jitter on shared runners can never trip the
1387+
# alert, while the deterministic B/op and allocs/op series keep their
1388+
# real values AND byte-identical names — so they still compare 1:1
1389+
# against the stored baseline. The unmodified bench-filtered.txt (real
1390+
# ns/op) still feeds the history chart via the benchmark-store job.
1391+
sed -E 's/[0-9]+ ns\/op/1 ns\/op/' \
1392+
"$RUNNER_TEMP/bench-filtered.txt" > "$RUNNER_TEMP/bench-compare.txt"
1393+
13801394
- name: 📊 Compare benchmark results
13811395
if: steps.discover.outputs.skip == 'false'
13821396
uses: benchmark-action/github-action-benchmark@52576c92bccf6ac60c8223ec7eb2565637cae9ba # v1.22.1
13831397
with:
13841398
tool: go
1385-
output-file-path: ${{ runner.temp }}/bench-filtered.txt
1399+
# bench-compare.txt pins ns/op to 1, so only the deterministic
1400+
# B/op and allocs/op series carry real values here.
1401+
output-file-path: ${{ runner.temp }}/bench-compare.txt
13861402
gh-pages-branch: benchmark-data
13871403
benchmark-data-dir-path: dev/bench
13881404
github-token: ${{ secrets.GITHUB_TOKEN }}
13891405
auto-push: false
13901406
alert-threshold: "150%"
1391-
fail-threshold: ${{ github.event_name != 'pull_request' && '200%' || '' }}
1392-
fail-on-alert: false
1407+
fail-threshold: "150%"
1408+
# Deterministic metrics are jitter-free, so PRs block on a real
1409+
# allocation/byte regression (≥1.5x). Pushes to main never fail
1410+
# (comment-on-alert/fail-on-alert are PR-only): the benchmark-store
1411+
# job updates the baseline so it self-heals after an intentional
1412+
# increase lands, avoiding a wedged main / blocked release.
1413+
fail-on-alert: ${{ github.event_name == 'pull_request' }}
13931414
comment-on-alert: ${{ github.event_name == 'pull_request' }}
13941415
summary-always: true
13951416

AGENTS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -348,7 +348,9 @@ For a deeper dive into KSail's design and internals, refer to:
348348

349349
## Benchmark Pipeline Consistency
350350

351-
When changing `-count` in the `go test -bench` command in CI, always update the awk averaging/filtering step in "Prepare benchmark regression gate input" to produce exactly one result line per benchmark name. Failing to do so causes false-positive performance regression alerts for all benchmarks, even ones unrelated to the PR. The comparison tool (`github-action-benchmark`) expects the file it reads to contain exactly one result line per benchmark name, so repeated entries for the same benchmark must be consolidated before comparison. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
351+
The benchmark regression gate (the `📊 Compare benchmark results` step in `.github/workflows/ci.yaml`) compares **only the deterministic allocation metrics** (`B/op`, `allocs/op`); the noisy wall-clock `ns/op` series is neutralized in the gate input (`bench-compare.txt`, derived from `bench-filtered.txt` by pinning every `ns/op` value to `1`) because gating on `ns/op` produces false-positive alerts from 2–5× runner jitter. Pull requests run `-count=1` (deterministic metrics need no averaging) and the gate **fails the check** on a ≥1.5× allocation/byte regression; pushes to `main` run `-count=5` to feed the historical `ns/op` chart and only refresh the baseline (they never fail, so the baseline self-heals after an intentional increase lands).
352+
353+
When changing `-count` (or otherwise editing the awk in "Prepare benchmark regression gate input"), keep it producing **exactly one result line per benchmark name**, and keep `bench-compare.txt` carrying the real `B/op`/`allocs/op` values under names byte-identical to the baseline — `github-action-benchmark` compares each line against the single stored baseline entry by name, so duplicate lines create spurious alerts and renamed series silently stop gating. See [docs/BENCHMARK-REGRESSION.md](docs/BENCHMARK-REGRESSION.md) for details.
352354

353355
## Active Technologies
354356

docs/BENCHMARK-REGRESSION.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,24 +7,30 @@ KSail includes automated benchmark regression testing to detect performance chan
77
The benchmark jobs in the [CI workflow](../.github/workflows/ci.yaml) run on pushes to `main` and pull requests. They use [`benchmark-action/github-action-benchmark`](https://github.com/benchmark-action/github-action-benchmark) for regression detection and historical tracking.
88

99
1. Discovers packages that contain benchmark functions (avoids compiling the entire module)
10-
2. Runs benchmarks on the current branch
11-
3. Compares results against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
12-
4. **Fails the CI check** if a benchmark regresses beyond the configured threshold
10+
2. Runs benchmarks on the current branch (`-count=1` on pull requests, `-count=5` on pushes to `main`)
11+
3. Compares the **deterministic allocation metrics** (`B/op`, `allocs/op`) against the stored baseline (persisted in the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data))
12+
4. **Fails the pull-request check** if an allocation metric regresses beyond the threshold
1313

14-
On pushes to `main`, benchmark results are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
14+
Wall-clock `ns/op` is still recorded for the historical chart, but it is **never** used to gate — see [Regression Detection](#regression-detection) for why.
15+
16+
On pushes to `main`, the full results (including `ns/op`) are auto-pushed to the `benchmark-data` branch as the new baseline. On pull requests, results are compared against the baseline without updating it.
1517

1618
## Regression Detection
1719

18-
The workflow uses threshold-based regression detection:
20+
Shared GitHub-hosted runners have wall-clock variance of 2–5× between runs, so gating on `ns/op` produces false-positive "Performance Alert" comments on changes that never touch the benchmarked code. The gate therefore compares **only the deterministic metrics**`allocs/op` and `B/op` — which depend on the code path, not on CPU speed or thermal throttling. A real algorithmic regression shows up in allocations; runner jitter does not.
21+
22+
Mechanically, the `🔍 Prepare benchmark regression gate input` step writes two files:
1923

20-
| Setting | Value | Meaning |
21-
|-------------------|-------|------------------------------------------------------------------------------------------------------|
22-
| `alert-threshold` | 150% | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline |
23-
| `fail-threshold` | 200% | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline |
24+
- `bench-filtered.txt` — the real measurements (including `ns/op`), used to update the baseline and the history chart.
25+
- `bench-compare.txt` — the same data with `ns/op` pinned to `1`, used as the gate input. `github-action-benchmark`'s `go` parser turns each line into one comparable series per metric; pinning `ns/op` makes its series compare as ~0× of the baseline (so it can never alert), while the `B/op` and `allocs/op` series keep their real values **and byte-identical names** and still compare 1:1.
2426

25-
On pull requests, the benchmark gate is **informational only**: results are posted as a PR comment when the alert threshold is exceeded, but CI never blocks on it. This is intentional — shared GitHub Actions runners have hardware variance of 2–5× between runs, making per-PR blocking gates unreliable. Real regressions are caught on pushes to `main`.
27+
| Setting | Value | Meaning |
28+
|-------------------|-------|---------------------------------------------------------------------|
29+
| `alert-threshold` | 150% | A deterministic metric is flagged when it is ≥1.5× the baseline |
30+
| `fail-threshold` | 150% | Threshold above which the gate fails the check (pull requests only) |
2631

27-
On push or merge-queue events, `fail-threshold` is active: a ≥2× regression fails CI, protecting `main` from persistent algorithmic regressions.
32+
- **On pull requests**, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Because the gated metrics are jitter-free, this gate is reliable — it won't fire on runner noise.
33+
- **On pushes to `main`**, the gate never fails: the `📤 Store Benchmark Data` job simply updates the baseline (and the `ns/op` chart). An intentional allocation increase that lands on `main` therefore re-baselines itself, so it can neither wedge `main` nor block the auto-release.
2834

2935
## Historical Results
3036

@@ -61,6 +67,6 @@ Follow the conventions established in the existing benchmark files:
6167

6268
**No baseline data yet:** The first push to `main` after enabling the workflow auto-pushes the initial baseline to the `benchmark-data` branch. PRs opened before that will skip the comparison.
6369

64-
**Benchmark times are inconsistent:** CI runs each benchmark 3 times on pull requests (`-count=3`) and 5 times on pushes to `main` (`-count=5`). The samples are averaged into a single representative value before comparison, giving a stable 1:1 comparison against the stored baseline. On pull requests, the benchmark gate is informational only — shared CI runners can vary 2–5× in hardware speed between runs, so per-PR blocking would produce too many false positives. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) are excluded from the regression gate entirely since their timing is dominated by disk-cache state rather than algorithmic complexity.
70+
**Benchmark times are inconsistent:** This is expected on shared CI runners (2–5× variance between runs) and is exactly why the gate ignores `ns/op` and compares only the deterministic `allocs/op`/`B/op` metrics (see [Regression Detection](#regression-detection)). CI runs each benchmark once on pull requests (`-count=1`, since allocation metrics need no averaging) and 5 times on pushes to `main` (`-count=5`, to smooth the `ns/op` history chart), averaging the samples into one representative value per benchmark. I/O-bound benchmarks (`BenchmarkCreateTarball_*`) and sub-100 `ns/op` benchmarks are excluded from the gate entirely, since their timing is dominated by disk-cache state or clock jitter rather than algorithmic complexity.
6571

6672
**Benchmark jobs skipped:** The workflow runs on all PRs, but benchmark jobs are skipped unless a file in one of the 16 packages that contain `Benchmark*` functions, `go.mod`, `go.sum`, or `.github/workflows/ci.yaml` changed. PRs that only touch unrelated Go code (e.g. a new CLI command, documentation, or a package not in the benchmark filter) will skip benchmarks entirely. In the merge queue, benchmark jobs are always skipped.

docs/src/content/docs/benchmarks.mdx

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,14 @@ KSail includes automated benchmark regression testing to detect performance chan
1313

1414
## Regression Detection
1515

16-
| Setting | Value | Meaning |
17-
|-------------------|-------|------------------------------------------------------------------------------------------------------|
18-
| `alert-threshold` | 150% | Marks benchmarks as regressed and posts a PR comment when ≥1.5× slower than baseline (never blocks CI on PRs) |
19-
| `fail-threshold` | 200% | Fails CI on non-PR runs (pushes to `main`, merge queue) when a benchmark is ≥2× slower than baseline |
16+
Shared GitHub-hosted runners vary 2–5× in wall-clock speed between runs, so the gate compares **only the deterministic allocation metrics** (`allocs/op`, `B/op`) — never wall-clock `ns/op`, which is tracked for the chart above but is too noisy to gate on. A real algorithmic regression shows up in allocations; runner jitter does not.
2017

21-
On pull requests where benchmarks run and benchmark functions are discovered, a comment is posted only when the alert threshold is exceeded, highlighting the regressed benchmarks.
18+
| Setting | Value | Meaning |
19+
|-------------------|-------|-------------------------------------------------------------------|
20+
| `alert-threshold` | 150% | A deterministic metric is flagged when it is ≥1.5× the baseline |
21+
| `fail-threshold` | 150% | Threshold above which the gate fails the check (pull requests only) |
22+
23+
On pull requests, a ≥1.5× regression in `allocs/op` or `B/op` posts a comment **and fails the check**, blocking the merge. Pushes to `main` never fail — they refresh the baseline and the `ns/op` chart above, so an intentional allocation increase re-baselines itself.
2224

2325
Benchmark results are also recorded in every [CI workflow run summary](https://github.com/devantler-tech/ksail/actions/workflows/ci.yaml). On pushes to `main`, results are auto-pushed to the [`benchmark-data` branch](https://github.com/devantler-tech/ksail/tree/benchmark-data).
2426

0 commit comments

Comments
 (0)