Merge pull request #152 from DemchaAV/chore/bench-baseline-ab-tool

DemchaAV · web-flow · commit 2ce252dd22fa · 2026-06-09T19:47:53.000+01:00
chore(bench): cross-platform A/B benchmark tooling, refreshed baseline, dev docs
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,8 @@
+# Shell scripts must keep LF endings so they run on Linux/macOS regardless of
+# the contributor's core.autocrlf setting (a CRLF shebang breaks on Unix).
+*.sh text eol=lf
+
+# Maven wrapper: the POSIX `mvnw` must stay LF to execute on Linux/macOS/WSL
+# (a CRLF shebang is unrunnable there); the Windows `mvnw.cmd` batch stays CRLF.
+mvnw text eol=lf
+mvnw.cmd text eol=crlf
diff --git a/.github/workflows/ab-bench-smoke.yml b/.github/workflows/ab-bench-smoke.yml
@@ -0,0 +1,46 @@
+name: A/B bench smoke (Linux)
+
+# Proves the cross-platform A/B harness scripts/ab-bench.sh actually RUNS
+# end-to-end on Linux: the re-exec survives the in-script branch switches, the
+# unix mvnw builds each side, and the median/diff is produced. This is a SMOKE
+# test of the script orchestration, NOT a perf gate — the numbers it prints are
+# informational (CI noise). Narrowly triggered (only when the script or this
+# workflow changes, or on demand) so it never burdens unrelated PRs. The coarse
+# perf smoke + weekly benchmark diff live in ci.yml; strict JMH in benchmarks-jmh.yml.
+
+on:
+  workflow_dispatch:
+  pull_request:
+    paths:
+      - 'scripts/ab-bench.sh'
+      - '.github/workflows/ab-bench-smoke.yml'
+
+permissions:
+  contents: read
+
+jobs:
+  ab-bench-smoke:
+    name: ab-bench.sh smoke (main vs develop)
+    runs-on: ubuntu-latest
+    timeout-minutes: 25
+    env:
+      JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
+
+    steps:
+      - name: Check out repository (full history for the A/B branch switches)
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+
+      - name: Set up Temurin JDK 17
+        uses: actions/setup-java@v5
+        with:
+          distribution: temurin
+          java-version: '17'
+          cache: maven
+
+      - name: Make main + develop available as local branches
+        run: git fetch --no-tags origin main:main develop:develop
+
+      - name: A/B smoke (single run; asserts the script runs on Linux, exit 0)
+        run: bash scripts/ab-bench.sh -a main -b develop -r 1 --cooldown 0
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -31,7 +31,7 @@ When writing new code, avoid Java 21+ APIs and language constructs that don't ex
 - The blocking validation gate for repository work is `./mvnw -B -ntp clean verify`.
 - Run the guard-focused suite with `./mvnw -B -ntp "-Dtest=EnginePdfBoundaryTest,CanonicalTemplateComposerPdfBoundaryTest,PdfRenderInterfaceGuardTest,PdfRenderingSystemECSDispatchTest,DocumentationCoverageTest,DocumentationExamplesTest,CanonicalSurfaceGuardTest" test`.
 - Run a focused documentation sanity check with `./mvnw -B -ntp "-Dtest=DocumentationExamplesTest" test`.
-- Run the local benchmark wrapper with `powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1` when you change performance-sensitive code or benchmark tooling.
+- Run the local benchmark wrapper when you change performance-sensitive code or benchmark tooling: `powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1` (Windows). To compare two branches fairly, use `scripts/ab-bench.ps1` (Windows) or the cross-platform `scripts/ab-bench.sh` (Linux/macOS/Git Bash). See [docs/operations/benchmarks.md](./docs/operations/benchmarks.md).
 
 ## How to propose changes
 
@@ -110,7 +110,7 @@ See [docs/contributing/release-process.md](./docs/contributing/release-process.m
 - `aggregator/pom.xml`
   Maven reactor (aggregator POM); release tooling propagates the version bump across all modules through it in one pass
 - `baselines/`
-  Committed performance-benchmark baseline summaries (`BASELINE_SUMMARY.md`, `COMPARISON.md`) — the reference numbers benchmark runs compare against
+  Committed performance-benchmark baselines: `current-speed-full.json` is the median reference the `11-verdict-current-speed` gate judges runs against; `BASELINE_SUMMARY.md` / `COMPARISON.md` are historical pre-optimization snapshots
 
 ## Recommended workflow
 
diff --git a/baselines/current-speed-full.json b/baselines/current-speed-full.json
@@ -1,5 +1,5 @@
 {
-  "timestamp" : "2026-06-08 12:07:23",
+  "timestamp" : "2026-06-09 17:19:39",
   "profile" : "full",
   "warmupIterations" : 12,
   "measurementIterations" : 40,
@@ -8,81 +8,91 @@
   "latency" : [ {
     "scenario" : "cv-template",
     "description" : "Compose-first CV template",
-    "avgMillis" : 4.28,
-    "p50Millis" : 3.93,
-    "p95Millis" : 5.83,
-    "maxMillis" : 7.15,
-    "docsPerSecond" : 233.52,
-    "avgKilobytes" : 2.29,
-    "peakHeapMb" : 33.08
+    "avgMillis" : 2.45,
+    "p50Millis" : 2.24,
+    "p95Millis" : 3.53,
+    "maxMillis" : 4.27,
+    "docsPerSecond" : 408.15,
+    "avgKilobytes" : 2.22,
+    "peakHeapMb" : 20.0
   }, {
     "scenario" : "engine-simple",
     "description" : "One-page engine composition",
-    "avgMillis" : 3.17,
-    "p50Millis" : 2.96,
-    "p95Millis" : 5.01,
-    "maxMillis" : 5.9,
-    "docsPerSecond" : 315.87,
+    "avgMillis" : 2.0,
+    "p50Millis" : 1.8,
+    "p95Millis" : 2.88,
+    "maxMillis" : 3.97,
+    "docsPerSecond" : 498.91,
     "avgKilobytes" : 1.08,
-    "peakHeapMb" : 12.0
+    "peakHeapMb" : 8.0
   }, {
     "scenario" : "feature-rich",
     "description" : "QR, barcode, watermark, header/footer, page break",
-    "avgMillis" : 45.37,
-    "p50Millis" : 37.09,
-    "p95Millis" : 60.65,
-    "maxMillis" : 69.62,
-    "docsPerSecond" : 22.04,
-    "avgKilobytes" : 6.37,
-    "peakHeapMb" : 86.14
+    "avgMillis" : 31.99,
+    "p50Millis" : 31.32,
+    "p95Millis" : 36.35,
+    "maxMillis" : 40.65,
+    "docsPerSecond" : 31.26,
+    "avgKilobytes" : 6.33,
+    "peakHeapMb" : 58.89
   }, {
     "scenario" : "invoice-template",
     "description" : "Compose-first invoice template",
-    "avgMillis" : 19.42,
-    "p50Millis" : 18.75,
-    "p95Millis" : 27.88,
-    "maxMillis" : 34.26,
-    "docsPerSecond" : 51.5,
+    "avgMillis" : 13.12,
+    "p50Millis" : 12.88,
+    "p95Millis" : 17.01,
+    "maxMillis" : 19.6,
+    "docsPerSecond" : 76.22,
     "avgKilobytes" : 9.72,
-    "peakHeapMb" : 85.09
+    "peakHeapMb" : 45.11
+  }, {
+    "scenario" : "long-token",
+    "description" : "Long unbreakable tokens (URLs/IDs) forcing character-level wrap",
+    "avgMillis" : 3.38,
+    "p50Millis" : 3.15,
+    "p95Millis" : 4.72,
+    "maxMillis" : 5.51,
+    "docsPerSecond" : 295.43,
+    "avgKilobytes" : 3.97,
+    "peakHeapMb" : 52.0
   }, {
     "scenario" : "proposal-template",
     "description" : "Long multi-page proposal template",
-    "avgMillis" : 14.41,
-    "p50Millis" : 13.71,
-    "p95Millis" : 19.18,
-    "maxMillis" : 19.93,
-    "docsPerSecond" : 69.38,
-    "avgKilobytes" : 7.72,
-    "peakHeapMb" : 97.52
+    "avgMillis" : 9.63,
+    "p50Millis" : 9.24,
+    "p95Millis" : 12.44,
+    "maxMillis" : 13.24,
+    "docsPerSecond" : 103.84,
+    "avgKilobytes" : 7.68,
+    "peakHeapMb" : 51.99
   } ],
   "throughput" : [ {
     "scenario" : "invoice-template",
     "threads" : 1,
     "totalDocs" : 12,
-    "docsPerSecond" : 81.22,
-    "avgMillisPerDoc" : 12.31
+    "docsPerSecond" : 121.21,
+    "avgMillisPerDoc" : 8.25
   }, {
     "scenario" : "invoice-template",
     "threads" : 2,
     "totalDocs" : 24,
-    "docsPerSecond" : 158.68,
-    "avgMillisPerDoc" : 6.3
+    "docsPerSecond" : 223.67,
+    "avgMillisPerDoc" : 4.47
   }, {
     "scenario" : "invoice-template",
     "threads" : 4,
     "totalDocs" : 48,
-    "docsPerSecond" : 265.11,
-    "avgMillisPerDoc" : 3.77
+    "docsPerSecond" : 335.65,
+    "avgMillisPerDoc" : 2.98
   }, {
     "scenario" : "invoice-template",
     "threads" : 8,
     "totalDocs" : 96,
-    "docsPerSecond" : 356.61,
-    "avgMillisPerDoc" : 2.8
+    "docsPerSecond" : 414.73,
+    "avgMillisPerDoc" : 2.41
   } ],
-  "totalBytes" : 2905520,
+  "totalBytes" : 3062560,
   "aggregation" : "median",
-  "sourceCount" : 7,
-  "sourceRuns" : [ "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120624.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120635.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120645.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120655.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120704.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120713.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120722.json" ]
+  "sourceCount" : 5,
+  "sourceRuns" : [ "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171909.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171916.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171924.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171931.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171938.json" ]
 }
diff --git a/docs/operations/benchmarks.md b/docs/operations/benchmarks.md
@@ -53,6 +53,10 @@ The script prints numbered sections so you can map console output to the pipelin
    Diffs the newest compatible current-speed reports.
 10. `10-diff-comparative`
    Diffs the two newest comparative reports.
+11. `11-verdict-current-speed`
+   Judges the newest current-speed median against the committed baseline
+   (`baselines/current-speed-full.json`). Hard gate on average latency; peak
+   heap is advisory. See [Refreshing the committed baseline](#refreshing-the-committed-baseline-perf-gate).
 
 Each step writes a dedicated log file under `target/benchmark-runs/<timestamp>/logs/`, and the wrapper mirrors that log back to the console after the step finishes.
 
@@ -172,6 +176,115 @@ powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -SkipDiff
 powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -OpenResults
 ```
 
+## Measuring the impact of an engine change
+
+Changing the engine (layout, pagination, render ordering, PDF session, text
+measurement, fonts) and want to see how it moves performance? Pick the view that
+fits, cheapest first:
+
+- **"Did I regress?" — gate against the committed baseline.** Run a median and
+  let the `11-verdict-current-speed` step score each scenario IMPROVED /
+  NEUTRAL / REGRESSED against `baselines/current-speed-full.json` (hard gate:
+  average latency ±10%, non-zero exit on a regression):
+
+  ```powershell
+  powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full -Repeat 5
+  ```
+
+- **"What exactly moved?" — A/B your branch against its base (any OS).** Commit
+  your change, then compare it to `develop` with the A/B scripts (see
+  [A/B comparison between two branches](#ab-comparison-between-two-branches)).
+  Both sides are rebuilt and benchmarked, with a per-scenario delta:
+
+  ```bash
+  ./scripts/ab-bench.sh -a develop -b my/engine-change -r 5
+  ```
+
+If a change is *meant* to improve performance and the gate confirms it, refresh
+the baseline so the gate ratchets down — see
+[Refreshing the committed baseline](#refreshing-the-committed-baseline-perf-gate).
+Treat sub-~5-10% laptop deltas as inconclusive, and re-run on the final checkout
+before citing a number.
+
+## A/B comparison between two branches
+
+The wrappers above benchmark whatever is currently checked out. To answer "is
+branch B faster or slower than branch A?" fairly on a noisy laptop, use the
+dedicated A/B scripts. They **interleave** the two branches (A,B,A,B,…) so
+thermal drift averages out, **repeat** each branch and compare **medians**, and
+**cool down** between runs. Each branch is rebuilt (`install -pl .`) before its
+runs so the benchmark measures that branch's engine, and untracked benchmark
+probes are moved aside around the branch switch so they cannot break the other
+branch's compile.
+
+- **Windows (PowerShell)** — `scripts/ab-bench.ps1`, full suite (latency,
+  throughput, scalability, stress, comparative):
+
+  ```powershell
+  ./scripts/ab-bench.ps1 -BranchA main -BranchB develop -Repeat 3
+  ./scripts/ab-bench.ps1 -BranchA develop -BranchB feature/x -Repeat 5
+  ```
+
+- **Linux / macOS / Windows Git Bash** — `scripts/ab-bench.sh`, `current-speed`
+  suite (per-scenario latency + parallel throughput, the primary engine-speed
+  signal):
+
+  ```bash
+  ./scripts/ab-bench.sh -a main -b develop -r 3
+  ./scripts/ab-bench.sh --branch-a develop --branch-b feature/x --repeat 5 --cooldown 45
+  ./scripts/ab-bench.sh -a main -b origin/anothertree -r 3   # remote-only ref (detached)
+  ```
+
+Both accept any pair of checkout-able refs (local branches or `origin/<name>`).
+Deltas are reported as **B relative to A** (negative latency % and positive
+docs-per-sec % mean B is faster). The working tree must have no uncommitted
+**tracked** changes — the scripts switch branches. Output lands under
+`target/ab-compare/` and `target/benchmarks/diffs/`. Treat sub-~5-10% deltas on
+a laptop as inconclusive; close other JVMs/IDEs and stay on AC power for the
+cleanest numbers.
+
+## Refreshing the committed baseline (perf gate)
+
+`baselines/current-speed-full.json` is a committed median `current-speed` report
+that `11-verdict-current-speed` judges new runs against (hard gate: average
+latency ±10%; peak heap is advisory, GC-timing noisy). Refresh it **only** for an
+intended, verified improvement so the gate ratchets down — never to turn a red
+gate green. Capture a median of **≥5** runs on the branch that defines the new
+reference, with the IDE closed:
+
+**Windows (PowerShell):**
+
+```powershell
+.\mvnw.cmd -B -ntp -f benchmarks\pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
+$cp = 'benchmarks\target\test-classes;benchmarks\target\classes;' + (Get-Content benchmarks\target\benchmark.classpath -Raw).Trim()
+1..5 | ForEach-Object { & java "-Dgraphcompose.benchmark.profile=full" -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark }
+$runs = Get-ChildItem target\benchmarks\current-speed\run-*.json | Sort-Object Name | Select-Object -Last 5 | ForEach-Object { $_.FullName }
+& java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed @runs
+Copy-Item target\benchmarks\aggregates\current-speed\full\latest.json baselines\current-speed-full.json -Force
+```
+
+**Linux / macOS / Git Bash:**
+
+```bash
+./mvnw -B -ntp -f benchmarks/pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
+sep=':'; case "$(uname -s)" in MINGW*|MSYS*|CYGWIN*) sep=';';; esac
+cp="benchmarks/target/test-classes${sep}benchmarks/target/classes${sep}$(cat benchmarks/target/benchmark.classpath)"
+for i in 1 2 3 4 5; do java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark; done
+runs=$(ls -t target/benchmarks/current-speed/run-*.json | head -5)
+java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed $runs
+cp -f target/benchmarks/aggregates/current-speed/full/latest.json baselines/current-speed-full.json
+```
+
+The baseline is machine-class-specific; the JSON records provenance
+(`timestamp`, `profile`, `sourceRuns`). Validate the refresh against a *fresh*
+run — not one of the five that built the median — on that branch; it should
+score NEUTRAL and exit `0`:
+
+```bash
+java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark
+java -cp "$cp" com.demcha.compose.BenchmarkVerdictTool baselines/current-speed-full.json target/benchmarks/current-speed/latest.json
+```
+
 ## Artifact layout
 
 The wrapper writes two groups of artifacts.
diff --git a/scripts/ab-bench.ps1 b/scripts/ab-bench.ps1
diff --git a/scripts/ab-bench.sh b/scripts/ab-bench.sh