Skip to content

Commit 2ce252d

Browse files
authored
Merge pull request #152 from DemchaAV/chore/bench-baseline-ab-tool
chore(bench): cross-platform A/B benchmark tooling, refreshed baseline, dev docs
2 parents 0b37989 + ec3d980 commit 2ce252d

7 files changed

Lines changed: 728 additions & 47 deletions

File tree

.gitattributes

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Shell scripts must keep LF endings so they run on Linux/macOS regardless of
2+
# the contributor's core.autocrlf setting (a CRLF shebang breaks on Unix).
3+
*.sh text eol=lf
4+
5+
# Maven wrapper: the POSIX `mvnw` must stay LF to execute on Linux/macOS/WSL
6+
# (a CRLF shebang is unrunnable there); the Windows `mvnw.cmd` batch stays CRLF.
7+
mvnw text eol=lf
8+
mvnw.cmd text eol=crlf
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: A/B bench smoke (Linux)
2+
3+
# Proves the cross-platform A/B harness scripts/ab-bench.sh actually RUNS
4+
# end-to-end on Linux: the re-exec survives the in-script branch switches, the
5+
# unix mvnw builds each side, and the median/diff is produced. This is a SMOKE
6+
# test of the script orchestration, NOT a perf gate — the numbers it prints are
7+
# informational (CI noise). Narrowly triggered (only when the script or this
8+
# workflow changes, or on demand) so it never burdens unrelated PRs. The coarse
9+
# perf smoke + weekly benchmark diff live in ci.yml; strict JMH in benchmarks-jmh.yml.
10+
11+
on:
12+
workflow_dispatch:
13+
pull_request:
14+
paths:
15+
- 'scripts/ab-bench.sh'
16+
- '.github/workflows/ab-bench-smoke.yml'
17+
18+
permissions:
19+
contents: read
20+
21+
jobs:
22+
ab-bench-smoke:
23+
name: ab-bench.sh smoke (main vs develop)
24+
runs-on: ubuntu-latest
25+
timeout-minutes: 25
26+
env:
27+
JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
28+
29+
steps:
30+
- name: Check out repository (full history for the A/B branch switches)
31+
uses: actions/checkout@v6
32+
with:
33+
fetch-depth: 0
34+
35+
- name: Set up Temurin JDK 17
36+
uses: actions/setup-java@v5
37+
with:
38+
distribution: temurin
39+
java-version: '17'
40+
cache: maven
41+
42+
- name: Make main + develop available as local branches
43+
run: git fetch --no-tags origin main:main develop:develop
44+
45+
- name: A/B smoke (single run; asserts the script runs on Linux, exit 0)
46+
run: bash scripts/ab-bench.sh -a main -b develop -r 1 --cooldown 0

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ When writing new code, avoid Java 21+ APIs and language constructs that don't ex
3131
- The blocking validation gate for repository work is `./mvnw -B -ntp clean verify`.
3232
- Run the guard-focused suite with `./mvnw -B -ntp "-Dtest=EnginePdfBoundaryTest,CanonicalTemplateComposerPdfBoundaryTest,PdfRenderInterfaceGuardTest,PdfRenderingSystemECSDispatchTest,DocumentationCoverageTest,DocumentationExamplesTest,CanonicalSurfaceGuardTest" test`.
3333
- Run a focused documentation sanity check with `./mvnw -B -ntp "-Dtest=DocumentationExamplesTest" test`.
34-
- Run the local benchmark wrapper with `powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1` when you change performance-sensitive code or benchmark tooling.
34+
- Run the local benchmark wrapper when you change performance-sensitive code or benchmark tooling: `powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1` (Windows). To compare two branches fairly, use `scripts/ab-bench.ps1` (Windows) or the cross-platform `scripts/ab-bench.sh` (Linux/macOS/Git Bash). See [docs/operations/benchmarks.md](./docs/operations/benchmarks.md).
3535

3636
## How to propose changes
3737

@@ -110,7 +110,7 @@ See [docs/contributing/release-process.md](./docs/contributing/release-process.m
110110
- `aggregator/pom.xml`
111111
Maven reactor (aggregator POM); release tooling propagates the version bump across all modules through it in one pass
112112
- `baselines/`
113-
Committed performance-benchmark baseline summaries (`BASELINE_SUMMARY.md`, `COMPARISON.md`) — the reference numbers benchmark runs compare against
113+
Committed performance-benchmark baselines: `current-speed-full.json` is the median reference the `11-verdict-current-speed` gate judges runs against; `BASELINE_SUMMARY.md` / `COMPARISON.md` are historical pre-optimization snapshots
114114

115115
## Recommended workflow
116116

baselines/current-speed-full.json

Lines changed: 55 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"timestamp" : "2026-06-08 12:07:23",
2+
"timestamp" : "2026-06-09 17:19:39",
33
"profile" : "full",
44
"warmupIterations" : 12,
55
"measurementIterations" : 40,
@@ -8,81 +8,91 @@
88
"latency" : [ {
99
"scenario" : "cv-template",
1010
"description" : "Compose-first CV template",
11-
"avgMillis" : 4.28,
12-
"p50Millis" : 3.93,
13-
"p95Millis" : 5.83,
14-
"maxMillis" : 7.15,
15-
"docsPerSecond" : 233.52,
16-
"avgKilobytes" : 2.29,
17-
"peakHeapMb" : 33.08
11+
"avgMillis" : 2.45,
12+
"p50Millis" : 2.24,
13+
"p95Millis" : 3.53,
14+
"maxMillis" : 4.27,
15+
"docsPerSecond" : 408.15,
16+
"avgKilobytes" : 2.22,
17+
"peakHeapMb" : 20.0
1818
}, {
1919
"scenario" : "engine-simple",
2020
"description" : "One-page engine composition",
21-
"avgMillis" : 3.17,
22-
"p50Millis" : 2.96,
23-
"p95Millis" : 5.01,
24-
"maxMillis" : 5.9,
25-
"docsPerSecond" : 315.87,
21+
"avgMillis" : 2.0,
22+
"p50Millis" : 1.8,
23+
"p95Millis" : 2.88,
24+
"maxMillis" : 3.97,
25+
"docsPerSecond" : 498.91,
2626
"avgKilobytes" : 1.08,
27-
"peakHeapMb" : 12.0
27+
"peakHeapMb" : 8.0
2828
}, {
2929
"scenario" : "feature-rich",
3030
"description" : "QR, barcode, watermark, header/footer, page break",
31-
"avgMillis" : 45.37,
32-
"p50Millis" : 37.09,
33-
"p95Millis" : 60.65,
34-
"maxMillis" : 69.62,
35-
"docsPerSecond" : 22.04,
36-
"avgKilobytes" : 6.37,
37-
"peakHeapMb" : 86.14
31+
"avgMillis" : 31.99,
32+
"p50Millis" : 31.32,
33+
"p95Millis" : 36.35,
34+
"maxMillis" : 40.65,
35+
"docsPerSecond" : 31.26,
36+
"avgKilobytes" : 6.33,
37+
"peakHeapMb" : 58.89
3838
}, {
3939
"scenario" : "invoice-template",
4040
"description" : "Compose-first invoice template",
41-
"avgMillis" : 19.42,
42-
"p50Millis" : 18.75,
43-
"p95Millis" : 27.88,
44-
"maxMillis" : 34.26,
45-
"docsPerSecond" : 51.5,
41+
"avgMillis" : 13.12,
42+
"p50Millis" : 12.88,
43+
"p95Millis" : 17.01,
44+
"maxMillis" : 19.6,
45+
"docsPerSecond" : 76.22,
4646
"avgKilobytes" : 9.72,
47-
"peakHeapMb" : 85.09
47+
"peakHeapMb" : 45.11
48+
}, {
49+
"scenario" : "long-token",
50+
"description" : "Long unbreakable tokens (URLs/IDs) forcing character-level wrap",
51+
"avgMillis" : 3.38,
52+
"p50Millis" : 3.15,
53+
"p95Millis" : 4.72,
54+
"maxMillis" : 5.51,
55+
"docsPerSecond" : 295.43,
56+
"avgKilobytes" : 3.97,
57+
"peakHeapMb" : 52.0
4858
}, {
4959
"scenario" : "proposal-template",
5060
"description" : "Long multi-page proposal template",
51-
"avgMillis" : 14.41,
52-
"p50Millis" : 13.71,
53-
"p95Millis" : 19.18,
54-
"maxMillis" : 19.93,
55-
"docsPerSecond" : 69.38,
56-
"avgKilobytes" : 7.72,
57-
"peakHeapMb" : 97.52
61+
"avgMillis" : 9.63,
62+
"p50Millis" : 9.24,
63+
"p95Millis" : 12.44,
64+
"maxMillis" : 13.24,
65+
"docsPerSecond" : 103.84,
66+
"avgKilobytes" : 7.68,
67+
"peakHeapMb" : 51.99
5868
} ],
5969
"throughput" : [ {
6070
"scenario" : "invoice-template",
6171
"threads" : 1,
6272
"totalDocs" : 12,
63-
"docsPerSecond" : 81.22,
64-
"avgMillisPerDoc" : 12.31
73+
"docsPerSecond" : 121.21,
74+
"avgMillisPerDoc" : 8.25
6575
}, {
6676
"scenario" : "invoice-template",
6777
"threads" : 2,
6878
"totalDocs" : 24,
69-
"docsPerSecond" : 158.68,
70-
"avgMillisPerDoc" : 6.3
79+
"docsPerSecond" : 223.67,
80+
"avgMillisPerDoc" : 4.47
7181
}, {
7282
"scenario" : "invoice-template",
7383
"threads" : 4,
7484
"totalDocs" : 48,
75-
"docsPerSecond" : 265.11,
76-
"avgMillisPerDoc" : 3.77
85+
"docsPerSecond" : 335.65,
86+
"avgMillisPerDoc" : 2.98
7787
}, {
7888
"scenario" : "invoice-template",
7989
"threads" : 8,
8090
"totalDocs" : 96,
81-
"docsPerSecond" : 356.61,
82-
"avgMillisPerDoc" : 2.8
91+
"docsPerSecond" : 414.73,
92+
"avgMillisPerDoc" : 2.41
8393
} ],
84-
"totalBytes" : 2905520,
94+
"totalBytes" : 3062560,
8595
"aggregation" : "median",
86-
"sourceCount" : 7,
87-
"sourceRuns" : [ "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120624.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120635.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120645.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120655.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120704.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120713.json", "C:\\Users\\Demch\\OneDrive\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260608-120722.json" ]
96+
"sourceCount" : 5,
97+
"sourceRuns" : [ "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171909.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171916.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171924.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171931.json", "C:\\Dev\\Java\\GraphCompose\\target\\benchmarks\\current-speed\\run-20260609-171938.json" ]
8898
}

docs/operations/benchmarks.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@ The script prints numbered sections so you can map console output to the pipelin
5353
Diffs the newest compatible current-speed reports.
5454
10. `10-diff-comparative`
5555
Diffs the two newest comparative reports.
56+
11. `11-verdict-current-speed`
57+
Judges the newest current-speed median against the committed baseline
58+
(`baselines/current-speed-full.json`). Hard gate on average latency; peak
59+
heap is advisory. See [Refreshing the committed baseline](#refreshing-the-committed-baseline-perf-gate).
5660

5761
Each step writes a dedicated log file under `target/benchmark-runs/<timestamp>/logs/`, and the wrapper mirrors that log back to the console after the step finishes.
5862

@@ -172,6 +176,115 @@ powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -SkipDiff
172176
powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -OpenResults
173177
```
174178

179+
## Measuring the impact of an engine change
180+
181+
Changing the engine (layout, pagination, render ordering, PDF session, text
182+
measurement, fonts) and want to see how it moves performance? Pick the view that
183+
fits, cheapest first:
184+
185+
- **"Did I regress?" — gate against the committed baseline.** Run a median and
186+
let the `11-verdict-current-speed` step score each scenario IMPROVED /
187+
NEUTRAL / REGRESSED against `baselines/current-speed-full.json` (hard gate:
188+
average latency ±10%, non-zero exit on a regression):
189+
190+
```powershell
191+
powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full -Repeat 5
192+
```
193+
194+
- **"What exactly moved?" — A/B your branch against its base (any OS).** Commit
195+
your change, then compare it to `develop` with the A/B scripts (see
196+
[A/B comparison between two branches](#ab-comparison-between-two-branches)).
197+
Both sides are rebuilt and benchmarked, with a per-scenario delta:
198+
199+
```bash
200+
./scripts/ab-bench.sh -a develop -b my/engine-change -r 5
201+
```
202+
203+
If a change is *meant* to improve performance and the gate confirms it, refresh
204+
the baseline so the gate ratchets down — see
205+
[Refreshing the committed baseline](#refreshing-the-committed-baseline-perf-gate).
206+
Treat sub-~5-10% laptop deltas as inconclusive, and re-run on the final checkout
207+
before citing a number.
208+
209+
## A/B comparison between two branches
210+
211+
The wrappers above benchmark whatever is currently checked out. To answer "is
212+
branch B faster or slower than branch A?" fairly on a noisy laptop, use the
213+
dedicated A/B scripts. They **interleave** the two branches (A,B,A,B,…) so
214+
thermal drift averages out, **repeat** each branch and compare **medians**, and
215+
**cool down** between runs. Each branch is rebuilt (`install -pl .`) before its
216+
runs so the benchmark measures that branch's engine, and untracked benchmark
217+
probes are moved aside around the branch switch so they cannot break the other
218+
branch's compile.
219+
220+
- **Windows (PowerShell)**`scripts/ab-bench.ps1`, full suite (latency,
221+
throughput, scalability, stress, comparative):
222+
223+
```powershell
224+
./scripts/ab-bench.ps1 -BranchA main -BranchB develop -Repeat 3
225+
./scripts/ab-bench.ps1 -BranchA develop -BranchB feature/x -Repeat 5
226+
```
227+
228+
- **Linux / macOS / Windows Git Bash**`scripts/ab-bench.sh`, `current-speed`
229+
suite (per-scenario latency + parallel throughput, the primary engine-speed
230+
signal):
231+
232+
```bash
233+
./scripts/ab-bench.sh -a main -b develop -r 3
234+
./scripts/ab-bench.sh --branch-a develop --branch-b feature/x --repeat 5 --cooldown 45
235+
./scripts/ab-bench.sh -a main -b origin/anothertree -r 3 # remote-only ref (detached)
236+
```
237+
238+
Both accept any pair of checkout-able refs (local branches or `origin/<name>`).
239+
Deltas are reported as **B relative to A** (negative latency % and positive
240+
docs-per-sec % mean B is faster). The working tree must have no uncommitted
241+
**tracked** changes — the scripts switch branches. Output lands under
242+
`target/ab-compare/` and `target/benchmarks/diffs/`. Treat sub-~5-10% deltas on
243+
a laptop as inconclusive; close other JVMs/IDEs and stay on AC power for the
244+
cleanest numbers.
245+
246+
## Refreshing the committed baseline (perf gate)
247+
248+
`baselines/current-speed-full.json` is a committed median `current-speed` report
249+
that `11-verdict-current-speed` judges new runs against (hard gate: average
250+
latency ±10%; peak heap is advisory, GC-timing noisy). Refresh it **only** for an
251+
intended, verified improvement so the gate ratchets down — never to turn a red
252+
gate green. Capture a median of **≥5** runs on the branch that defines the new
253+
reference, with the IDE closed:
254+
255+
**Windows (PowerShell):**
256+
257+
```powershell
258+
.\mvnw.cmd -B -ntp -f benchmarks\pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
259+
$cp = 'benchmarks\target\test-classes;benchmarks\target\classes;' + (Get-Content benchmarks\target\benchmark.classpath -Raw).Trim()
260+
1..5 | ForEach-Object { & java "-Dgraphcompose.benchmark.profile=full" -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark }
261+
$runs = Get-ChildItem target\benchmarks\current-speed\run-*.json | Sort-Object Name | Select-Object -Last 5 | ForEach-Object { $_.FullName }
262+
& java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed @runs
263+
Copy-Item target\benchmarks\aggregates\current-speed\full\latest.json baselines\current-speed-full.json -Force
264+
```
265+
266+
**Linux / macOS / Git Bash:**
267+
268+
```bash
269+
./mvnw -B -ntp -f benchmarks/pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
270+
sep=':'; case "$(uname -s)" in MINGW*|MSYS*|CYGWIN*) sep=';';; esac
271+
cp="benchmarks/target/test-classes${sep}benchmarks/target/classes${sep}$(cat benchmarks/target/benchmark.classpath)"
272+
for i in 1 2 3 4 5; do java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark; done
273+
runs=$(ls -t target/benchmarks/current-speed/run-*.json | head -5)
274+
java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed $runs
275+
cp -f target/benchmarks/aggregates/current-speed/full/latest.json baselines/current-speed-full.json
276+
```
277+
278+
The baseline is machine-class-specific; the JSON records provenance
279+
(`timestamp`, `profile`, `sourceRuns`). Validate the refresh against a *fresh*
280+
run — not one of the five that built the median — on that branch; it should
281+
score NEUTRAL and exit `0`:
282+
283+
```bash
284+
java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark
285+
java -cp "$cp" com.demcha.compose.BenchmarkVerdictTool baselines/current-speed-full.json target/benchmarks/current-speed/latest.json
286+
```
287+
175288
## Artifact layout
176289

177290
The wrapper writes two groups of artifacts.

0 commit comments

Comments
 (0)