Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
75d124f
PR 2 Task 2.1: capture pre-split slug-set fixture for deletion-safety
elijahr May 1, 2026
2df9466
PR 2 Task 2.3: bench_spsc binary (Sipsic 1p1c)
elijahr May 1, 2026
f3e1427
PR 2 Task 2.4: bench_mpsc binary (Mupsic 1p1c, 2p1c, 4p1c)
elijahr May 1, 2026
62a54ad
PR 2 Task 2.5: bench_mpmc binary (Mupmuc grid + 8p8c, Sipmuc, channels)
elijahr May 1, 2026
e8d3f5f
PR 2 Task 2.6: bench_unbounded binary (4 unbounded variants)
elijahr May 1, 2026
5615613
PR 2 Task 2.7: superset_check.py + deletion-safety wiring
elijahr May 1, 2026
0523acd
PR 2 Task 2.8: bench.yml matrix over 5 topology binaries + per-step t…
elijahr May 1, 2026
3686f51
PR 2 Task 2.9: 5-input union test in test_merge_bmf.py
elijahr May 1, 2026
7085bd7
PR 2 Task 2.10: delete bench_throughput.nim, rewire consumers
elijahr May 1, 2026
4bada5e
PR 2 Task 2.11: CHANGELOG entry under [Unreleased] for bench-rollup PR 2
elijahr May 1, 2026
6f71cc5
fix(bench): tighten bench_unbounded CI shape to fit 18-min budget
elijahr May 1, 2026
a81d483
fix(bench): gate oversubscribed C>=4 unbounded shapes behind a define
elijahr May 2, 2026
e896c6d
fix(bench): document p95 in BMF schema; clean up runner.py fragment f…
elijahr May 2, 2026
bd4ed7b
fix(bench): run upload on partial failures; correct timeout-comment
elijahr May 2, 2026
0fb3791
fix(bench): drop unparseable BMF fragments before merge
elijahr May 2, 2026
b9129ce
perf(bench): backoffOnPeerWait in busy-spin loops; document teardown …
elijahr May 3, 2026
04a4e42
docs(bench_unbounded): defend `create(T)` as alloc0, not destructor-o…
elijahr May 3, 2026
ab19733
fix(bench): pass -d:danger in runner.py build_nim() to match CI
elijahr May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 164 additions & 123 deletions .github/workflows/bench.yml

Large diffs are not rendered by default.

8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ nimble.paths
# Internal planning docs
docs/plans/
deps/
# Worktree-local symlink to the main repo's deps/ folder, created by
# the worktree setup so nim.cfg's `--path:"deps/unittest2"` resolves.
deps

# Embedded repositories (use as dependencies, not submodules)
nim-typestates/
Expand All @@ -32,7 +35,10 @@ nim-unittest2/
logs/
test_typed_introspection*
benchmarks/nim/bench_latency
benchmarks/nim/bench_throughput
benchmarks/nim/bench_spsc
benchmarks/nim/bench_mpsc
benchmarks/nim/bench_mpmc
benchmarks/nim/bench_unbounded

# Compiled benchmark test binaries (extensionless executables)
benchmarks/nim/tests/t_*
Expand Down
51 changes: 51 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
+ throughput measures on shared per-slug histories. (Multiple
`bencher run` invocations create separate Bencher Reports and would
NOT co-locate measures — see merge rationale in design 1.)
- Four new topology-split throughput binaries replacing the legacy
`bench_throughput.nim` (PR 2):
`benchmarks/nim/bench_spsc.nim` (Sipsic 1p1c),
`benchmarks/nim/bench_mpsc.nim` (Mupsic {1,2,4}p1c),
`benchmarks/nim/bench_mpmc.nim` (Mupmuc {1,2,4}p{1,2,4}c plus 8p8c
oversubscription, Sipmuc 1p{1,2,4}c, Nim channels {1,2,4}p{1,2,4}c),
`benchmarks/nim/bench_unbounded.nim` (all four lockfreequeues
unbounded variants at their natural shapes).
Each emits BMF JSON via `--bmf-out=<path>` with the same per-slug
`throughput_ops_ms` shape as the prior binary. Each owns its own
per-binary intdefines (`-d:BenchSpscRuns/MessageCount/Warmup`,
`-d:BenchMpscRuns/...`, `-d:BenchMpmcRuns/...`, plus four pairs of
`-d:Unbounded<Variant>Runs/MessageCount` per design 2.5) so CI can
budget each topology independently.
- New `benchmarks/scripts/superset_check.py`: slug-set deletion-safety
guard that exits 0 when the post-split BMF covers every slug in the
pre-split fixture (`tests/fixtures/pre-split-slugs.json`) and
exits 1 with the missing slugs alpha-listed on stderr otherwise.
Run by `bench-upload` immediately after `merge_bmf.py` so any
silent slug regression introduced by future edits to the topology
binaries fails the PR check. Covered by 9 unit tests in
`benchmarks/tests/test_superset_check.py`.
- `benchmarks/tests/test_merge_bmf.py` gains `test_five_input_union`
covering the upload-job pipeline shape: 5 sibling fragments (one per
topology binary) merged via `merge_bmf.py` produce a single output
whose slug set is the disjoint union, with shared slugs carrying
measures from every input binary.

### Changed

Expand Down Expand Up @@ -105,6 +132,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
`runThroughputHarness` without per-call-site type conversion. No
external API change: legacy callers that imported `./adapter` for
`PushResult` / `PopResult` continue to compile (PR 1).
- `.github/workflows/bench.yml` now runs the five topology-split
binaries (`bench_spsc`, `bench_mpsc`, `bench_mpmc`, `bench_unbounded`,
`bench_latency`) as a GitHub Actions matrix instead of the legacy
pair of bench-throughput / bench-latency jobs. Each matrix entry
has its own `timeout-minutes: 12` budget so a hang in one binary
cannot burn the entire workflow's clock; the surviving binaries
finish, the bench-upload job merges what arrived, and the operator
gets partial Bencher coverage rather than no coverage. The
bench-upload job now also runs the `superset_check.py` deletion-
safety guard between `merge_bmf.py` and `bencher run` (PR 2).
- `benchmarks/runner.py` and `lockfreequeues.nimble` `task benchmarks`
iterate the five topology-split binaries and merge their fragments
via `merge_bmf.py` (PR 2).
- `benchmarks/README.md` rewritten to describe the 5-binary pipeline
(matrix CI job, per-binary intdefines, deletion-safety guard, the
merged BMF schema where one slug can carry both throughput and
latency measures) (PR 2).

### Removed

Expand All @@ -116,6 +160,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped
bench_throughput + bench_latency and produced a custom JSON shape.
`bench_throughput` is now the canonical entry point.
- `benchmarks/nim/bench_throughput.nim` — single multi-topology
throughput driver, replaced by the four topology-split binaries
`bench_spsc`, `bench_mpsc`, `bench_mpmc`, and `bench_unbounded`.
The pre-split slug fixture committed at
`tests/fixtures/pre-split-slugs.json` plus the `superset_check.py`
guard wired into bench.yml enforces that no slug from the legacy
binary silently disappears across the split (PR 2).

## [4.1.0] - 2026-05-01

Expand Down
141 changes: 92 additions & 49 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,79 +5,101 @@ regression gate via [Bencher.dev](https://bencher.dev).

## Structure

- `nim/` - Nim benchmarks (lockfreequeues, Loony, Nim channels)
- `nim/` - Nim benchmarks (lockfreequeues + Nim channels).
- `nim/bench_common.nim` - Shared bench harness (BMF emission, stats,
Histogram with top-K + reservoir percentiles, throughput / latency
runners). One module, consumed by every per-topology bench binary.
- `nim/bench_throughput.nim` - Throughput driver. Emits Bencher Metric
Format JSON natively via `--bmf-out=<path>`.
- `nim/bench_spsc.nim` - Bounded SPSC throughput driver (Sipsic 1p1c).
- `nim/bench_mpsc.nim` - Bounded MPSC throughput driver
(Mupsic {1,2,4}p1c).
- `nim/bench_mpmc.nim` - Bounded MPMC throughput driver
(Mupmuc {1,2,4}p{1,2,4}c plus 8p8c oversubscription, Sipmuc 1p{1,2,4}c,
Nim channels {1,2,4}p{1,2,4}c).
- `nim/bench_unbounded.nim` - Unbounded throughput driver across all
four lockfreequeues unbounded variants.
- `nim/bench_latency.nim` - Latency (ping-pong RTT) driver across the
four bounded lockfreequeues variants.
- `nim/adapters/` - One file per upstream queue library
(`<library_slug>_adapter.nim`). Adapters expose a `push(value)
-> PushResult` / `pop() -> PopResult[T]` shape consumed by the
shared harness.
shared harness; multi-thread topologies bypass the generic adapter
and call queue.getProducer(idx) / queue.getConsumer(idx) directly.
- `merge_bmf.py` - Stateless union of per-binary BMF JSON fragments
into a single `merged.json` for `bencher run`. Exits 1 on
`(slug, measure)` collisions naming the colliding inputs.
- `results/` - JSON output from local benchmark runs
- `runner.py` - Orchestrates local benchmark execution
- `scripts/superset_check.py` - Slug-set deletion-safety guard. Exits
1 with the missing slug list on stderr if a post-split BMF drops
any slug present in the pre-split fixture
(`tests/fixtures/pre-split-slugs.json`).
- `results/` - JSON output from local benchmark runs.
- `runner.py` - Orchestrates local benchmark execution. Builds and
runs all five binaries, then merges their fragments via
`merge_bmf.py`.

## Quick Start (local)

```bash
# Run all Nim throughput benchmarks (1M messages x 33 runs - takes a while).
nim c -r -d:release -d:danger --threads:on benchmarks/nim/bench_throughput.nim
# Run every topology binary at default run shape (1M messages * 33
# runs for bounded throughput; 500K * 3 for unbounded; 100K * 33 for
# latency). Takes a while.
nimble benchmarks

# Same, but the CI wall-clock budget. Bounded variants run at 1M x 5
# runs; unbounded_mupsic is gated separately to 500k x 3 runs because
# its wall-clock cost is super-linear in message count. Bump in lockstep
# with `.github/workflows/bench.yml` if you change the CI shape.
# CI-tighter shape: pick one binary and override its per-binary
# intdefines. Each binary owns its own knobs (design doc 2.5).
nim c -r -d:release -d:danger --threads:on \
-d:MessageCount=1000000 -d:DefaultRuns=5 -d:WarmupRuns=2 \
-d:UnboundedMupsicMessageCount=500000 -d:UnboundedMupsicRuns=3 \
benchmarks/nim/bench_throughput.nim

# Emit BMF JSON natively (no Python parser; see merge step below).
./.tmp/bench_throughput --bmf-out=throughput.json
python3 benchmarks/merge_bmf.py merged.json throughput.json
-d:BenchMpmcMessageCount=100000 -d:BenchMpmcRuns=5 -d:BenchMpmcWarmup=2 \
benchmarks/nim/bench_mpmc.nim

# Emit BMF JSON natively (no Python parser; merge to combine).
./.tmp/bench_spsc --bmf-out=spsc.json
./.tmp/bench_mpsc --bmf-out=mpsc.json
./.tmp/bench_mpmc --bmf-out=mpmc.json
./.tmp/bench_unbounded --bmf-out=unbounded.json
./.tmp/bench_latency --bmf-out=latency.json
python3 benchmarks/merge_bmf.py merged.json \
spsc.json mpsc.json mpmc.json unbounded.json latency.json
```

## Metrics

- **Throughput**: `ops/ms` with N producer / N consumer threads
(mean, optional min/max for unbounded variants).
- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99, p999).
(mean, lower=mean-stddev, upper=mean+stddev).
- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency between the metrics list and the BMF schema description. Line 67 mentions p95 as a supported percentile for latency, but the schema example (lines 118-119) and the merge test case (lines 235-236 in test_merge_bmf.py) only include p50 and p99. Please update the documentation to reflect the actual percentiles emitted by the latency benchmark.


## Cloud benchmarking (Bencher.dev)

`.github/workflows/bench.yml` runs `bench_throughput` on `ubuntu-latest`
for every PR and every push to `main`/`devel`. The workflow:
`.github/workflows/bench.yml` runs the five topology-split binaries on
`ubuntu-latest` for every PR and every push to `main`/`devel` via a
GitHub Actions matrix (one matrix entry per binary, each with its own
`timeout-minutes: 12` budget). The workflow:

1. Compiles `bench_throughput` with the CI run shape
(`-d:MessageCount=1000000 -d:DefaultRuns=5 -d:WarmupRuns=2
-d:UnboundedMupsicRuns=3 -d:UnboundedMupsicMessageCount=500000`).
2. Runs `bench_throughput --bmf-out=throughput.json`, which writes
1. Compiles each binary with its CI-tuned per-binary intdefines
(e.g. `-d:BenchSpscMessageCount=1000000 -d:BenchSpscRuns=5
-d:BenchSpscWarmup=2` for `bench_spsc`).
2. Runs the binary with `--bmf-out=<binary>.json`, which writes
Bencher Metric Format JSON natively.
3. Runs `python3 benchmarks/merge_bmf.py merged.json throughput.json`
to produce a single `merged.json` for upload. The merge step is a
no-op union today, but stays in place for the per-topology binary
split landing in PR 2-4.
4. Uploads `merged.json` to the `lockfreequeues` Bencher project via
the `bencherdev/bencher@main` action.
3. Uploads each per-binary JSON as a GitHub Actions artifact.
4. The dependent `bench-upload` job downloads every artifact, unions
them via `merge_bmf.py merged.json $(ls bmf-inputs/*.json)`, then
runs `superset_check.py tests/fixtures/pre-split-slugs.json
merged.json` to enforce deletion-safety. A single `bencher run`
uploads `merged.json` to the `lockfreequeues` Bencher project.

On pull requests, Bencher posts a comparison comment against the base
branch using `--start-point-clone-thresholds` and `--start-point-reset`,
so threshold breaches show up inline.

The workflow also runs on `workflow_dispatch` for ad-hoc baseline pinning.
The workflow also runs on `workflow_dispatch` for ad-hoc baseline
pinning.

### One-time setup (maintainer)

The cloud workflow requires:

1. A Bencher.dev project named `lockfreequeues`
(create at https://bencher.dev with that exact slug).
2. A repository secret `BENCHER_API_TOKEN` containing a Bencher API token
with write access to the project.
2. A repository secret `BENCHER_API_TOKEN` containing a Bencher API
token with write access to the project.

Until those exist the `bench` workflow will fail on the upload step;
PR / push events still produce the `merged.json` artifact in the
Expand All @@ -92,28 +114,49 @@ job log so local debugging is possible without the upload.
"value": <mean ops/ms>,
"lower_value": <mean - stddev>,
"upper_value": <mean + stddev>
}
},
"latency_p50_ns": {"value": <ns>},
"latency_p95_ns": {"value": <ns>},
"latency_p99_ns": {"value": <ns>}
}
}
```

Slugs are alpha-sorted at the top level and measures are alpha-sorted
within each slug. `lower_value` / `upper_value` are omitted when the
emitter receives `NaN` sentinels for the bounds. Current slug set
emitted by `bench_throughput`:

- `lockfreequeues_sipsic/spsc/1p1c`
- `lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c`
- `lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c`
- `nim_channels/mpmc/{1,2,4}p{1,2,4}c`

## Running merge_bmf tests
emitter receives `NaN` sentinels for the bounds. After `merge_bmf.py`
unions the five binary fragments, a single slug can carry both
`throughput_ops_ms` (from the matching topology binary) AND
`latency_p50_ns` / `latency_p95_ns` / `latency_p99_ns` (from
`bench_latency`) when the slug shape matches `1p1c` on a bounded
variant.

Current slug set emitted across the five binaries:

- `bench_spsc`: `lockfreequeues_sipsic/spsc/1p1c`.
- `bench_mpsc`: `lockfreequeues_mupsic/mpsc/{1,2,4}p1c`.
- `bench_mpmc`: `lockfreequeues_mupmuc/mpmc/{1,2,4}p{1,2,4}c` plus
`lockfreequeues_mupmuc/mpmc/8p8c`,
`lockfreequeues_sipmuc/mpmc/1p{1,2,4}c`,
`nim_channels/mpmc/{1,2,4}p{1,2,4}c`.
- `bench_unbounded`:
`lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c`,
`lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p{1,2,4}c`,
`lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c`,
`lockfreequeues_unbounded_mupmuc/mpmc_unbounded/{1,2,4}p{1,2,4}c`.
- `bench_latency`:
`lockfreequeues_{sipsic,sipmuc,mupsic,mupmuc}/{spsc,mpmc,mpsc,mpmc}/1p1c`.

## Running merge_bmf and superset_check tests

```bash
python3 -m unittest benchmarks.tests.test_merge_bmf -v
python3 -m unittest benchmarks.tests.test_superset_check -v
```

The tests use only the Python standard library (`unittest`) and run in
< 0.1s. They cover slug regex enforcement, measure regex enforcement,
collision detection (with both colliding files named in stderr), and
alpha-sorted output.
under a second. They cover slug regex enforcement, measure regex
enforcement, collision detection (with both colliding files named in
stderr), alpha-sorted output, 5-input union (one fragment per
topology binary), and the deletion-safety contract enforced by
`superset_check.py`.
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@
## consumer handle and exposes them so that bench code can register
## producers on the worker threads themselves.
##
## The bench harness in `bench_throughput.nim` consumes this adapter
## directly via specialized benchmark procs (mirroring the Mupmuc path).
## The bench harness in `bench_unbounded.nim` consumes this adapter
## directly via specialized benchmark procs (was `bench_throughput.nim`
## prior to the PR 2 topology split).

import lockfreequeues/unbounded_mupsic
import debra
Expand Down
Loading
Loading