elijahr · elijahr · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/.github/workflows/bench.yml b/.github/workflows/bench.yml
diff --git a/.gitignore b/.gitignore
@@ -23,6 +23,9 @@ nimble.paths
 # Internal planning docs
 docs/plans/
 deps/
+# Worktree-local symlink to the main repo's deps/ folder, created by
+# the worktree setup so nim.cfg's `--path:"deps/unittest2"` resolves.
+deps
 
 # Embedded repositories (use as dependencies, not submodules)
 nim-typestates/
@@ -32,7 +35,10 @@ nim-unittest2/
 logs/
 test_typed_introspection*
 benchmarks/nim/bench_latency
-benchmarks/nim/bench_throughput
+benchmarks/nim/bench_spsc
+benchmarks/nim/bench_mpsc
+benchmarks/nim/bench_mpmc
+benchmarks/nim/bench_unbounded
 
 # Compiled benchmark test binaries (extensionless executables)
 benchmarks/nim/tests/t_*

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -72,6 +72,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   + throughput measures on shared per-slug histories. (Multiple
   `bencher run` invocations create separate Bencher Reports and would
   NOT co-locate measures — see merge rationale in design 1.)
+- Four new topology-split throughput binaries replacing the legacy
+  `bench_throughput.nim` (PR 2):
+  `benchmarks/nim/bench_spsc.nim` (Sipsic 1p1c),
+  `benchmarks/nim/bench_mpsc.nim` (Mupsic {1,2,4}p1c),
+  `benchmarks/nim/bench_mpmc.nim` (Mupmuc {1,2,4}p{1,2,4}c plus 8p8c
+    oversubscription, Sipmuc 1p{1,2,4}c, Nim channels {1,2,4}p{1,2,4}c),
+  `benchmarks/nim/bench_unbounded.nim` (all four lockfreequeues
+    unbounded variants at their natural shapes).
+  Each emits BMF JSON via `--bmf-out=<path>` with the same per-slug
+  `throughput_ops_ms` shape as the prior binary. Each owns its own
+  per-binary intdefines (`-d:BenchSpscRuns/MessageCount/Warmup`,
+  `-d:BenchMpscRuns/...`, `-d:BenchMpmcRuns/...`, plus four pairs of
+  `-d:Unbounded<Variant>Runs/MessageCount` per design 2.5) so CI can
+  budget each topology independently.
+- New `benchmarks/scripts/superset_check.py`: slug-set deletion-safety
+  guard that exits 0 when the post-split BMF covers every slug in the
+  pre-split fixture (`tests/fixtures/pre-split-slugs.json`) and
+  exits 1 with the missing slugs alpha-listed on stderr otherwise.
+  Run by `bench-upload` immediately after `merge_bmf.py` so any
+  silent slug regression introduced by future edits to the topology
+  binaries fails the PR check. Covered by 9 unit tests in
+  `benchmarks/tests/test_superset_check.py`.
+- `benchmarks/tests/test_merge_bmf.py` gains `test_five_input_union`
+  covering the upload-job pipeline shape: 5 sibling fragments (one per
+  topology binary) merged via `merge_bmf.py` produce a single output
+  whose slug set is the disjoint union, with shared slugs carrying
+  measures from every input binary.
 
 ### Changed
 
@@ -105,6 +132,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   `runThroughputHarness` without per-call-site type conversion. No
   external API change: legacy callers that imported `./adapter` for
   `PushResult` / `PopResult` continue to compile (PR 1).
+- `.github/workflows/bench.yml` now runs the five topology-split
+  binaries (`bench_spsc`, `bench_mpsc`, `bench_mpmc`, `bench_unbounded`,
+  `bench_latency`) as a GitHub Actions matrix instead of the legacy
+  pair of bench-throughput / bench-latency jobs. Each matrix entry
+  has its own `timeout-minutes: 12` budget so a hang in one binary
+  cannot burn the entire workflow's clock; the surviving binaries
+  finish, the bench-upload job merges what arrived, and the operator
+  gets partial Bencher coverage rather than no coverage. The
+  bench-upload job now also runs the `superset_check.py` deletion-
+  safety guard between `merge_bmf.py` and `bencher run` (PR 2).
+- `benchmarks/runner.py` and `lockfreequeues.nimble` `task benchmarks`
+  iterate the five topology-split binaries and merge their fragments
+  via `merge_bmf.py` (PR 2).
+- `benchmarks/README.md` rewritten to describe the 5-binary pipeline
+  (matrix CI job, per-binary intdefines, deletion-safety guard, the
+  merged BMF schema where one slug can carry both throughput and
+  latency measures) (PR 2).
 
 ### Removed
 
@@ -116,6 +160,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped
   bench_throughput + bench_latency and produced a custom JSON shape.
   `bench_throughput` is now the canonical entry point.
+- `benchmarks/nim/bench_throughput.nim` — single multi-topology
+  throughput driver, replaced by the four topology-split binaries
+  `bench_spsc`, `bench_mpsc`, `bench_mpmc`, and `bench_unbounded`.
+  The pre-split slug fixture committed at
+  `tests/fixtures/pre-split-slugs.json` plus the `superset_check.py`
+  guard wired into bench.yml enforces that no slug from the legacy
+  binary silently disappears across the split (PR 2).
 
 ## [4.1.0] - 2026-05-01
 

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -5,79 +5,101 @@ regression gate via [Bencher.dev](https://bencher.dev).
 
 ## Structure
 
-- `nim/` - Nim benchmarks (lockfreequeues, Loony, Nim channels)
+- `nim/` - Nim benchmarks (lockfreequeues + Nim channels).
 - `nim/bench_common.nim` - Shared bench harness (BMF emission, stats,
   Histogram with top-K + reservoir percentiles, throughput / latency
   runners). One module, consumed by every per-topology bench binary.
-- `nim/bench_throughput.nim` - Throughput driver. Emits Bencher Metric
-  Format JSON natively via `--bmf-out=<path>`.
+- `nim/bench_spsc.nim` - Bounded SPSC throughput driver (Sipsic 1p1c).
+- `nim/bench_mpsc.nim` - Bounded MPSC throughput driver
+  (Mupsic {1,2,4}p1c).
+- `nim/bench_mpmc.nim` - Bounded MPMC throughput driver
+  (Mupmuc {1,2,4}p{1,2,4}c plus 8p8c oversubscription, Sipmuc 1p{1,2,4}c,
+  Nim channels {1,2,4}p{1,2,4}c).
+- `nim/bench_unbounded.nim` - Unbounded throughput driver across all
+  four lockfreequeues unbounded variants.
+- `nim/bench_latency.nim` - Latency (ping-pong RTT) driver across the
+  four bounded lockfreequeues variants.
 - `nim/adapters/` - One file per upstream queue library
   (`<library_slug>_adapter.nim`). Adapters expose a `push(value)
   -> PushResult` / `pop() -> PopResult[T]` shape consumed by the
-  shared harness.
+  shared harness; multi-thread topologies bypass the generic adapter
+  and call queue.getProducer(idx) / queue.getConsumer(idx) directly.
 - `merge_bmf.py` - Stateless union of per-binary BMF JSON fragments
   into a single `merged.json` for `bencher run`. Exits 1 on
   `(slug, measure)` collisions naming the colliding inputs.
-- `results/` - JSON output from local benchmark runs
-- `runner.py` - Orchestrates local benchmark execution
+- `scripts/superset_check.py` - Slug-set deletion-safety guard. Exits
+  1 with the missing slug list on stderr if a post-split BMF drops
+  any slug present in the pre-split fixture
+  (`tests/fixtures/pre-split-slugs.json`).
+- `results/` - JSON output from local benchmark runs.
+- `runner.py` - Orchestrates local benchmark execution. Builds and
+  runs all five binaries, then merges their fragments via
+  `merge_bmf.py`.
 
 ## Quick Start (local)
 
 ```bash
-# Run all Nim throughput benchmarks (1M messages x 33 runs - takes a while).
-nim c -r -d:release -d:danger --threads:on benchmarks/nim/bench_throughput.nim
+# Run every topology binary at default run shape (1M messages * 33
+# runs for bounded throughput; 500K * 3 for unbounded; 100K * 33 for
+# latency). Takes a while.
+nimble benchmarks
 
-# Same, but the CI wall-clock budget. Bounded variants run at 1M x 5
-# runs; unbounded_mupsic is gated separately to 500k x 3 runs because
-# its wall-clock cost is super-linear in message count. Bump in lockstep
-# with `.github/workflows/bench.yml` if you change the CI shape.
+# CI-tighter shape: pick one binary and override its per-binary
+# intdefines. Each binary owns its own knobs (design doc 2.5).
 nim c -r -d:release -d:danger --threads:on \
-  -d:MessageCount=1000000 -d:DefaultRuns=5 -d:WarmupRuns=2 \
-  -d:UnboundedMupsicMessageCount=500000 -d:UnboundedMupsicRuns=3 \
-  benchmarks/nim/bench_throughput.nim
-
-# Emit BMF JSON natively (no Python parser; see merge step below).
-./.tmp/bench_throughput --bmf-out=throughput.json
-python3 benchmarks/merge_bmf.py merged.json throughput.json
+  -d:BenchMpmcMessageCount=100000 -d:BenchMpmcRuns=5 -d:BenchMpmcWarmup=2 \
+  benchmarks/nim/bench_mpmc.nim
+
+# Emit BMF JSON natively (no Python parser; merge to combine).
+./.tmp/bench_spsc       --bmf-out=spsc.json
+./.tmp/bench_mpsc       --bmf-out=mpsc.json
+./.tmp/bench_mpmc       --bmf-out=mpmc.json
+./.tmp/bench_unbounded  --bmf-out=unbounded.json
+./.tmp/bench_latency    --bmf-out=latency.json
+python3 benchmarks/merge_bmf.py merged.json \
+  spsc.json mpsc.json mpmc.json unbounded.json latency.json
 ```
 
 ## Metrics
 
 - **Throughput**: `ops/ms` with N producer / N consumer threads
-  (mean, optional min/max for unbounded variants).
-- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99, p999).
+  (mean, lower=mean-stddev, upper=mean+stddev).
+- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99).
 
 ## Cloud benchmarking (Bencher.dev)
 
-`.github/workflows/bench.yml` runs `bench_throughput` on `ubuntu-latest`
-for every PR and every push to `main`/`devel`. The workflow:
+`.github/workflows/bench.yml` runs the five topology-split binaries on
+`ubuntu-latest` for every PR and every push to `main`/`devel` via a
+GitHub Actions matrix (one matrix entry per binary, each with its own
+`timeout-minutes: 12` budget). The workflow:
 
-1. Compiles `bench_throughput` with the CI run shape
-   (`-d:MessageCount=1000000 -d:DefaultRuns=5 -d:WarmupRuns=2
-   -d:UnboundedMupsicRuns=3 -d:UnboundedMupsicMessageCount=500000`).
-2. Runs `bench_throughput --bmf-out=throughput.json`, which writes
+1. Compiles each binary with its CI-tuned per-binary intdefines
+   (e.g. `-d:BenchSpscMessageCount=1000000 -d:BenchSpscRuns=5
+   -d:BenchSpscWarmup=2` for `bench_spsc`).
+2. Runs the binary with `--bmf-out=<binary>.json`, which writes
    Bencher Metric Format JSON natively.
-3. Runs `python3 benchmarks/merge_bmf.py merged.json throughput.json`
-   to produce a single `merged.json` for upload. The merge step is a
-   no-op union today, but stays in place for the per-topology binary
-   split landing in PR 2-4.
-4. Uploads `merged.json` to the `lockfreequeues` Bencher project via
-   the `bencherdev/bencher@main` action.
+3. Uploads each per-binary JSON as a GitHub Actions artifact.
+4. The dependent `bench-upload` job downloads every artifact, unions
+   them via `merge_bmf.py merged.json $(ls bmf-inputs/*.json)`, then
+   runs `superset_check.py tests/fixtures/pre-split-slugs.json
+   merged.json` to enforce deletion-safety. A single `bencher run`
+   uploads `merged.json` to the `lockfreequeues` Bencher project.
 
 On pull requests, Bencher posts a comparison comment against the base
 branch using `--start-point-clone-thresholds` and `--start-point-reset`,
 so threshold breaches show up inline.
 
-The workflow also runs on `workflow_dispatch` for ad-hoc baseline pinning.
+The workflow also runs on `workflow_dispatch` for ad-hoc baseline
+pinning.
 
 ### One-time setup (maintainer)
 
 The cloud workflow requires:
 
 1. A Bencher.dev project named `lockfreequeues`
    (create at https://bencher.dev with that exact slug).
-2. A repository secret `BENCHER_API_TOKEN` containing a Bencher API token
-   with write access to the project.
+2. A repository secret `BENCHER_API_TOKEN` containing a Bencher API
+   token with write access to the project.
 
 Until those exist the `bench` workflow will fail on the upload step;
 PR / push events still produce the `merged.json` artifact in the
@@ -92,28 +114,49 @@ job log so local debugging is possible without the upload.
       "value": <mean ops/ms>,
       "lower_value": <mean - stddev>,
       "upper_value": <mean + stddev>
-    }
+    },
+    "latency_p50_ns": {"value": <ns>},
+    "latency_p95_ns": {"value": <ns>},
+    "latency_p99_ns": {"value": <ns>}
   }
 }
 ```
 
 Slugs are alpha-sorted at the top level and measures are alpha-sorted
 within each slug. `lower_value` / `upper_value` are omitted when the
-emitter receives `NaN` sentinels for the bounds. Current slug set
-emitted by `bench_throughput`:
-
-- `lockfreequeues_sipsic/spsc/1p1c`
-- `lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c`
-- `lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c`
-- `nim_channels/mpmc/{1,2,4}p{1,2,4}c`
-
-## Running merge_bmf tests
+emitter receives `NaN` sentinels for the bounds. After `merge_bmf.py`
+unions the five binary fragments, a single slug can carry both
+`throughput_ops_ms` (from the matching topology binary) AND
+`latency_p50_ns` / `latency_p95_ns` / `latency_p99_ns` (from
+`bench_latency`) when the slug shape matches `1p1c` on a bounded
+variant.
+
+Current slug set emitted across the five binaries:
+
+- `bench_spsc`: `lockfreequeues_sipsic/spsc/1p1c`.
+- `bench_mpsc`: `lockfreequeues_mupsic/mpsc/{1,2,4}p1c`.
+- `bench_mpmc`: `lockfreequeues_mupmuc/mpmc/{1,2,4}p{1,2,4}c` plus
+  `lockfreequeues_mupmuc/mpmc/8p8c`,
+  `lockfreequeues_sipmuc/mpmc/1p{1,2,4}c`,
+  `nim_channels/mpmc/{1,2,4}p{1,2,4}c`.
+- `bench_unbounded`:
+  `lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c`,
+  `lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p{1,2,4}c`,
+  `lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c`,
+  `lockfreequeues_unbounded_mupmuc/mpmc_unbounded/{1,2,4}p{1,2,4}c`.
+- `bench_latency`:
+  `lockfreequeues_{sipsic,sipmuc,mupsic,mupmuc}/{spsc,mpmc,mpsc,mpmc}/1p1c`.
+
+## Running merge_bmf and superset_check tests
 
 ```bash
 python3 -m unittest benchmarks.tests.test_merge_bmf -v
+python3 -m unittest benchmarks.tests.test_superset_check -v
 ```
 
 The tests use only the Python standard library (`unittest`) and run in
-< 0.1s. They cover slug regex enforcement, measure regex enforcement,
-collision detection (with both colliding files named in stderr), and
-alpha-sorted output.
+under a second. They cover slug regex enforcement, measure regex
+enforcement, collision detection (with both colliding files named in
+stderr), alpha-sorted output, 5-input union (one fragment per
+topology binary), and the deletion-safety contract enforced by
+`superset_check.py`.
diff --git a/benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim b/benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim
@@ -21,8 +21,9 @@
 ## consumer handle and exposes them so that bench code can register
 ## producers on the worker threads themselves.
 ##
-## The bench harness in `bench_throughput.nim` consumes this adapter
-## directly via specialized benchmark procs (mirroring the Mupmuc path).
+## The bench harness in `bench_unbounded.nim` consumes this adapter
+## directly via specialized benchmark procs (was `bench_throughput.nim`
+## prior to the PR 2 topology split).
 
 import lockfreequeues/unbounded_mupsic
 import debra