bench-rollup PR 2: topology split (5 binaries) + per-step CI timeouts#21
bench-rollup PR 2: topology split (5 binaries) + per-step CI timeouts#21elijahr wants to merge 18 commits intofeat/bench-rollup-pr1-latency-mvpfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request decomposes the legacy bench_throughput.nim into four specialized topology-split binaries—bench_spsc, bench_mpsc, bench_mpmc, and bench_unbounded—to enable independent CI budgeting and improved timeout management. It introduces a superset_check.py utility to ensure benchmark coverage is maintained and updates the orchestration logic in runner.py and lockfreequeues.nimble. Review feedback suggests optimizing the new benchmark harnesses by replacing inefficient busy-wait loops with a backoff strategy when queues are full or empty to reduce CPU overhead.
| ) {.thread.} = | ||
| for i in ctx.startIdx ..< ctx.startIdx + ctx.count: | ||
| while not ctx.producer.push(T(i)): | ||
| discard |
There was a problem hiding this comment.
The producer thread busy-waits with discard when the queue is full, which can lead to excessive CPU consumption. It would be more efficient to use a backoff strategy here, similar to what's done in other benchmark harnesses in this PR.
You will also need to add import lockfreequeues/backoff at the top of the file.
backoffOnPeerWait()
| ) {.thread.} = | ||
| for i in ctx.startIdx ..< ctx.startIdx + ctx.count: | ||
| while not ctx.queue[].push(T(i)): | ||
| discard |
There was a problem hiding this comment.
The producer thread uses a tight loop with discard when the queue is full. This is inefficient as it will burn CPU cycles. It would be better to use a backoff strategy, similar to what is done in mupmucProducerThread.
The lockfreequeues/backoff module is already imported, so you can use backoffOnPeerWait() here.
backoffOnPeerWait()
| let r = ctx.queue[].pop() | ||
| if r.isSome: | ||
| inc local |
There was a problem hiding this comment.
The consumer threads in this file use a busy-wait loop when the queue is empty. This is inefficient and burns CPU cycles. It's better to use a backoff strategy.
This same issue is present in usipmucConsumerThread, the consumer loop in runOneUMupsicRun, and umupmucConsumerThread.
You should use backoffOnPeerWait() in an else branch when pop() returns nothing. You will also need to import lockfreequeues/backoff.
let r = ctx.queue[].pop()
if r.isSome:
inc local
else:
backoffOnPeerWait()
400e205 to
33fc9c8
Compare
4615926 to
7aed4e1
Compare
33fc9c8 to
459090f
Compare
466d152 to
6386498
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the benchmarking suite by splitting the monolithic bench_throughput.nim into five specialized binaries: bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, and bench_latency. It introduces a superset_check.py utility and a corresponding fixture to ensure no benchmark slugs are lost during this transition. The runner.py and lockfreequeues.nimble files are updated to orchestrate the new multi-binary pipeline. Feedback highlights a documentation inconsistency regarding latency percentiles and suggests cleaning up temporary JSON fragments in the runner script.
| (mean, optional min/max for unbounded variants). | ||
| - **Latency**: RTT nanoseconds with percentiles (p50, p95, p99, p999). | ||
| (mean, lower=mean-stddev, upper=mean+stddev). | ||
| - **Latency**: RTT nanoseconds with percentiles (p50, p95, p99). |
There was a problem hiding this comment.
There is an inconsistency between the metrics list and the BMF schema description. Line 67 mentions p95 as a supported percentile for latency, but the schema example (lines 118-119) and the merge test case (lines 235-236 in test_merge_bmf.py) only include p50 and p99. Please update the documentation to reflect the actual percentiles emitted by the latency benchmark.
| for bin_name in NIM_BINARIES: | ||
| bin_path = PROJECT_ROOT / ".tmp" / bin_name | ||
| out = output_file.parent / f"{output_file.stem}-{bin_name}.json" | ||
| print(f" -> {bin_name} -> {out.name}") | ||
| subprocess.run( | ||
| [str(bin_path), f"--bmf-out={out}"], | ||
| check=True, | ||
| ) | ||
| fragments.append(out) | ||
| # Merge the 5 BMF fragments into the requested output file via | ||
| # merge_bmf.py. Unions per-slug measure dicts and exits 1 on | ||
| # (slug, measure) collisions. | ||
| print(f"Merging {len(fragments)} fragments -> {output_file}") | ||
| subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| str(BENCHMARK_DIR / "merge_bmf.py"), | ||
| str(output_file), | ||
| *[str(p) for p in fragments], | ||
| ], | ||
| check=True, | ||
| ) | ||
| print(f"Combined results written to {output_file}") |
There was a problem hiding this comment.
The run_nim function creates multiple timestamped fragment files in the results directory (e.g., nim-TIMESTAMP-bench_spsc.json) which are never deleted. This leads to unnecessary disk clutter after every benchmark run, especially since the merged file is already preserved. It is recommended to clean up these temporary fragments after the merge step is completed.
print(f"Running Nim topology-split benchmarks ({runs} runs)...")
fragments: list[Path] = []
try:
for bin_name in NIM_BINARIES:
bin_path = PROJECT_ROOT / ".tmp" / bin_name
out = output_file.parent / f"{output_file.stem}-{bin_name}.json"
print(f" -> {bin_name} -> {out.name}")
subprocess.run(
[str(bin_path), f"--bmf-out={out}"],
check=True,
)
fragments.append(out)
# Merge the 5 BMF fragments into the requested output file via
# merge_bmf.py. Unions per-slug measure dicts and exits 1 on
# (slug, measure) collisions.
print(f"Merging {len(fragments)} fragments -> {output_file}")
subprocess.run(
[
sys.executable,
str(BENCHMARK_DIR / "merge_bmf.py"),
str(output_file),
*[str(p) for p in fragments],
],
check=True,
)
print(f"Combined results written to {output_file}")
finally:
for p in fragments:
try:
p.unlink(missing_ok=True)
except OSError:
pass459090f to
19c6dcb
Compare
…iles Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced.
6386498 to
5012581
Compare
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/review |
Code Review by Qodo
1. Bench-upload skipped on failure
|
| bench-upload: | ||
| # Merges per-binary BMF artifacts into a single JSON before a single |
There was a problem hiding this comment.
1. Bench-upload skipped on failure 🐞 Bug ☼ Reliability
bench-upload won’t run when any bench matrix entry fails because it uses needs: [bench] without an if: always() override, so no partial merge/upload happens. This contradicts the workflow’s own comment that surviving binaries will still be merged and uploaded when one variant hangs.
Agent Prompt
### Issue description
`bench-upload` is intended to merge/upload whatever benchmark artifacts were successfully produced even if one matrix entry fails/hangs, but it currently won’t run if any `bench` matrix leg fails.
### Issue Context
- `bench` is a matrix job with `fail-fast: false`, so other legs continue.
- However, `bench-upload` has `needs: [bench]` and no `if: always()` override, so it’s skipped when any matrix leg fails.
### Fix Focus Areas
- .github/workflows/bench.yml[33-41]
- .github/workflows/bench.yml[172-181]
### Suggested fix
- Change `bench-upload` to run even if `bench` fails, while still skipping fork PRs:
- e.g. `if: always() && (github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository)`
- Optionally make the per-matrix artifact upload more resilient:
- Add `if: always()` to the upload-artifact step and set `if-no-files-found: ignore` so a timed-out run doesn’t block uploading a partially-written JSON (if present).
- Add a clear guard before merge to fail with a helpful error if `bmf-inputs/` contains zero `*.json` files.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
19c6dcb to
7a6ff2c
Compare
…iles Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced.
5012581 to
9585d15
Compare
|
/review |
|
Persistent review updated to latest commit 9585d15 |
|
Preparing review... |
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request replaces the legacy bench_throughput.nim with five specialized topology-split binaries to allow independent CI budgeting and introduces a superset_check.py script to prevent benchmark coverage regressions. Review feedback highlights several improvement opportunities in the new harnesses, specifically recommending the addition of backoffOnPeerWait() to prevent busy-spinning in producer and consumer threads. Additionally, concerns were raised regarding memory safety in bench_unbounded.nim, suggesting the use of proper destructors instead of reset() and avoiding manual memory management for DebraManager to prevent resource leaks under ARC/ORC.
| addr q, manager, messageCount) | ||
| reset(q) |
There was a problem hiding this comment.
Using reset() on types with custom destructors (like UnboundedSipmuc and DebraManager) is dangerous in modern Nim (ARC/ORC). reset merely zeros the memory and does NOT call the destructor, which will leak segments and other heap-allocated resources. Since these variables are local to the loop iteration, they will be destroyed automatically at the end of the loop. If manual destruction is required for the pointer manager, use =destroy(manager[]) before dealloc.
# reset(q) is redundant; stack object is destroyed at loop end.
=destroy(manager[])
dealloc(manager)
| while not ctx.queue[].push(T(i)): | ||
| discard |
There was a problem hiding this comment.
The Sipmuc producer thread busy-spins without backoff when the queue is full. This can lead to high CPU usage and starvation of consumer threads, especially in oversubscribed environments. To maintain consistency with the Mupmuc harness and the library's own livelock fixes, please add a backoff call here.
while not ctx.queue[].push(T(i)):
backoffOnPeerWait()
| while not ctx.producer.push(T(i)): | ||
| discard |
There was a problem hiding this comment.
| let item = ctx.consumer.pop() | ||
| if item.isSome: | ||
| inc local |
There was a problem hiding this comment.
| var manager = create(DebraManager[MaxThreads]) | ||
| manager[] = initDebraManager[MaxThreads]() |
There was a problem hiding this comment.
Manual memory management via create and dealloc for DebraManager is error-prone and requires explicit destructor calls in ARC/ORC. Unless there is a specific reason to avoid the stack (e.g., object size), consider using a standard stack variable or new for a ref type to ensure safe resource management.
| let r = consumer.pop() | ||
| if r.isSome: | ||
| inc local |
7a6ff2c to
a41a91d
Compare
…iles Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced.
89eb1b9 to
2aa47ce
Compare
|
/ai-review |
|
✅ Momus review posted — verdict APPROVE, 1 finding
|
|
/ai-review |
1 similar comment
|
/ai-review |
There was a problem hiding this comment.
This PR splits the legacy bench_throughput.nim into five topology-specific binaries (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency), rewires CI to a GitHub Actions matrix with per-step timeouts, and adds a deletion-safety superset check against a pre-split slug fixture. The overall structure is sound. Key issues: (1) the CHANGELOG, README, and PR body document a timeout-minutes of 12 but the actual workflow code uses 18, a docs/code inconsistency introduced in the PR itself; (2) the Bench*TestCompileTime compile-time assertion blocks in all four new binaries are dead code — never activated by tests or CI and would fail if combined with test overrides.
Severity tally: 1 Low.
Low
- BOT-A3 (
benchmarks/runner.py:58): runner.py build_nim() omits -d:danger flag present in CI compilation
Noteworthy
- The deletion-safety guard via superset_check.py + pre-split-slugs.json fixture is a thoughtful defense against silent slug regression across the binary split.
- The CI matrix with per-binary timeouts and
if: always()on bench-upload provides real partial-coverage resilience against individual binary hangs. - The 5-input merge test in test_merge_bmf.py correctly exercises the production pipeline shape with a shared slug carrying both throughput and latency measures.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| """Run all 5 Nim topology-split binaries and merge their BMF outputs. | ||
|
|
||
| The `runs` argument is honored at compile time (via per-binary | ||
| `-d:Bench<Topo>Runs=<n>` overrides) — kept as a no-op runtime arg |
There was a problem hiding this comment.
BOT-A3 — Low (quality)
runner.py build_nim() omits -d:danger flag present in CI compilation
The CI bench.yml compiles each binary with nim c -d:release -d:danger --threads:on. The local runner.py build_nim() only passes -d:release --threads:on, omitting -d:danger. This means local benchmark runs via runner.py will produce different (slower) throughput numbers than CI, breaking reproducibility for developers.
| `-d:Bench<Topo>Runs=<n>` overrides) — kept as a no-op runtime arg | |
| Add `-d:danger` to the nim compile command in build_nim() to match CI. |
6c4cb5f to
2684687
Compare
Snapshots the BMF emitted by the legacy bench_throughput at the same
small-override knobs the topology-split tests will use, then commits
the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion-
safety check (Task 2.7) verifies that the post-split union is a strict
superset of this fixture so no slug silently disappears with the split.
Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape
suite (Task 2.1) passes against the committed JSON; the bench_spsc /
bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and
the superset suite (Task 2.7) fail loudly until those binaries +
the superset_check.py helper land.
Fixture covers 11 slugs:
sipsic spsc/1p1c
mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case)
unbounded_mupsic mpsc_unbounded/{1,2,4}p1c
nim_channels mpmc/{1,2,4}p{1,2,4}c
Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps SipsicAdapter in a uniform queueInit closure and runs through runThroughputHarness so this binary stays under 100 lines and shares the per-run shaping (warmup, run-count, monotonic-ns timing, remainder-spread distribution) with the rest of the topology split. Per-binary intdefines (design 2.5): -d:BenchSpscRuns (default 33) -d:BenchSpscMessageCount (default 1_000_000) -d:BenchSpscWarmup (default 3) Behind -d:BenchSpscTestCompileTime the binary's `static` block asserts the three defaults. The runtime + BMF emission contract is covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3), which compiles at 1000 messages * 2 runs and parses the emitted lockfreequeues_sipsic/spsc/1p1c slug.
Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic
requires per-thread Producer objects via queue.getProducer(idx); the
generic runThroughputHarness assumes a uniform queue.push call, so this
binary keeps the bespoke harness pattern from the legacy file:
pre-assign Producer values on the main thread (idx = 0..P-1) and ship
them into worker thread contexts. The shape-iteration loop instantiates
three concrete Mupsic types {1,2,4}p1c so P stays a static int.
Per-binary intdefines (design 2.5):
-d:BenchMpscRuns (default 33)
-d:BenchMpscMessageCount (default 1_000_000)
-d:BenchMpscWarmup (default 3)
Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind
-d:BenchMpscTestCompileTime the binary's static block asserts the three
defaults. Runtime behavior covered by tests/t_topology_split.nim's
bench_mpsc suite.
Splits the MPMC slice out of the legacy bench_throughput.nim into a
single binary covering three queue families. Mupmuc and Sipmuc both
need per-thread Producer / Consumer values via getProducer(idx) /
getConsumer(idx); the harness pre-assigns those on the main thread and
ships them into worker contexts (mirrors the bench_throughput Mupmuc
pattern). Channels has a uniform push/pop interface and runs through
bench_common.runThroughputHarness.
Coverage:
Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the
issue #15 livelock regression case preserved from the pre-split
bench_throughput fixture so the deletion-safety check does not
silently lose that coverage.
Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc
per design 2.4).
channels {1,2,4} P x {1,2,4} C grid.
Per-binary intdefines (design 2.5):
-d:BenchMpmcRuns (default 33)
-d:BenchMpmcMessageCount (default 1_000_000)
-d:BenchMpmcWarmup (default 3)
Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the
contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep
type resolution unambiguous.
Splits the unbounded slice out of the legacy bench_throughput.nim.
Covers all 4 lockfreequeues unbounded variants at their natural shapes:
unbounded_sipsic spsc_unbounded/1p1c (no DEBRA)
unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c (DEBRA on consumers)
unbounded_mupsic mpsc_unbounded/{1,2,4}p1c (DEBRA on producers)
unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid (DEBRA both sides)
UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch
reclamation, so producer/consumer threads call registerThread(manager)
on their own thread before queue.getProducer(handle) /
queue.getConsumer(handle). UnboundedMupsic reuses the existing
UnboundedMupsicAdapter (which pre-registers the consumer at adapter
init) and drives pop from the calling thread, mirroring the legacy
bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are
constructed inline with a heap-pointer DebraManager since their
constructors take a `ptr DebraManager`.
Per-variant intdefines (design 2.5):
-d:UnboundedSipsicRuns / UnboundedSipsicMessageCount (3 / 500_000)
-d:UnboundedSipmucRuns / UnboundedSipmucMessageCount (3 / 500_000)
-d:UnboundedMupsicRuns / UnboundedMupsicMessageCount (3 / 500_000)
-d:UnboundedMupmucRuns / UnboundedMupmucMessageCount (3 / 500_000)
-d:BenchUnboundedWarmup (default 2)
Per-variant defines are chosen over a shared BenchUnboundedRuns so CI
can budget unbounded_mupmuc/4p4c independently from the cheaper
unbounded_sipsic/1p1c (design 2.5 rationale).
Adds benchmarks/scripts/superset_check.py: a slug-set subset check that exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc + bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in the pre-split fixture, and exits 1 with the missing slugs listed on stderr otherwise. The check operates on JSON top-level keys only; it does not compare measure values, bounds, or ordering, so the harness is free to reshape per-run statistics across the split without tripping the deletion-safety guard. Also wires the check into tests/t_topology_split.nim's deletion-safety suite (compile + run all four binaries at small overrides, merge via merge_bmf.py, then invoke superset_check.py against the committed fixture and assert exit 0). benchmarks/tests/test_superset_check.py covers 9 cases: identical sets, strict superset, missing slugs (with alpha-sorted listing), empty pre, wrong arg count, missing pre file, malformed JSON, and top-level non-object. All exit codes use 1 for failure to match the existing benchmarks/merge_bmf.py contract.
…imeouts Replaces the legacy bench-throughput + bench-latency job pair with a single matrix job over the 5 topology-split binaries. Each matrix entry runs in its own GitHub Actions job with an independent `timeout-minutes: 12` budget so a hang in one binary cannot burn the whole workflow's clock; the surviving binaries finish, the bench-upload job merges what arrived, and the operator gets partial Bencher coverage rather than no coverage. The compile step's `case` sets per-binary `-d:` overrides matching design 2.5: bounded throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3 runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup. bench-upload now also runs the deletion-safety check (superset_check.py vs tests/fixtures/pre-split-slugs.json) right after merge_bmf.py, so any silent slug regression introduced by future edits to the topology binaries fails the PR check. Artifact naming changes: each matrix entry uploads `bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's download pattern is unchanged (`bench-*-bmf`) so it picks up every new artifact without further edits. actionlint clean; yamllint shows only the pre-existing "missing document start" warning shared across all repo workflows.
Adds test_five_input_union covering the upload-job pipeline shape that PR 2's matrix produces: five sibling BMF fragments (one per binary — bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency) arrive via actions/download-artifact, are merged via merge_bmf.py, and must produce a single output whose slug set is the disjoint union of the inputs. The test deliberately collides the sipsic 1p1c slug between bench_spsc (throughput_ops_ms) and bench_latency (latency_p50_ns, latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring the production multi-measure-per-slug pattern from Track 1 Task 1.5. The merge must preserve every measure on the shared slug; collisions on (slug, measure) pairs continue to exit 1 (covered by the existing test_collision_exits_1 case, no regression).
Removes benchmarks/nim/bench_throughput.nim now that the five topology-
split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded
+ bench_latency from PR 1) cover its full slug surface. The pre-split
fixture committed in Task 2.1 plus the superset_check.py guard wired
into bench.yml in Task 2.8 enforce that no slug from the legacy binary
silently disappears across the split.
Live consumers updated to reference the new binaries:
benchmarks/runner.py - builds + runs all 5 binaries, then
merges fragments via merge_bmf.py.
lockfreequeues.nimble - `nimble benchmarks` task iterates the
5 binaries and merges to results/latest.json.
benchmarks/README.md - rewritten to describe the 5-binary
pipeline, the matrix CI job, and the
merged BMF schema (single slug can carry
both throughput and latency measures).
.gitignore - tracks the four new throughput binaries;
also ignores the worktree-local `deps`
symlink that nim.cfg's relative path
expects.
tests/t_bench_common.nim - removes the obsolete Task 0.10
bench_throughput integration suite (now
covered by t_topology_split.nim per
topology binary).
benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim
- doc-comment rewires from bench_throughput
to bench_unbounded.
Comment-only references in bench_common.nim, bench_latency.nim,
render_readme.nim, and the new bench_*.nim files describe the legacy
binary as historical context (the legacy `runThroughputBenchmark`
proc / Mupmuc registration pattern that bench_common.runThroughputHarness
factored out) and stay verbatim.
Adds Track 2 PR 2 entries to the [Unreleased] section under Added,
Changed, and Removed:
- Added: 4 new topology-split throughput binaries (bench_spsc,
bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py
deletion-safety guard, 5-input merge_bmf union test.
- Changed: bench.yml matrix over 5 binaries with per-step timeouts
+ superset_check.py wired into bench-upload; runner.py and
`nimble benchmarks` task iterate the 5 binaries; README rewrite.
- Removed: bench_throughput.nim (replaced by the 4 topology-split
binaries; pre-split fixture + superset_check.py guard pin
deletion-safety).
bench_unbounded matrix item was timing out at 10 minutes after only 4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed shapes on 4-vCPU runners. Reduce message counts and runs across the heavy fan-out variants (mupmuc 9-shape grid is the main offender) and bump the job timeout to 18 min for cushion. Statistical signal is preserved for Bencher trend tracking; absolute throughput numbers will shift but per-shape comparisons remain meaningful. - UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3) - UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupmuc: 500Kx3 -> 50Kx2 (9 shapes) - BenchUnboundedWarmup: 2 -> 1 - timeout-minutes: 12 -> 18
bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape:
4 consumers compete in a strict round-robin slot order (each consumer
must take its turn before another can pop), and on a 4-vCPU runner
one descheduled consumer blocks every other. The same starvation
pattern hits every unbounded mupmuc shape with C>=4.
Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops:
- unbounded_sipmuc/mpmc_unbounded/1p4c
- unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c}
mupsic shapes (multi-producer, single consumer) are kept regardless:
the round-robin ordering exists on consumers only, so a single consumer
cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json
and the deletion-safety guard would fail without it.
The full grid still runs in `bench-comparison.yml` nightly cron / on
workflow_dispatch where a beefier runner is available. PR CI now gets
the meaningful subset (4 mupmuc shapes covering each P,C in {1,2})
that fits a 4-vCPU runner without hanging.
…iles Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced.
Three CI hardening fixes from code review on PR 2: 1. bench-upload had needs:[bench] without if:always(), so any single matrix-leg failure would skip the merge+upload entirely. Per the job comment the design is partial-coverage tolerant: surviving binaries should still be merged and uploaded. Add if: always() && (fork-PR guard) and a guard step that fails loudly when zero JSON fragments are present. 2. The per-binary upload-artifact step now runs with if: always() and if-no-files-found: ignore so a step-level timeout flushes any partial JSON the binary wrote, and a binary that crashed before writing the file does not turn the matrix leg red for the merger. 3. Job and step comments referenced a 12-min timeout while the configured value was 18 min. Update both comments to 18.
bench-upload runs with if: always(), so a matrix leg that crashed mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1 on the first malformed input, blocking uploads from surviving legs. Pre-validate with json.load and drop any that fail, emitting a ::warning:: per dropped file.
…order - Add backoffOnPeerWait to producer/consumer busy loops in bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer) and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation in oversubscribed runs. Mupmuc producer already had it. - Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/ UMupmuc consumer loops in bench_unbounded.nim. - Document the per-iteration teardown order in runUSipmucShape and reference it from runUMupmucShape: reset(q) MUST precede reset(manager[]) because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager, which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC, system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip destructors (verified against project's own =destroy implementations in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])).
…n-uninit Reviewer flagged that `create(DebraManager[...])` followed by `manager[] = initDebraManager[...]()` "runs destructors on uninitialised memory." It does not. `proc create*(T: typedesc, size)` in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` — zero-initialised allocation with no hook invocation. The follow-up assignment then performs the only construction. Add an explanatory note in the per-iteration teardown block comment (runUSipmucShape) so this doesn't keep getting flagged. The other harness (runUMupmucShape) already back-references that block comment. No behaviour change.
CI compiles topology-split bench binaries with -d:release -d:danger; runner.py was missing -d:danger, so locally produced numbers were not comparable to the cloud baseline.
2aa47ce to
ab19733
Compare
…ts (#19) * feat(bench): stub bench_common module with public API surface (Task 0.1) Compile-only stub for benchmarks/nim/bench_common.nim. Types and proc signatures are final per design doc section 2.1; bodies raise AssertionDefect with 'not implemented' so tasks 0.2-0.6 can be written and reviewed in parallel against this surface. Public surface stubbed: - Topology enum (tSpsc/tMpsc/tMpmc + unbounded variants) - PushResult / PopResult adapter primitives - BMFEmitter + initBMFEmitter / addMeasure / emit - Stats helpers (mean / stddev / minVal / maxVal / percentile) - Histogram + initHistogram / record / topK / percentile (HistogramTopK = 1000, HistogramReservoir = 99000) - LatencyMetrics + runLatencyHarness - ThroughputMetrics + runThroughputHarness Test (tests/t_bench_common.nim) is compile-time-only: it references each promised symbol via 'when declared' and 'when compiles', leaving behavior tests to subsequent tasks. Stub file: 170 LOC (acceptance cap: 200). * feat(bench): implement BMFEmitter with alpha-sorted output (Task 0.2) initBMFEmitter / addMeasure / emit per design doc section 2.1. - addMeasure stores via Table[slug, Table[measure, MeasureValue]]; NaN sentinel for lower/upper bounds maps to none[float](), finite values map to some(v). - emit writes JSON with slugs alpha-sorted at the top level and measures alpha-sorted within each slug. lower_value/upper_value fields are emitted only when the corresponding bound is some. - Output format: pretty(root, 2) plus a trailing newline so diffs in CI / local-dev surface line-level changes cleanly. Tests (tests/t_bench_common.nim): empty emit produces {}, two slugs alpha-ordered, measures alpha-ordered within slug, NaN bounds omitted from output, finite bounds emitted as lower_value/upper_value. * feat(bench): implement Stats helpers (Task 0.4) mean / stddev / minVal / maxVal / percentile bodies. Sort-based percentile uses linear interpolation between adjacent samples; matches numpy's default percentile interpolation ('linear'). stddev uses ddof=1 (sample) for consistency with numpy default. Empty-input behavior matches the legacy benchmarks/nim/stats.nim that this module replaces (returns 0.0 across the board). Tests: numpy-baseline parity for stddev on [1,2,3,4]; min/max on unordered input; percentile(p=0.5) on [0..99] = 49.5; percentile extremes (p=0.0, p=1.0) match min and max. * feat(bench): implement Histogram top-K + reservoir percentile (Task 0.3) Online percentile tracker per design doc section 2.1. - topKHeap: min-heap of HistogramTopK = 1000 largest values seen. record() pushes if heap < K; else replace if value > heap[0] (evicts the min into the body bucket via reservoirAdmit). - reservoir: Algorithm R uniform sample of body distribution, size HistogramReservoir = 99_000. - debugMode (initHistogram(true)): records every sample to debugAll and percentile() returns the exact sort answer over debugAll. Used as the test oracle and as a debug-only fallback per design 2.1. Lookup rule with stratified-estimator math: p == 1.0 -> top-K max seenAll*(1-p) <= topKHeap.len -> top-K rank lookup (linear interp) otherwise -> reservoir, with p rescaled to p_body = (p*seenAll) / (seenAll - K) since the reservoir samples ONLY the body stratum (values that lost the top-K race). Without rescaling a reservoir-based p99 underestimates by ~25% on log-normal data. Tests: p99 within 1% of sort fallback on 100K log-normal samples; percentile(1.0) == top-K max; p50 within 5% of sort fallback; debug mode returns exact sort answer; topK() returns ascending-sorted samples. * feat(bench): implement runThroughputHarness (Task 0.5) Generic N-producer / N-consumer throughput runner factored out of the legacy benchmarks/nim/bench_throughput.nim. Parameterized over the queue init proc and the queue type Q (constrained at instantiation time via 'mixin push' / 'mixin pop' so adapter-specific push/pop are resolved at the call site). The remainder-spread deadlock-avoidance rule is preserved verbatim: if messageCount mod numProducers != 0, the leading producers each get one extra item so the producer-side total exactly equals messageCount; same for consumers. Without this, the harness silently degrades into a deadlock (consumers waiting forever) or strands items in the queue (consumers stop early). Returns ThroughputMetrics{ops_ms_mean, ops_ms_stddev, runs}; warmup runs are discarded. Smoke test in tests/t_bench_common.nim with a 1024-cap channel SmokeAdapter at 1P/1C × 1000 messages × 1 run. * feat(bench): implement runLatencyHarness (Task 0.6) Ping-pong RTT runner. For each run: - allocates two queues fwd + rev of type Q via queueInit() - spawns 1 pinger + 1 ponger thread - pinger pushes a monotonic-ns timestamp via fwd, spins on rev for the echo, records RTT (ns) into a per-run Histogram - ponger pops from fwd, pushes the same value to rev Aggregation per design 2.5 'Run aggregation for latency': reported percentiles are the MEAN across runs of each percentile, NOT the percentile of unioned samples. Per-run histograms isolate scheduler artifacts to the run that contains them. Returns LatencyMetrics{p50_ns, p95_ns, p99_ns, p999_ns, max_ns, samples}. samples is the cumulative across-runs sample count. This task ships the 1P/1C smoke topology only; multi-thread sharding lands in PR 1 (Task 1.x). 1P/1C is sufficient for the smoke check because PR 1's first BMF measure uses Sipsic 1P/1C as the smoke shape. Test: SmokeAdapter (channel-backed), 1000 messages, 1 run, 0 warmup; asserts samples >= 1000 and p50 < p99 < max. * feat(bench): add merge_bmf.py BMF JSON merge utility (Task 0.7) Stateless union of per-slug measure dicts across N input BMF JSON files into one output JSON, per design doc section 4.3. Required because Bencher.dev creates a separate Report per 'bencher run' invocation; multiple uploads from sibling CI jobs would NOT co-locate multiple measures on a single per-slug history. Contract: merge_bmf.py <output_path> <input1> [input2 ...] - Slug regex: ^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$ - Measure regex: ^[a-z][a-z0-9_]*$ - Value fields ('value', 'lower_value', 'upper_value') must be finite, non-NaN floats. Any malformed input or non-finite value -> exit 1. - Collision (same slug + same measure across two inputs) -> exit 1 with stderr 'error: collision on slug=<slug> measure=<measure> in files <pathA> and <pathB>'. - Output is alpha-sorted at both top level (slugs) and within each slug (measures), 2-space indent. Tests in benchmarks/tests/test_merge_bmf.py cover all 8 design-spec cases plus an alpha-sort assertion: zero-input usage error, single empty pass-through, single valid round-trip, two disjoint slugs, shared slug + disjoint measures, collision, NaN rejection, invalid measure key (uppercase), invalid slug shape. * feat(bench): add 5 missing lockfreequeues adapters (Task 0.8) Adds the 5 adapters not yet present per design doc 2.2: - lockfreequeues_sipmuc_adapter.nim (bounded SPMC) - lockfreequeues_mupsic_adapter.nim (bounded MPSC) - lockfreequeues_unbounded_sipsic_adapter.nim - lockfreequeues_unbounded_sipmuc_adapter.nim - lockfreequeues_unbounded_mupmuc_adapter.nim Each adapter file exports: - 'topologiesSupported: set[Topology]' constant (consumed by PR 3 Task 3.11 to restrict shape generation; declared but unused in PR 0) - 'make<Lib>Adapter' factory taking 'capacity: int' - 'cleanup' tear-down proc - 'push(item)' returning bench_common.PushResult - 'pop()' returning bench_common.PopResult[T] Smoke test in tests/t_bench_common.nim: each adapter pushes 100 sequential uint64s, pops them, asserts count == 100 AND set-equality with the input range. Implementation note (Nim 2.2.6 codegen workaround): The unbounded adapters store the queue INLINE rather than via heap pointer. Heap-allocating UnboundedSipsic/UnboundedSipmuc/ UnboundedMupmuc and then calling reset() or =destroy() on the referent from a generic adapter cleanup proc triggers 'internal error: getTypeDescAux(tyGenericParam)' OR 'internal error: expr(nkIdent); unknown node kind' in Nim 2.2.6 when bench_common is also imported. Inline storage lets the compiler emit the per-instance =destroy automatically when the adapter goes out of scope, sidestepping the bug. The DEBRA manager is still heap-allocated because the queue constructors take 'ptr DebraManager'. Track upstream Nim repro for follow-up. * PR 0 Task 0.9: rename existing adapters with `_adapter` suffix and export `topologiesSupported` Per design section 2.2 (`<library_slug>_adapter.nim` filename convention): - Rename via `git mv` (history preserved): - lockfreequeues_sipsic.nim -> lockfreequeues_sipsic_adapter.nim - lockfreequeues_mupmuc.nim -> lockfreequeues_mupmuc_adapter.nim - lockfreequeues_unbounded_mupsic.nim -> lockfreequeues_unbounded_mupsic_adapter.nim - Update bench_throughput.nim import paths to match. - Add `topologiesSupported*: set[Topology]` constant to all four existing adapters (3 renamed + channels_adapter). Use selective `from ../bench_common import Topology, <member>` to avoid pulling PushResult into scope, which would conflict with adapter.PushResult (the legacy concept in ../adapter still imported by these adapters). - `topologiesSupported` is exported in PR 0 but consumed only by PR 3 Task 3.11; the "unused export" hint is expected per the impl plan. * PR 0 Task 0.10: native BMF emission via --bmf-out= flag in bench_throughput Add --bmf-out=<path> flag (parsed via std/parseopt) that emits Bencher Metric Format JSON for every variant that ran. With the flag absent the binary is bit-for-bit unchanged; the existing positional CLI (`bench_throughput sipsic mupmuc unbounded_mupsic channels`) is preserved verbatim. Slug shape per design 2.2: lockfreequeues_sipsic/spsc/1p1c lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c nim_channels/mpmc/{1,2,4}p{1,2,4}c Each slug carries the `throughput_ops_ms` measure as {value: mean, lower_value: mean-stddev, upper_value: mean+stddev} — matching the convention bmf_adapter.py used (which this code replaces in Task 0.12). Note on slug for built-in channels: the human-readable adapter name() returns "nim/channels" (kept for stdout text), but the slug uses underscore form `nim_channels` so it satisfies the merge_bmf.py SLUG_RE pattern `^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$`. Per-variant {.intdefine.}s for runs/warmup added so the integration test can build a fast variant via -d:BenchSipsicRuns=2 -d:BenchSipsicWarmup=0. Default values (10 runs, 0 warmup) match the prior hard-coded behavior. Selective `from ./bench_common import` is used to pull only {BMFEmitter, initBMFEmitter, addMeasure, emit} — importing the full module would shadow `results.ThroughputMetrics` with `bench_common.ThroughputMetrics` and break the legacy bespoke harness procs that still drive Mupmuc and UnboundedMupsic. Test: tests/t_bench_common.nim adds an integration test that compiles bench_throughput with small overrides, runs it with --bmf-out=<tmp> sipsic, parses the resulting JSON, and asserts the expected slug and measure exist with a positive value. Stdout text output is also asserted (`Sipsic` substring) so the legacy CLI contract is exercised. * PR 0 Task 0.11: bench.yml — native BMF + merge step, drop legacy parser Apply the four edits from design section 3 (PR 0): (a) Remove `Run adapter unit tests` step. The Python regex parser it sanity-checked is gone in Task 0.12; nothing left to test. (b) Remove `Convert to BMF JSON` step. bench_throughput now emits BMF natively via `--bmf-out=throughput.json` (Task 0.10). (c) Update `Run bench_throughput` step to add the `--bmf-out=throughput.json` flag while preserving the existing positional variant filter and the 20-minute step timeout. (d) Insert `Merge BMF JSON` step running `python3 benchmarks/merge_bmf.py merged.json throughput.json`. Single-input today, but stays in place for PR 2-4 when bench is split into per-topology binaries each producing its own fragment. (e) Change all three `bencher run` invocations and the debug-cat step to use `--file merged.json` instead of `bench_results.json`. actionlint clean. The bench upload contract is unchanged: same project, branch, testbed, thresholds, comparison semantics — only the file source flips from regex-parsed text to native BMF. * PR 0 Task 0.12: delete legacy bmf_adapter.py + bench_main.nim, refactor consumers Three legacy files removed via `git rm`: - `benchmarks/bmf_adapter.py` — regex parser for bench_throughput stdout. Replaced by native BMF emission via `--bmf-out=` (Task 0.10). - `benchmarks/test_bmf_adapter.py` — unit tests for the parser. Replaced by `benchmarks/tests/test_merge_bmf.py` (Task 0.7). - `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped bench_throughput + bench_latency and emitted a custom JSON shape. Bench_throughput is now the canonical entry point and emits BMF directly; bench_latency follows in PR 1. Live consumers refactored in this same task (per impl plan §0.12): - `benchmarks/runner.py`: build_nim now compiles bench_throughput.nim; run_nim invokes `bench_throughput --bmf-out=<path>` instead of `bench_main --runs=<n> -o=<path>`. The `runs` argument is preserved in the CLI for back-compat but is now compile-time-only. - `lockfreequeues.nimble`: `task benchmarks` switched to bench_throughput. Output path corrected to `.tmp/bench_throughput` (matching the project nim.cfg `--outdir:.tmp`). - `benchmarks/README.md`: rewritten to document the new flow (bench_common module, adapter convention, merge_bmf.py, native BMF emission, expected slug set). - `.gitignore`: drop the `bench_main` build-artifact ignore line. `git grep "bmf_adapter\|bench_main"` returns only intentional historical references in code comments and CHANGELOG (prior release notes). Acceptance: `nimble tasks` parses; `nim c benchmarks/nim/bench_throughput.nim` compiles; tests/t_bench_common.nim passes all 26 tests including the Task 0.10 --bmf-out integration test. * PR 0 Task 0.13: render_readme.nim consumes new BMF shape The legacy bench_main aggregator emitted JSON of shape `{results: [{implementation, thread_config, metrics: {...}}], metadata: {...}}`. Bench_main was deleted in Task 0.12, replaced by `bench_throughput --bmf-out` (Task 0.10) + `merge_bmf.py` (Task 0.7) which produce Bencher Metric Format JSON of shape `{<lib>/<topology>/<P>p<C>c: {<measure>: {value, lower_value?, upper_value?}}}`. `benchmarks/render_readme.nim` now walks the BMF shape directly: - New `parseSlug(slug)` helper splits `<lib>/<topology>/<P>p<C>c` into `(impl, "<P>P/<C>C")` for the two table columns. Strict on shape: rejects slugs with the wrong segment count or non-digit P/C, so malformed entries do not produce noise rows. - `loadRows` walks every top-level slug, pulls `<slug>.throughput_ops_ms.value` for the throughput column, and `<slug>.latency_p50_ns.value` for the optional latency column (PR 1+ emits this; PR 0 does not, so all PR 0 rows show "—"). - BMF has no `metadata` block, so the meta argument is always nil. `renderTable` already accepts nil and skips the platform header in that case — no change needed. - `when isMainModule` block guarded by `not defined(renderReadmeNoMain)` so unit tests can include this file without triggering the README rewrite. CLI invocations (`nim r benchmarks/render_readme.nim`) leave the define unset and run main as before. Test: tests/t_render_readme.nim (new) writes a fixture BMF JSON with three slugs (mupmuc/2p2c, sipsic/1p1c, nim_channels/1p1c), runs `loadRows` and `renderBlock`, and asserts: - 3 rows produced; impl/threads/throughput parsed correctly per slug - alpha sort by impl preserved - meta == nil (BMF carries no metadata) - rendered Markdown table contains every impl name and value - latency column is em-dash for every row (no `latency_p50_ns` in fixture) - missing JSON file falls back to placeholder - empty `{}` BMF document falls back to placeholder All 4 new tests + 26 existing t_bench_common tests pass. * PR 0 Task 0.14: CHANGELOG entry under [Unreleased] for bench-rollup PR 0 Document Track 0 of the bench-rollup feature under [Unreleased]: - Added: bench_common.nim module, 5 new lockfreequeues adapters, merge_bmf.py CLI, --bmf-out flag, per-variant run-count intdefines. - Changed: bench_throughput emits BMF natively, 3 adapters renamed to the <library_slug>_adapter.nim convention, render_readme.nim consumes the new BMF shape, runner.py / nimble task / benchmarks/README.md point at bench_throughput. - Removed: bmf_adapter.py, test_bmf_adapter.py, bench_main.nim. Each entry is one bullet (one logical change), no test plan checklist, no GitHub issue references. * fix(bench): plug heap-allocated queue leaks in latency and throughput harness Per Gemini review on PR #19. Adapters whose Q heap-allocates the backing queue (lockfreequeues bounded mupsic/sipmuc/mupmuc, channels) leaked the queue every time runOneLatencyRun or runOneThroughputRun returned: `var fwd = queueInit()` only frees the adapter struct's stack slot at scope exit, not the create()-d Mupsic/Sipmuc/Mupmuc/ Channel pointee. Across N runs * M warmups the leak grew without bound. Two-part fix: 1. bench_common.nim: harness now drops the queue via a uniform `mixin cleanup; when compiles(cleanup(q)): cleanup(q)` block after joinThread (and before any return path, including the elapsedNs<=0 short-circuit in runOneThroughputRun). `when compiles` keeps value-type adapters (sipsic, unbounded inline storage) usable without forcing them to define a no-op cleanup. 2. lockfreequeues_mupsic / sipmuc cleanup procs: prepend reset(a.queue[]) so Mupsic/Sipmuc's `=destroy` actually runs before the heap slot is freed. Bare dealloc skipped any queue-internal cleanup that an ARC/ORC-generated destructor would otherwise do (managed T fields, future atomic-cleanup hooks). Same fix on the mupmuc adapter inside deinitMupmucAdapter. 3. mupmuc and channels adapters gain a `cleanup` alias around their existing deinit*Adapter procs so the harness's mixin cleanup call resolves uniformly across adapter naming conventions. Verified by smoke build of `runLatencyHarness` + `runThroughputHarness` against LockfreequeuesMupsicAdapter; no leak-on-return, percentiles and ops/ms reported as expected. * fix(bench): destroy queue before manager; guard zero-duration throughput Two fixes flagged by code review on PR 0: 1. Unbounded MPMC adapter cleanup() freed the DebraManager while the inline queue still held a pointer to it. The queue's =destroy then called unbindClient(self.manager[]) on freed memory. Reset the queue first (running its destructor while the manager is still valid), then dealloc the manager. Applied to both unbounded_mupmuc and unbounded_sipmuc adapters. 2. The legacy bench_throughput run procs computed messageCount/elapsedNs without guarding against elapsedNs <= 0, which can occur on fast CI runners with sub-nanosecond clock granularity. A zero-duration measurement produces inf, which merge_bmf.py rejects via math.isfinite. Return 0.0 in that case so the sample is recorded as a valid (degenerate) value. * fix(bench): guard zero-thread harness; expose adapter queues; drop non-finite BMF Three further code-review fixes on PR 0: 1. runOneThroughputRun would crash with DivByZeroDefect on numProducers==0 / numConsumers==0 because the workload-distribution step does messageCount div N. The harness is exported, so a future caller could pass 0 thinking 'no producers' is a degenerate but valid shape. Guard with doAssert + a clear message instead of letting the worker thread crash with no context. Also assert messageCount >= 0. 2. lockfreequeues_mupsic_adapter and lockfreequeues_sipmuc_adapter stored a single Producer/Consumer at construction time but kept the underlying queue pointer non-exported, so multi-thread shapes couldn't actually register per-thread producers/consumers as the adapter docstrings already described. Export the queue field (used by today's bench_throughput multi-producer code path) and add explicit getProducer*/getConsumer* accessor procs that wrap queue.getProducer(idx)/getConsumer(idx) for harness consumers. 3. Defense in depth on the BMF emitter. The PR previously added elapsedNs <= 0 guards to the legacy bench_throughput run procs; this pass also rejects non-finite values inside addMeasure itself. If a future code path produces inf/NaN despite the upstream guard, the offending (slug, measure) entry is silently dropped instead of being written to the BMF, which merge_bmf.py would otherwise reject and fail the entire upload on. Same classification widened on toBound for lower/upper bounds. * perf(bench): bulk-percentile to avoid 5x reservoir sort per latency run Histogram.percentile() previously sorted the reservoir (up to 99k samples) on every call, so each latency run re-sorted the same data five times for p50/p95/p99/p999/max. Add Histogram.percentiles() that takes an array of probabilities and computes them all from a single sorted snapshot. Refactor percentile(openArray[float]) through a new percentileSorted() helper so the sort/lerp can be reused without a fresh allocation. Update runOneLatencyRun to use the bulk API. * fix(render_readme): bump row cap; restore platform/cores/timestamp line - Bump the row cap from 8 to 32. The bench matrix today already exceeds 8 (1 sipsic + 4 mupmuc + 3 unbounded mupsic + 3 channels = 11) and PR 3 / PR 4 add comparison adapters; the old cap silently truncated the table. PR 5 replaces this auto-rendered block with a hand-curated 4-row summary, so the higher cap is a transitional safeguard. - Reconstruct platform / cores / timestamp metadata at render time instead of pulling from JSON. BMF has no metadata block, so the per-table metadata line was dropping silently in the legacy -> BMF migration. detectMetadata() falls back to runtime detection (hostOS/hostCPU, countProcessors, now()) and honors three env vars (BENCH_RENDER_PLATFORM / _CORES / _TIMESTAMP) for devs re-rendering historical snapshots from a different host. renderTable still honors any meta JsonNode caller passes through (forward compatibility for a future BMF-with-metadata extension). * ci: pin sibling-dep clones to nim-debra v0.6.0 / nim-typestates v0.7.0 Both build.yml and bench.yml clone the matching siblings under `../nim-debra` and `../nim-typestates` so nim.cfg's `--path:` directives resolve. They were cloning `--branch main`, which floats: a downstream breaking change in either repo would silently propagate into our CI without the corresponding `requires` bump in lockfreequeues.nimble. Switch to specific release tags. Bumps now require touching three places in lockstep (this comment, the `requires` line, the matching release tag), which is the right cost for a dep update. Drop the stale "Once debra 0.3.0 publishes…" comment — debra is at v0.6.0 and typestates at v0.7.0, both released and registered. * ci(bench): cost gate — drop feat/** PR base; require `bench` label Two compounding gates so cascade-rebases of stacked feature PRs do not burn Bencher.dev metric-ingest quota: A) `pull_request: branches:` is reduced to [main, devel]. Stacked feature-to-feature PRs (`feat/foo` -> `feat/bar`) no longer trigger this workflow at all. Only the topmost PR in any stack (the one whose base is devel/main) gets PR-time bench treatment. B) The `Track PR benchmarks with Bencher` step is now gated on a `bench` label being present on the PR. Without the label the step is skipped even on PRs targeting devel; the bench binaries still compile and run (free GHA minutes, useful for stdout-level spot-checks), but no `bencher run` upload fires. Add the label deliberately when you want a comparison report on a revision. The preflight step is renamed and extended so the check summary displays a clear "skipped because no `bench` label" warning when B is the gate, paralleling the existing missing-token branch. Push to main/devel (baseline upload) and workflow_dispatch (manual upload) are unchanged. * docs(bench): align Quick Start CI command with current 1M/500k budget * ci(bench): wire t_bench_common into nimble benchtests + CI job Adds a 'benchtests' nimble task that compiles/runs the bench harness tests, which live outside srcDir and are intentionally not imported by tests/test.nim (avoids forcing every entry of the MM/sanitizer test matrix to compile bench harness threading machinery). Adds a 'bench-tests' job to bench.yml that invokes 'nimble benchtests' on every PR and push. The job has no secret dependencies, so fork PRs are allowed; only the github-actions[bot] loop-back guard is kept. Sibling Nim deps pin to v0.6.0 (debra) and v0.7.0 (typestates), matching the existing 'benchmark' job in this workflow. * PR 1 Task 1.1: BenchLatencyRuns/BenchLatencyMessageCount intdefines Exposes per-binary {.intdefine.} overrides for latency wall-time control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount (default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert defaults compile cleanly and that -d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 overrides take effect via compile-time static blocks. Also fixes a stale import (./adapters/lockfreequeues_sipsic -> ./adapters/lockfreequeues_sipsic_adapter) so the file compiles after PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix. The rest of the file (legacy runLatencyBenchmark / benchmarkLatency / isMainModule) is preserved verbatim and gets fully rewritten in Task 1.2 onto runLatencyHarness. * PR 1 Task 1.2: rewrite bench_latency.nim onto bench_common.runLatencyHarness Replaces the legacy ping-pong RTT body with a CLI mirroring bench_throughput's: positional variant filter (sipsic / mupmuc / sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format JSON emission, stdout text preserved. Per-binary run-shape intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount / BenchLatencyWarmupRuns drive runLatencyHarness directly. Slug shape per design 2.2 (table at design line 357), 1p1c smoke shape only — full grid lands in PR 2: - lockfreequeues_sipsic/spsc/1p1c - lockfreequeues_sipmuc/mpmc/1p1c - lockfreequeues_mupsic/mpsc/1p1c - lockfreequeues_mupmuc/mpmc/1p1c Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns measures (p999 / max deferred to PR 6 per impl plan). Type unification: PushResult / PopResult were duplicated across benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim (new), forcing two nominally distinct enums to coexist. Aliasing adapter.PushResult / PopResult to bench_common's via re-export unifies the surface — both adapter packs (legacy sipsic / mupmuc and the newer sipmuc / mupsic / unbounded_*) now flow through the same runLatencyHarness and runThroughputHarness without per-call-site conversion. Verified: t_bench_common (26 tests) and bench_throughput --bmf-out integration unchanged. The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer( idx = 0) which bypasses Mupmuc's threadId-based registration, so the pre-built producer / consumer slots are safe to drive from either of the harness's pinger and ponger threads (Mupmuc's underlying Vyukov per-slot CAS protocol is fully concurrent-safe). Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering the sipsic single-variant happy path, the four-variant slug coverage required by Track 1 Acceptance Criteria, and the unknown-variant exit-1 path. All pass with -d:BenchLatencyMessageCount=200 -d:BenchLatencyRuns=2 in under ~10s on Apple silicon. * PR 1 Tasks 1.3 + 1.4: bench-latency job + N-input merge step in bench.yml Restructures bench.yml from one all-in-one bench-throughput job into three jobs: bench-throughput (renamed from 'benchmark'): builds + runs bench_throughput, uploads bench-throughput-bmf artifact. bench-latency (NEW): builds + runs bench_latency, uploads bench-latency-bmf artifact. Compile-time overrides: BenchLatencyMessageCount=50000, BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four 1p1c smoke-shape variants finish in single-digit seconds inside the 12-min job budget). bench-upload (NEW, needs both): downloads bench-*-bmf artifacts via actions/download-artifact@v4 pattern, runs python3 benchmarks/merge_bmf.py merged.json $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload flows (PR comparison / push baseline / workflow_dispatch). One bencher run invocation -> one Bencher Report -> latency + throughput measures co-located on the shared per-slug history (design 1). Bencher upload commands are unchanged in spirit: same PR / push / workflow_dispatch gating, same --threshold-measure throughput on push baselines, same yellow-warning preflight when BENCHER_API_TOKEN is unset on forks. This commit does NOT correct the historical `--threshold-measure throughput` -> `throughput_ops_ms` rename deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure threshold blocks in the same edit. Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4 (line 359-366). The two tasks share the same file and are tightly coupled (1.4 strictly depends on 1.3); folding them into one commit keeps the workflow file consistent at every commit boundary. actionlint clean. SC2046 word-splitting on $(ls ...) is intentional (each path must arrive as a separate positional argv to merge_bmf.py) and is suppressed inline with a justification comment; artifact filenames are repo-controlled (no whitespace / glob chars). * PR 1 Task 1.5: smoke test for multi-measure-per-slug merge Adds two cases to tests/t_bench_latency.nim that validate the end-to-end shape Track 1 ships: a single slug carries BOTH `throughput_ops_ms` (from bench_throughput's BMF fragment) and `latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER `merge_bmf.py` unions the two fragments. 1. Multi-measure union — writes synthetic throughput + latency fragments on the same slug, runs python3 benchmarks/merge_bmf.py, and asserts the merged JSON carries all three measures with their original values intact. 2. Collision guard — re-declares `throughput_ops_ms` on the same slug across both inputs and asserts merge_bmf.py exits 1 with a 'collision' message in stderr. This regression-tests PR 0 Task 0.7's collision guard against silent overwrites that would erase one of the colliding measures. Both tests use createTempDir + defer removeDir for isolated, leak-free fixtures and run in <200ms because they bypass the bench compile + execute path entirely. Plan reference: impl plan Track 1 Task 1.5 (line 368-376). * PR 1 Task 1.7: CHANGELOG entry under [Unreleased] for bench-rollup PR 1 Adds two bullets to the [Unreleased] Added section covering bench_latency --bmf-out emission and the new three-job bench.yml structure (bench- throughput / bench-latency / bench-upload). Adds a Changed bullet documenting the adapter.nim PushResult / PopResult unification. Plan reference: impl plan Track 1 Task 1.7 (line 386-393). * fix(bench-latency): correct prefix on unknown-flag error; rename groups to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity. * fix(bench): grant bench-upload contents/actions read; isolate artifact dirs Two CI hardening fixes from code review on PR 1: 1. The bench-upload job overrode GITHUB_TOKEN permissions to only pull-requests:write + checks:write but then ran actions/checkout and actions/download-artifact, which need contents:read and actions:read. Add both so the job is robust against orgs that restrict default token permissions. 2. Switch from merge-multiple:true to per-artifact subdirectories. Today's artifact filenames (throughput.json, latency.json) are unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/ unbounded binaries; each could end up writing files that collide under merge-multiple. Each artifact now lands at bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked over the recursive find. * fix(bench): print p999 in latency stdout for operator visibility The harness already computes metrics.p999_ns; surfacing it during manual runs costs nothing and matches the legacy bench_latency output. BMF wire-up for latency_p999_ns / latency_max_ns is deferred to PR 6 alongside threshold configuration. * test(bench_latency): isolate test workspaces with createTempDir + ExeExt Replace static /tmp/bench_latency_t1*_* paths with per-test createTempDir workspaces (via newTestWorkspace helper) cleaned up by defer: removeDir. Use ExeExt for the compiled binary name so the suffix is correct on Windows. Reshape compileBenchLatency to take an owned `dir` instead of a `suffix` so workspace lifecycle stays with the test that allocates it. Prevents collisions between parallel test runs / repeated runs in the same shell, and stops binaries from accumulating in the system temp dir. * ci(bench-latency): pin sibling deps to v0.6.0/v0.7.0; wire t_bench_latency into benchtests * PR 2 Task 2.1: capture pre-split slug-set fixture for deletion-safety Snapshots the BMF emitted by the legacy bench_throughput at the same small-override knobs the topology-split tests will use, then commits the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion- safety check (Task 2.7) verifies that the post-split union is a strict superset of this fixture so no slug silently disappears with the split. Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape suite (Task 2.1) passes against the committed JSON; the bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and the superset suite (Task 2.7) fail loudly until those binaries + the superset_check.py helper land. Fixture covers 11 slugs: sipsic spsc/1p1c mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case) unbounded_mupsic mpsc_unbounded/{1,2,4}p1c nim_channels mpmc/{1,2,4}p{1,2,4}c * PR 2 Task 2.3: bench_spsc binary (Sipsic 1p1c) Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps SipsicAdapter in a uniform queueInit closure and runs through runThroughputHarness so this binary stays under 100 lines and shares the per-run shaping (warmup, run-count, monotonic-ns timing, remainder-spread distribution) with the rest of the topology split. Per-binary intdefines (design 2.5): -d:BenchSpscRuns (default 33) -d:BenchSpscMessageCount (default 1_000_000) -d:BenchSpscWarmup (default 3) Behind -d:BenchSpscTestCompileTime the binary's `static` block asserts the three defaults. The runtime + BMF emission contract is covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3), which compiles at 1000 messages * 2 runs and parses the emitted lockfreequeues_sipsic/spsc/1p1c slug. * PR 2 Task 2.4: bench_mpsc binary (Mupsic 1p1c, 2p1c, 4p1c) Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic requires per-thread Producer objects via queue.getProducer(idx); the generic runThroughputHarness assumes a uniform queue.push call, so this binary keeps the bespoke harness pattern from the legacy file: pre-assign Producer values on the main thread (idx = 0..P-1) and ship them into worker thread contexts. The shape-iteration loop instantiates three concrete Mupsic types {1,2,4}p1c so P stays a static int. Per-binary intdefines (design 2.5): -d:BenchMpscRuns (default 33) -d:BenchMpscMessageCount (default 1_000_000) -d:BenchMpscWarmup (default 3) Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind -d:BenchMpscTestCompileTime the binary's static block asserts the three defaults. Runtime behavior covered by tests/t_topology_split.nim's bench_mpsc suite. * PR 2 Task 2.5: bench_mpmc binary (Mupmuc grid + 8p8c, Sipmuc, channels) Splits the MPMC slice out of the legacy bench_throughput.nim into a single binary covering three queue families. Mupmuc and Sipmuc both need per-thread Producer / Consumer values via getProducer(idx) / getConsumer(idx); the harness pre-assigns those on the main thread and ships them into worker contexts (mirrors the bench_throughput Mupmuc pattern). Channels has a uniform push/pop interface and runs through bench_common.runThroughputHarness. Coverage: Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the issue #15 livelock regression case preserved from the pre-split bench_throughput fixture so the deletion-safety check does not silently lose that coverage. Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc per design 2.4). channels {1,2,4} P x {1,2,4} C grid. Per-binary intdefines (design 2.5): -d:BenchMpmcRuns (default 33) -d:BenchMpmcMessageCount (default 1_000_000) -d:BenchMpmcWarmup (default 3) Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep type resolution unambiguous. * PR 2 Task 2.6: bench_unbounded binary (4 unbounded variants) Splits the unbounded slice out of the legacy bench_throughput.nim. Covers all 4 lockfreequeues unbounded variants at their natural shapes: unbounded_sipsic spsc_unbounded/1p1c (no DEBRA) unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c (DEBRA on consumers) unbounded_mupsic mpsc_unbounded/{1,2,4}p1c (DEBRA on producers) unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid (DEBRA both sides) UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch reclamation, so producer/consumer threads call registerThread(manager) on their own thread before queue.getProducer(handle) / queue.getConsumer(handle). UnboundedMupsic reuses the existing UnboundedMupsicAdapter (which pre-registers the consumer at adapter init) and drives pop from the calling thread, mirroring the legacy bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are constructed inline with a heap-pointer DebraManager since their constructors take a `ptr DebraManager`. Per-variant intdefines (design 2.5): -d:UnboundedSipsicRuns / UnboundedSipsicMessageCount (3 / 500_000) -d:UnboundedSipmucRuns / UnboundedSipmucMessageCount (3 / 500_000) -d:UnboundedMupsicRuns / UnboundedMupsicMessageCount (3 / 500_000) -d:UnboundedMupmucRuns / UnboundedMupmucMessageCount (3 / 500_000) -d:BenchUnboundedWarmup (default 2) Per-variant defines are chosen over a shared BenchUnboundedRuns so CI can budget unbounded_mupmuc/4p4c independently from the cheaper unbounded_sipsic/1p1c (design 2.5 rationale). * PR 2 Task 2.7: superset_check.py + deletion-safety wiring Adds benchmarks/scripts/superset_check.py: a slug-set subset check that exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc + bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in the pre-split fixture, and exits 1 with the missing slugs listed on stderr otherwise. The check operates on JSON top-level keys only; it does not compare measure values, bounds, or ordering, so the harness is free to reshape per-run statistics across the split without tripping the deletion-safety guard. Also wires the check into tests/t_topology_split.nim's deletion-safety suite (compile + run all four binaries at small overrides, merge via merge_bmf.py, then invoke superset_check.py against the committed fixture and assert exit 0). benchmarks/tests/test_superset_check.py covers 9 cases: identical sets, strict superset, missing slugs (with alpha-sorted listing), empty pre, wrong arg count, missing pre file, malformed JSON, and top-level non-object. All exit codes use 1 for failure to match the existing benchmarks/merge_bmf.py contract. * PR 2 Task 2.8: bench.yml matrix over 5 topology binaries + per-step timeouts Replaces the legacy bench-throughput + bench-latency job pair with a single matrix job over the 5 topology-split binaries. Each matrix entry runs in its own GitHub Actions job with an independent `timeout-minutes: 12` budget so a hang in one binary cannot burn the whole workflow's clock; the surviving binaries finish, the bench-upload job merges what arrived, and the operator gets partial Bencher coverage rather than no coverage. The compile step's `case` sets per-binary `-d:` overrides matching design 2.5: bounded throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3 runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup. bench-upload now also runs the deletion-safety check (superset_check.py vs tests/fixtures/pre-split-slugs.json) right after merge_bmf.py, so any silent slug regression introduced by future edits to the topology binaries fails the PR check. Artifact naming changes: each matrix entry uploads `bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's download pattern is unchanged (`bench-*-bmf`) so it picks up every new artifact without further edits. actionlint clean; yamllint shows only the pre-existing "missing document start" warning shared across all repo workflows. * PR 2 Task 2.9: 5-input union test in test_merge_bmf.py Adds test_five_input_union covering the upload-job pipeline shape that PR 2's matrix produces: five sibling BMF fragments (one per binary — bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency) arrive via actions/download-artifact, are merged via merge_bmf.py, and must produce a single output whose slug set is the disjoint union of the inputs. The test deliberately collides the sipsic 1p1c slug between bench_spsc (throughput_ops_ms) and bench_latency (latency_p50_ns, latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring the production multi-measure-per-slug pattern from Track 1 Task 1.5. The merge must preserve every measure on the shared slug; collisions on (slug, measure) pairs continue to exit 1 (covered by the existing test_collision_exits_1 case, no regression). * PR 2 Task 2.10: delete bench_throughput.nim, rewire consumers Removes benchmarks/nim/bench_throughput.nim now that the five topology- split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded + bench_latency from PR 1) cover its full slug surface. The pre-split fixture committed in Task 2.1 plus the superset_check.py guard wired into bench.yml in Task 2.8 enforce that no slug from the legacy binary silently disappears across the split. Live consumers updated to reference the new binaries: benchmarks/runner.py - builds + runs all 5 binaries, then merges fragments via merge_bmf.py. lockfreequeues.nimble - `nimble benchmarks` task iterates the 5 binaries and merges to results/latest.json. benchmarks/README.md - rewritten to describe the 5-binary pipeline, the matrix CI job, and the merged BMF schema (single slug can carry both throughput and latency measures). .gitignore - tracks the four new throughput binaries; also ignores the worktree-local `deps` symlink that nim.cfg's relative path expects. tests/t_bench_common.nim - removes the obsolete Task 0.10 bench_throughput integration suite (now covered by t_topology_split.nim per topology binary). benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim - doc-comment rewires from bench_throughput to bench_unbounded. Comment-only references in bench_common.nim, bench_latency.nim, render_readme.nim, and the new bench_*.nim files describe the legacy binary as historical context (the legacy `runThroughputBenchmark` proc / Mupmuc registration pattern that bench_common.runThroughputHarness factored out) and stay verbatim. * PR 2 Task 2.11: CHANGELOG entry under [Unreleased] for bench-rollup PR 2 Adds Track 2 PR 2 entries to the [Unreleased] section under Added, Changed, and Removed: - Added: 4 new topology-split throughput binaries (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py deletion-safety guard, 5-input merge_bmf union test. - Changed: bench.yml matrix over 5 binaries with per-step timeouts + superset_check.py wired into bench-upload; runner.py and `nimble benchmarks` task iterate the 5 binaries; README rewrite. - Removed: bench_throughput.nim (replaced by the 4 topology-split binaries; pre-split fixture + superset_check.py guard pin deletion-safety). * fix(bench): tighten bench_unbounded CI shape to fit 18-min budget bench_unbounded matrix item was timing out at 10 minutes after only 4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed shapes on 4-vCPU runners. Reduce message counts and runs across the heavy fan-out variants (mupmuc 9-shape grid is the main offender) and bump the job timeout to 18 min for cushion. Statistical signal is preserved for Bencher trend tracking; absolute throughput numbers will shift but per-shape comparisons remain meaningful. - UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3) - UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupmuc: 500Kx3 -> 50Kx2 (9 shapes) - BenchUnboundedWarmup: 2 -> 1 - timeout-minutes: 12 -> 18 * fix(bench): gate oversubscribed C>=4 unbounded shapes behind a define bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape: 4 consumers compete in a strict round-robin slot order (each consumer must take its turn before another can pop), and on a 4-vCPU runner one descheduled consumer blocks every other. The same starvation pattern hits every unbounded mupmuc shape with C>=4. Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops: - unbounded_sipmuc/mpmc_unbounded/1p4c - unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c} mupsic shapes (multi-producer, single consumer) are kept regardless: the round-robin ordering exists on consumers only, so a single consumer cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json and the deletion-safety guard would fail without it. The full grid still runs in `bench-comparison.yml` nightly cron / on workflow_dispatch where a beefier runner is available. PR CI now gets the meaningful subset (4 mupmuc shapes covering each P,C in {1,2}) that fits a 4-vCPU runner without hanging. * fix(bench): document p95 in BMF schema; clean up runner.py fragment files Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced. * fix(bench): run upload on partial failures; correct timeout-comment Three CI hardening fixes from code review on PR 2: 1. bench-upload had needs:[bench] without if:always(), so any single matrix-leg failure would skip the merge+upload entirely. Per the job comment the design is partial-coverage tolerant: surviving binaries should still be merged and uploaded. Add if: always() && (fork-PR guard) and a guard step that fails loudly when zero JSON fragments are present. 2. The per-binary upload-artifact step now runs with if: always() and if-no-files-found: ignore so a step-level timeout flushes any partial JSON the binary wrote, and a binary that crashed before writing the file does not turn the matrix leg red for the merger. 3. Job and step comments referenced a 12-min timeout while the configured value was 18 min. Update both comments to 18. * fix(bench): drop unparseable BMF fragments before merge bench-upload runs with if: always(), so a matrix leg that crashed mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1 on the first malformed input, blocking uploads from surviving legs. Pre-validate with json.load and drop any that fail, emitting a ::warning:: per dropped file. * perf(bench): backoffOnPeerWait in busy-spin loops; document teardown order - Add backoffOnPeerWait to producer/consumer busy loops in bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer) and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation in oversubscribed runs. Mupmuc producer already had it. - Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/ UMupmuc consumer loops in bench_unbounded.nim. - Document the per-iteration teardown order in runUSipmucShape and reference it from runUMupmucShape: reset(q) MUST precede reset(manager[]) because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager, which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC, system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip destructors (verified against project's own =destroy implementations in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])). * docs(bench_unbounded): defend `create(T)` as alloc0, not destructor-on-uninit Reviewer flagged that `create(DebraManager[...])` followed by `manager[] = initDebraManager[...]()` "runs destructors on uninitialised memory." It does not. `proc create*(T: typedesc, size)` in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` — zero-initialised allocation with no hook invocation. The follow-up assignment then performs the only construction. Add an explanatory note in the per-iteration teardown block comment (runUSipmucShape) so this doesn't keep getting flagged. The other harness (runUMupmucShape) already back-references that block comment. No behaviour change. * fix(bench): pass -d:danger in runner.py build_nim() to match CI CI compiles topology-split bench binaries with -d:release -d:danger; runner.py was missing -d:danger, so locally produced numbers were not comparable to the cloud baseline. * test: t_unbounded_padding red — segment alloc returns 16B-aligned, expects 64B PR 3 Task 3.1 (cache-line padding audit, step 1 of 3). Verifies bounded variants already conform; introduces the failing test that will drive the unbounded fix in 3.2 and 3.3. Bounded MPMC audit (read-only verification, design doc §4.2): - src/lockfreequeues/sipsic.nim:19-20 head/tail Atomic[int] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/sipmuc.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/mupsic.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/mupmuc.nim:43-44 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ The bounded MPMC cell array also uses the documented array-of-cells trick (typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line. No bounded change required. Unbounded variants currently FAIL both audit conditions: - Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.} pragmas, so they share cache lines with neighbouring data. - newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma controls only the intra-struct offset, not the base. Test additions: - tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants: (a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer coordination atomics. (b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated Segment via the queue's allocator. - Test-only accessors guarded by `when defined(testing):` in each unbounded module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest (returns offsets of audit-relevant fields). Compiled away in release. - Wired into tests/test.nim aggregator. RED state confirmed: Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED off.tail mod 64 was 8 (expected 0) off.prevConsumerIdx mod 64 was 16 off.committed mod 64 was 24 cast[uint](segPtr) mod 64 was 16 Existing test suite continues to compile (verified `nim c tests/test.nim`). Refs: design doc §4.2; impl plan Task 3.1. * feat(internal): add aligned_alloc.allocAligned via posix_memalign PR 3 Task 3.2 (cache-line padding audit, step 2 of 3). Adds the cache-line-aligned heap allocator that replaces c_calloc in the unbounded queue Segment allocation path (call sites swap in Task 3.3). Compile probe (impl plan §3.2.0): $ nim c -r tmp_probe.nim rc=0 align=0 verifying that std/posix.posix_memalign is available on macOS (libSystem) and Linux glibc with the same C signature, returning 64-byte aligned memory on success. Probe file deleted post-verification per impl plan. Implementation (src/lockfreequeues/internal/aligned_alloc.nim): - ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on rc != 0. Caller releases via the standard c_free per POSIX guarantee that posix_memalign blocks are free-compatible. - Re-exports CacheLineBytes from atomic_dsl for callers in 3.3. - isMainModule smoke confirms tail-call alignment. Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim): - Tiny payload (sizeof < 64). - Line-sized payload (sizeof == 64). - Big payload (sizeof > 64). - 64 sequential allocations all 64B-aligned. All 4 cases assert pointer alignment AND zero-initialization. Result: internal/aligned_alloc.allocAligned .... [Summary] 4 tests run: 4 OK, 0 FAILED. The unbounded padding audit test (tests/t_unbounded_padding.nim) is still RED at this commit; Task 3.3 swaps the Segment allocation call sites and adds {.align: CacheLineBytes.} pragmas, taking it GREEN. Refs: design doc §4.2; impl plan Task 3.2. * fix(unbounded): pad Segment fields and base-align via posix_memalign PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN). Applies the cache-line padding fix to all four unbounded queue variants and their per-segment producer/consumer atomics. Resolves the asymmetry called out in design doc §4.2: bounded variants were already padded; unbounded Segments were not, so cross-library comparison numbers in PR 3 would have been biased against lockfreequeues' unbounded types. Production changes (4 files): - src/lockfreequeues/unbounded_sipsic.nim Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment} gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned. - src/lockfreequeues/unbounded_sipmuc.nim Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. - src/lockfreequeues/unbounded_mupsic.nim Segment.{next, tail, committed} + UnboundedMupsic.{headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. ``head: int`` (single consumer, non-atomic) is intentionally left unpadded — only the consumer touches it. - src/lockfreequeues/unbounded_mupmuc.nim Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{ headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. The ``c_calloc`` import is dropped; ``c_free`` retained for segment release (POSIX guarantees posix_memalign blocks are free-compatible). Cross-platform fix in aligned_alloc.nim: - ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on macOS because the SDK declares the first parameter ``__unsafe_indexable`` (incomplete-type ``void *`` vs the wrapper's ``void **``). - Replaced the ``import std/posix`` with a local ``importc, header: "<stdlib.h>"`` shim using the canonical C signature, which clang accepts in both C and C++ modes. Documented in aligned_alloc.nim header comment. Verification: - tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp. - Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc: [Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED (all four variants) - t_unbounded_*_threaded suites pass — no false-sharing-induced timing regression in the threaded code paths. Refs: design doc §4.2; impl plan Task 3.3. * feat(bench): add loony_adapter (unbounded MPMC) behind compile gate PR 3 Task 3.5 (comparison MVP, first FFI adapter). Adds the Loony unbounded MPMC adapter and a parametric round-trip test in tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so the in-tree test suite passes vacuously when the package is absent. Files: - benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported = {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name. - tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves set; parametric template extends to subsequent adapters in 3.6–3.10. Loony storage-encoding fix (NOT a hack — required by Loony's design): Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as "slot empty". Loony was designed for ref/ptr types whose addresses are 8-byte-aligned and never null. Pushing the literal value 0 (or any value whose low 3 bits are non-zero) breaks Loony's pop path. Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3`` on push and ``(raw shr 3) - 1`` on pop: - ``shl 3`` clears the low flag bits. - ``+ 1`` ensures the encoded value is never zero. This is documentation-equivalent to how Loony users push 8-byte-aligned, non-null ref addresses; it does NOT change Loony's algorithm or affect the comparison fairness — every push/pop still goes through Loony's full fast/slow paths. First-iteration RED state (recorded for the audit): - Plain push of ``uint64(i)`` returned only 125 of 1000 items, all multiples of 8 (i.e. the values whose ``i and 7 == 0``). - ``shl 3`` alone returned 999 of 1000 items (everything except 0, which collided with Loony's empty-slot sentinel). - ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input. Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available + --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK. Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK. Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK. * feat(bench): add boost_lockfree_queue adapter (MPMC bounded, nim cpp) Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps boost::lockfree::queue<unsigned long long> with capacity-bounded push (bounded_push) and pop. Topology: tMpmc. Compile gates: -d:adapter_boost_lockfree_queue_available -- enable nim cpp -- required (else hard error) Pulls headers from the conventional install paths (apt /usr/include, brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts -d:boostIncludeDir=<path> for non-standard locations. Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric round-trip test in tests/t_bench_adapters.nim gated by both the define and 'cpp' build mode. Adapter capacity argument is the fixed node-pool size; pushes that exceed it return prFull. Default 1024 mirrors other bounded adapters. * feat(bench): add boost_lockfree_spsc adapter (SPSC bounded, nim cpp) Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer ring buffer. The pop overload used here -- pop(out, max_size) -- returns the count of items popped (csize_t); 1 means success, 0 means empty. Compile gates: -d:adapter_boost_lockfree_spsc_available -- enable nim cpp -- required Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header search paths and indirection rationale identical to boost_lockfree_queue adapter; smoke_boost.nim covers both adapters under one binary. * feat(bench): add bench-ffi-crossbeam Rust cdylib (ArrayQueue + SegQueue) C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C" fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy) that the Nim adapters in tasks 3.9 / 3.10 will importc. - Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link the integration test against the same code path the cdylib exports. - rust-toolchain.toml pins stable channel for reproducibility. - src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic is intercepted (returns null) so we never unwind across FFI. - tests/integration_test.rs: 6 tests covering round-trip set equality for both queues, full/empty edges, and null-pointer tolerance. Build: cargo build --release -> target/release/libbench_ffi_crossbeam.{so,dylib} .gitignore extended for benchmarks/rust/*/target/ and Cargo.lock. * feat(bench): add crossbeam_array_queue adapter (MPMC bounded, FFI cdylib) importc the cb_array_init/push/pop/destroy fns from benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a typed Nim adapter with topologiesSupported = {tMpmc}. The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling crossbeam_seg_queue_adapter gate is also active, so combined builds do not produce 'ignoring duplicate libraries' warnings on Apple ld. Adds smoke_crossbeam.nim covering both crossbeam adapters under one binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim extended with the gated parametric round-trip suite. Compile gates: -d:adapter_crossbeam_array_queue_available -d:crossbeamLibDir=<path> (override search path for non-default builds) * feat(bench): add crossbeam_seg_queue adapter (MPMC unbounded, FFI cdylib) importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess when the handle is non-null; the underlying cb_seg_push only reports failure on null pointers). topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored so the wire-up sites (Task 3.11) can pass capacity uniformly across all adapters without a per-adapter conditional. Compile gate: -d:adapter_crossbeam_seg_queue_available Optional override: -d:crossbeamLibDir=<path> * feat(bench): wire MVP comparison adapters into bench_spsc/mpmc/unbounded Each MVP adapter is pulled in only when its compile-time gate is set; absent gates leave production builds unchanged. Variant lists are populated at compile time via 'when declared(...)' so the CLI's --list / unknown-variant error messages reflect the actual compiled-in set. Wiring map (per design 2.4 / Track 3 §3.11): bench_spsc: - boost_lockfree_spsc -> spsc/1p1c (gated by adapter_boost_lockfree_spsc_available) bench_mpmc: - boost_lockfree_queue -> mpmc/{1,2,4}p{1,2,4}c - crossbeam_array_queue -> mpmc/{1,2,4}p{1,2,4}c (each gated by its respective adapter_*_available define) bench_unbounded: - loony -> mpmc_unbounded/{1,2,4}p{1,2,4}c - crossbeam_seg_queue -> mpmc_unbounded/{1,2,4}p{1,2,4}c bench_mpsc has no MVP coverage today: none of the MVP libraries provide a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC parity for Track 3 and the existing in-tree mupsic baseline stays the sole MPSC slug. Verified locally: bench_mpmc + bench_unbounded produce expected crossbeam_* slugs at the small fixture (10K msg, 2 runs). * ci(bench): add boost+loony soft-skip flow per design §2.6 workflow_dispatch inputs force_skip_boost / force_skip_loony surface the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay them into step-level if: gates. Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate on any failure. The combined defines + compile-mode (c vs cpp) flow into the per-binary compile step via a final 'Build ada…
Depends on PR #19, PR #20.
Track 2 of the bench-rollup feature. Splits the legacy
benchmarks/nim/bench_throughput.niminto four topology-specificthroughput binaries plus the latency binary already shipped in PR 1,
and rewires CI to a GitHub Actions matrix with per-step timeouts so a
hang in one variant cannot burn the entire workflow's wall-clock
budget.
What ships
Four new throughput binaries replacing
bench_throughput.nim:benchmarks/nim/bench_spsc.nim— Sipsic at1p1c.benchmarks/nim/bench_mpsc.nim— Mupsic at{1,2,4}p1c.benchmarks/nim/bench_mpmc.nim— Mupmuc at the{1,2,4}p{1,2,4}cgrid plus8p8c(oversubscription regressioncase for the issue 15 livelock fix), Sipmuc at
1p{1,2,4}c, andNim system channels at
{1,2,4}p{1,2,4}c.benchmarks/nim/bench_unbounded.nim— all four lockfreequeuesunbounded variants (sipsic / sipmuc / mupsic / mupmuc) at their
natural shapes; the DEBRA-using variants register on-thread before
push/pop.
Each binary owns its own per-binary
{.intdefine.}set per design2.5 (
-d:BenchSpscRuns / MessageCount / Warmup,-d:BenchMpscRuns / ...,-d:BenchMpmcRuns / ..., plus four pairsof
-d:Unbounded<Variant>Runs / MessageCount) so CI can budget eachtopology independently.
Deletion-safety
Task 2.1 captured the full slug set the legacy
bench_throughputemits at the same small-override knobs the topology-split tests use
and committed it to
tests/fixtures/pre-split-slugs.json(11 slugs:sipsic, mupmuc 1/2/4/8 grid, unbounded_mupsic 1/2/4, channels
1/2/4 grid). New
benchmarks/scripts/superset_check.pyverifies thepost-split BMF (union of the four throughput binaries through
merge_bmf.py) covers every slug in the fixture; it exits 1 withthe missing slugs alpha-listed on stderr otherwise. The check is
wired into the bench workflow between
merge_bmf.pyandbencher run, so any silent slug regression introduced by futureedits to the topology binaries fails the PR check.
CI shape
bench.ymlnow runs the five topology-split binaries(
bench_spsc,bench_mpsc,bench_mpmc,bench_unbounded,bench_latency) as astrategy.matrixjob withtimeout-minutes: 12per matrix entry andtimeout-minutes: 10onthe inner run step (fail fast if a variant hangs). Per-binary
artifact names follow the existing
bench-*-bmfpattern; thebench-uploadjob downloads them all, merges, runssuperset_check.py, and performs a singlebencher runso themulti-measure-per-slug history shape from PR 1 still works.
Tests
tests/t_topology_split.nim(new): compile + run each new binaryat small overrides, parse emitted BMF, assert expected slug grid.
Deletion-safety suite runs the full pipeline end-to-end.
benchmarks/tests/test_merge_bmf.pygainstest_five_input_union(1 fragment per topology binary, with thebench_spsc + bench_latency slug collision exercising the
multi-measure-per-slug shape).
benchmarks/tests/test_superset_check.py(new): 9 casescovering identical sets, strict superset, missing slugs (with
alpha-sorted stderr listing), empty pre, wrong arg count, missing
pre file, malformed JSON, and top-level non-object.
Test counts (all green locally):
Removed
benchmarks/nim/bench_throughput.nim(replaced by the fourtopology-split binaries; pre-split fixture + superset_check.py
guard pin deletion-safety).
benchmarks/runner.py,lockfreequeues.nimbletask benchmarks,benchmarks/README.md,.gitignore, andtests/t_bench_common.nim(legacy Task 0.10integration suite removed; equivalent coverage now in
t_topology_split.nim).