Skip to content

bench-rollup PR 2: topology split (5 binaries) + per-step CI timeouts#21

Closed
elijahr wants to merge 18 commits intofeat/bench-rollup-pr1-latency-mvpfrom
feat/bench-rollup-pr2-topology-split
Closed

bench-rollup PR 2: topology split (5 binaries) + per-step CI timeouts#21
elijahr wants to merge 18 commits intofeat/bench-rollup-pr1-latency-mvpfrom
feat/bench-rollup-pr2-topology-split

Conversation

@elijahr
Copy link
Copy Markdown
Owner

@elijahr elijahr commented May 1, 2026

Depends on PR #19, PR #20.

Track 2 of the bench-rollup feature. Splits the legacy
benchmarks/nim/bench_throughput.nim into four topology-specific
throughput binaries plus the latency binary already shipped in PR 1,
and rewires CI to a GitHub Actions matrix with per-step timeouts so a
hang in one variant cannot burn the entire workflow's wall-clock
budget.

What ships

Four new throughput binaries replacing bench_throughput.nim:

  • benchmarks/nim/bench_spsc.nim — Sipsic at 1p1c.
  • benchmarks/nim/bench_mpsc.nim — Mupsic at {1,2,4}p1c.
  • benchmarks/nim/bench_mpmc.nim — Mupmuc at the
    {1,2,4}p{1,2,4}c grid plus 8p8c (oversubscription regression
    case for the issue 15 livelock fix), Sipmuc at 1p{1,2,4}c, and
    Nim system channels at {1,2,4}p{1,2,4}c.
  • benchmarks/nim/bench_unbounded.nim — all four lockfreequeues
    unbounded variants (sipsic / sipmuc / mupsic / mupmuc) at their
    natural shapes; the DEBRA-using variants register on-thread before
    push/pop.

Each binary owns its own per-binary {.intdefine.} set per design
2.5 (-d:BenchSpscRuns / MessageCount / Warmup,
-d:BenchMpscRuns / ..., -d:BenchMpmcRuns / ..., plus four pairs
of -d:Unbounded<Variant>Runs / MessageCount) so CI can budget each
topology independently.

Deletion-safety

Task 2.1 captured the full slug set the legacy bench_throughput
emits at the same small-override knobs the topology-split tests use
and committed it to tests/fixtures/pre-split-slugs.json (11 slugs:
sipsic, mupmuc 1/2/4/8 grid, unbounded_mupsic 1/2/4, channels
1/2/4 grid). New benchmarks/scripts/superset_check.py verifies the
post-split BMF (union of the four throughput binaries through
merge_bmf.py) covers every slug in the fixture; it exits 1 with
the missing slugs alpha-listed on stderr otherwise. The check is
wired into the bench workflow between merge_bmf.py and
bencher run, so any silent slug regression introduced by future
edits to the topology binaries fails the PR check.

CI shape

bench.yml now runs the five topology-split binaries
(bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded,
bench_latency) as a strategy.matrix job with
timeout-minutes: 12 per matrix entry and timeout-minutes: 10 on
the inner run step (fail fast if a variant hangs). Per-binary
artifact names follow the existing bench-*-bmf pattern; the
bench-upload job downloads them all, merges, runs
superset_check.py, and performs a single bencher run so the
multi-measure-per-slug history shape from PR 1 still works.

Tests

  • tests/t_topology_split.nim (new): compile + run each new binary
    at small overrides, parse emitted BMF, assert expected slug grid.
    Deletion-safety suite runs the full pipeline end-to-end.
  • benchmarks/tests/test_merge_bmf.py gains
    test_five_input_union (1 fragment per topology binary, with the
    bench_spsc + bench_latency slug collision exercising the
    multi-measure-per-slug shape).
  • benchmarks/tests/test_superset_check.py (new): 9 cases
    covering identical sets, strict superset, missing slugs (with
    alpha-sorted stderr listing), empty pre, wrong arg count, missing
    pre file, malformed JSON, and top-level non-object.

Test counts (all green locally):

  • nimble test: 188 / 188
  • t_bench_common: 25 / 25
  • t_bench_latency: 7 / 7
  • t_topology_split: 6 / 6
  • test_merge_bmf: 11 / 11
  • test_superset_check: 9 / 9

Removed

  • benchmarks/nim/bench_throughput.nim (replaced by the four
    topology-split binaries; pre-split fixture + superset_check.py
    guard pin deletion-safety).
  • Live consumer references rewired in benchmarks/runner.py,
    lockfreequeues.nimble task benchmarks, benchmarks/README.md,
    .gitignore, and tests/t_bench_common.nim (legacy Task 0.10
    integration suite removed; equivalent coverage now in
    t_topology_split.nim).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request decomposes the legacy bench_throughput.nim into four specialized topology-split binaries—bench_spsc, bench_mpsc, bench_mpmc, and bench_unbounded—to enable independent CI budgeting and improved timeout management. It introduces a superset_check.py utility to ensure benchmark coverage is maintained and updates the orchestration logic in runner.py and lockfreequeues.nimble. Review feedback suggests optimizing the new benchmark harnesses by replacing inefficient busy-wait loops with a backoff strategy when queues are full or empty to reduce CPU overhead.

Comment thread benchmarks/nim/bench_mpsc.nim Outdated
) {.thread.} =
for i in ctx.startIdx ..< ctx.startIdx + ctx.count:
while not ctx.producer.push(T(i)):
discard
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The producer thread busy-waits with discard when the queue is full, which can lead to excessive CPU consumption. It would be more efficient to use a backoff strategy here, similar to what's done in other benchmark harnesses in this PR.

You will also need to add import lockfreequeues/backoff at the top of the file.

      backoffOnPeerWait()

Comment thread benchmarks/nim/bench_mpmc.nim Outdated
) {.thread.} =
for i in ctx.startIdx ..< ctx.startIdx + ctx.count:
while not ctx.queue[].push(T(i)):
discard
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The producer thread uses a tight loop with discard when the queue is full. This is inefficient as it will burn CPU cycles. It would be better to use a backoff strategy, similar to what is done in mupmucProducerThread.

The lockfreequeues/backoff module is already imported, so you can use backoffOnPeerWait() here.

      backoffOnPeerWait()

Comment on lines +94 to +96
let r = ctx.queue[].pop()
if r.isSome:
inc local
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The consumer threads in this file use a busy-wait loop when the queue is empty. This is inefficient and burns CPU cycles. It's better to use a backoff strategy.

This same issue is present in usipmucConsumerThread, the consumer loop in runOneUMupsicRun, and umupmucConsumerThread.

You should use backoffOnPeerWait() in an else branch when pop() returns nothing. You will also need to import lockfreequeues/backoff.

    let r = ctx.queue[].pop()
    if r.isSome:
      inc local
    else:
      backoffOnPeerWait()

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

🐰 Bencher Report

Branchfeat/bench-rollup-pr2-topology-split
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarklatency_p50_nsMeasure (units)latency_p95_nsMeasure (units)latency_p99_nsMeasure (units)throughput_ops_msMeasure (units)
lockfreequeues_mupmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
274.09 units📈 view plot
⚠️ NO THRESHOLD
316.64 units📈 view plot
⚠️ NO THRESHOLD
330.82 units📈 view plot
⚠️ NO THRESHOLD
31,314.28 units
lockfreequeues_mupmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
27,850.90 units
lockfreequeues_mupmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
18,705.20 units
lockfreequeues_mupmuc/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
22,177.88 units
lockfreequeues_mupmuc/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
16,559.51 units
lockfreequeues_mupmuc/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
16,410.47 units
lockfreequeues_mupmuc/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
11,719.80 units
lockfreequeues_mupmuc/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
12,614.80 units
lockfreequeues_mupmuc/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
11,691.07 units
lockfreequeues_mupmuc/mpmc/8p8c📈 view plot
⚠️ NO THRESHOLD
8,094.09 units
lockfreequeues_mupsic/mpsc/1p1c📈 view plot
⚠️ NO THRESHOLD
285.91 units📈 view plot
⚠️ NO THRESHOLD
313.55 units📈 view plot
⚠️ NO THRESHOLD
339.55 units📈 view plot
⚠️ NO THRESHOLD
15,032.38 units
lockfreequeues_mupsic/mpsc/2p1c📈 view plot
⚠️ NO THRESHOLD
15,644.26 units
lockfreequeues_mupsic/mpsc/4p1c📈 view plot
⚠️ NO THRESHOLD
8,776.24 units
lockfreequeues_sipmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
266.82 units📈 view plot
⚠️ NO THRESHOLD
311.00 units📈 view plot
⚠️ NO THRESHOLD
337.91 units📈 view plot
⚠️ NO THRESHOLD
31,839.66 units
lockfreequeues_sipmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
29,941.93 units
lockfreequeues_sipmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
18,111.11 units
lockfreequeues_sipsic/spsc/1p1c📈 view plot
⚠️ NO THRESHOLD
389.73 units📈 view plot
⚠️ NO THRESHOLD
429.18 units📈 view plot
⚠️ NO THRESHOLD
442.73 units📈 view plot
⚠️ NO THRESHOLD
9,050.48 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
16,950.22 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
6,817.40 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
11,627.08 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p2c📈 view plot
⚠️ NO THRESHOLD
11,644.33 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
9,873.41 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
8,680.69 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/4p1c📈 view plot
⚠️ NO THRESHOLD
10,212.56 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
12,029.52 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
12,450.04 units
lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
20,059.72 units
nim_channels/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
2,728.99 units
nim_channels/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
1,203.34 units
nim_channels/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
318.98 units
nim_channels/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
5,705.25 units
nim_channels/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
1,424.69 units
nim_channels/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
713.93 units
nim_channels/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
5,085.19 units
nim_channels/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
2,037.43 units
nim_channels/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
1,736.22 units
🐰 View full continuous benchmarking report in Bencher

@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 33fc9c8 to 459090f Compare May 2, 2026 03:50
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 466d152 to 6386498 Compare May 2, 2026 03:50
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the benchmarking suite by splitting the monolithic bench_throughput.nim into five specialized binaries: bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, and bench_latency. It introduces a superset_check.py utility and a corresponding fixture to ensure no benchmark slugs are lost during this transition. The runner.py and lockfreequeues.nimble files are updated to orchestrate the new multi-binary pipeline. Feedback highlights a documentation inconsistency regarding latency percentiles and suggests cleaning up temporary JSON fragments in the runner script.

Comment thread benchmarks/README.md
(mean, optional min/max for unbounded variants).
- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99, p999).
(mean, lower=mean-stddev, upper=mean+stddev).
- **Latency**: RTT nanoseconds with percentiles (p50, p95, p99).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency between the metrics list and the BMF schema description. Line 67 mentions p95 as a supported percentile for latency, but the schema example (lines 118-119) and the merge test case (lines 235-236 in test_merge_bmf.py) only include p50 and p99. Please update the documentation to reflect the actual percentiles emitted by the latency benchmark.

Comment thread benchmarks/runner.py Outdated
Comment on lines +64 to +86
for bin_name in NIM_BINARIES:
bin_path = PROJECT_ROOT / ".tmp" / bin_name
out = output_file.parent / f"{output_file.stem}-{bin_name}.json"
print(f" -> {bin_name} -> {out.name}")
subprocess.run(
[str(bin_path), f"--bmf-out={out}"],
check=True,
)
fragments.append(out)
# Merge the 5 BMF fragments into the requested output file via
# merge_bmf.py. Unions per-slug measure dicts and exits 1 on
# (slug, measure) collisions.
print(f"Merging {len(fragments)} fragments -> {output_file}")
subprocess.run(
[
sys.executable,
str(BENCHMARK_DIR / "merge_bmf.py"),
str(output_file),
*[str(p) for p in fragments],
],
check=True,
)
print(f"Combined results written to {output_file}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The run_nim function creates multiple timestamped fragment files in the results directory (e.g., nim-TIMESTAMP-bench_spsc.json) which are never deleted. This leads to unnecessary disk clutter after every benchmark run, especially since the merged file is already preserved. It is recommended to clean up these temporary fragments after the merge step is completed.

    print(f"Running Nim topology-split benchmarks ({runs} runs)...")
    fragments: list[Path] = []
    try:
        for bin_name in NIM_BINARIES:
            bin_path = PROJECT_ROOT / ".tmp" / bin_name
            out = output_file.parent / f"{output_file.stem}-{bin_name}.json"
            print(f"  -> {bin_name} -> {out.name}")
            subprocess.run(
                [str(bin_path), f"--bmf-out={out}"],
                check=True,
            )
            fragments.append(out)
        # Merge the 5 BMF fragments into the requested output file via
        # merge_bmf.py. Unions per-slug measure dicts and exits 1 on
        # (slug, measure) collisions.
        print(f"Merging {len(fragments)} fragments -> {output_file}")
        subprocess.run(
            [
                sys.executable,
                str(BENCHMARK_DIR / "merge_bmf.py"),
                str(output_file),
                *[str(p) for p in fragments],
            ],
            check=True,
        )
        print(f"Combined results written to {output_file}")
    finally:
        for p in fragments:
            try:
                p.unlink(missing_ok=True)
            except OSError:
                pass

@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 459090f to 19c6dcb Compare May 2, 2026 05:59
elijahr added a commit that referenced this pull request May 2, 2026
…iles

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 6386498 to 5012581 Compare May 2, 2026 06:00
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Action required

1. Bench-upload skipped on failure 🐞 Bug ☼ Reliability
Description
bench-upload won’t run when any bench matrix entry fails because it uses needs: [bench]
without an if: always() override, so no partial merge/upload happens. This contradicts the
workflow’s own comment that surviving binaries will still be merged and uploaded when one variant
hangs.
Code

.github/workflows/bench.yml[R172-173]

 bench-upload:
   # Merges per-binary BMF artifacts into a single JSON before a single
Evidence
The bench job explicitly describes the desired behavior (“surviving binaries finish, the
bench-upload job merges what arrived”), but bench-upload depends on the entire bench job via
needs: [bench] and has no always() condition to run when dependencies fail. As a result, any
single matrix leg timing out/failing prevents the merge + Bencher upload step from executing,
eliminating the intended partial coverage.

.github/workflows/bench.yml[33-41]
.github/workflows/bench.yml[172-181]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`bench-upload` is intended to merge/upload whatever benchmark artifacts were successfully produced even if one matrix entry fails/hangs, but it currently won’t run if any `bench` matrix leg fails.
### Issue Context
- `bench` is a matrix job with `fail-fast: false`, so other legs continue.
- However, `bench-upload` has `needs: [bench]` and no `if: always()` override, so it’s skipped when any matrix leg fails.
### Fix Focus Areas
- .github/workflows/bench.yml[33-41]
- .github/workflows/bench.yml[172-181]
### Suggested fix
- Change `bench-upload` to run even if `bench` fails, while still skipping fork PRs:
- e.g. `if: always() && (github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository)`
- Optionally make the per-matrix artifact upload more resilient:
- Add `if: always()` to the upload-artifact step and set `if-no-files-found: ignore` so a timed-out run doesn’t block uploading a partially-written JSON (if present).
- Add a clear guard before merge to fail with a helpful error if `bmf-inputs/` contains zero `*.json` files.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Truncated JSON breaks merge 🐞 Bug ☼ Reliability ⭐ New
Description
The workflow uploads per-binary JSON artifacts even when the run step fails/timeouts, but
bench-upload merges every downloaded *.json and merge_bmf.py exits 1 on malformed JSON; a truncated
artifact can therefore fail the merge and block uploading results from otherwise successful matrix
legs.
Code

.github/workflows/bench.yml[R232-245]

      - name: Merge BMF JSON
-        # `merge_bmf.py` unions per-slug measure dicts across N inputs and
-        # exits 1 on (slug, measure) collisions. Pure stdlib (no third-
-        # party deps), so no Python install step is needed beyond the
-        # ubuntu-latest default.
+        # `merge_bmf.py` unions per-slug measure dicts across N inputs
+        # and exits 1 on (slug, measure) collisions. Pure stdlib (no
+        # third-party deps), so no Python install step is needed beyond
+        # the ubuntu-latest default.
        #
-        # Inputs live one subdir deep (no `merge-multiple`); enumerate
-        # all `*.json` recursively. `xargs` arranges argv splitting; we
-        # NUL-delimit to be robust against any future filename quirks.
+        # Inputs live one subdir deep under `bmf-inputs/<artifact-name>/`
+        # (no `merge-multiple`); enumerate all `*.json` recursively and
+        # NUL-delimit so filenames with whitespace cannot break argv
+        # splitting in the future.
        run: |
          find bmf-inputs/ -type f -name '*.json' -print0 | \
            xargs -0 python3 benchmarks/merge_bmf.py merged.json
Evidence
bench.yml’s per-binary artifact upload runs under if: always() and is intended to upload whatever
file exists, while bench-upload’s merge step indiscriminately feeds all downloaded *.json files
into merge_bmf.py. merge_bmf.py treats JSON decode failures as fatal and returns exit code 1, so any
truncated/invalid fragment causes the entire merge+upload job to fail instead of producing the
intended partial merged.json from the valid fragments.

.github/workflows/bench.yml[156-178]
.github/workflows/bench.yml[232-245]
benchmarks/merge_bmf.py[111-130]
benchmarks/merge_bmf.py[143-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bench-upload` merges all downloaded `*.json` fragments, but `merge_bmf.py` exits 1 on any malformed JSON. Because the per-binary artifact upload runs under `if: always()`, a timed-out/aborted bench process could leave a truncated JSON file that still gets uploaded and then causes the merge (and Bencher upload) to fail.

### Issue Context
The workflow intent (per comments) is to still upload partial Bencher coverage when only some matrix legs succeed. A single malformed fragment defeats that by failing the merge step.

### Fix Focus Areas
- .github/workflows/bench.yml[156-178]
- .github/workflows/bench.yml[232-245]

### Suggested change
In `bench-upload`, add a validation/filter step before `merge_bmf.py`, e.g. iterate `bmf-inputs/**/*.json` and:
- attempt `python3 -c 'import json,sys; json.load(open(sys.argv[1],"r",encoding="utf-8"))' <file>`
- delete (or move aside) any fragment that fails to parse, and emit a `::warning::` naming it
- keep the existing “no fragments present” guard, but base it on *valid* fragments

This preserves the partial-coverage goal while keeping `merge_bmf.py` strict.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

3. Timeout comment mismatch 🐞 Bug ⚙ Maintainability
Description
The bench job comment says each matrix entry has a 12-minute job timeout, but the actual
configured timeout-minutes is 18. This is a documentation/config mismatch that can confuse CI
budgeting and troubleshooting.
Code

.github/workflows/bench.yml[R34-56]

+    # Track 2 PR 2: matrix over the 5 topology-split binaries. Each
+    # matrix entry is its own GitHub Actions job with an independent
+    # `timeout-minutes: 12` budget so a hang in one binary cannot burn
+    # the entire workflow's clock — the surviving binaries finish, the
+    # bench-upload job merges what arrived, and the operator gets
+    # partial Bencher coverage rather than no coverage. Per-binary
+    # `-d:` overrides are tuned to fit each variant's CI budget; see
+    # design 2.5 for the override table.
+    name: ${{ matrix.binary }} (ubuntu-latest)
   runs-on: ubuntu-latest
   # PRs from forks cannot read BENCHER_API_TOKEN. Skip rather than fail
   # noisily; maintainer pushes / merges still record the baseline.
   if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
-    timeout-minutes: 30
+    strategy:
+      fail-fast: false
+      matrix:
+        binary:
+          - bench_spsc
+          - bench_mpsc
+          - bench_mpmc
+          - bench_unbounded
+          - bench_latency
+    timeout-minutes: 18
Evidence
Within the same job definition, the comment documents a 12-minute job budget while the job
configuration sets 18 minutes, so the workflow is self-inconsistent.

.github/workflows/bench.yml[34-41]
.github/workflows/bench.yml[47-56]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The bench matrix job’s comment documents a 12-minute job timeout, but the job-level `timeout-minutes` is set to 18.
### Issue Context
This appears to be either:
- an intentional tuning change (then the comment should be updated), or
- an unintended configuration drift (then `timeout-minutes` should be set back to 12).
### Fix Focus Areas
- .github/workflows/bench.yml[34-56]
### Suggested fix
- Either update the comment to state 18 minutes, or change `timeout-minutes: 18` to `timeout-minutes: 12` to match the documented intent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 9585d15

Results up to commit 5012581


🐞 Bugs (2) 📘 Rule violations (0)


Action required
1. Bench-upload skipped on failure 🐞 Bug ☼ Reliability
Description
bench-upload won’t run when any bench matrix entry fails because it uses needs: [bench]
without an if: always() override, so no partial merge/upload happens. This contradicts the
workflow’s own comment that surviving binaries will still be merged and uploaded when one variant
hangs.
Code

.github/workflows/bench.yml[R172-173]

  bench-upload:
    # Merges per-binary BMF artifacts into a single JSON before a single
Evidence
The bench job explicitly describes the desired behavior (“surviving binaries finish, the
bench-upload job merges what arrived”), but bench-upload depends on the entire bench job via
needs: [bench] and has no always() condition to run when dependencies fail. As a result, any
single matrix leg timing out/failing prevents the merge + Bencher upload step from executing,
eliminating the intended partial coverage.

.github/workflows/bench.yml[33-41]
.github/workflows/bench.yml[172-181]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bench-upload` is intended to merge/upload whatever benchmark artifacts were successfully produced even if one matrix entry fails/hangs, but it currently won’t run if any `bench` matrix leg fails.

### Issue Context
- `bench` is a matrix job with `fail-fast: false`, so other legs continue.
- However, `bench-upload` has `needs: [bench]` and no `if: always()` override, so it’s skipped when any matrix leg fails.

### Fix Focus Areas
- .github/workflows/bench.yml[33-41]
- .github/workflows/bench.yml[172-181]

### Suggested fix
- Change `bench-upload` to run even if `bench` fails, while still skipping fork PRs:
 - e.g. `if: always() && (github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository)`
- Optionally make the per-matrix artifact upload more resilient:
 - Add `if: always()` to the upload-artifact step and set `if-no-files-found: ignore` so a timed-out run doesn’t block uploading a partially-written JSON (if present).
- Add a clear guard before merge to fail with a helpful error if `bmf-inputs/` contains zero `*.json` files.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments
2. Timeout comment mismatch 🐞 Bug ⚙ Maintainability
Description
The bench job comment says each matrix entry has a 12-minute job timeout, but the actual
configured timeout-minutes is 18. This is a documentation/config mismatch that can confuse CI
budgeting and troubleshooting.
Code

.github/workflows/bench.yml[R34-56]

+    # Track 2 PR 2: matrix over the 5 topology-split binaries. Each
+    # matrix entry is its own GitHub Actions job with an independent
+    # `timeout-minutes: 12` budget so a hang in one binary cannot burn
+    # the entire workflow's clock — the surviving binaries finish, the
+    # bench-upload job merges what arrived, and the operator gets
+    # partial Bencher coverage rather than no coverage. Per-binary
+    # `-d:` overrides are tuned to fit each variant's CI budget; see
+    # design 2.5 for the override table.
+    name: ${{ matrix.binary }} (ubuntu-latest)
    runs-on: ubuntu-latest
    # PRs from forks cannot read BENCHER_API_TOKEN. Skip rather than fail
    # noisily; maintainer pushes / merges still record the baseline.
    if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
-    timeout-minutes: 30
+    strategy:
+      fail-fast: false
+      matrix:
+        binary:
+          - bench_spsc
+          - bench_mpsc
+          - bench_mpmc
+          - bench_unbounded
+          - bench_latency
+    timeout-minutes: 18
Evidence
Within the same job definition, the comment documents a 12-minute job budget while the job
configuration sets 18 minutes, so the workflow is self-inconsistent.

.github/workflows/bench.yml[34-41]
.github/workflows/bench.yml[47-56]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The bench matrix job’s comment documents a 12-minute job timeout, but the job-level `timeout-minutes` is set to 18.

### Issue Context
This appears to be either:
- an intentional tuning change (then the comment should be updated), or
- an unintended configuration drift (then `timeout-minutes` should be set back to 12).

### Fix Focus Areas
- .github/workflows/bench.yml[34-56]

### Suggested fix
- Either update the comment to state 18 minutes, or change `timeout-minutes: 18` to `timeout-minutes: 12` to match the documented intent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

Comment on lines 172 to 173
bench-upload:
# Merges per-binary BMF artifacts into a single JSON before a single
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Bench-upload skipped on failure 🐞 Bug ☼ Reliability

bench-upload won’t run when any bench matrix entry fails because it uses needs: [bench]
without an if: always() override, so no partial merge/upload happens. This contradicts the
workflow’s own comment that surviving binaries will still be merged and uploaded when one variant
hangs.
Agent Prompt
### Issue description
`bench-upload` is intended to merge/upload whatever benchmark artifacts were successfully produced even if one matrix entry fails/hangs, but it currently won’t run if any `bench` matrix leg fails.

### Issue Context
- `bench` is a matrix job with `fail-fast: false`, so other legs continue.
- However, `bench-upload` has `needs: [bench]` and no `if: always()` override, so it’s skipped when any matrix leg fails.

### Fix Focus Areas
- .github/workflows/bench.yml[33-41]
- .github/workflows/bench.yml[172-181]

### Suggested fix
- Change `bench-upload` to run even if `bench` fails, while still skipping fork PRs:
  - e.g. `if: always() && (github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository)`
- Optionally make the per-matrix artifact upload more resilient:
  - Add `if: always()` to the upload-artifact step and set `if-no-files-found: ignore` so a timed-out run doesn’t block uploading a partially-written JSON (if present).
- Add a clear guard before merge to fail with a helpful error if `bmf-inputs/` contains zero `*.json` files.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 19c6dcb to 7a6ff2c Compare May 2, 2026 10:25
elijahr added a commit that referenced this pull request May 2, 2026
…iles

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 5012581 to 9585d15 Compare May 2, 2026 10:30
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Persistent review updated to latest commit 9585d15

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Preparing review...

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the legacy bench_throughput.nim with five specialized topology-split binaries to allow independent CI budgeting and introduces a superset_check.py script to prevent benchmark coverage regressions. Review feedback highlights several improvement opportunities in the new harnesses, specifically recommending the addition of backoffOnPeerWait() to prevent busy-spinning in producer and consumer threads. Additionally, concerns were raised regarding memory safety in bench_unbounded.nim, suggesting the use of proper destructors instead of reset() and avoiding manual memory management for DebraManager to prevent resource leaks under ARC/ORC.

Comment on lines +210 to +211
addr q, manager, messageCount)
reset(q)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using reset() on types with custom destructors (like UnboundedSipmuc and DebraManager) is dangerous in modern Nim (ARC/ORC). reset merely zeros the memory and does NOT call the destructor, which will leak segments and other heap-allocated resources. Since these variables are local to the loop iteration, they will be destroyed automatically at the end of the loop. If manual destruction is required for the pointer manager, use =destroy(manager[]) before dealloc.

    # reset(q) is redundant; stack object is destroyed at loop end.
    =destroy(manager[])
    dealloc(manager)

Comment thread benchmarks/nim/bench_mpmc.nim Outdated
Comment on lines +168 to +169
while not ctx.queue[].push(T(i)):
discard
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Sipmuc producer thread busy-spins without backoff when the queue is full. This can lead to high CPU usage and starvation of consumer threads, especially in oversubscribed environments. To maintain consistency with the Mupmuc harness and the library's own livelock fixes, please add a backoff call here.

    while not ctx.queue[].push(T(i)):
      backoffOnPeerWait()

Comment thread benchmarks/nim/bench_mpsc.nim Outdated
Comment on lines +63 to +64
while not ctx.producer.push(T(i)):
discard
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Mupsic producer thread busy-spins without backoff. This is inconsistent with the Mupmuc implementation in bench_mpmc.nim and may cause performance degradation in oversubscribed scenarios. Please use backoffOnPeerWait().

    while not ctx.producer.push(T(i)):
      backoffOnPeerWait()

Comment on lines +176 to +178
let item = ctx.consumer.pop()
if item.isSome:
inc local
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Sipmuc consumer thread busy-spins when the queue is empty. Adding a backoff call here will improve performance under contention and oversubscription by allowing producer threads to make progress.

    let item = ctx.consumer.pop()
    if item.isSome:
      inc local
    else:
      backoffOnPeerWait()

Comment on lines +206 to +207
var manager = create(DebraManager[MaxThreads])
manager[] = initDebraManager[MaxThreads]()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manual memory management via create and dealloc for DebraManager is error-prone and requires explicit destructor calls in ARC/ORC. Unless there is a specific reason to avoid the stack (e.g., object size), consider using a standard stack variable or new for a ref type to ensure safe resource management.

Comment on lines +161 to +163
let r = consumer.pop()
if r.isSome:
inc local
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The consumer thread in the UnboundedSipmuc harness busy-spins when the queue is empty. Please add backoffOnPeerWait() to improve efficiency during oversubscription.

      let r = consumer.pop()
      if r.isSome:
        inc local
      else:
        backoffOnPeerWait()

@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 7a6ff2c to a41a91d Compare May 3, 2026 01:48
elijahr added a commit that referenced this pull request May 3, 2026
…iles

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 89eb1b9 to 2aa47ce Compare May 3, 2026 03:41
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

@axiomantic-momus
Copy link
Copy Markdown

axiomantic-momus Bot commented May 4, 2026

Momus review posted — verdict APPROVE, 1 finding

████████████████████ 100%

run log

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

1 similar comment
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR splits the legacy bench_throughput.nim into five topology-specific binaries (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency), rewires CI to a GitHub Actions matrix with per-step timeouts, and adds a deletion-safety superset check against a pre-split slug fixture. The overall structure is sound. Key issues: (1) the CHANGELOG, README, and PR body document a timeout-minutes of 12 but the actual workflow code uses 18, a docs/code inconsistency introduced in the PR itself; (2) the Bench*TestCompileTime compile-time assertion blocks in all four new binaries are dead code — never activated by tests or CI and would fail if combined with test overrides.

Severity tally: 1 Low.

Low

  • BOT-A3 (benchmarks/runner.py:58): runner.py build_nim() omits -d:danger flag present in CI compilation

Noteworthy

  • The deletion-safety guard via superset_check.py + pre-split-slugs.json fixture is a thoughtful defense against silent slug regression across the binary split.
  • The CI matrix with per-binary timeouts and if: always() on bench-upload provides real partial-coverage resilience against individual binary hangs.
  • The 5-input merge test in test_merge_bmf.py correctly exercises the production pipeline shape with a shared slug carrying both throughput and latency measures.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

Comment thread benchmarks/runner.py
"""Run all 5 Nim topology-split binaries and merge their BMF outputs.

The `runs` argument is honored at compile time (via per-binary
`-d:Bench<Topo>Runs=<n>` overrides) — kept as a no-op runtime arg
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-A3 — Low (quality)
runner.py build_nim() omits -d:danger flag present in CI compilation
The CI bench.yml compiles each binary with nim c -d:release -d:danger --threads:on. The local runner.py build_nim() only passes -d:release --threads:on, omitting -d:danger. This means local benchmark runs via runner.py will produce different (slower) throughput numbers than CI, breaking reproducibility for developers.

Suggested change
`-d:Bench<Topo>Runs=<n>` overrides) — kept as a no-op runtime arg
Add `-d:danger` to the nim compile command in build_nim() to match CI.

@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 6c4cb5f to 2684687 Compare May 5, 2026 03:47
elijahr added 18 commits May 4, 2026 22:47
Snapshots the BMF emitted by the legacy bench_throughput at the same
small-override knobs the topology-split tests will use, then commits
the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion-
safety check (Task 2.7) verifies that the post-split union is a strict
superset of this fixture so no slug silently disappears with the split.

Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape
suite (Task 2.1) passes against the committed JSON; the bench_spsc /
bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and
the superset suite (Task 2.7) fail loudly until those binaries +
the superset_check.py helper land.

Fixture covers 11 slugs:
  sipsic spsc/1p1c
  mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c
  nim_channels mpmc/{1,2,4}p{1,2,4}c
Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps
SipsicAdapter in a uniform queueInit closure and runs through
runThroughputHarness so this binary stays under 100 lines and shares
the per-run shaping (warmup, run-count, monotonic-ns timing,
remainder-spread distribution) with the rest of the topology split.

Per-binary intdefines (design 2.5):
  -d:BenchSpscRuns          (default 33)
  -d:BenchSpscMessageCount  (default 1_000_000)
  -d:BenchSpscWarmup        (default 3)

Behind -d:BenchSpscTestCompileTime the binary's `static` block
asserts the three defaults. The runtime + BMF emission contract is
covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3),
which compiles at 1000 messages * 2 runs and parses the emitted
lockfreequeues_sipsic/spsc/1p1c slug.
Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic
requires per-thread Producer objects via queue.getProducer(idx); the
generic runThroughputHarness assumes a uniform queue.push call, so this
binary keeps the bespoke harness pattern from the legacy file:
pre-assign Producer values on the main thread (idx = 0..P-1) and ship
them into worker thread contexts. The shape-iteration loop instantiates
three concrete Mupsic types {1,2,4}p1c so P stays a static int.

Per-binary intdefines (design 2.5):
  -d:BenchMpscRuns          (default 33)
  -d:BenchMpscMessageCount  (default 1_000_000)
  -d:BenchMpscWarmup        (default 3)

Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind
-d:BenchMpscTestCompileTime the binary's static block asserts the three
defaults. Runtime behavior covered by tests/t_topology_split.nim's
bench_mpsc suite.
Splits the MPMC slice out of the legacy bench_throughput.nim into a
single binary covering three queue families. Mupmuc and Sipmuc both
need per-thread Producer / Consumer values via getProducer(idx) /
getConsumer(idx); the harness pre-assigns those on the main thread and
ships them into worker contexts (mirrors the bench_throughput Mupmuc
pattern). Channels has a uniform push/pop interface and runs through
bench_common.runThroughputHarness.

Coverage:
  Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the
    issue #15 livelock regression case preserved from the pre-split
    bench_throughput fixture so the deletion-safety check does not
    silently lose that coverage.
  Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc
    per design 2.4).
  channels {1,2,4} P x {1,2,4} C grid.

Per-binary intdefines (design 2.5):
  -d:BenchMpmcRuns          (default 33)
  -d:BenchMpmcMessageCount  (default 1_000_000)
  -d:BenchMpmcWarmup        (default 3)

Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the
contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep
type resolution unambiguous.
Splits the unbounded slice out of the legacy bench_throughput.nim.
Covers all 4 lockfreequeues unbounded variants at their natural shapes:

  unbounded_sipsic spsc_unbounded/1p1c                    (no DEBRA)
  unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c              (DEBRA on consumers)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c              (DEBRA on producers)
  unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid   (DEBRA both sides)

UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch
reclamation, so producer/consumer threads call registerThread(manager)
on their own thread before queue.getProducer(handle) /
queue.getConsumer(handle). UnboundedMupsic reuses the existing
UnboundedMupsicAdapter (which pre-registers the consumer at adapter
init) and drives pop from the calling thread, mirroring the legacy
bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are
constructed inline with a heap-pointer DebraManager since their
constructors take a `ptr DebraManager`.

Per-variant intdefines (design 2.5):
  -d:UnboundedSipsicRuns / UnboundedSipsicMessageCount   (3 / 500_000)
  -d:UnboundedSipmucRuns / UnboundedSipmucMessageCount   (3 / 500_000)
  -d:UnboundedMupsicRuns / UnboundedMupsicMessageCount   (3 / 500_000)
  -d:UnboundedMupmucRuns / UnboundedMupmucMessageCount   (3 / 500_000)
  -d:BenchUnboundedWarmup                                (default 2)

Per-variant defines are chosen over a shared BenchUnboundedRuns so CI
can budget unbounded_mupmuc/4p4c independently from the cheaper
unbounded_sipsic/1p1c (design 2.5 rationale).
Adds benchmarks/scripts/superset_check.py: a slug-set subset check that
exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc +
bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in
the pre-split fixture, and exits 1 with the missing slugs listed on
stderr otherwise. The check operates on JSON top-level keys only; it
does not compare measure values, bounds, or ordering, so the harness
is free to reshape per-run statistics across the split without
tripping the deletion-safety guard.

Also wires the check into tests/t_topology_split.nim's deletion-safety
suite (compile + run all four binaries at small overrides, merge via
merge_bmf.py, then invoke superset_check.py against the committed
fixture and assert exit 0).

benchmarks/tests/test_superset_check.py covers 9 cases: identical
sets, strict superset, missing slugs (with alpha-sorted listing),
empty pre, wrong arg count, missing pre file, malformed JSON, and
top-level non-object. All exit codes use 1 for failure to match the
existing benchmarks/merge_bmf.py contract.
…imeouts

Replaces the legacy bench-throughput + bench-latency job pair with a
single matrix job over the 5 topology-split binaries. Each matrix
entry runs in its own GitHub Actions job with an independent
`timeout-minutes: 12` budget so a hang in one binary cannot burn the
whole workflow's clock; the surviving binaries finish, the
bench-upload job merges what arrived, and the operator gets partial
Bencher coverage rather than no coverage. The compile step's `case`
sets per-binary `-d:` overrides matching design 2.5: bounded
throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3
runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup.

bench-upload now also runs the deletion-safety check
(superset_check.py vs tests/fixtures/pre-split-slugs.json) right after
merge_bmf.py, so any silent slug regression introduced by future
edits to the topology binaries fails the PR check.

Artifact naming changes: each matrix entry uploads
`bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's
download pattern is unchanged (`bench-*-bmf`) so it picks up every
new artifact without further edits.

actionlint clean; yamllint shows only the pre-existing
"missing document start" warning shared across all repo workflows.
Adds test_five_input_union covering the upload-job pipeline shape that
PR 2's matrix produces: five sibling BMF fragments (one per binary —
bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency)
arrive via actions/download-artifact, are merged via merge_bmf.py, and
must produce a single output whose slug set is the disjoint union of
the inputs.

The test deliberately collides the sipsic 1p1c slug between bench_spsc
(throughput_ops_ms) and bench_latency (latency_p50_ns,
latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring
the production multi-measure-per-slug pattern from Track 1 Task 1.5.
The merge must preserve every measure on the shared slug; collisions
on (slug, measure) pairs continue to exit 1 (covered by the existing
test_collision_exits_1 case, no regression).
Removes benchmarks/nim/bench_throughput.nim now that the five topology-
split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded
+ bench_latency from PR 1) cover its full slug surface. The pre-split
fixture committed in Task 2.1 plus the superset_check.py guard wired
into bench.yml in Task 2.8 enforce that no slug from the legacy binary
silently disappears across the split.

Live consumers updated to reference the new binaries:
  benchmarks/runner.py     - builds + runs all 5 binaries, then
                             merges fragments via merge_bmf.py.
  lockfreequeues.nimble    - `nimble benchmarks` task iterates the
                             5 binaries and merges to results/latest.json.
  benchmarks/README.md     - rewritten to describe the 5-binary
                             pipeline, the matrix CI job, and the
                             merged BMF schema (single slug can carry
                             both throughput and latency measures).
  .gitignore               - tracks the four new throughput binaries;
                             also ignores the worktree-local `deps`
                             symlink that nim.cfg's relative path
                             expects.
  tests/t_bench_common.nim - removes the obsolete Task 0.10
                             bench_throughput integration suite (now
                             covered by t_topology_split.nim per
                             topology binary).
  benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim
                           - doc-comment rewires from bench_throughput
                             to bench_unbounded.

Comment-only references in bench_common.nim, bench_latency.nim,
render_readme.nim, and the new bench_*.nim files describe the legacy
binary as historical context (the legacy `runThroughputBenchmark`
proc / Mupmuc registration pattern that bench_common.runThroughputHarness
factored out) and stay verbatim.
Adds Track 2 PR 2 entries to the [Unreleased] section under Added,
Changed, and Removed:
  - Added: 4 new topology-split throughput binaries (bench_spsc,
    bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py
    deletion-safety guard, 5-input merge_bmf union test.
  - Changed: bench.yml matrix over 5 binaries with per-step timeouts
    + superset_check.py wired into bench-upload; runner.py and
    `nimble benchmarks` task iterate the 5 binaries; README rewrite.
  - Removed: bench_throughput.nim (replaced by the 4 topology-split
    binaries; pre-split fixture + superset_check.py guard pin
    deletion-safety).
bench_unbounded matrix item was timing out at 10 minutes after only
4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed
shapes on 4-vCPU runners. Reduce message counts and runs across the
heavy fan-out variants (mupmuc 9-shape grid is the main offender) and
bump the job timeout to 18 min for cushion. Statistical signal is
preserved for Bencher trend tracking; absolute throughput numbers
will shift but per-shape comparisons remain meaningful.

- UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3)
- UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupmuc: 500Kx3 -> 50Kx2  (9 shapes)
- BenchUnboundedWarmup: 2 -> 1
- timeout-minutes: 12 -> 18
bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape:
4 consumers compete in a strict round-robin slot order (each consumer
must take its turn before another can pop), and on a 4-vCPU runner
one descheduled consumer blocks every other. The same starvation
pattern hits every unbounded mupmuc shape with C>=4.

Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops:

  - unbounded_sipmuc/mpmc_unbounded/1p4c
  - unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c}

mupsic shapes (multi-producer, single consumer) are kept regardless:
the round-robin ordering exists on consumers only, so a single consumer
cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json
and the deletion-safety guard would fail without it.

The full grid still runs in `bench-comparison.yml` nightly cron / on
workflow_dispatch where a beefier runner is available. PR CI now gets
the meaningful subset (4 mupmuc shapes covering each P,C in {1,2})
that fits a 4-vCPU runner without hanging.
…iles

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.
Three CI hardening fixes from code review on PR 2:

1. bench-upload had needs:[bench] without if:always(), so any single
   matrix-leg failure would skip the merge+upload entirely. Per the
   job comment the design is partial-coverage tolerant: surviving
   binaries should still be merged and uploaded. Add
   if: always() && (fork-PR guard) and a guard step that fails loudly
   when zero JSON fragments are present.

2. The per-binary upload-artifact step now runs with if: always() and
   if-no-files-found: ignore so a step-level timeout flushes any
   partial JSON the binary wrote, and a binary that crashed before
   writing the file does not turn the matrix leg red for the merger.

3. Job and step comments referenced a 12-min timeout while the
   configured value was 18 min. Update both comments to 18.
bench-upload runs with if: always(), so a matrix leg that crashed
mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1
on the first malformed input, blocking uploads from surviving legs.
Pre-validate with json.load and drop any that fail, emitting a
::warning:: per dropped file.
…order

- Add backoffOnPeerWait to producer/consumer busy loops in
  bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer)
  and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation
  in oversubscribed runs. Mupmuc producer already had it.
- Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/
  UMupmuc consumer loops in bench_unbounded.nim.
- Document the per-iteration teardown order in runUSipmucShape and
  reference it from runUMupmucShape: reset(q) MUST precede reset(manager[])
  because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager,
  which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC,
  system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip
  destructors (verified against project's own =destroy implementations
  in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])).
…n-uninit

Reviewer flagged that `create(DebraManager[...])` followed by
`manager[] = initDebraManager[...]()` "runs destructors on
uninitialised memory." It does not. `proc create*(T: typedesc, size)`
in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` —
zero-initialised allocation with no hook invocation. The follow-up
assignment then performs the only construction.

Add an explanatory note in the per-iteration teardown block comment
(runUSipmucShape) so this doesn't keep getting flagged. The other
harness (runUMupmucShape) already back-references that block comment.

No behaviour change.
CI compiles topology-split bench binaries with -d:release -d:danger;
runner.py was missing -d:danger, so locally produced numbers were not
comparable to the cloud baseline.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 2aa47ce to ab19733 Compare May 5, 2026 03:49
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 5, 2026

Consolidated into #19 — fast-forwarded feat/bench-rollup-pr0-bench-common to this branch's tip; all commits preserved in #19. Closing this stacked PR as redundant.

@elijahr elijahr closed this May 5, 2026
@elijahr elijahr deleted the feat/bench-rollup-pr2-topology-split branch May 5, 2026 04:19
elijahr added a commit that referenced this pull request May 5, 2026
…ts (#19)

* feat(bench): stub bench_common module with public API surface (Task 0.1)

Compile-only stub for benchmarks/nim/bench_common.nim. Types and proc
signatures are final per design doc section 2.1; bodies raise
AssertionDefect with 'not implemented' so tasks 0.2-0.6 can be written
and reviewed in parallel against this surface.

Public surface stubbed:
- Topology enum (tSpsc/tMpsc/tMpmc + unbounded variants)
- PushResult / PopResult adapter primitives
- BMFEmitter + initBMFEmitter / addMeasure / emit
- Stats helpers (mean / stddev / minVal / maxVal / percentile)
- Histogram + initHistogram / record / topK / percentile
  (HistogramTopK = 1000, HistogramReservoir = 99000)
- LatencyMetrics + runLatencyHarness
- ThroughputMetrics + runThroughputHarness

Test (tests/t_bench_common.nim) is compile-time-only: it references
each promised symbol via 'when declared' and 'when compiles', leaving
behavior tests to subsequent tasks.

Stub file: 170 LOC (acceptance cap: 200).

* feat(bench): implement BMFEmitter with alpha-sorted output (Task 0.2)

initBMFEmitter / addMeasure / emit per design doc section 2.1.

- addMeasure stores via Table[slug, Table[measure, MeasureValue]];
  NaN sentinel for lower/upper bounds maps to none[float](),
  finite values map to some(v).
- emit writes JSON with slugs alpha-sorted at the top level and
  measures alpha-sorted within each slug. lower_value/upper_value
  fields are emitted only when the corresponding bound is some.
- Output format: pretty(root, 2) plus a trailing newline so
  diffs in CI / local-dev surface line-level changes cleanly.

Tests (tests/t_bench_common.nim): empty emit produces {}, two
slugs alpha-ordered, measures alpha-ordered within slug, NaN bounds
omitted from output, finite bounds emitted as lower_value/upper_value.

* feat(bench): implement Stats helpers (Task 0.4)

mean / stddev / minVal / maxVal / percentile bodies. Sort-based
percentile uses linear interpolation between adjacent samples;
matches numpy's default percentile interpolation ('linear').
stddev uses ddof=1 (sample) for consistency with numpy default.
Empty-input behavior matches the legacy benchmarks/nim/stats.nim
that this module replaces (returns 0.0 across the board).

Tests: numpy-baseline parity for stddev on [1,2,3,4]; min/max on
unordered input; percentile(p=0.5) on [0..99] = 49.5; percentile
extremes (p=0.0, p=1.0) match min and max.

* feat(bench): implement Histogram top-K + reservoir percentile (Task 0.3)

Online percentile tracker per design doc section 2.1.

- topKHeap: min-heap of HistogramTopK = 1000 largest values seen.
  record() pushes if heap < K; else replace if value > heap[0]
  (evicts the min into the body bucket via reservoirAdmit).
- reservoir: Algorithm R uniform sample of body distribution, size
  HistogramReservoir = 99_000.
- debugMode (initHistogram(true)): records every sample to debugAll
  and percentile() returns the exact sort answer over debugAll. Used
  as the test oracle and as a debug-only fallback per design 2.1.

Lookup rule with stratified-estimator math:
  p == 1.0                          -> top-K max
  seenAll*(1-p) <= topKHeap.len     -> top-K rank lookup (linear interp)
  otherwise                         -> reservoir, with p rescaled to
                                       p_body = (p*seenAll) / (seenAll - K)
                                       since the reservoir samples ONLY the
                                       body stratum (values that lost the
                                       top-K race). Without rescaling a
                                       reservoir-based p99 underestimates
                                       by ~25% on log-normal data.

Tests: p99 within 1% of sort fallback on 100K log-normal samples;
percentile(1.0) == top-K max; p50 within 5% of sort fallback;
debug mode returns exact sort answer; topK() returns ascending-sorted
samples.

* feat(bench): implement runThroughputHarness (Task 0.5)

Generic N-producer / N-consumer throughput runner factored out of
the legacy benchmarks/nim/bench_throughput.nim. Parameterized over the
queue init proc and the queue type Q (constrained at instantiation
time via 'mixin push' / 'mixin pop' so adapter-specific push/pop are
resolved at the call site).

The remainder-spread deadlock-avoidance rule is preserved verbatim:
if messageCount mod numProducers != 0, the leading producers each get
one extra item so the producer-side total exactly equals messageCount;
same for consumers. Without this, the harness silently degrades into
a deadlock (consumers waiting forever) or strands items in the queue
(consumers stop early).

Returns ThroughputMetrics{ops_ms_mean, ops_ms_stddev, runs}; warmup
runs are discarded. Smoke test in tests/t_bench_common.nim with a
1024-cap channel SmokeAdapter at 1P/1C × 1000 messages × 1 run.

* feat(bench): implement runLatencyHarness (Task 0.6)

Ping-pong RTT runner. For each run:
- allocates two queues fwd + rev of type Q via queueInit()
- spawns 1 pinger + 1 ponger thread
- pinger pushes a monotonic-ns timestamp via fwd, spins on rev for
  the echo, records RTT (ns) into a per-run Histogram
- ponger pops from fwd, pushes the same value to rev

Aggregation per design 2.5 'Run aggregation for latency': reported
percentiles are the MEAN across runs of each percentile, NOT the
percentile of unioned samples. Per-run histograms isolate scheduler
artifacts to the run that contains them.

Returns LatencyMetrics{p50_ns, p95_ns, p99_ns, p999_ns, max_ns,
samples}. samples is the cumulative across-runs sample count.

This task ships the 1P/1C smoke topology only; multi-thread sharding
lands in PR 1 (Task 1.x). 1P/1C is sufficient for the smoke check
because PR 1's first BMF measure uses Sipsic 1P/1C as the smoke shape.

Test: SmokeAdapter (channel-backed), 1000 messages, 1 run, 0 warmup;
asserts samples >= 1000 and p50 < p99 < max.

* feat(bench): add merge_bmf.py BMF JSON merge utility (Task 0.7)

Stateless union of per-slug measure dicts across N input BMF JSON
files into one output JSON, per design doc section 4.3. Required
because Bencher.dev creates a separate Report per 'bencher run'
invocation; multiple uploads from sibling CI jobs would NOT co-locate
multiple measures on a single per-slug history.

Contract:
  merge_bmf.py <output_path> <input1> [input2 ...]

- Slug regex: ^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$
- Measure regex: ^[a-z][a-z0-9_]*$
- Value fields ('value', 'lower_value', 'upper_value') must be finite,
  non-NaN floats. Any malformed input or non-finite value -> exit 1.
- Collision (same slug + same measure across two inputs) -> exit 1
  with stderr 'error: collision on slug=<slug> measure=<measure> in
  files <pathA> and <pathB>'.
- Output is alpha-sorted at both top level (slugs) and within each
  slug (measures), 2-space indent.

Tests in benchmarks/tests/test_merge_bmf.py cover all 8 design-spec
cases plus an alpha-sort assertion: zero-input usage error, single
empty pass-through, single valid round-trip, two disjoint slugs,
shared slug + disjoint measures, collision, NaN rejection, invalid
measure key (uppercase), invalid slug shape.

* feat(bench): add 5 missing lockfreequeues adapters (Task 0.8)

Adds the 5 adapters not yet present per design doc 2.2:
- lockfreequeues_sipmuc_adapter.nim   (bounded SPMC)
- lockfreequeues_mupsic_adapter.nim   (bounded MPSC)
- lockfreequeues_unbounded_sipsic_adapter.nim
- lockfreequeues_unbounded_sipmuc_adapter.nim
- lockfreequeues_unbounded_mupmuc_adapter.nim

Each adapter file exports:
- 'topologiesSupported: set[Topology]' constant (consumed by PR 3 Task
  3.11 to restrict shape generation; declared but unused in PR 0)
- 'make<Lib>Adapter' factory taking 'capacity: int'
- 'cleanup' tear-down proc
- 'push(item)' returning bench_common.PushResult
- 'pop()' returning bench_common.PopResult[T]

Smoke test in tests/t_bench_common.nim: each adapter pushes 100
sequential uint64s, pops them, asserts count == 100 AND set-equality
with the input range.

Implementation note (Nim 2.2.6 codegen workaround):
The unbounded adapters store the queue INLINE rather than via
heap pointer. Heap-allocating UnboundedSipsic/UnboundedSipmuc/
UnboundedMupmuc and then calling reset() or =destroy() on the
referent from a generic adapter cleanup proc triggers
'internal error: getTypeDescAux(tyGenericParam)' OR
'internal error: expr(nkIdent); unknown node kind' in Nim 2.2.6
when bench_common is also imported. Inline storage lets the
compiler emit the per-instance =destroy automatically when the
adapter goes out of scope, sidestepping the bug. The DEBRA
manager is still heap-allocated because the queue constructors
take 'ptr DebraManager'. Track upstream Nim repro for follow-up.

* PR 0 Task 0.9: rename existing adapters with `_adapter` suffix and export `topologiesSupported`

Per design section 2.2 (`<library_slug>_adapter.nim` filename convention):

- Rename via `git mv` (history preserved):
  - lockfreequeues_sipsic.nim -> lockfreequeues_sipsic_adapter.nim
  - lockfreequeues_mupmuc.nim -> lockfreequeues_mupmuc_adapter.nim
  - lockfreequeues_unbounded_mupsic.nim -> lockfreequeues_unbounded_mupsic_adapter.nim

- Update bench_throughput.nim import paths to match.

- Add `topologiesSupported*: set[Topology]` constant to all four
  existing adapters (3 renamed + channels_adapter). Use selective
  `from ../bench_common import Topology, <member>` to avoid pulling
  PushResult into scope, which would conflict with adapter.PushResult
  (the legacy concept in ../adapter still imported by these adapters).

- `topologiesSupported` is exported in PR 0 but consumed only by PR 3
  Task 3.11; the "unused export" hint is expected per the impl plan.

* PR 0 Task 0.10: native BMF emission via --bmf-out= flag in bench_throughput

Add --bmf-out=<path> flag (parsed via std/parseopt) that emits Bencher
Metric Format JSON for every variant that ran. With the flag absent the
binary is bit-for-bit unchanged; the existing positional CLI
(`bench_throughput sipsic mupmuc unbounded_mupsic channels`) is
preserved verbatim.

Slug shape per design 2.2:

  lockfreequeues_sipsic/spsc/1p1c
  lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c
  lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c
  nim_channels/mpmc/{1,2,4}p{1,2,4}c

Each slug carries the `throughput_ops_ms` measure as
{value: mean, lower_value: mean-stddev, upper_value: mean+stddev} —
matching the convention bmf_adapter.py used (which this code replaces
in Task 0.12).

Note on slug for built-in channels: the human-readable adapter name()
returns "nim/channels" (kept for stdout text), but the slug uses
underscore form `nim_channels` so it satisfies the merge_bmf.py
SLUG_RE pattern `^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$`.

Per-variant {.intdefine.}s for runs/warmup added so the integration
test can build a fast variant via -d:BenchSipsicRuns=2
-d:BenchSipsicWarmup=0. Default values (10 runs, 0 warmup) match the
prior hard-coded behavior.

Selective `from ./bench_common import` is used to pull only
{BMFEmitter, initBMFEmitter, addMeasure, emit} — importing the full
module would shadow `results.ThroughputMetrics` with
`bench_common.ThroughputMetrics` and break the legacy bespoke harness
procs that still drive Mupmuc and UnboundedMupsic.

Test: tests/t_bench_common.nim adds an integration test that compiles
bench_throughput with small overrides, runs it with --bmf-out=<tmp>
sipsic, parses the resulting JSON, and asserts the expected slug and
measure exist with a positive value. Stdout text output is also
asserted (`Sipsic` substring) so the legacy CLI contract is exercised.

* PR 0 Task 0.11: bench.yml — native BMF + merge step, drop legacy parser

Apply the four edits from design section 3 (PR 0):

(a) Remove `Run adapter unit tests` step. The Python regex parser it
    sanity-checked is gone in Task 0.12; nothing left to test.
(b) Remove `Convert to BMF JSON` step. bench_throughput now emits BMF
    natively via `--bmf-out=throughput.json` (Task 0.10).
(c) Update `Run bench_throughput` step to add the `--bmf-out=throughput.json`
    flag while preserving the existing positional variant filter and the
    20-minute step timeout.
(d) Insert `Merge BMF JSON` step running
    `python3 benchmarks/merge_bmf.py merged.json throughput.json`.
    Single-input today, but stays in place for PR 2-4 when bench is
    split into per-topology binaries each producing its own fragment.
(e) Change all three `bencher run` invocations and the debug-cat step
    to use `--file merged.json` instead of `bench_results.json`.

actionlint clean. The bench upload contract is unchanged: same
project, branch, testbed, thresholds, comparison semantics — only the
file source flips from regex-parsed text to native BMF.

* PR 0 Task 0.12: delete legacy bmf_adapter.py + bench_main.nim, refactor consumers

Three legacy files removed via `git rm`:

- `benchmarks/bmf_adapter.py` — regex parser for bench_throughput stdout.
  Replaced by native BMF emission via `--bmf-out=` (Task 0.10).
- `benchmarks/test_bmf_adapter.py` — unit tests for the parser. Replaced
  by `benchmarks/tests/test_merge_bmf.py` (Task 0.7).
- `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped
  bench_throughput + bench_latency and emitted a custom JSON shape.
  Bench_throughput is now the canonical entry point and emits BMF
  directly; bench_latency follows in PR 1.

Live consumers refactored in this same task (per impl plan §0.12):

- `benchmarks/runner.py`: build_nim now compiles bench_throughput.nim;
  run_nim invokes `bench_throughput --bmf-out=<path>` instead of
  `bench_main --runs=<n> -o=<path>`. The `runs` argument is preserved
  in the CLI for back-compat but is now compile-time-only.
- `lockfreequeues.nimble`: `task benchmarks` switched to bench_throughput.
  Output path corrected to `.tmp/bench_throughput` (matching the project
  nim.cfg `--outdir:.tmp`).
- `benchmarks/README.md`: rewritten to document the new flow
  (bench_common module, adapter convention, merge_bmf.py, native BMF
  emission, expected slug set).
- `.gitignore`: drop the `bench_main` build-artifact ignore line.

`git grep "bmf_adapter\|bench_main"` returns only intentional historical
references in code comments and CHANGELOG (prior release notes).

Acceptance: `nimble tasks` parses; `nim c benchmarks/nim/bench_throughput.nim`
compiles; tests/t_bench_common.nim passes all 26 tests including the
Task 0.10 --bmf-out integration test.

* PR 0 Task 0.13: render_readme.nim consumes new BMF shape

The legacy bench_main aggregator emitted JSON of shape
`{results: [{implementation, thread_config, metrics: {...}}], metadata: {...}}`.
Bench_main was deleted in Task 0.12, replaced by `bench_throughput
--bmf-out` (Task 0.10) + `merge_bmf.py` (Task 0.7) which produce
Bencher Metric Format JSON of shape
`{<lib>/<topology>/<P>p<C>c: {<measure>: {value, lower_value?, upper_value?}}}`.

`benchmarks/render_readme.nim` now walks the BMF shape directly:

- New `parseSlug(slug)` helper splits `<lib>/<topology>/<P>p<C>c` into
  `(impl, "<P>P/<C>C")` for the two table columns. Strict on shape:
  rejects slugs with the wrong segment count or non-digit P/C, so
  malformed entries do not produce noise rows.
- `loadRows` walks every top-level slug, pulls
  `<slug>.throughput_ops_ms.value` for the throughput column, and
  `<slug>.latency_p50_ns.value` for the optional latency column
  (PR 1+ emits this; PR 0 does not, so all PR 0 rows show "—").
- BMF has no `metadata` block, so the meta argument is always nil.
  `renderTable` already accepts nil and skips the platform header in
  that case — no change needed.
- `when isMainModule` block guarded by `not defined(renderReadmeNoMain)`
  so unit tests can include this file without triggering the README
  rewrite. CLI invocations (`nim r benchmarks/render_readme.nim`) leave
  the define unset and run main as before.

Test: tests/t_render_readme.nim (new) writes a fixture BMF JSON with
three slugs (mupmuc/2p2c, sipsic/1p1c, nim_channels/1p1c), runs
`loadRows` and `renderBlock`, and asserts:
- 3 rows produced; impl/threads/throughput parsed correctly per slug
- alpha sort by impl preserved
- meta == nil (BMF carries no metadata)
- rendered Markdown table contains every impl name and value
- latency column is em-dash for every row (no `latency_p50_ns` in fixture)
- missing JSON file falls back to placeholder
- empty `{}` BMF document falls back to placeholder

All 4 new tests + 26 existing t_bench_common tests pass.

* PR 0 Task 0.14: CHANGELOG entry under [Unreleased] for bench-rollup PR 0

Document Track 0 of the bench-rollup feature under [Unreleased]:

- Added: bench_common.nim module, 5 new lockfreequeues adapters,
  merge_bmf.py CLI, --bmf-out flag, per-variant run-count intdefines.
- Changed: bench_throughput emits BMF natively, 3 adapters renamed to
  the <library_slug>_adapter.nim convention, render_readme.nim consumes
  the new BMF shape, runner.py / nimble task / benchmarks/README.md
  point at bench_throughput.
- Removed: bmf_adapter.py, test_bmf_adapter.py, bench_main.nim.

Each entry is one bullet (one logical change), no test plan checklist,
no GitHub issue references.

* fix(bench): plug heap-allocated queue leaks in latency and throughput harness

Per Gemini review on PR #19. Adapters whose Q heap-allocates the
backing queue (lockfreequeues bounded mupsic/sipmuc/mupmuc, channels)
leaked the queue every time runOneLatencyRun or runOneThroughputRun
returned: `var fwd = queueInit()` only frees the adapter struct's
stack slot at scope exit, not the create()-d Mupsic/Sipmuc/Mupmuc/
Channel pointee. Across N runs * M warmups the leak grew without bound.

Two-part fix:

1. bench_common.nim: harness now drops the queue via a uniform
   `mixin cleanup; when compiles(cleanup(q)): cleanup(q)` block after
   joinThread (and before any return path, including the elapsedNs<=0
   short-circuit in runOneThroughputRun). `when compiles` keeps
   value-type adapters (sipsic, unbounded inline storage) usable
   without forcing them to define a no-op cleanup.

2. lockfreequeues_mupsic / sipmuc cleanup procs: prepend reset(a.queue[])
   so Mupsic/Sipmuc's `=destroy` actually runs before the heap slot
   is freed. Bare dealloc skipped any queue-internal cleanup that an
   ARC/ORC-generated destructor would otherwise do (managed T fields,
   future atomic-cleanup hooks). Same fix on the mupmuc adapter inside
   deinitMupmucAdapter.

3. mupmuc and channels adapters gain a `cleanup` alias around their
   existing deinit*Adapter procs so the harness's mixin cleanup call
   resolves uniformly across adapter naming conventions.

Verified by smoke build of `runLatencyHarness` + `runThroughputHarness`
against LockfreequeuesMupsicAdapter; no leak-on-return, percentiles
and ops/ms reported as expected.

* fix(bench): destroy queue before manager; guard zero-duration throughput

Two fixes flagged by code review on PR 0:

1. Unbounded MPMC adapter cleanup() freed the DebraManager while the
   inline queue still held a pointer to it. The queue's =destroy then
   called unbindClient(self.manager[]) on freed memory. Reset the queue
   first (running its destructor while the manager is still valid),
   then dealloc the manager. Applied to both unbounded_mupmuc and
   unbounded_sipmuc adapters.

2. The legacy bench_throughput run procs computed
   messageCount/elapsedNs without guarding against elapsedNs <= 0,
   which can occur on fast CI runners with sub-nanosecond clock
   granularity. A zero-duration measurement produces inf, which
   merge_bmf.py rejects via math.isfinite. Return 0.0 in that case
   so the sample is recorded as a valid (degenerate) value.

* fix(bench): guard zero-thread harness; expose adapter queues; drop non-finite BMF

Three further code-review fixes on PR 0:

1. runOneThroughputRun would crash with DivByZeroDefect on
   numProducers==0 / numConsumers==0 because the workload-distribution
   step does messageCount div N. The harness is exported, so a future
   caller could pass 0 thinking 'no producers' is a degenerate but
   valid shape. Guard with doAssert + a clear message instead of
   letting the worker thread crash with no context. Also assert
   messageCount >= 0.

2. lockfreequeues_mupsic_adapter and lockfreequeues_sipmuc_adapter
   stored a single Producer/Consumer at construction time but kept the
   underlying queue pointer non-exported, so multi-thread shapes
   couldn't actually register per-thread producers/consumers as the
   adapter docstrings already described. Export the queue field
   (used by today's bench_throughput multi-producer code path) and
   add explicit getProducer*/getConsumer* accessor procs that wrap
   queue.getProducer(idx)/getConsumer(idx) for harness consumers.

3. Defense in depth on the BMF emitter. The PR previously added
   elapsedNs <= 0 guards to the legacy bench_throughput run procs;
   this pass also rejects non-finite values inside addMeasure
   itself. If a future code path produces inf/NaN despite the
   upstream guard, the offending (slug, measure) entry is silently
   dropped instead of being written to the BMF, which merge_bmf.py
   would otherwise reject and fail the entire upload on. Same
   classification widened on toBound for lower/upper bounds.

* perf(bench): bulk-percentile to avoid 5x reservoir sort per latency run

Histogram.percentile() previously sorted the reservoir (up to 99k
samples) on every call, so each latency run re-sorted the same data
five times for p50/p95/p99/p999/max. Add Histogram.percentiles() that
takes an array of probabilities and computes them all from a single
sorted snapshot. Refactor percentile(openArray[float]) through a new
percentileSorted() helper so the sort/lerp can be reused without a
fresh allocation. Update runOneLatencyRun to use the bulk API.

* fix(render_readme): bump row cap; restore platform/cores/timestamp line

- Bump the row cap from 8 to 32. The bench matrix today already exceeds
  8 (1 sipsic + 4 mupmuc + 3 unbounded mupsic + 3 channels = 11) and PR
  3 / PR 4 add comparison adapters; the old cap silently truncated the
  table. PR 5 replaces this auto-rendered block with a hand-curated
  4-row summary, so the higher cap is a transitional safeguard.
- Reconstruct platform / cores / timestamp metadata at render time
  instead of pulling from JSON. BMF has no metadata block, so the
  per-table metadata line was dropping silently in the legacy ->
  BMF migration. detectMetadata() falls back to runtime detection
  (hostOS/hostCPU, countProcessors, now()) and honors three env vars
  (BENCH_RENDER_PLATFORM / _CORES / _TIMESTAMP) for devs re-rendering
  historical snapshots from a different host. renderTable still honors
  any meta JsonNode caller passes through (forward compatibility for
  a future BMF-with-metadata extension).

* ci: pin sibling-dep clones to nim-debra v0.6.0 / nim-typestates v0.7.0

Both build.yml and bench.yml clone the matching siblings under
`../nim-debra` and `../nim-typestates` so nim.cfg's `--path:` directives
resolve. They were cloning `--branch main`, which floats: a downstream
breaking change in either repo would silently propagate into our CI
without the corresponding `requires` bump in lockfreequeues.nimble.

Switch to specific release tags. Bumps now require touching three
places in lockstep (this comment, the `requires` line, the matching
release tag), which is the right cost for a dep update.

Drop the stale "Once debra 0.3.0 publishes…" comment — debra is at
v0.6.0 and typestates at v0.7.0, both released and registered.

* ci(bench): cost gate — drop feat/** PR base; require `bench` label

Two compounding gates so cascade-rebases of stacked feature PRs do not
burn Bencher.dev metric-ingest quota:

  A) `pull_request: branches:` is reduced to [main, devel]. Stacked
     feature-to-feature PRs (`feat/foo` -> `feat/bar`) no longer
     trigger this workflow at all. Only the topmost PR in any stack
     (the one whose base is devel/main) gets PR-time bench treatment.

  B) The `Track PR benchmarks with Bencher` step is now gated on a
     `bench` label being present on the PR. Without the label the step
     is skipped even on PRs targeting devel; the bench binaries still
     compile and run (free GHA minutes, useful for stdout-level
     spot-checks), but no `bencher run` upload fires. Add the label
     deliberately when you want a comparison report on a revision.

The preflight step is renamed and extended so the check summary
displays a clear "skipped because no `bench` label" warning when B is
the gate, paralleling the existing missing-token branch.

Push to main/devel (baseline upload) and workflow_dispatch (manual
upload) are unchanged.

* docs(bench): align Quick Start CI command with current 1M/500k budget

* ci(bench): wire t_bench_common into nimble benchtests + CI job

Adds a 'benchtests' nimble task that compiles/runs the bench harness
tests, which live outside srcDir and are intentionally not imported by
tests/test.nim (avoids forcing every entry of the MM/sanitizer test
matrix to compile bench harness threading machinery).

Adds a 'bench-tests' job to bench.yml that invokes 'nimble benchtests'
on every PR and push. The job has no secret dependencies, so fork PRs
are allowed; only the github-actions[bot] loop-back guard is kept.

Sibling Nim deps pin to v0.6.0 (debra) and v0.7.0 (typestates), matching
the existing 'benchmark' job in this workflow.

* PR 1 Task 1.1: BenchLatencyRuns/BenchLatencyMessageCount intdefines

Exposes per-binary {.intdefine.} overrides for latency wall-time
control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount
(default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern
in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert
defaults compile cleanly and that -d:BenchLatencyRuns=2
-d:BenchLatencyMessageCount=1000 overrides take effect via
compile-time static blocks.

Also fixes a stale import (./adapters/lockfreequeues_sipsic ->
./adapters/lockfreequeues_sipsic_adapter) so the file compiles after
PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix.
The rest of the file (legacy runLatencyBenchmark / benchmarkLatency /
isMainModule) is preserved verbatim and gets fully rewritten in
Task 1.2 onto runLatencyHarness.

* PR 1 Task 1.2: rewrite bench_latency.nim onto bench_common.runLatencyHarness

Replaces the legacy ping-pong RTT body with a CLI mirroring
bench_throughput's: positional variant filter (sipsic / mupmuc /
sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format
JSON emission, stdout text preserved. Per-binary run-shape
intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount /
BenchLatencyWarmupRuns drive runLatencyHarness directly.

Slug shape per design 2.2 (table at design line 357), 1p1c smoke
shape only — full grid lands in PR 2:
- lockfreequeues_sipsic/spsc/1p1c
- lockfreequeues_sipmuc/mpmc/1p1c
- lockfreequeues_mupsic/mpsc/1p1c
- lockfreequeues_mupmuc/mpmc/1p1c

Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns
measures (p999 / max deferred to PR 6 per impl plan).

Type unification: PushResult / PopResult were duplicated across
benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim
(new), forcing two nominally distinct enums to coexist. Aliasing
adapter.PushResult / PopResult to bench_common's via re-export
unifies the surface — both adapter packs (legacy sipsic / mupmuc and
the newer sipmuc / mupsic / unbounded_*) now flow through the same
runLatencyHarness and runThroughputHarness without per-call-site
conversion. Verified: t_bench_common (26 tests) and bench_throughput
--bmf-out integration unchanged.

The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer(
idx = 0) which bypasses Mupmuc's threadId-based registration, so the
pre-built producer / consumer slots are safe to drive from either of
the harness's pinger and ponger threads (Mupmuc's underlying
Vyukov per-slot CAS protocol is fully concurrent-safe).

Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering
the sipsic single-variant happy path, the four-variant slug coverage
required by Track 1 Acceptance Criteria, and the unknown-variant
exit-1 path. All pass with -d:BenchLatencyMessageCount=200
-d:BenchLatencyRuns=2 in under ~10s on Apple silicon.

* PR 1 Tasks 1.3 + 1.4: bench-latency job + N-input merge step in bench.yml

Restructures bench.yml from one all-in-one bench-throughput job into
three jobs:

  bench-throughput  (renamed from 'benchmark'):
    builds + runs bench_throughput, uploads bench-throughput-bmf artifact.
  bench-latency     (NEW):
    builds + runs bench_latency, uploads bench-latency-bmf artifact.
    Compile-time overrides: BenchLatencyMessageCount=50000,
    BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four
    1p1c smoke-shape variants finish in single-digit seconds inside the
    12-min job budget).
  bench-upload      (NEW, needs both):
    downloads bench-*-bmf artifacts via actions/download-artifact@v4
    pattern, runs python3 benchmarks/merge_bmf.py merged.json
    $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload
    flows (PR comparison / push baseline / workflow_dispatch). One
    bencher run invocation -> one Bencher Report -> latency + throughput
    measures co-located on the shared per-slug history (design 1).

Bencher upload commands are unchanged in spirit: same PR / push /
workflow_dispatch gating, same --threshold-measure throughput on push
baselines, same yellow-warning preflight when BENCHER_API_TOKEN is
unset on forks. This commit does NOT correct the historical
`--threshold-measure throughput` -> `throughput_ops_ms` rename
deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure
threshold blocks in the same edit.

Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4
(line 359-366). The two tasks share the same file and are tightly
coupled (1.4 strictly depends on 1.3); folding them into one commit
keeps the workflow file consistent at every commit boundary.

actionlint clean. SC2046 word-splitting on $(ls ...) is intentional
(each path must arrive as a separate positional argv to merge_bmf.py)
and is suppressed inline with a justification comment; artifact
filenames are repo-controlled (no whitespace / glob chars).

* PR 1 Task 1.5: smoke test for multi-measure-per-slug merge

Adds two cases to tests/t_bench_latency.nim that validate the
end-to-end shape Track 1 ships: a single slug carries BOTH
`throughput_ops_ms` (from bench_throughput's BMF fragment) and
`latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER
`merge_bmf.py` unions the two fragments.

  1. Multi-measure union — writes synthetic throughput + latency
     fragments on the same slug, runs python3 benchmarks/merge_bmf.py,
     and asserts the merged JSON carries all three measures with their
     original values intact.
  2. Collision guard — re-declares `throughput_ops_ms` on the same
     slug across both inputs and asserts merge_bmf.py exits 1 with a
     'collision' message in stderr. This regression-tests PR 0 Task 0.7's
     collision guard against silent overwrites that would erase one of
     the colliding measures.

Both tests use createTempDir + defer removeDir for isolated, leak-free
fixtures and run in <200ms because they bypass the bench compile +
execute path entirely. Plan reference: impl plan Track 1 Task 1.5
(line 368-376).

* PR 1 Task 1.7: CHANGELOG entry under [Unreleased] for bench-rollup PR 1

Adds two bullets to the [Unreleased] Added section covering bench_latency
--bmf-out emission and the new three-job bench.yml structure (bench-
throughput / bench-latency / bench-upload). Adds a Changed bullet
documenting the adapter.nim PushResult / PopResult unification.

Plan reference: impl plan Track 1 Task 1.7 (line 386-393).

* fix(bench-latency): correct prefix on unknown-flag error; rename groups to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.

* fix(bench): grant bench-upload contents/actions read; isolate artifact dirs

Two CI hardening fixes from code review on PR 1:

1. The bench-upload job overrode GITHUB_TOKEN permissions to only
   pull-requests:write + checks:write but then ran actions/checkout
   and actions/download-artifact, which need contents:read and
   actions:read. Add both so the job is robust against orgs that
   restrict default token permissions.

2. Switch from merge-multiple:true to per-artifact subdirectories.
   Today's artifact filenames (throughput.json, latency.json) are
   unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/
   unbounded binaries; each could end up writing files that collide
   under merge-multiple. Each artifact now lands at
   bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked
   over the recursive find.

* fix(bench): print p999 in latency stdout for operator visibility

The harness already computes metrics.p999_ns; surfacing it during
manual runs costs nothing and matches the legacy bench_latency
output. BMF wire-up for latency_p999_ns / latency_max_ns is
deferred to PR 6 alongside threshold configuration.

* test(bench_latency): isolate test workspaces with createTempDir + ExeExt

Replace static /tmp/bench_latency_t1*_* paths with per-test
createTempDir workspaces (via newTestWorkspace helper) cleaned up by
defer: removeDir. Use ExeExt for the compiled binary name so the
suffix is correct on Windows. Reshape compileBenchLatency to take an
owned `dir` instead of a `suffix` so workspace lifecycle stays with
the test that allocates it.

Prevents collisions between parallel test runs / repeated runs in
the same shell, and stops binaries from accumulating in the system
temp dir.

* ci(bench-latency): pin sibling deps to v0.6.0/v0.7.0; wire t_bench_latency into benchtests

* PR 2 Task 2.1: capture pre-split slug-set fixture for deletion-safety

Snapshots the BMF emitted by the legacy bench_throughput at the same
small-override knobs the topology-split tests will use, then commits
the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion-
safety check (Task 2.7) verifies that the post-split union is a strict
superset of this fixture so no slug silently disappears with the split.

Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape
suite (Task 2.1) passes against the committed JSON; the bench_spsc /
bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and
the superset suite (Task 2.7) fail loudly until those binaries +
the superset_check.py helper land.

Fixture covers 11 slugs:
  sipsic spsc/1p1c
  mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c
  nim_channels mpmc/{1,2,4}p{1,2,4}c

* PR 2 Task 2.3: bench_spsc binary (Sipsic 1p1c)

Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps
SipsicAdapter in a uniform queueInit closure and runs through
runThroughputHarness so this binary stays under 100 lines and shares
the per-run shaping (warmup, run-count, monotonic-ns timing,
remainder-spread distribution) with the rest of the topology split.

Per-binary intdefines (design 2.5):
  -d:BenchSpscRuns          (default 33)
  -d:BenchSpscMessageCount  (default 1_000_000)
  -d:BenchSpscWarmup        (default 3)

Behind -d:BenchSpscTestCompileTime the binary's `static` block
asserts the three defaults. The runtime + BMF emission contract is
covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3),
which compiles at 1000 messages * 2 runs and parses the emitted
lockfreequeues_sipsic/spsc/1p1c slug.

* PR 2 Task 2.4: bench_mpsc binary (Mupsic 1p1c, 2p1c, 4p1c)

Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic
requires per-thread Producer objects via queue.getProducer(idx); the
generic runThroughputHarness assumes a uniform queue.push call, so this
binary keeps the bespoke harness pattern from the legacy file:
pre-assign Producer values on the main thread (idx = 0..P-1) and ship
them into worker thread contexts. The shape-iteration loop instantiates
three concrete Mupsic types {1,2,4}p1c so P stays a static int.

Per-binary intdefines (design 2.5):
  -d:BenchMpscRuns          (default 33)
  -d:BenchMpscMessageCount  (default 1_000_000)
  -d:BenchMpscWarmup        (default 3)

Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind
-d:BenchMpscTestCompileTime the binary's static block asserts the three
defaults. Runtime behavior covered by tests/t_topology_split.nim's
bench_mpsc suite.

* PR 2 Task 2.5: bench_mpmc binary (Mupmuc grid + 8p8c, Sipmuc, channels)

Splits the MPMC slice out of the legacy bench_throughput.nim into a
single binary covering three queue families. Mupmuc and Sipmuc both
need per-thread Producer / Consumer values via getProducer(idx) /
getConsumer(idx); the harness pre-assigns those on the main thread and
ships them into worker contexts (mirrors the bench_throughput Mupmuc
pattern). Channels has a uniform push/pop interface and runs through
bench_common.runThroughputHarness.

Coverage:
  Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the
    issue #15 livelock regression case preserved from the pre-split
    bench_throughput fixture so the deletion-safety check does not
    silently lose that coverage.
  Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc
    per design 2.4).
  channels {1,2,4} P x {1,2,4} C grid.

Per-binary intdefines (design 2.5):
  -d:BenchMpmcRuns          (default 33)
  -d:BenchMpmcMessageCount  (default 1_000_000)
  -d:BenchMpmcWarmup        (default 3)

Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the
contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep
type resolution unambiguous.

* PR 2 Task 2.6: bench_unbounded binary (4 unbounded variants)

Splits the unbounded slice out of the legacy bench_throughput.nim.
Covers all 4 lockfreequeues unbounded variants at their natural shapes:

  unbounded_sipsic spsc_unbounded/1p1c                    (no DEBRA)
  unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c              (DEBRA on consumers)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c              (DEBRA on producers)
  unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid   (DEBRA both sides)

UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch
reclamation, so producer/consumer threads call registerThread(manager)
on their own thread before queue.getProducer(handle) /
queue.getConsumer(handle). UnboundedMupsic reuses the existing
UnboundedMupsicAdapter (which pre-registers the consumer at adapter
init) and drives pop from the calling thread, mirroring the legacy
bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are
constructed inline with a heap-pointer DebraManager since their
constructors take a `ptr DebraManager`.

Per-variant intdefines (design 2.5):
  -d:UnboundedSipsicRuns / UnboundedSipsicMessageCount   (3 / 500_000)
  -d:UnboundedSipmucRuns / UnboundedSipmucMessageCount   (3 / 500_000)
  -d:UnboundedMupsicRuns / UnboundedMupsicMessageCount   (3 / 500_000)
  -d:UnboundedMupmucRuns / UnboundedMupmucMessageCount   (3 / 500_000)
  -d:BenchUnboundedWarmup                                (default 2)

Per-variant defines are chosen over a shared BenchUnboundedRuns so CI
can budget unbounded_mupmuc/4p4c independently from the cheaper
unbounded_sipsic/1p1c (design 2.5 rationale).

* PR 2 Task 2.7: superset_check.py + deletion-safety wiring

Adds benchmarks/scripts/superset_check.py: a slug-set subset check that
exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc +
bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in
the pre-split fixture, and exits 1 with the missing slugs listed on
stderr otherwise. The check operates on JSON top-level keys only; it
does not compare measure values, bounds, or ordering, so the harness
is free to reshape per-run statistics across the split without
tripping the deletion-safety guard.

Also wires the check into tests/t_topology_split.nim's deletion-safety
suite (compile + run all four binaries at small overrides, merge via
merge_bmf.py, then invoke superset_check.py against the committed
fixture and assert exit 0).

benchmarks/tests/test_superset_check.py covers 9 cases: identical
sets, strict superset, missing slugs (with alpha-sorted listing),
empty pre, wrong arg count, missing pre file, malformed JSON, and
top-level non-object. All exit codes use 1 for failure to match the
existing benchmarks/merge_bmf.py contract.

* PR 2 Task 2.8: bench.yml matrix over 5 topology binaries + per-step timeouts

Replaces the legacy bench-throughput + bench-latency job pair with a
single matrix job over the 5 topology-split binaries. Each matrix
entry runs in its own GitHub Actions job with an independent
`timeout-minutes: 12` budget so a hang in one binary cannot burn the
whole workflow's clock; the surviving binaries finish, the
bench-upload job merges what arrived, and the operator gets partial
Bencher coverage rather than no coverage. The compile step's `case`
sets per-binary `-d:` overrides matching design 2.5: bounded
throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3
runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup.

bench-upload now also runs the deletion-safety check
(superset_check.py vs tests/fixtures/pre-split-slugs.json) right after
merge_bmf.py, so any silent slug regression introduced by future
edits to the topology binaries fails the PR check.

Artifact naming changes: each matrix entry uploads
`bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's
download pattern is unchanged (`bench-*-bmf`) so it picks up every
new artifact without further edits.

actionlint clean; yamllint shows only the pre-existing
"missing document start" warning shared across all repo workflows.

* PR 2 Task 2.9: 5-input union test in test_merge_bmf.py

Adds test_five_input_union covering the upload-job pipeline shape that
PR 2's matrix produces: five sibling BMF fragments (one per binary —
bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency)
arrive via actions/download-artifact, are merged via merge_bmf.py, and
must produce a single output whose slug set is the disjoint union of
the inputs.

The test deliberately collides the sipsic 1p1c slug between bench_spsc
(throughput_ops_ms) and bench_latency (latency_p50_ns,
latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring
the production multi-measure-per-slug pattern from Track 1 Task 1.5.
The merge must preserve every measure on the shared slug; collisions
on (slug, measure) pairs continue to exit 1 (covered by the existing
test_collision_exits_1 case, no regression).

* PR 2 Task 2.10: delete bench_throughput.nim, rewire consumers

Removes benchmarks/nim/bench_throughput.nim now that the five topology-
split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded
+ bench_latency from PR 1) cover its full slug surface. The pre-split
fixture committed in Task 2.1 plus the superset_check.py guard wired
into bench.yml in Task 2.8 enforce that no slug from the legacy binary
silently disappears across the split.

Live consumers updated to reference the new binaries:
  benchmarks/runner.py     - builds + runs all 5 binaries, then
                             merges fragments via merge_bmf.py.
  lockfreequeues.nimble    - `nimble benchmarks` task iterates the
                             5 binaries and merges to results/latest.json.
  benchmarks/README.md     - rewritten to describe the 5-binary
                             pipeline, the matrix CI job, and the
                             merged BMF schema (single slug can carry
                             both throughput and latency measures).
  .gitignore               - tracks the four new throughput binaries;
                             also ignores the worktree-local `deps`
                             symlink that nim.cfg's relative path
                             expects.
  tests/t_bench_common.nim - removes the obsolete Task 0.10
                             bench_throughput integration suite (now
                             covered by t_topology_split.nim per
                             topology binary).
  benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim
                           - doc-comment rewires from bench_throughput
                             to bench_unbounded.

Comment-only references in bench_common.nim, bench_latency.nim,
render_readme.nim, and the new bench_*.nim files describe the legacy
binary as historical context (the legacy `runThroughputBenchmark`
proc / Mupmuc registration pattern that bench_common.runThroughputHarness
factored out) and stay verbatim.

* PR 2 Task 2.11: CHANGELOG entry under [Unreleased] for bench-rollup PR 2

Adds Track 2 PR 2 entries to the [Unreleased] section under Added,
Changed, and Removed:
  - Added: 4 new topology-split throughput binaries (bench_spsc,
    bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py
    deletion-safety guard, 5-input merge_bmf union test.
  - Changed: bench.yml matrix over 5 binaries with per-step timeouts
    + superset_check.py wired into bench-upload; runner.py and
    `nimble benchmarks` task iterate the 5 binaries; README rewrite.
  - Removed: bench_throughput.nim (replaced by the 4 topology-split
    binaries; pre-split fixture + superset_check.py guard pin
    deletion-safety).

* fix(bench): tighten bench_unbounded CI shape to fit 18-min budget

bench_unbounded matrix item was timing out at 10 minutes after only
4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed
shapes on 4-vCPU runners. Reduce message counts and runs across the
heavy fan-out variants (mupmuc 9-shape grid is the main offender) and
bump the job timeout to 18 min for cushion. Statistical signal is
preserved for Bencher trend tracking; absolute throughput numbers
will shift but per-shape comparisons remain meaningful.

- UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3)
- UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupmuc: 500Kx3 -> 50Kx2  (9 shapes)
- BenchUnboundedWarmup: 2 -> 1
- timeout-minutes: 12 -> 18

* fix(bench): gate oversubscribed C>=4 unbounded shapes behind a define

bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape:
4 consumers compete in a strict round-robin slot order (each consumer
must take its turn before another can pop), and on a 4-vCPU runner
one descheduled consumer blocks every other. The same starvation
pattern hits every unbounded mupmuc shape with C>=4.

Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops:

  - unbounded_sipmuc/mpmc_unbounded/1p4c
  - unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c}

mupsic shapes (multi-producer, single consumer) are kept regardless:
the round-robin ordering exists on consumers only, so a single consumer
cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json
and the deletion-safety guard would fail without it.

The full grid still runs in `bench-comparison.yml` nightly cron / on
workflow_dispatch where a beefier runner is available. PR CI now gets
the meaningful subset (4 mupmuc shapes covering each P,C in {1,2})
that fits a 4-vCPU runner without hanging.

* fix(bench): document p95 in BMF schema; clean up runner.py fragment files

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.

* fix(bench): run upload on partial failures; correct timeout-comment

Three CI hardening fixes from code review on PR 2:

1. bench-upload had needs:[bench] without if:always(), so any single
   matrix-leg failure would skip the merge+upload entirely. Per the
   job comment the design is partial-coverage tolerant: surviving
   binaries should still be merged and uploaded. Add
   if: always() && (fork-PR guard) and a guard step that fails loudly
   when zero JSON fragments are present.

2. The per-binary upload-artifact step now runs with if: always() and
   if-no-files-found: ignore so a step-level timeout flushes any
   partial JSON the binary wrote, and a binary that crashed before
   writing the file does not turn the matrix leg red for the merger.

3. Job and step comments referenced a 12-min timeout while the
   configured value was 18 min. Update both comments to 18.

* fix(bench): drop unparseable BMF fragments before merge

bench-upload runs with if: always(), so a matrix leg that crashed
mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1
on the first malformed input, blocking uploads from surviving legs.
Pre-validate with json.load and drop any that fail, emitting a
::warning:: per dropped file.

* perf(bench): backoffOnPeerWait in busy-spin loops; document teardown order

- Add backoffOnPeerWait to producer/consumer busy loops in
  bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer)
  and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation
  in oversubscribed runs. Mupmuc producer already had it.
- Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/
  UMupmuc consumer loops in bench_unbounded.nim.
- Document the per-iteration teardown order in runUSipmucShape and
  reference it from runUMupmucShape: reset(q) MUST precede reset(manager[])
  because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager,
  which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC,
  system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip
  destructors (verified against project's own =destroy implementations
  in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])).

* docs(bench_unbounded): defend `create(T)` as alloc0, not destructor-on-uninit

Reviewer flagged that `create(DebraManager[...])` followed by
`manager[] = initDebraManager[...]()` "runs destructors on
uninitialised memory." It does not. `proc create*(T: typedesc, size)`
in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` —
zero-initialised allocation with no hook invocation. The follow-up
assignment then performs the only construction.

Add an explanatory note in the per-iteration teardown block comment
(runUSipmucShape) so this doesn't keep getting flagged. The other
harness (runUMupmucShape) already back-references that block comment.

No behaviour change.

* fix(bench): pass -d:danger in runner.py build_nim() to match CI

CI compiles topology-split bench binaries with -d:release -d:danger;
runner.py was missing -d:danger, so locally produced numbers were not
comparable to the cloud baseline.

* test: t_unbounded_padding red — segment alloc returns 16B-aligned, expects 64B

PR 3 Task 3.1 (cache-line padding audit, step 1 of 3).

Verifies bounded variants already conform; introduces the failing test that
will drive the unbounded fix in 3.2 and 3.3.

Bounded MPMC audit (read-only verification, design doc §4.2):
- src/lockfreequeues/sipsic.nim:19-20    head/tail Atomic[int]    {.align: CacheLineBytes.} ✓
- src/lockfreequeues/sipmuc.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupsic.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupmuc.nim:43-44    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
The bounded MPMC cell array also uses the documented array-of-cells trick
(typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line.
No bounded change required.

Unbounded variants currently FAIL both audit conditions:
- Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.}
  pragmas, so they share cache lines with neighbouring data.
- newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte
  aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma
  controls only the intra-struct offset, not the base.

Test additions:
- tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants:
  (a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer
      coordination atomics.
  (b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated
      Segment via the queue's allocator.
- Test-only accessors guarded by `when defined(testing):` in each unbounded
  module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest
  (returns offsets of audit-relevant fields). Compiled away in release.
- Wired into tests/test.nim aggregator.

RED state confirmed:
  Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED
  off.tail mod 64 was 8 (expected 0)
  off.prevConsumerIdx mod 64 was 16
  off.committed mod 64 was 24
  cast[uint](segPtr) mod 64 was 16

Existing test suite continues to compile (verified `nim c tests/test.nim`).

Refs: design doc §4.2; impl plan Task 3.1.

* feat(internal): add aligned_alloc.allocAligned via posix_memalign

PR 3 Task 3.2 (cache-line padding audit, step 2 of 3).

Adds the cache-line-aligned heap allocator that replaces c_calloc in the
unbounded queue Segment allocation path (call sites swap in Task 3.3).

Compile probe (impl plan §3.2.0):
  $ nim c -r tmp_probe.nim
  rc=0
  align=0
verifying that std/posix.posix_memalign is available on macOS (libSystem)
and Linux glibc with the same C signature, returning 64-byte aligned memory
on success. Probe file deleted post-verification per impl plan.

Implementation (src/lockfreequeues/internal/aligned_alloc.nim):
- ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with
  CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on
  rc != 0. Caller releases via the standard c_free per POSIX guarantee
  that posix_memalign blocks are free-compatible.
- Re-exports CacheLineBytes from atomic_dsl for callers in 3.3.
- isMainModule smoke confirms tail-call alignment.

Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim):
- Tiny payload (sizeof < 64).
- Line-sized payload (sizeof == 64).
- Big payload (sizeof > 64).
- 64 sequential allocations all 64B-aligned.
All 4 cases assert pointer alignment AND zero-initialization. Result:
  internal/aligned_alloc.allocAligned ....
  [Summary] 4 tests run: 4 OK, 0 FAILED.

The unbounded padding audit test (tests/t_unbounded_padding.nim) is still
RED at this commit; Task 3.3 swaps the Segment allocation call sites and
adds {.align: CacheLineBytes.} pragmas, taking it GREEN.

Refs: design doc §4.2; impl plan Task 3.2.

* fix(unbounded): pad Segment fields and base-align via posix_memalign

PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN).

Applies the cache-line padding fix to all four unbounded queue variants and
their per-segment producer/consumer atomics. Resolves the asymmetry called
out in design doc §4.2: bounded variants were already padded; unbounded
Segments were not, so cross-library comparison numbers in PR 3 would have
been biased against lockfreequeues' unbounded types.

Production changes (4 files):
- src/lockfreequeues/unbounded_sipsic.nim
  Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment}
  gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned.
- src/lockfreequeues/unbounded_sipmuc.nim
  Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
- src/lockfreequeues/unbounded_mupsic.nim
  Segment.{next, tail, committed} + UnboundedMupsic.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
  ``head: int`` (single consumer, non-atomic) is intentionally left
  unpadded — only the consumer touches it.
- src/lockfreequeues/unbounded_mupmuc.nim
  Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{
  headSegment, tailSegment} gain alignment. newSegment swaps to
  allocAligned.

The ``c_calloc`` import is dropped; ``c_free`` retained for segment
release (POSIX guarantees posix_memalign blocks are free-compatible).

Cross-platform fix in aligned_alloc.nim:
- ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on
  macOS because the SDK declares the first parameter ``__unsafe_indexable``
  (incomplete-type ``void *`` vs the wrapper's ``void **``).
- Replaced the ``import std/posix`` with a local ``importc, header:
  "<stdlib.h>"`` shim using the canonical C signature, which clang accepts
  in both C and C++ modes. Documented in aligned_alloc.nim header comment.

Verification:
- tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp.
- Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc:
    [Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED  (all four variants)
- t_unbounded_*_threaded suites pass — no false-sharing-induced timing
  regression in the threaded code paths.

Refs: design doc §4.2; impl plan Task 3.3.

* feat(bench): add loony_adapter (unbounded MPMC) behind compile gate

PR 3 Task 3.5 (comparison MVP, first FFI adapter).

Adds the Loony unbounded MPMC adapter and a parametric round-trip test in
tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so
the in-tree test suite passes vacuously when the package is absent.

Files:
- benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported
  = {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name.
- tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves
  set; parametric template extends to subsequent adapters in 3.6–3.10.

Loony storage-encoding fix (NOT a hack — required by Loony's design):
Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state
flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection
reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as
"slot empty". Loony was designed for ref/ptr types whose addresses are
8-byte-aligned and never null. Pushing the literal value 0 (or any value
whose low 3 bits are non-zero) breaks Loony's pop path.

Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3``
on push and ``(raw shr 3) - 1`` on pop:
- ``shl 3`` clears the low flag bits.
- ``+ 1`` ensures the encoded value is never zero.
This is documentation-equivalent to how Loony users push 8-byte-aligned,
non-null ref addresses; it does NOT change Loony's algorithm or affect
the comparison fairness — every push/pop still goes through Loony's full
fast/slow paths.

First-iteration RED state (recorded for the audit):
- Plain push of ``uint64(i)`` returned only 125 of 1000 items, all
  multiples of 8 (i.e. the values whose ``i and 7 == 0``).
- ``shl 3`` alone returned 999 of 1000 items (everything except 0,
  which collided with Loony's empty-slot sentinel).
- ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input.

Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available
+ --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK.
Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK.

Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK.

* feat(bench): add boost_lockfree_queue adapter (MPMC bounded, nim cpp)

Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps
boost::lockfree::queue<unsigned long long> with capacity-bounded
push (bounded_push) and pop. Topology: tMpmc.

Compile gates:
  -d:adapter_boost_lockfree_queue_available  -- enable
  nim cpp                                     -- required (else hard error)

Pulls headers from the conventional install paths (apt /usr/include,
brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts
-d:boostIncludeDir=<path> for non-standard locations.

Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric
round-trip test in tests/t_bench_adapters.nim gated by both the define
and 'cpp' build mode.

Adapter capacity argument is the fixed node-pool size; pushes that
exceed it return prFull. Default 1024 mirrors other bounded adapters.

* feat(bench): add boost_lockfree_spsc adapter (SPSC bounded, nim cpp)

Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer
ring buffer. The pop overload used here -- pop(out, max_size) -- returns
the count of items popped (csize_t); 1 means success, 0 means empty.

Compile gates:
  -d:adapter_boost_lockfree_spsc_available  -- enable
  nim cpp                                    -- required

Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header
search paths and indirection rationale identical to boost_lockfree_queue
adapter; smoke_boost.nim covers both adapters under one binary.

* feat(bench): add bench-ffi-crossbeam Rust cdylib (ArrayQueue + SegQueue)

C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and
crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C"
fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy)
that the Nim adapters in tasks 3.9 / 3.10 will importc.

- Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link
  the integration test against the same code path the cdylib exports.
- rust-toolchain.toml pins stable channel for reproducibility.
- src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic
  is intercepted (returns null) so we never unwind across FFI.
- tests/integration_test.rs: 6 tests covering round-trip set equality
  for both queues, full/empty edges, and null-pointer tolerance.

Build: cargo build --release  -> target/release/libbench_ffi_crossbeam.{so,dylib}
.gitignore extended for benchmarks/rust/*/target/ and Cargo.lock.

* feat(bench): add crossbeam_array_queue adapter (MPMC bounded, FFI cdylib)

importc the cb_array_init/push/pop/destroy fns from
benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a
typed Nim adapter with topologiesSupported = {tMpmc}.

The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling
crossbeam_seg_queue_adapter gate is also active, so combined builds
do not produce 'ignoring duplicate libraries' warnings on Apple ld.

Adds smoke_crossbeam.nim covering both crossbeam adapters under one
binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim
extended with the gated parametric round-trip suite.

Compile gates:
  -d:adapter_crossbeam_array_queue_available
  -d:crossbeamLibDir=<path>   (override search path for non-default builds)

* feat(bench): add crossbeam_seg_queue adapter (MPMC unbounded, FFI cdylib)

importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam
cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess
when the handle is non-null; the underlying cb_seg_push only reports
failure on null pointers).

topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored
so the wire-up sites (Task 3.11) can pass capacity uniformly across all
adapters without a per-adapter conditional.

Compile gate: -d:adapter_crossbeam_seg_queue_available
Optional override: -d:crossbeamLibDir=<path>

* feat(bench): wire MVP comparison adapters into bench_spsc/mpmc/unbounded

Each MVP adapter is pulled in only when its compile-time gate is set;
absent gates leave production builds unchanged. Variant lists are
populated at compile time via 'when declared(...)' so the CLI's
--list / unknown-variant error messages reflect the actual compiled-in
set.

Wiring map (per design 2.4 / Track 3 §3.11):

  bench_spsc:
    - boost_lockfree_spsc -> spsc/1p1c (gated by
      adapter_boost_lockfree_spsc_available)

  bench_mpmc:
    - boost_lockfree_queue   -> mpmc/{1,2,4}p{1,2,4}c
    - crossbeam_array_queue  -> mpmc/{1,2,4}p{1,2,4}c
    (each gated by its respective adapter_*_available define)

  bench_unbounded:
    - loony                  -> mpmc_unbounded/{1,2,4}p{1,2,4}c
    - crossbeam_seg_queue    -> mpmc_unbounded/{1,2,4}p{1,2,4}c

bench_mpsc has no MVP coverage today: none of the MVP libraries provide
a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC
parity for Track 3 and the existing in-tree mupsic baseline stays the
sole MPSC slug.

Verified locally: bench_mpmc + bench_unbounded produce expected
crossbeam_* slugs at the small fixture (10K msg, 2 runs).

* ci(bench): add boost+loony soft-skip flow per design §2.6

workflow_dispatch inputs force_skip_boost / force_skip_loony surface
the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay
them into step-level if: gates.

Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate
on any failure. The combined defines + compile-mode (c vs cpp) flow
into the per-binary compile step via a final 'Build ada…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant