Skip to content

PR 1: bench_latency wired to BMF (smoke shape, multi-measure-per-slug)#20

Merged
elijahr merged 10 commits intofeat/bench-rollup-pr0-bench-commonfrom
feat/bench-rollup-pr1-latency-mvp
May 5, 2026
Merged

PR 1: bench_latency wired to BMF (smoke shape, multi-measure-per-slug)#20
elijahr merged 10 commits intofeat/bench-rollup-pr0-bench-commonfrom
feat/bench-rollup-pr1-latency-mvp

Conversation

@elijahr
Copy link
Copy Markdown
Owner

@elijahr elijahr commented May 1, 2026

Stacked PR — Depends on PR #19 (feat/bench-rollup-pr0-bench-common).

PR 1 of the bench-rollup series. Proves multi-measure-per-slug end-to-end by wiring bench_latency.nim into the BMF emission pipeline introduced in PR 0, then merging the latency fragment with the throughput fragment in CI before a single bencher run upload.

What changed

  • benchmarks/nim/bench_latency.nim — rewritten on top of bench_common.runLatencyHarness. CLI mirrors bench_throughput: positional variant filter (sipsic / mupmuc / sipmuc / mupsic), --bmf-out=<path> for native BMF emission, stdout text preserved. New per-binary intdefines: BenchLatencyRuns (default 33), BenchLatencyMessageCount (default 100_000), BenchLatencyWarmupRuns (default 3).
  • .github/workflows/bench.yml — restructured from one all-in-one job into three:
    • bench-throughput (renamed from benchmark): builds + runs bench_throughput, uploads bench-throughput-bmf artifact.
    • bench-latency (NEW): builds + runs bench_latency, uploads bench-latency-bmf artifact. CI-tuned shape: 50K messages × 11 runs × 2 warmup so the four 1p1c smoke variants finish in single-digit seconds inside the 12-min budget.
    • bench-upload (NEW, needs both): downloads bench-*-bmf artifacts, runs python3 benchmarks/merge_bmf.py merged.json $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload flows (PR comparison / push baseline / workflow_dispatch). One bencher run invocation -> one Bencher Report -> latency + throughput measures co-located on shared per-slug histories (Bencher creates a separate Report per invocation, so multiple uploads would NOT co-locate measures — see merge rationale in design 1).
  • benchmarks/nim/adapter.nim — re-exports PushResult / PopResult from bench_common instead of defining its own copies, unifying the two parallel definitions introduced by PR 0 Task 0.1. Both adapter packs (legacy lockfreequeues_sipsic / mupmuc / channels and the newer lockfreequeues_sipmuc / mupsic / unbounded_*) now flow through the same harnesses without per-call-site type conversion.
  • benchmarks/nim/adapters/lockfreequeues_sipsic_adapter.nim and lockfreequeues_mupmuc_adapter.nim — drop newPopResult / emptyPopResult constructors in favor of the plain PopResult[T] literal form (compatible with the unified non-case-object type). Mupmuc adapter doc clarifies that getProducer(idx = 0) / getConsumer(idx = 0) bypasses Mupmuc's threadId-based registration, so the pre-built producer / consumer slots are safe to drive from either of the harness's pinger and ponger threads (Mupmuc's underlying Vyukov per-slot CAS protocol is fully concurrent-safe).
  • tests/t_bench_latency.nim (NEW) — 7 tests across three suites:
    • Task 1.1: intdefine defaults + override compile-time assertions (2 tests).
    • Task 1.2: --bmf-out integration on sipsic single-variant + four-variant slug coverage + unknown-variant exit-1 (3 tests).
    • Task 1.5: synthetic throughput + latency fragment merge + collision-guard regression (2 tests).
  • CHANGELOG.md — bullets under [Unreleased] -> Added (bench_latency native BMF, bench-latency CI job) and Changed (adapter type unification).

Emitted measures (this PR)

Each of the 4 smoke-shape slugs carries latency_p50_ns, latency_p95_ns, latency_p99_ns:

  • lockfreequeues_sipsic/spsc/1p1c
  • lockfreequeues_sipmuc/mpmc/1p1c
  • lockfreequeues_mupsic/mpsc/1p1c
  • lockfreequeues_mupmuc/mpmc/1p1c

latency_p999_ns / latency_max_ns are deferred to PR 6's threshold-gating work, per the impl plan.

Verification

  • nim c -r tests/test.nim — 188 core tests pass, 0 failures (no regression).
  • nim c -r tests/t_bench_common.nim — 26 tests pass (PR 0 surface unchanged).
  • nim c -r tests/t_bench_latency.nim — 7 tests pass (new).
  • actionlint .github/workflows/bench.yml — clean.
  • Local smoke: bench_latency --bmf-out=lat.json sipsic mupmuc sipmuc mupsic produces all 4 slugs with all 3 measures, p50 < p95 < p99 < max.

Follow-ups (not in this PR)

  • Task 1.6 (Bencher dashboard verification) — manual step from the impl plan; the screenshot lands in this PR description after CI green and the maintainer's Bencher dashboard inspection.
  • --threshold-measure throughput — preserved as-is in the push-baseline Bencher invocation; the BMF emission renamed the measure to throughput_ops_ms in PR 0 Task 0.10, and PR 6 (Task 6.2) reworks the threshold blocks per measure. Out of scope for PR 1.
  • benchmarks/nim/tests/t_adapter_*.nim — orphaned per-adapter unit tests still reference the pre-PR-0-Task-0.9 adapter filenames (lockfreequeues_sipsic instead of lockfreequeues_sipsic_adapter). Not wired into any CI / nimble task; pre-existing breakage. Will tidy in a later PR.
  • bench_common adapter cleanup hookrunOneLatencyRun / runOneThroughputRun don't call adapter cleanup / deinit*, leaking ~2 small heap allocations per run × 36 runs × 4 variants for the latency path. Bounded and acceptable for the 1p1c smoke shape, but PR 2's topology split should add a generic teardown hook.

References

  • Design doc: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md section 3 PR 1.
  • Implementation plan: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 1 (Tasks 1.1-1.7).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🐰 Bencher Report

Branchfeat/bench-rollup-pr1-latency-mvp
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarklatency_p50_nsMeasure (units)latency_p95_nsMeasure (units)latency_p99_nsMeasure (units)throughput_ops_msMeasure (units) x 1e3
lockfreequeues_mupmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
273.45 units📈 view plot
⚠️ NO THRESHOLD
300.00 units📈 view plot
⚠️ NO THRESHOLD
306.00 units📈 view plot
⚠️ NO THRESHOLD
13.07 units x 1e3
lockfreequeues_mupmuc/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
20.82 units x 1e3
lockfreequeues_mupmuc/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
15.65 units x 1e3
lockfreequeues_mupmuc/mpmc/8p8c📈 view plot
⚠️ NO THRESHOLD
17.98 units x 1e3
lockfreequeues_mupsic/mpsc/1p1c📈 view plot
⚠️ NO THRESHOLD
280.00 units📈 view plot
⚠️ NO THRESHOLD
300.82 units📈 view plot
⚠️ NO THRESHOLD
309.18 units
lockfreequeues_sipmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
272.64 units📈 view plot
⚠️ NO THRESHOLD
300.00 units📈 view plot
⚠️ NO THRESHOLD
301.00 units
lockfreequeues_sipsic/spsc/1p1c📈 view plot
⚠️ NO THRESHOLD
436.82 units📈 view plot
⚠️ NO THRESHOLD
451.00 units📈 view plot
⚠️ NO THRESHOLD
463.64 units📈 view plot
⚠️ NO THRESHOLD
7.91 units x 1e3
lockfreequeues_unbounded_mupsic/mpsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
8.53 units x 1e3
lockfreequeues_unbounded_mupsic/mpsc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
7.32 units x 1e3
lockfreequeues_unbounded_mupsic/mpsc_unbounded/4p1c📈 view plot
⚠️ NO THRESHOLD
8.74 units x 1e3
nim_channels/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
1.85 units x 1e3
nim_channels/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
1.08 units x 1e3
nim_channels/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
1.25 units x 1e3
🐰 View full continuous benchmarking report in Bencher

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request rewrites the bench_latency tool to utilize the common benchmarking harness, enabling native Bencher Metric Format (BMF) output via a new --bmf-out flag. It unifies the PushResult and PopResult types across adapters by re-exporting them from bench_common, which simplifies the adapter implementations for various queue variants. Additionally, the PR introduces a new bench-latency GitHub workflow job and comprehensive integration tests for the latency benchmark and BMF merging logic. I have no feedback to provide.

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request rewrites the bench_latency utility to leverage the bench_common.runLatencyHarness, enabling native Bencher Metric Format (BMF) JSON output and standardizing the CLI interface. It also unifies the PushResult and PopResult types across adapters and introduces comprehensive integration tests. The review feedback highlights an opportunity to improve CLI error messages for unknown short flags and suggests renaming a variable to better reflect the variant terminology used throughout the codebase.

Comment thread benchmarks/nim/bench_latency.nim Outdated
quit 1
bmfOutPath = p.val
else:
echo "Unknown flag: --", p.key
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message for unknown flags incorrectly assumes all flags are long options (prefixed with --). If a user provides an unknown short option (e.g., -x), the output will be Unknown flag: --x, which is slightly misleading.

          echo "Unknown flag: ", (if p.kind == cmdLongOption: "--" else: "-"), p.key

Comment thread benchmarks/nim/bench_latency.nim Outdated
if positional.len == 0:
supported
else:
var groups = initHashSet[string]()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable name groups is used here, but in the context of bench_latency, the positional arguments represent individual variants rather than groups of variants (unlike bench_throughput). Renaming this to variants would improve clarity and consistency with the rest of the file.

      var variants = initHashSet[string]()

elijahr added a commit that referenced this pull request May 2, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 459090f to 19c6dcb Compare May 2, 2026 05:59
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Bench-upload missing token scopes 🐞 Bug ☼ Reliability
Description
The bench-upload job sets job-level permissions but does not grant contents: read (for
actions/checkout) or actions: read (for actions/download-artifact). In repos/orgs that
restrict default GITHUB_TOKEN permissions, this job can fail to checkout or download artifacts,
preventing BMF merge and Bencher upload.
Code

.github/workflows/bench.yml[R187-208]

+    name: Merge BMF + Bencher upload (ubuntu-latest)
+    runs-on: ubuntu-latest
+    if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
+    needs: [bench-throughput, bench-latency]
+    permissions:
+      pull-requests: write
+      checks: write
+    timeout-minutes: 10
+
+    steps:
+      - name: Checkout project
+        uses: actions/checkout@v4
+
+      - name: Download BMF artifacts
+        # Pattern matches both bench-throughput-bmf and bench-latency-bmf
+        # (and any future per-binary artifact uploaded by PR 2's topology
+        # split, which uses the same `bench-*-bmf` naming convention).
+        uses: actions/download-artifact@v4
+        with:
+          pattern: bench-*-bmf
+          path: ./bmf-inputs/
+          merge-multiple: true
Evidence
bench-upload explicitly overrides token permissions to only pull-requests: write and `checks:
write, then immediately invokes actions/checkout@v4 and actions/download-artifact@v4`, both of
which rely on GitHub API access that is commonly controlled by contents and actions scopes.

.github/workflows/bench.yml[182-208]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`bench-upload` overrides `GITHUB_TOKEN` permissions but does not include the permissions needed to (a) checkout the repository and (b) download workflow artifacts. This can cause 403/permission failures and stop the merge+upload pipeline.
### Issue Context
The job runs `actions/checkout@v4` and `actions/download-artifact@v4` after setting a restrictive `permissions:` block.
### Fix Focus Areas
- .github/workflows/bench.yml[182-208]
### Suggested change
Update `bench-upload.permissions` to include at least:
- `contents: read`
- `actions: read`
Keep existing:
- `pull-requests: write`
- `checks: write`
(Optionally, validate if Bencher/comment posting needs any additional scopes beyond these.)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Artifact merge overwrite risk 🐞 Bug ☼ Reliability
Description
actions/download-artifact is configured with merge-multiple: true, so if multiple bench-*-bmf
artifacts contain the same JSON filename, later downloads can overwrite earlier ones and silently
drop measures before merge. This is especially likely if future topology-split jobs keep emitting a
generic name like throughput.json.
Code

.github/workflows/bench.yml[R200-209]

+      - name: Download BMF artifacts
+        # Pattern matches both bench-throughput-bmf and bench-latency-bmf
+        # (and any future per-binary artifact uploaded by PR 2's topology
+        # split, which uses the same `bench-*-bmf` naming convention).
+        uses: actions/download-artifact@v4
+        with:
+          pattern: bench-*-bmf
+          path: ./bmf-inputs/
+          merge-multiple: true
+
Evidence
The workflow merges multiple artifacts into a single directory (merge-multiple: true) and then
enumerates bmf-inputs/*.json for merging; filename collisions in that directory would result in
lost inputs.

.github/workflows/bench.yml[200-229]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Multiple artifacts are merged into one directory. If two artifacts contain the same JSON filename, one can overwrite the other, causing missing benchmark measures in `merged.json`.
### Issue Context
This is a latent issue that can appear when adding more per-binary/per-topology benchmark jobs.
### Fix Focus Areas
- .github/workflows/bench.yml[200-229]
### Suggested change options
1. Ensure each producer job writes uniquely named files (e.g., `throughput-sipsic.json`, `latency.json`, etc.).
2. Or download artifacts into separate subdirectories (do not use `merge-multiple: true`) and pass all discovered JSON paths to `merge_bmf.py`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 7a6ff2c

Results up to commit 19c6dcb


🐞 Bugs (2) 📘 Rule violations (0)


Action required
1. Bench-upload missing token scopes 🐞 Bug ☼ Reliability
Description
The bench-upload job sets job-level permissions but does not grant contents: read (for
actions/checkout) or actions: read (for actions/download-artifact). In repos/orgs that
restrict default GITHUB_TOKEN permissions, this job can fail to checkout or download artifacts,
preventing BMF merge and Bencher upload.
Code

.github/workflows/bench.yml[R187-208]

+    name: Merge BMF + Bencher upload (ubuntu-latest)
+    runs-on: ubuntu-latest
+    if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
+    needs: [bench-throughput, bench-latency]
+    permissions:
+      pull-requests: write
+      checks: write
+    timeout-minutes: 10
+
+    steps:
+      - name: Checkout project
+        uses: actions/checkout@v4
+
+      - name: Download BMF artifacts
+        # Pattern matches both bench-throughput-bmf and bench-latency-bmf
+        # (and any future per-binary artifact uploaded by PR 2's topology
+        # split, which uses the same `bench-*-bmf` naming convention).
+        uses: actions/download-artifact@v4
+        with:
+          pattern: bench-*-bmf
+          path: ./bmf-inputs/
+          merge-multiple: true
Evidence
bench-upload explicitly overrides token permissions to only pull-requests: write and `checks:
write, then immediately invokes actions/checkout@v4 and actions/download-artifact@v4`, both of
which rely on GitHub API access that is commonly controlled by contents and actions scopes.

.github/workflows/bench.yml[182-208]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bench-upload` overrides `GITHUB_TOKEN` permissions but does not include the permissions needed to (a) checkout the repository and (b) download workflow artifacts. This can cause 403/permission failures and stop the merge+upload pipeline.

### Issue Context
The job runs `actions/checkout@v4` and `actions/download-artifact@v4` after setting a restrictive `permissions:` block.

### Fix Focus Areas
- .github/workflows/bench.yml[182-208]

### Suggested change
Update `bench-upload.permissions` to include at least:
- `contents: read`
- `actions: read`

Keep existing:
- `pull-requests: write`
- `checks: write`

(Optionally, validate if Bencher/comment posting needs any additional scopes beyond these.)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments
2. Artifact merge overwrite risk 🐞 Bug ☼ Reliability
Description
actions/download-artifact is configured with merge-multiple: true, so if multiple bench-*-bmf
artifacts contain the same JSON filename, later downloads can overwrite earlier ones and silently
drop measures before merge. This is especially likely if future topology-split jobs keep emitting a
generic name like throughput.json.
Code

.github/workflows/bench.yml[R200-209]

+      - name: Download BMF artifacts
+        # Pattern matches both bench-throughput-bmf and bench-latency-bmf
+        # (and any future per-binary artifact uploaded by PR 2's topology
+        # split, which uses the same `bench-*-bmf` naming convention).
+        uses: actions/download-artifact@v4
+        with:
+          pattern: bench-*-bmf
+          path: ./bmf-inputs/
+          merge-multiple: true
+
Evidence
The workflow merges multiple artifacts into a single directory (merge-multiple: true) and then
enumerates bmf-inputs/*.json for merging; filename collisions in that directory would result in
lost inputs.

.github/workflows/bench.yml[200-229]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Multiple artifacts are merged into one directory. If two artifacts contain the same JSON filename, one can overwrite the other, causing missing benchmark measures in `merged.json`.

### Issue Context
This is a latent issue that can appear when adding more per-binary/per-topology benchmark jobs.

### Fix Focus Areas
- .github/workflows/bench.yml[200-229]

### Suggested change options
1. Ensure each producer job writes uniquely named files (e.g., `throughput-sipsic.json`, `latency.json`, etc.).
2. Or download artifacts into separate subdirectories (do not use `merge-multiple: true`) and pass all discovered JSON paths to `merge_bmf.py`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

Comment thread .github/workflows/bench.yml Outdated
Comment on lines +187 to +208
name: Merge BMF + Bencher upload (ubuntu-latest)
runs-on: ubuntu-latest
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
needs: [bench-throughput, bench-latency]
permissions:
pull-requests: write
checks: write
timeout-minutes: 10

steps:
- name: Checkout project
uses: actions/checkout@v4

- name: Download BMF artifacts
# Pattern matches both bench-throughput-bmf and bench-latency-bmf
# (and any future per-binary artifact uploaded by PR 2's topology
# split, which uses the same `bench-*-bmf` naming convention).
uses: actions/download-artifact@v4
with:
pattern: bench-*-bmf
path: ./bmf-inputs/
merge-multiple: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Bench-upload missing token scopes 🐞 Bug ☼ Reliability

The bench-upload job sets job-level permissions but does not grant contents: read (for
actions/checkout) or actions: read (for actions/download-artifact). In repos/orgs that
restrict default GITHUB_TOKEN permissions, this job can fail to checkout or download artifacts,
preventing BMF merge and Bencher upload.
Agent Prompt
### Issue description
`bench-upload` overrides `GITHUB_TOKEN` permissions but does not include the permissions needed to (a) checkout the repository and (b) download workflow artifacts. This can cause 403/permission failures and stop the merge+upload pipeline.

### Issue Context
The job runs `actions/checkout@v4` and `actions/download-artifact@v4` after setting a restrictive `permissions:` block.

### Fix Focus Areas
- .github/workflows/bench.yml[182-208]

### Suggested change
Update `bench-upload.permissions` to include at least:
- `contents: read`
- `actions: read`

Keep existing:
- `pull-requests: write`
- `checks: write`

(Optionally, validate if Bencher/comment posting needs any additional scopes beyond these.)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

elijahr added a commit that referenced this pull request May 2, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 19c6dcb to 7a6ff2c Compare May 2, 2026 10:25
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Persistent review updated to latest commit 7a6ff2c

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Preparing review...

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request rewrites the bench_latency benchmark to use the bench_common harness, enabling native Bencher Metric Format (BMF) output and aligning its CLI with bench_throughput. It unifies the PushResult and PopResult types across adapters, updates the CI workflow to merge benchmark results, and adds integration tests. Feedback suggests including the p999 latency metric in the stdout report to maintain consistency with previous versions and improve tail latency visibility.

Comment on lines +138 to +139
echo fmt" p99: {metrics.p99_ns:.0f} ns"
echo fmt" max: {metrics.max_ns:.0f} ns"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The p999 latency metric is calculated by the harness and available in the metrics object, but it is omitted from the stdout report. For consistency with the previous implementation and to provide better tail latency visibility during manual runs, it should be printed alongside the other percentiles.

  echo fmt"  p99:  {metrics.p99_ns:.0f} ns"
  echo fmt"  p999: {metrics.p999_ns:.0f} ns"
  echo fmt"  max:  {metrics.max_ns:.0f} ns"

elijahr added a commit that referenced this pull request May 3, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 7a6ff2c to a41a91d Compare May 3, 2026 01:48
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request rewrites the bench_latency utility to utilize the bench_common harness, introduces native Bencher Metric Format (BMF) output, and unifies adapter result types across different queue implementations. It also adds a comprehensive integration test suite and updates the CI workflow to merge performance metrics. Feedback was provided to improve the test suite's robustness by using unique temporary directories for build artifacts to prevent collisions and ensuring cross-platform compatibility for binary execution.

Comment thread tests/t_bench_latency.nim Outdated
Comment on lines +27 to +29
let outBin = getTempDir() / "bench_latency_t11_defaults"
let cmd = "nim c --threads:on -d:release -d:BenchLatencyTestCompileTime=1 " &
"-o:" & outBin & " " & BenchLatencySrc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The integration tests compile the bench_latency binary into the system temporary directory using static filenames (e.g., bench_latency_t11_defaults). This approach has two issues: it can lead to collisions if tests are run in parallel or in a shared environment, and it leaves behind compiled binaries that are never cleaned up. Consider using createTempDir to create a unique workspace for each test run and ensure the directory is removed via a defer block.

Comment thread tests/t_bench_latency.nim Outdated
## Compile bench_latency.nim with extra -d: defines, return path to the
## resulting binary. Compiles in release mode for realistic timing but
## with tiny message counts so the integration test stays fast.
let outBin = getTempDir() / ("bench_latency_t" & suffix)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the static binary names in the intdefine tests, compileBenchLatency creates binaries in the temporary directory that are not cleaned up. Additionally, for cross-platform compatibility, you should append ExeExt to the binary name to ensure it works correctly on Windows, although the current project seems focused on Unix-like environments.

  let outBin = getTempDir() / ("bench_latency_t" & suffix & "_" & $getCurrentProcessId() & ExeExt)

elijahr added a commit that referenced this pull request May 3, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from a41a91d to b0c8224 Compare May 3, 2026 02:45
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the latency benchmarking infrastructure by migrating bench_latency to a shared harness, unifying adapter result types, and adding native Bencher Metric Format (BMF) support. It also introduces comprehensive integration tests for benchmark configuration and BMF merging. Review feedback suggests improving CLI robustness by supporting standard POSIX flag formats and maintaining test code consistency by utilizing existing workspace helpers.

Comment on lines +165 to +169
of "bmf-out":
if p.val.len == 0:
echo "Missing value for --bmf-out"
quit 1
bmfOutPath = p.val
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current CLI parsing for --bmf-out using parseopt expects the format --bmf-out=path. If a user provides --bmf-out path (with a space), p.val will be empty, and the program will quit with an error. While acceptable for a benchmark tool, it's worth noting that this deviates from standard POSIX-style flag parsing where both forms are typically supported.

Comment thread tests/t_bench_latency.nim

suite "bench_latency multi-measure-per-slug merge (Task 1.5)":
test "merge_bmf.py unions throughput + latency on shared slug":
let dir = createTempDir("bench_latency_t15_", "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the Task 1.5 test suite, createTempDir is called directly instead of using the newTestWorkspace helper defined at line 19. For consistency and to ensure proper cleanup (as newTestWorkspace includes a descriptive prefix), it's better to use the helper here as well.

    let dir = newTestWorkspace("t15_merge")

elijahr added a commit that referenced this pull request May 3, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from b0c8224 to 55c16bf Compare May 3, 2026 03:03
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the latency benchmarking tool to utilize a shared harness and unifies the queue adapter interface by re-exporting result types from a common module. It introduces native Bencher Metric Format (BMF) output via a new CLI flag, updates the CI workflow to merge these metrics, and adds comprehensive integration tests. Feedback was provided to use a Nim template to reduce repetitive boilerplate in the benchmark variant dispatch logic.

Comment on lines +107 to +133
runLatencyHarness[SipsicAdapter[LatencyCapacity, uint64]](
queueInit = initSipsic,
messageCount = BenchLatencyMessageCount,
runCount = BenchLatencyRuns,
warmupCount = BenchLatencyWarmupRuns,
)
of "sipmuc":
runLatencyHarness[LockfreequeuesSipmucAdapter[LatencyCapacity, 1, uint64]](
queueInit = initSipmuc,
messageCount = BenchLatencyMessageCount,
runCount = BenchLatencyRuns,
warmupCount = BenchLatencyWarmupRuns,
)
of "mupsic":
runLatencyHarness[LockfreequeuesMupsicAdapter[LatencyCapacity, 1, uint64]](
queueInit = initMupsic,
messageCount = BenchLatencyMessageCount,
runCount = BenchLatencyRuns,
warmupCount = BenchLatencyWarmupRuns,
)
of "mupmuc":
runLatencyHarness[MupmucAdapter[LatencyCapacity, uint64]](
queueInit = initMupmuc,
messageCount = BenchLatencyMessageCount,
runCount = BenchLatencyRuns,
warmupCount = BenchLatencyWarmupRuns,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The runLatencyHarness calls for each variant are highly repetitive. Using a template can significantly reduce this boilerplate and improve maintainability, especially as more variants or configuration parameters are added in future PRs.

    template runHarness(AdapterType, initProc): untyped =
      runLatencyHarness[AdapterType](
        queueInit = initProc,
        messageCount = BenchLatencyMessageCount,
        runCount = BenchLatencyRuns,
        warmupCount = BenchLatencyWarmupRuns,
      )

    let metrics =
      case variant
      of "sipsic": runHarness(SipsicAdapter[LatencyCapacity, uint64], initSipsic)
      of "sipmuc": runHarness(LockfreequeuesSipmucAdapter[LatencyCapacity, 1, uint64], initSipmuc)
      of "mupsic": runHarness(LockfreequeuesMupsicAdapter[LatencyCapacity, 1, uint64], initMupsic)
      of "mupmuc": runHarness(MupmucAdapter[LatencyCapacity, uint64], initMupmuc)
      else:
        raise newException(ValueError, "unknown variant: " & variant)

elijahr added a commit that referenced this pull request May 3, 2026
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 55c16bf to 6c4cb5f Compare May 3, 2026 03:40
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

@axiomantic-momus
Copy link
Copy Markdown

axiomantic-momus Bot commented May 4, 2026

Momus review posted — verdict APPROVE, 2 findings

████████████████████ 100%

run log

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

2 similar comments
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 5, 2026

/ai-review

Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 1 of the bench-rollup series: rewrites bench_latency.nim on bench_common.runLatencyHarness, splits bench.yml into three jobs (throughput, latency, upload), unifies PushResult/PopResult types via re-export from adapter.nim, and adds a 7-test t_bench_latency suite. Two low-severity findings: the bench-latency CI job clones deps from the mutable main branch (inconsistent with bench-throughput and build.yml which pin to release tags), and the new tests are not wired into any CI workflow or nimble task.

Severity tally: 2 Lows.

Low

  • BOT-A1 (.github/workflows/bench.yml:143): bench-latency CI job clones deps from main, making it non-deterministic
  • BOT-A2 (tests/t_bench_latency.nim:1): New bench_latency tests are not wired into any CI workflow or nimble task

Noteworthy

  • Clean unification of PushResult/PopResult via re-export from adapter.nim, eliminating the parallel definition split from PR 0.
  • merge_bmf.py collision guard test is thorough — covers both the happy-path union and the (slug, measure) collision rejection.
  • CI smoke shape tuning (50K × 11 runs × 2 warmup) is well-calibrated for the 12-min latency job budget.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

Comment on lines +143 to +147
sudo apt-get -qq install -y clang

- name: Clone and install sibling Nim deps (nim-debra, nim-typestates)
run: |
set -e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-A1 — Low (standards)
bench-latency CI job clones deps from main, making it non-deterministic
The new bench-latency job clones nim-debra and nim-typestates from --branch main. The sibling bench-throughput job (and build.yml) clone from pinned tags (v0.6.0, v0.7.0) with an explicit comment: "Pin to specific release tags (not main) for deterministic CI; bump in lockstep with build.yml and lockfreequeues.nimble's requires." Using main means any commit to those repos can silently break the bench-latency CI job without a corresponding change in this repo. This is inconsistent and undermines CI reproducibility.

Suggested change
sudo apt-get -qq install -y clang
- name: Clone and install sibling Nim deps (nim-debra, nim-typestates)
run: |
set -e
Change `--branch main` to `--branch v0.6.0` for nim-debra and `--branch v0.7.0` for nim-typestates in the bench-latency job, matching bench-throughput.

Comment thread tests/t_bench_latency.nim
Comment on lines +1 to +234
## Tests for benchmarks/nim/bench_latency.nim — the latency bench binary.
##
## Track 1 (PR 1) covers: per-binary intdefines (Task 1.1), --bmf-out
## emission via runLatencyHarness (Task 1.2), and multi-measure-per-slug
## merge with throughput (Task 1.5).
##
## The bench binary is invoked as a subprocess in the integration tests
## (Tasks 1.2 / 1.5); compile-time intdefine assertions (Task 1.1) live
## in the binary itself behind a `BenchLatencyTestCompileTime` flag and
## are exercised from a tiny dedicated build invocation here.

import std/[json, os, osproc, strutils, tempfiles]
import unittest2

const
RepoRoot = currentSourcePath().parentDir.parentDir
BenchLatencySrc = RepoRoot / "benchmarks" / "nim" / "bench_latency.nim"

proc newTestWorkspace(prefix: string): string =
## Allocate a private workspace dir for one test. Each test that
## compiles a bench_latency binary or writes a BMF file uses this so:
## 1. parallel runs (or repeated runs in the same shell) cannot
## collide on a static `/tmp/bench_latency_t11_*` filename, and
## 2. compiled binaries don't accumulate in the system temp dir.
## The caller is responsible for `removeDir` (typically via `defer`).
## `prefix` is a per-test stem that matches the original static suffix
## so test failure messages stay legible.
result = createTempDir("bench_latency_" & prefix & "_", "")

# ---------- Task 1.1: intdefine defaults + override ----------

suite "bench_latency intdefines (Task 1.1)":
test "defaults: BenchLatencyRuns == 33, BenchLatencyMessageCount == 100_000":
# Compile bench_latency.nim with -d:BenchLatencyTestCompileTime=1.
# The binary, when this define is set, runs a `static` block that
# asserts the two intdefine defaults; if they are wrong, compilation
# fails with the static assert message.
let dir = newTestWorkspace("t11_defaults")
defer: removeDir(dir)
let outBin = dir / ("bench_latency" & ExeExt)
let cmd = "nim c --threads:on -d:release -d:BenchLatencyTestCompileTime=1 " &
"-o:" & outBin & " " & BenchLatencySrc
let (output, exitCode) = execCmdEx(cmd)
check exitCode == 0
if exitCode != 0:
echo "compile output:\n", output

test "overrides: -d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 take effect":
# Compile with overrides + a different test flag that checks the
# overridden values rather than the defaults.
let dir = newTestWorkspace("t11_overrides")
defer: removeDir(dir)
let outBin = dir / ("bench_latency" & ExeExt)
let cmd = "nim c --threads:on -d:release " &
"-d:BenchLatencyTestCompileTimeOverrides=1 " &
"-d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 " &
"-o:" & outBin & " " & BenchLatencySrc
let (output, exitCode) = execCmdEx(cmd)
check exitCode == 0
if exitCode != 0:
echo "compile output:\n", output

# ---------- Task 1.2: --bmf-out integration ----------

proc compileBenchLatency(
extraDefs: openArray[string], dir: string
): string =
## Compile bench_latency.nim with extra -d: defines into `dir` and
## return the binary path. Caller owns `dir` and must remove it.
## Compiles in release mode for realistic timing but with tiny
## message counts so the integration test stays fast.
let outBin = dir / ("bench_latency" & ExeExt)
var cmd = "nim c --threads:on -d:release"
for d in extraDefs:
cmd.add(" -d:" & d)
cmd.add(" -o:" & outBin & " " & BenchLatencySrc)
let (output, exitCode) = execCmdEx(cmd)
if exitCode != 0:
raise newException(IOError, "bench_latency compile failed:\n" & output)
result = outBin

suite "bench_latency --bmf-out integration (Task 1.2)":
test "sipsic variant emits latency_p50_ns / latency_p99_ns on expected slug":
# Override message count + runs to keep the integration run under ~5s.
let dir = newTestWorkspace("t12_sipsic")
defer: removeDir(dir)
let bin = compileBenchLatency(@[
"BenchLatencyMessageCount=200",
"BenchLatencyRuns=2",
], dir = dir)
let bmfPath = dir / "bench_latency.json"
let cmd = bin & " --bmf-out=" & bmfPath & " sipsic"
let (output, exitCode) = execCmdEx(cmd)
check exitCode == 0
check fileExists(bmfPath)
let node = parseJson(readFile(bmfPath))
# Expected slug per design 2.2 / table at design line 357.
let slug = "lockfreequeues_sipsic/spsc/1p1c"
check node.hasKey(slug)
let s = node[slug]
check s.hasKey("latency_p50_ns")
check s.hasKey("latency_p95_ns")
check s.hasKey("latency_p99_ns")
check s["latency_p50_ns"]["value"].getFloat() > 0.0
check s["latency_p99_ns"]["value"].getFloat() >= s["latency_p50_ns"]["value"].getFloat()
# Stdout text output preserved (acceptance: positional CLI behavior).
check output.contains("Sipsic") or output.contains("sipsic")

test "all four bounded variants emit latency_p50_ns / latency_p99_ns":
# Per impl plan Track 1 Acceptance Criteria: BMF JSON contains
# latency_p50_ns and latency_p99_ns for sipsic / sipmuc / mupsic /
# mupmuc on the 1p1c smoke shape.
let dir = newTestWorkspace("t12_all4")
defer: removeDir(dir)
let bin = compileBenchLatency(@[
"BenchLatencyMessageCount=200",
"BenchLatencyRuns=2",
], dir = dir)
let bmfPath = dir / "bench_latency.json"
let cmd = bin & " --bmf-out=" & bmfPath &
" sipsic mupmuc sipmuc mupsic"
let (_, exitCode) = execCmdEx(cmd)
check exitCode == 0
let node = parseJson(readFile(bmfPath))
let expectedSlugs = @[
"lockfreequeues_sipsic/spsc/1p1c",
"lockfreequeues_sipmuc/mpmc/1p1c",
"lockfreequeues_mupsic/mpsc/1p1c",
"lockfreequeues_mupmuc/mpmc/1p1c",
]
for slug in expectedSlugs:
check node.hasKey(slug)
check node[slug].hasKey("latency_p50_ns")
check node[slug].hasKey("latency_p99_ns")

test "unknown variant exits 1":
let dir = newTestWorkspace("t12_unknown")
defer: removeDir(dir)
let bin = compileBenchLatency(@[
"BenchLatencyMessageCount=200",
"BenchLatencyRuns=2",
], dir = dir)
let cmd = bin & " bogus_variant"
let (_, exitCode) = execCmdEx(cmd)
check exitCode == 1

# ---------- Task 1.5: multi-measure-per-slug merge ----------
#
# Validates the end-to-end shape that Track 1 ships: a single slug
# carries BOTH `throughput_ops_ms` (from bench_throughput's BMF
# fragment) and `latency_p50_ns` / `latency_p99_ns` (from bench_latency)
# AFTER `merge_bmf.py` unions the two fragments. Production CI does this
# with real bench output; the test uses two synthetic fragments so it
# stays fast and deterministic.

const MergeBmfPath = RepoRoot / "benchmarks" / "merge_bmf.py"

suite "bench_latency multi-measure-per-slug merge (Task 1.5)":
test "merge_bmf.py unions throughput + latency on shared slug":
let dir = createTempDir("bench_latency_t15_", "")
defer: removeDir(dir)
let throughputPath = dir / "throughput.json"
let latencyPath = dir / "latency.json"
let mergedPath = dir / "merged.json"
let slug = "lockfreequeues_sipsic/spsc/1p1c"

# Synthetic throughput fragment.
writeFile(throughputPath,
"""{
"lockfreequeues_sipsic/spsc/1p1c": {
"throughput_ops_ms": {
"value": 1234.5,
"lower_value": 1200.0,
"upper_value": 1270.0
}
}
}""")
# Synthetic latency fragment on the SAME slug, distinct measures.
writeFile(latencyPath,
"""{
"lockfreequeues_sipsic/spsc/1p1c": {
"latency_p50_ns": { "value": 250.0 },
"latency_p99_ns": { "value": 875.0 }
}
}""")

let cmd = "python3 " & MergeBmfPath & " " & mergedPath &
" " & throughputPath & " " & latencyPath
let (output, exitCode) = execCmdEx(cmd)
check exitCode == 0
if exitCode != 0:
echo "merge stdout/stderr:\n", output
check fileExists(mergedPath)

let node = parseJson(readFile(mergedPath))
check node.hasKey(slug)
let s = node[slug]
# All three measures from BOTH fragments must coexist on the shared slug.
check s.hasKey("throughput_ops_ms")
check s.hasKey("latency_p50_ns")
check s.hasKey("latency_p99_ns")
check s["throughput_ops_ms"]["value"].getFloat() == 1234.5
check s["latency_p50_ns"]["value"].getFloat() == 250.0
check s["latency_p99_ns"]["value"].getFloat() == 875.0

test "merge_bmf.py exits 1 on collision when same measure key in both inputs":
# Sanity: the per-slug union union semantics are NOT a free-for-all;
# ensure the collision guard from PR 0 Task 0.7 still fires when the
# latency fragment accidentally re-declares throughput_ops_ms on the
# same slug. This guards against silent overwrites that would erase
# one of the measures.
let dir = createTempDir("bench_latency_t15_collide_", "")
defer: removeDir(dir)
let aPath = dir / "a.json"
let bPath = dir / "b.json"
let mergedPath = dir / "merged.json"

writeFile(aPath,
"""{
"lockfreequeues_sipsic/spsc/1p1c": {
"throughput_ops_ms": { "value": 100.0 }
}
}""")
writeFile(bPath,
"""{
"lockfreequeues_sipsic/spsc/1p1c": {
"throughput_ops_ms": { "value": 200.0 }
}
}""")
let cmd = "python3 " & MergeBmfPath & " " & mergedPath &
" " & aPath & " " & bPath
let (output, exitCode) = execCmdEx(cmd)
check exitCode == 1
check output.contains("collision")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-A2 — Low (tests)
New bench_latency tests are not wired into any CI workflow or nimble task
The 7 tests in tests/t_bench_latency.nim (intdefine defaults, --bmf-out integration, merge_bmf.py collision guard) are never run by CI. tests/test.nim (the only nimble test entry point) does not include or import t_bench_latency. Neither build.yml nor bench.yml invokes nim c -r tests/t_bench_latency.nim. The same gap existed for t_bench_common.nim from PR 0, but the new tests in this PR add a fresh instance of the problem. Without CI execution, these tests will silently bitrot.

elijahr added 10 commits May 4, 2026 22:46
Exposes per-binary {.intdefine.} overrides for latency wall-time
control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount
(default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern
in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert
defaults compile cleanly and that -d:BenchLatencyRuns=2
-d:BenchLatencyMessageCount=1000 overrides take effect via
compile-time static blocks.

Also fixes a stale import (./adapters/lockfreequeues_sipsic ->
./adapters/lockfreequeues_sipsic_adapter) so the file compiles after
PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix.
The rest of the file (legacy runLatencyBenchmark / benchmarkLatency /
isMainModule) is preserved verbatim and gets fully rewritten in
Task 1.2 onto runLatencyHarness.
…Harness

Replaces the legacy ping-pong RTT body with a CLI mirroring
bench_throughput's: positional variant filter (sipsic / mupmuc /
sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format
JSON emission, stdout text preserved. Per-binary run-shape
intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount /
BenchLatencyWarmupRuns drive runLatencyHarness directly.

Slug shape per design 2.2 (table at design line 357), 1p1c smoke
shape only — full grid lands in PR 2:
- lockfreequeues_sipsic/spsc/1p1c
- lockfreequeues_sipmuc/mpmc/1p1c
- lockfreequeues_mupsic/mpsc/1p1c
- lockfreequeues_mupmuc/mpmc/1p1c

Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns
measures (p999 / max deferred to PR 6 per impl plan).

Type unification: PushResult / PopResult were duplicated across
benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim
(new), forcing two nominally distinct enums to coexist. Aliasing
adapter.PushResult / PopResult to bench_common's via re-export
unifies the surface — both adapter packs (legacy sipsic / mupmuc and
the newer sipmuc / mupsic / unbounded_*) now flow through the same
runLatencyHarness and runThroughputHarness without per-call-site
conversion. Verified: t_bench_common (26 tests) and bench_throughput
--bmf-out integration unchanged.

The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer(
idx = 0) which bypasses Mupmuc's threadId-based registration, so the
pre-built producer / consumer slots are safe to drive from either of
the harness's pinger and ponger threads (Mupmuc's underlying
Vyukov per-slot CAS protocol is fully concurrent-safe).

Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering
the sipsic single-variant happy path, the four-variant slug coverage
required by Track 1 Acceptance Criteria, and the unknown-variant
exit-1 path. All pass with -d:BenchLatencyMessageCount=200
-d:BenchLatencyRuns=2 in under ~10s on Apple silicon.
….yml

Restructures bench.yml from one all-in-one bench-throughput job into
three jobs:

  bench-throughput  (renamed from 'benchmark'):
    builds + runs bench_throughput, uploads bench-throughput-bmf artifact.
  bench-latency     (NEW):
    builds + runs bench_latency, uploads bench-latency-bmf artifact.
    Compile-time overrides: BenchLatencyMessageCount=50000,
    BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four
    1p1c smoke-shape variants finish in single-digit seconds inside the
    12-min job budget).
  bench-upload      (NEW, needs both):
    downloads bench-*-bmf artifacts via actions/download-artifact@v4
    pattern, runs python3 benchmarks/merge_bmf.py merged.json
    $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload
    flows (PR comparison / push baseline / workflow_dispatch). One
    bencher run invocation -> one Bencher Report -> latency + throughput
    measures co-located on the shared per-slug history (design 1).

Bencher upload commands are unchanged in spirit: same PR / push /
workflow_dispatch gating, same --threshold-measure throughput on push
baselines, same yellow-warning preflight when BENCHER_API_TOKEN is
unset on forks. This commit does NOT correct the historical
`--threshold-measure throughput` -> `throughput_ops_ms` rename
deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure
threshold blocks in the same edit.

Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4
(line 359-366). The two tasks share the same file and are tightly
coupled (1.4 strictly depends on 1.3); folding them into one commit
keeps the workflow file consistent at every commit boundary.

actionlint clean. SC2046 word-splitting on $(ls ...) is intentional
(each path must arrive as a separate positional argv to merge_bmf.py)
and is suppressed inline with a justification comment; artifact
filenames are repo-controlled (no whitespace / glob chars).
Adds two cases to tests/t_bench_latency.nim that validate the
end-to-end shape Track 1 ships: a single slug carries BOTH
`throughput_ops_ms` (from bench_throughput's BMF fragment) and
`latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER
`merge_bmf.py` unions the two fragments.

  1. Multi-measure union — writes synthetic throughput + latency
     fragments on the same slug, runs python3 benchmarks/merge_bmf.py,
     and asserts the merged JSON carries all three measures with their
     original values intact.
  2. Collision guard — re-declares `throughput_ops_ms` on the same
     slug across both inputs and asserts merge_bmf.py exits 1 with a
     'collision' message in stderr. This regression-tests PR 0 Task 0.7's
     collision guard against silent overwrites that would erase one of
     the colliding measures.

Both tests use createTempDir + defer removeDir for isolated, leak-free
fixtures and run in <200ms because they bypass the bench compile +
execute path entirely. Plan reference: impl plan Track 1 Task 1.5
(line 368-376).
Adds two bullets to the [Unreleased] Added section covering bench_latency
--bmf-out emission and the new three-job bench.yml structure (bench-
throughput / bench-latency / bench-upload). Adds a Changed bullet
documenting the adapter.nim PushResult / PopResult unification.

Plan reference: impl plan Track 1 Task 1.7 (line 386-393).
…ps to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.
…t dirs

Two CI hardening fixes from code review on PR 1:

1. The bench-upload job overrode GITHUB_TOKEN permissions to only
   pull-requests:write + checks:write but then ran actions/checkout
   and actions/download-artifact, which need contents:read and
   actions:read. Add both so the job is robust against orgs that
   restrict default token permissions.

2. Switch from merge-multiple:true to per-artifact subdirectories.
   Today's artifact filenames (throughput.json, latency.json) are
   unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/
   unbounded binaries; each could end up writing files that collide
   under merge-multiple. Each artifact now lands at
   bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked
   over the recursive find.
The harness already computes metrics.p999_ns; surfacing it during
manual runs costs nothing and matches the legacy bench_latency
output. BMF wire-up for latency_p999_ns / latency_max_ns is
deferred to PR 6 alongside threshold configuration.
Replace static /tmp/bench_latency_t1*_* paths with per-test
createTempDir workspaces (via newTestWorkspace helper) cleaned up by
defer: removeDir. Use ExeExt for the compiled binary name so the
suffix is correct on Windows. Reshape compileBenchLatency to take an
owned `dir` instead of a `suffix` so workspace lifecycle stays with
the test that allocates it.

Prevents collisions between parallel test runs / repeated runs in
the same shell, and stops binaries from accumulating in the system
temp dir.
@elijahr elijahr force-pushed the feat/bench-rollup-pr1-latency-mvp branch from 6c4cb5f to 2684687 Compare May 5, 2026 03:47
@elijahr elijahr merged commit 2684687 into feat/bench-rollup-pr0-bench-common May 5, 2026
5 of 6 checks passed
@elijahr elijahr deleted the feat/bench-rollup-pr1-latency-mvp branch May 5, 2026 04:20
elijahr added a commit that referenced this pull request May 5, 2026
…ts (#19)

* feat(bench): stub bench_common module with public API surface (Task 0.1)

Compile-only stub for benchmarks/nim/bench_common.nim. Types and proc
signatures are final per design doc section 2.1; bodies raise
AssertionDefect with 'not implemented' so tasks 0.2-0.6 can be written
and reviewed in parallel against this surface.

Public surface stubbed:
- Topology enum (tSpsc/tMpsc/tMpmc + unbounded variants)
- PushResult / PopResult adapter primitives
- BMFEmitter + initBMFEmitter / addMeasure / emit
- Stats helpers (mean / stddev / minVal / maxVal / percentile)
- Histogram + initHistogram / record / topK / percentile
  (HistogramTopK = 1000, HistogramReservoir = 99000)
- LatencyMetrics + runLatencyHarness
- ThroughputMetrics + runThroughputHarness

Test (tests/t_bench_common.nim) is compile-time-only: it references
each promised symbol via 'when declared' and 'when compiles', leaving
behavior tests to subsequent tasks.

Stub file: 170 LOC (acceptance cap: 200).

* feat(bench): implement BMFEmitter with alpha-sorted output (Task 0.2)

initBMFEmitter / addMeasure / emit per design doc section 2.1.

- addMeasure stores via Table[slug, Table[measure, MeasureValue]];
  NaN sentinel for lower/upper bounds maps to none[float](),
  finite values map to some(v).
- emit writes JSON with slugs alpha-sorted at the top level and
  measures alpha-sorted within each slug. lower_value/upper_value
  fields are emitted only when the corresponding bound is some.
- Output format: pretty(root, 2) plus a trailing newline so
  diffs in CI / local-dev surface line-level changes cleanly.

Tests (tests/t_bench_common.nim): empty emit produces {}, two
slugs alpha-ordered, measures alpha-ordered within slug, NaN bounds
omitted from output, finite bounds emitted as lower_value/upper_value.

* feat(bench): implement Stats helpers (Task 0.4)

mean / stddev / minVal / maxVal / percentile bodies. Sort-based
percentile uses linear interpolation between adjacent samples;
matches numpy's default percentile interpolation ('linear').
stddev uses ddof=1 (sample) for consistency with numpy default.
Empty-input behavior matches the legacy benchmarks/nim/stats.nim
that this module replaces (returns 0.0 across the board).

Tests: numpy-baseline parity for stddev on [1,2,3,4]; min/max on
unordered input; percentile(p=0.5) on [0..99] = 49.5; percentile
extremes (p=0.0, p=1.0) match min and max.

* feat(bench): implement Histogram top-K + reservoir percentile (Task 0.3)

Online percentile tracker per design doc section 2.1.

- topKHeap: min-heap of HistogramTopK = 1000 largest values seen.
  record() pushes if heap < K; else replace if value > heap[0]
  (evicts the min into the body bucket via reservoirAdmit).
- reservoir: Algorithm R uniform sample of body distribution, size
  HistogramReservoir = 99_000.
- debugMode (initHistogram(true)): records every sample to debugAll
  and percentile() returns the exact sort answer over debugAll. Used
  as the test oracle and as a debug-only fallback per design 2.1.

Lookup rule with stratified-estimator math:
  p == 1.0                          -> top-K max
  seenAll*(1-p) <= topKHeap.len     -> top-K rank lookup (linear interp)
  otherwise                         -> reservoir, with p rescaled to
                                       p_body = (p*seenAll) / (seenAll - K)
                                       since the reservoir samples ONLY the
                                       body stratum (values that lost the
                                       top-K race). Without rescaling a
                                       reservoir-based p99 underestimates
                                       by ~25% on log-normal data.

Tests: p99 within 1% of sort fallback on 100K log-normal samples;
percentile(1.0) == top-K max; p50 within 5% of sort fallback;
debug mode returns exact sort answer; topK() returns ascending-sorted
samples.

* feat(bench): implement runThroughputHarness (Task 0.5)

Generic N-producer / N-consumer throughput runner factored out of
the legacy benchmarks/nim/bench_throughput.nim. Parameterized over the
queue init proc and the queue type Q (constrained at instantiation
time via 'mixin push' / 'mixin pop' so adapter-specific push/pop are
resolved at the call site).

The remainder-spread deadlock-avoidance rule is preserved verbatim:
if messageCount mod numProducers != 0, the leading producers each get
one extra item so the producer-side total exactly equals messageCount;
same for consumers. Without this, the harness silently degrades into
a deadlock (consumers waiting forever) or strands items in the queue
(consumers stop early).

Returns ThroughputMetrics{ops_ms_mean, ops_ms_stddev, runs}; warmup
runs are discarded. Smoke test in tests/t_bench_common.nim with a
1024-cap channel SmokeAdapter at 1P/1C × 1000 messages × 1 run.

* feat(bench): implement runLatencyHarness (Task 0.6)

Ping-pong RTT runner. For each run:
- allocates two queues fwd + rev of type Q via queueInit()
- spawns 1 pinger + 1 ponger thread
- pinger pushes a monotonic-ns timestamp via fwd, spins on rev for
  the echo, records RTT (ns) into a per-run Histogram
- ponger pops from fwd, pushes the same value to rev

Aggregation per design 2.5 'Run aggregation for latency': reported
percentiles are the MEAN across runs of each percentile, NOT the
percentile of unioned samples. Per-run histograms isolate scheduler
artifacts to the run that contains them.

Returns LatencyMetrics{p50_ns, p95_ns, p99_ns, p999_ns, max_ns,
samples}. samples is the cumulative across-runs sample count.

This task ships the 1P/1C smoke topology only; multi-thread sharding
lands in PR 1 (Task 1.x). 1P/1C is sufficient for the smoke check
because PR 1's first BMF measure uses Sipsic 1P/1C as the smoke shape.

Test: SmokeAdapter (channel-backed), 1000 messages, 1 run, 0 warmup;
asserts samples >= 1000 and p50 < p99 < max.

* feat(bench): add merge_bmf.py BMF JSON merge utility (Task 0.7)

Stateless union of per-slug measure dicts across N input BMF JSON
files into one output JSON, per design doc section 4.3. Required
because Bencher.dev creates a separate Report per 'bencher run'
invocation; multiple uploads from sibling CI jobs would NOT co-locate
multiple measures on a single per-slug history.

Contract:
  merge_bmf.py <output_path> <input1> [input2 ...]

- Slug regex: ^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$
- Measure regex: ^[a-z][a-z0-9_]*$
- Value fields ('value', 'lower_value', 'upper_value') must be finite,
  non-NaN floats. Any malformed input or non-finite value -> exit 1.
- Collision (same slug + same measure across two inputs) -> exit 1
  with stderr 'error: collision on slug=<slug> measure=<measure> in
  files <pathA> and <pathB>'.
- Output is alpha-sorted at both top level (slugs) and within each
  slug (measures), 2-space indent.

Tests in benchmarks/tests/test_merge_bmf.py cover all 8 design-spec
cases plus an alpha-sort assertion: zero-input usage error, single
empty pass-through, single valid round-trip, two disjoint slugs,
shared slug + disjoint measures, collision, NaN rejection, invalid
measure key (uppercase), invalid slug shape.

* feat(bench): add 5 missing lockfreequeues adapters (Task 0.8)

Adds the 5 adapters not yet present per design doc 2.2:
- lockfreequeues_sipmuc_adapter.nim   (bounded SPMC)
- lockfreequeues_mupsic_adapter.nim   (bounded MPSC)
- lockfreequeues_unbounded_sipsic_adapter.nim
- lockfreequeues_unbounded_sipmuc_adapter.nim
- lockfreequeues_unbounded_mupmuc_adapter.nim

Each adapter file exports:
- 'topologiesSupported: set[Topology]' constant (consumed by PR 3 Task
  3.11 to restrict shape generation; declared but unused in PR 0)
- 'make<Lib>Adapter' factory taking 'capacity: int'
- 'cleanup' tear-down proc
- 'push(item)' returning bench_common.PushResult
- 'pop()' returning bench_common.PopResult[T]

Smoke test in tests/t_bench_common.nim: each adapter pushes 100
sequential uint64s, pops them, asserts count == 100 AND set-equality
with the input range.

Implementation note (Nim 2.2.6 codegen workaround):
The unbounded adapters store the queue INLINE rather than via
heap pointer. Heap-allocating UnboundedSipsic/UnboundedSipmuc/
UnboundedMupmuc and then calling reset() or =destroy() on the
referent from a generic adapter cleanup proc triggers
'internal error: getTypeDescAux(tyGenericParam)' OR
'internal error: expr(nkIdent); unknown node kind' in Nim 2.2.6
when bench_common is also imported. Inline storage lets the
compiler emit the per-instance =destroy automatically when the
adapter goes out of scope, sidestepping the bug. The DEBRA
manager is still heap-allocated because the queue constructors
take 'ptr DebraManager'. Track upstream Nim repro for follow-up.

* PR 0 Task 0.9: rename existing adapters with `_adapter` suffix and export `topologiesSupported`

Per design section 2.2 (`<library_slug>_adapter.nim` filename convention):

- Rename via `git mv` (history preserved):
  - lockfreequeues_sipsic.nim -> lockfreequeues_sipsic_adapter.nim
  - lockfreequeues_mupmuc.nim -> lockfreequeues_mupmuc_adapter.nim
  - lockfreequeues_unbounded_mupsic.nim -> lockfreequeues_unbounded_mupsic_adapter.nim

- Update bench_throughput.nim import paths to match.

- Add `topologiesSupported*: set[Topology]` constant to all four
  existing adapters (3 renamed + channels_adapter). Use selective
  `from ../bench_common import Topology, <member>` to avoid pulling
  PushResult into scope, which would conflict with adapter.PushResult
  (the legacy concept in ../adapter still imported by these adapters).

- `topologiesSupported` is exported in PR 0 but consumed only by PR 3
  Task 3.11; the "unused export" hint is expected per the impl plan.

* PR 0 Task 0.10: native BMF emission via --bmf-out= flag in bench_throughput

Add --bmf-out=<path> flag (parsed via std/parseopt) that emits Bencher
Metric Format JSON for every variant that ran. With the flag absent the
binary is bit-for-bit unchanged; the existing positional CLI
(`bench_throughput sipsic mupmuc unbounded_mupsic channels`) is
preserved verbatim.

Slug shape per design 2.2:

  lockfreequeues_sipsic/spsc/1p1c
  lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c
  lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c
  nim_channels/mpmc/{1,2,4}p{1,2,4}c

Each slug carries the `throughput_ops_ms` measure as
{value: mean, lower_value: mean-stddev, upper_value: mean+stddev} —
matching the convention bmf_adapter.py used (which this code replaces
in Task 0.12).

Note on slug for built-in channels: the human-readable adapter name()
returns "nim/channels" (kept for stdout text), but the slug uses
underscore form `nim_channels` so it satisfies the merge_bmf.py
SLUG_RE pattern `^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$`.

Per-variant {.intdefine.}s for runs/warmup added so the integration
test can build a fast variant via -d:BenchSipsicRuns=2
-d:BenchSipsicWarmup=0. Default values (10 runs, 0 warmup) match the
prior hard-coded behavior.

Selective `from ./bench_common import` is used to pull only
{BMFEmitter, initBMFEmitter, addMeasure, emit} — importing the full
module would shadow `results.ThroughputMetrics` with
`bench_common.ThroughputMetrics` and break the legacy bespoke harness
procs that still drive Mupmuc and UnboundedMupsic.

Test: tests/t_bench_common.nim adds an integration test that compiles
bench_throughput with small overrides, runs it with --bmf-out=<tmp>
sipsic, parses the resulting JSON, and asserts the expected slug and
measure exist with a positive value. Stdout text output is also
asserted (`Sipsic` substring) so the legacy CLI contract is exercised.

* PR 0 Task 0.11: bench.yml — native BMF + merge step, drop legacy parser

Apply the four edits from design section 3 (PR 0):

(a) Remove `Run adapter unit tests` step. The Python regex parser it
    sanity-checked is gone in Task 0.12; nothing left to test.
(b) Remove `Convert to BMF JSON` step. bench_throughput now emits BMF
    natively via `--bmf-out=throughput.json` (Task 0.10).
(c) Update `Run bench_throughput` step to add the `--bmf-out=throughput.json`
    flag while preserving the existing positional variant filter and the
    20-minute step timeout.
(d) Insert `Merge BMF JSON` step running
    `python3 benchmarks/merge_bmf.py merged.json throughput.json`.
    Single-input today, but stays in place for PR 2-4 when bench is
    split into per-topology binaries each producing its own fragment.
(e) Change all three `bencher run` invocations and the debug-cat step
    to use `--file merged.json` instead of `bench_results.json`.

actionlint clean. The bench upload contract is unchanged: same
project, branch, testbed, thresholds, comparison semantics — only the
file source flips from regex-parsed text to native BMF.

* PR 0 Task 0.12: delete legacy bmf_adapter.py + bench_main.nim, refactor consumers

Three legacy files removed via `git rm`:

- `benchmarks/bmf_adapter.py` — regex parser for bench_throughput stdout.
  Replaced by native BMF emission via `--bmf-out=` (Task 0.10).
- `benchmarks/test_bmf_adapter.py` — unit tests for the parser. Replaced
  by `benchmarks/tests/test_merge_bmf.py` (Task 0.7).
- `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped
  bench_throughput + bench_latency and emitted a custom JSON shape.
  Bench_throughput is now the canonical entry point and emits BMF
  directly; bench_latency follows in PR 1.

Live consumers refactored in this same task (per impl plan §0.12):

- `benchmarks/runner.py`: build_nim now compiles bench_throughput.nim;
  run_nim invokes `bench_throughput --bmf-out=<path>` instead of
  `bench_main --runs=<n> -o=<path>`. The `runs` argument is preserved
  in the CLI for back-compat but is now compile-time-only.
- `lockfreequeues.nimble`: `task benchmarks` switched to bench_throughput.
  Output path corrected to `.tmp/bench_throughput` (matching the project
  nim.cfg `--outdir:.tmp`).
- `benchmarks/README.md`: rewritten to document the new flow
  (bench_common module, adapter convention, merge_bmf.py, native BMF
  emission, expected slug set).
- `.gitignore`: drop the `bench_main` build-artifact ignore line.

`git grep "bmf_adapter\|bench_main"` returns only intentional historical
references in code comments and CHANGELOG (prior release notes).

Acceptance: `nimble tasks` parses; `nim c benchmarks/nim/bench_throughput.nim`
compiles; tests/t_bench_common.nim passes all 26 tests including the
Task 0.10 --bmf-out integration test.

* PR 0 Task 0.13: render_readme.nim consumes new BMF shape

The legacy bench_main aggregator emitted JSON of shape
`{results: [{implementation, thread_config, metrics: {...}}], metadata: {...}}`.
Bench_main was deleted in Task 0.12, replaced by `bench_throughput
--bmf-out` (Task 0.10) + `merge_bmf.py` (Task 0.7) which produce
Bencher Metric Format JSON of shape
`{<lib>/<topology>/<P>p<C>c: {<measure>: {value, lower_value?, upper_value?}}}`.

`benchmarks/render_readme.nim` now walks the BMF shape directly:

- New `parseSlug(slug)` helper splits `<lib>/<topology>/<P>p<C>c` into
  `(impl, "<P>P/<C>C")` for the two table columns. Strict on shape:
  rejects slugs with the wrong segment count or non-digit P/C, so
  malformed entries do not produce noise rows.
- `loadRows` walks every top-level slug, pulls
  `<slug>.throughput_ops_ms.value` for the throughput column, and
  `<slug>.latency_p50_ns.value` for the optional latency column
  (PR 1+ emits this; PR 0 does not, so all PR 0 rows show "—").
- BMF has no `metadata` block, so the meta argument is always nil.
  `renderTable` already accepts nil and skips the platform header in
  that case — no change needed.
- `when isMainModule` block guarded by `not defined(renderReadmeNoMain)`
  so unit tests can include this file without triggering the README
  rewrite. CLI invocations (`nim r benchmarks/render_readme.nim`) leave
  the define unset and run main as before.

Test: tests/t_render_readme.nim (new) writes a fixture BMF JSON with
three slugs (mupmuc/2p2c, sipsic/1p1c, nim_channels/1p1c), runs
`loadRows` and `renderBlock`, and asserts:
- 3 rows produced; impl/threads/throughput parsed correctly per slug
- alpha sort by impl preserved
- meta == nil (BMF carries no metadata)
- rendered Markdown table contains every impl name and value
- latency column is em-dash for every row (no `latency_p50_ns` in fixture)
- missing JSON file falls back to placeholder
- empty `{}` BMF document falls back to placeholder

All 4 new tests + 26 existing t_bench_common tests pass.

* PR 0 Task 0.14: CHANGELOG entry under [Unreleased] for bench-rollup PR 0

Document Track 0 of the bench-rollup feature under [Unreleased]:

- Added: bench_common.nim module, 5 new lockfreequeues adapters,
  merge_bmf.py CLI, --bmf-out flag, per-variant run-count intdefines.
- Changed: bench_throughput emits BMF natively, 3 adapters renamed to
  the <library_slug>_adapter.nim convention, render_readme.nim consumes
  the new BMF shape, runner.py / nimble task / benchmarks/README.md
  point at bench_throughput.
- Removed: bmf_adapter.py, test_bmf_adapter.py, bench_main.nim.

Each entry is one bullet (one logical change), no test plan checklist,
no GitHub issue references.

* fix(bench): plug heap-allocated queue leaks in latency and throughput harness

Per Gemini review on PR #19. Adapters whose Q heap-allocates the
backing queue (lockfreequeues bounded mupsic/sipmuc/mupmuc, channels)
leaked the queue every time runOneLatencyRun or runOneThroughputRun
returned: `var fwd = queueInit()` only frees the adapter struct's
stack slot at scope exit, not the create()-d Mupsic/Sipmuc/Mupmuc/
Channel pointee. Across N runs * M warmups the leak grew without bound.

Two-part fix:

1. bench_common.nim: harness now drops the queue via a uniform
   `mixin cleanup; when compiles(cleanup(q)): cleanup(q)` block after
   joinThread (and before any return path, including the elapsedNs<=0
   short-circuit in runOneThroughputRun). `when compiles` keeps
   value-type adapters (sipsic, unbounded inline storage) usable
   without forcing them to define a no-op cleanup.

2. lockfreequeues_mupsic / sipmuc cleanup procs: prepend reset(a.queue[])
   so Mupsic/Sipmuc's `=destroy` actually runs before the heap slot
   is freed. Bare dealloc skipped any queue-internal cleanup that an
   ARC/ORC-generated destructor would otherwise do (managed T fields,
   future atomic-cleanup hooks). Same fix on the mupmuc adapter inside
   deinitMupmucAdapter.

3. mupmuc and channels adapters gain a `cleanup` alias around their
   existing deinit*Adapter procs so the harness's mixin cleanup call
   resolves uniformly across adapter naming conventions.

Verified by smoke build of `runLatencyHarness` + `runThroughputHarness`
against LockfreequeuesMupsicAdapter; no leak-on-return, percentiles
and ops/ms reported as expected.

* fix(bench): destroy queue before manager; guard zero-duration throughput

Two fixes flagged by code review on PR 0:

1. Unbounded MPMC adapter cleanup() freed the DebraManager while the
   inline queue still held a pointer to it. The queue's =destroy then
   called unbindClient(self.manager[]) on freed memory. Reset the queue
   first (running its destructor while the manager is still valid),
   then dealloc the manager. Applied to both unbounded_mupmuc and
   unbounded_sipmuc adapters.

2. The legacy bench_throughput run procs computed
   messageCount/elapsedNs without guarding against elapsedNs <= 0,
   which can occur on fast CI runners with sub-nanosecond clock
   granularity. A zero-duration measurement produces inf, which
   merge_bmf.py rejects via math.isfinite. Return 0.0 in that case
   so the sample is recorded as a valid (degenerate) value.

* fix(bench): guard zero-thread harness; expose adapter queues; drop non-finite BMF

Three further code-review fixes on PR 0:

1. runOneThroughputRun would crash with DivByZeroDefect on
   numProducers==0 / numConsumers==0 because the workload-distribution
   step does messageCount div N. The harness is exported, so a future
   caller could pass 0 thinking 'no producers' is a degenerate but
   valid shape. Guard with doAssert + a clear message instead of
   letting the worker thread crash with no context. Also assert
   messageCount >= 0.

2. lockfreequeues_mupsic_adapter and lockfreequeues_sipmuc_adapter
   stored a single Producer/Consumer at construction time but kept the
   underlying queue pointer non-exported, so multi-thread shapes
   couldn't actually register per-thread producers/consumers as the
   adapter docstrings already described. Export the queue field
   (used by today's bench_throughput multi-producer code path) and
   add explicit getProducer*/getConsumer* accessor procs that wrap
   queue.getProducer(idx)/getConsumer(idx) for harness consumers.

3. Defense in depth on the BMF emitter. The PR previously added
   elapsedNs <= 0 guards to the legacy bench_throughput run procs;
   this pass also rejects non-finite values inside addMeasure
   itself. If a future code path produces inf/NaN despite the
   upstream guard, the offending (slug, measure) entry is silently
   dropped instead of being written to the BMF, which merge_bmf.py
   would otherwise reject and fail the entire upload on. Same
   classification widened on toBound for lower/upper bounds.

* perf(bench): bulk-percentile to avoid 5x reservoir sort per latency run

Histogram.percentile() previously sorted the reservoir (up to 99k
samples) on every call, so each latency run re-sorted the same data
five times for p50/p95/p99/p999/max. Add Histogram.percentiles() that
takes an array of probabilities and computes them all from a single
sorted snapshot. Refactor percentile(openArray[float]) through a new
percentileSorted() helper so the sort/lerp can be reused without a
fresh allocation. Update runOneLatencyRun to use the bulk API.

* fix(render_readme): bump row cap; restore platform/cores/timestamp line

- Bump the row cap from 8 to 32. The bench matrix today already exceeds
  8 (1 sipsic + 4 mupmuc + 3 unbounded mupsic + 3 channels = 11) and PR
  3 / PR 4 add comparison adapters; the old cap silently truncated the
  table. PR 5 replaces this auto-rendered block with a hand-curated
  4-row summary, so the higher cap is a transitional safeguard.
- Reconstruct platform / cores / timestamp metadata at render time
  instead of pulling from JSON. BMF has no metadata block, so the
  per-table metadata line was dropping silently in the legacy ->
  BMF migration. detectMetadata() falls back to runtime detection
  (hostOS/hostCPU, countProcessors, now()) and honors three env vars
  (BENCH_RENDER_PLATFORM / _CORES / _TIMESTAMP) for devs re-rendering
  historical snapshots from a different host. renderTable still honors
  any meta JsonNode caller passes through (forward compatibility for
  a future BMF-with-metadata extension).

* ci: pin sibling-dep clones to nim-debra v0.6.0 / nim-typestates v0.7.0

Both build.yml and bench.yml clone the matching siblings under
`../nim-debra` and `../nim-typestates` so nim.cfg's `--path:` directives
resolve. They were cloning `--branch main`, which floats: a downstream
breaking change in either repo would silently propagate into our CI
without the corresponding `requires` bump in lockfreequeues.nimble.

Switch to specific release tags. Bumps now require touching three
places in lockstep (this comment, the `requires` line, the matching
release tag), which is the right cost for a dep update.

Drop the stale "Once debra 0.3.0 publishes…" comment — debra is at
v0.6.0 and typestates at v0.7.0, both released and registered.

* ci(bench): cost gate — drop feat/** PR base; require `bench` label

Two compounding gates so cascade-rebases of stacked feature PRs do not
burn Bencher.dev metric-ingest quota:

  A) `pull_request: branches:` is reduced to [main, devel]. Stacked
     feature-to-feature PRs (`feat/foo` -> `feat/bar`) no longer
     trigger this workflow at all. Only the topmost PR in any stack
     (the one whose base is devel/main) gets PR-time bench treatment.

  B) The `Track PR benchmarks with Bencher` step is now gated on a
     `bench` label being present on the PR. Without the label the step
     is skipped even on PRs targeting devel; the bench binaries still
     compile and run (free GHA minutes, useful for stdout-level
     spot-checks), but no `bencher run` upload fires. Add the label
     deliberately when you want a comparison report on a revision.

The preflight step is renamed and extended so the check summary
displays a clear "skipped because no `bench` label" warning when B is
the gate, paralleling the existing missing-token branch.

Push to main/devel (baseline upload) and workflow_dispatch (manual
upload) are unchanged.

* docs(bench): align Quick Start CI command with current 1M/500k budget

* ci(bench): wire t_bench_common into nimble benchtests + CI job

Adds a 'benchtests' nimble task that compiles/runs the bench harness
tests, which live outside srcDir and are intentionally not imported by
tests/test.nim (avoids forcing every entry of the MM/sanitizer test
matrix to compile bench harness threading machinery).

Adds a 'bench-tests' job to bench.yml that invokes 'nimble benchtests'
on every PR and push. The job has no secret dependencies, so fork PRs
are allowed; only the github-actions[bot] loop-back guard is kept.

Sibling Nim deps pin to v0.6.0 (debra) and v0.7.0 (typestates), matching
the existing 'benchmark' job in this workflow.

* PR 1 Task 1.1: BenchLatencyRuns/BenchLatencyMessageCount intdefines

Exposes per-binary {.intdefine.} overrides for latency wall-time
control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount
(default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern
in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert
defaults compile cleanly and that -d:BenchLatencyRuns=2
-d:BenchLatencyMessageCount=1000 overrides take effect via
compile-time static blocks.

Also fixes a stale import (./adapters/lockfreequeues_sipsic ->
./adapters/lockfreequeues_sipsic_adapter) so the file compiles after
PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix.
The rest of the file (legacy runLatencyBenchmark / benchmarkLatency /
isMainModule) is preserved verbatim and gets fully rewritten in
Task 1.2 onto runLatencyHarness.

* PR 1 Task 1.2: rewrite bench_latency.nim onto bench_common.runLatencyHarness

Replaces the legacy ping-pong RTT body with a CLI mirroring
bench_throughput's: positional variant filter (sipsic / mupmuc /
sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format
JSON emission, stdout text preserved. Per-binary run-shape
intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount /
BenchLatencyWarmupRuns drive runLatencyHarness directly.

Slug shape per design 2.2 (table at design line 357), 1p1c smoke
shape only — full grid lands in PR 2:
- lockfreequeues_sipsic/spsc/1p1c
- lockfreequeues_sipmuc/mpmc/1p1c
- lockfreequeues_mupsic/mpsc/1p1c
- lockfreequeues_mupmuc/mpmc/1p1c

Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns
measures (p999 / max deferred to PR 6 per impl plan).

Type unification: PushResult / PopResult were duplicated across
benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim
(new), forcing two nominally distinct enums to coexist. Aliasing
adapter.PushResult / PopResult to bench_common's via re-export
unifies the surface — both adapter packs (legacy sipsic / mupmuc and
the newer sipmuc / mupsic / unbounded_*) now flow through the same
runLatencyHarness and runThroughputHarness without per-call-site
conversion. Verified: t_bench_common (26 tests) and bench_throughput
--bmf-out integration unchanged.

The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer(
idx = 0) which bypasses Mupmuc's threadId-based registration, so the
pre-built producer / consumer slots are safe to drive from either of
the harness's pinger and ponger threads (Mupmuc's underlying
Vyukov per-slot CAS protocol is fully concurrent-safe).

Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering
the sipsic single-variant happy path, the four-variant slug coverage
required by Track 1 Acceptance Criteria, and the unknown-variant
exit-1 path. All pass with -d:BenchLatencyMessageCount=200
-d:BenchLatencyRuns=2 in under ~10s on Apple silicon.

* PR 1 Tasks 1.3 + 1.4: bench-latency job + N-input merge step in bench.yml

Restructures bench.yml from one all-in-one bench-throughput job into
three jobs:

  bench-throughput  (renamed from 'benchmark'):
    builds + runs bench_throughput, uploads bench-throughput-bmf artifact.
  bench-latency     (NEW):
    builds + runs bench_latency, uploads bench-latency-bmf artifact.
    Compile-time overrides: BenchLatencyMessageCount=50000,
    BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four
    1p1c smoke-shape variants finish in single-digit seconds inside the
    12-min job budget).
  bench-upload      (NEW, needs both):
    downloads bench-*-bmf artifacts via actions/download-artifact@v4
    pattern, runs python3 benchmarks/merge_bmf.py merged.json
    $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload
    flows (PR comparison / push baseline / workflow_dispatch). One
    bencher run invocation -> one Bencher Report -> latency + throughput
    measures co-located on the shared per-slug history (design 1).

Bencher upload commands are unchanged in spirit: same PR / push /
workflow_dispatch gating, same --threshold-measure throughput on push
baselines, same yellow-warning preflight when BENCHER_API_TOKEN is
unset on forks. This commit does NOT correct the historical
`--threshold-measure throughput` -> `throughput_ops_ms` rename
deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure
threshold blocks in the same edit.

Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4
(line 359-366). The two tasks share the same file and are tightly
coupled (1.4 strictly depends on 1.3); folding them into one commit
keeps the workflow file consistent at every commit boundary.

actionlint clean. SC2046 word-splitting on $(ls ...) is intentional
(each path must arrive as a separate positional argv to merge_bmf.py)
and is suppressed inline with a justification comment; artifact
filenames are repo-controlled (no whitespace / glob chars).

* PR 1 Task 1.5: smoke test for multi-measure-per-slug merge

Adds two cases to tests/t_bench_latency.nim that validate the
end-to-end shape Track 1 ships: a single slug carries BOTH
`throughput_ops_ms` (from bench_throughput's BMF fragment) and
`latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER
`merge_bmf.py` unions the two fragments.

  1. Multi-measure union — writes synthetic throughput + latency
     fragments on the same slug, runs python3 benchmarks/merge_bmf.py,
     and asserts the merged JSON carries all three measures with their
     original values intact.
  2. Collision guard — re-declares `throughput_ops_ms` on the same
     slug across both inputs and asserts merge_bmf.py exits 1 with a
     'collision' message in stderr. This regression-tests PR 0 Task 0.7's
     collision guard against silent overwrites that would erase one of
     the colliding measures.

Both tests use createTempDir + defer removeDir for isolated, leak-free
fixtures and run in <200ms because they bypass the bench compile +
execute path entirely. Plan reference: impl plan Track 1 Task 1.5
(line 368-376).

* PR 1 Task 1.7: CHANGELOG entry under [Unreleased] for bench-rollup PR 1

Adds two bullets to the [Unreleased] Added section covering bench_latency
--bmf-out emission and the new three-job bench.yml structure (bench-
throughput / bench-latency / bench-upload). Adds a Changed bullet
documenting the adapter.nim PushResult / PopResult unification.

Plan reference: impl plan Track 1 Task 1.7 (line 386-393).

* fix(bench-latency): correct prefix on unknown-flag error; rename groups to variants

Per Gemini review on PR #20.

1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption
   vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks
   the right prefix from p.kind.

2. `groups` is a leftover name from bench_throughput where positional
   args ARE topology groups. In bench_latency they're individual
   variants. Renamed for clarity.

* fix(bench): grant bench-upload contents/actions read; isolate artifact dirs

Two CI hardening fixes from code review on PR 1:

1. The bench-upload job overrode GITHUB_TOKEN permissions to only
   pull-requests:write + checks:write but then ran actions/checkout
   and actions/download-artifact, which need contents:read and
   actions:read. Add both so the job is robust against orgs that
   restrict default token permissions.

2. Switch from merge-multiple:true to per-artifact subdirectories.
   Today's artifact filenames (throughput.json, latency.json) are
   unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/
   unbounded binaries; each could end up writing files that collide
   under merge-multiple. Each artifact now lands at
   bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked
   over the recursive find.

* fix(bench): print p999 in latency stdout for operator visibility

The harness already computes metrics.p999_ns; surfacing it during
manual runs costs nothing and matches the legacy bench_latency
output. BMF wire-up for latency_p999_ns / latency_max_ns is
deferred to PR 6 alongside threshold configuration.

* test(bench_latency): isolate test workspaces with createTempDir + ExeExt

Replace static /tmp/bench_latency_t1*_* paths with per-test
createTempDir workspaces (via newTestWorkspace helper) cleaned up by
defer: removeDir. Use ExeExt for the compiled binary name so the
suffix is correct on Windows. Reshape compileBenchLatency to take an
owned `dir` instead of a `suffix` so workspace lifecycle stays with
the test that allocates it.

Prevents collisions between parallel test runs / repeated runs in
the same shell, and stops binaries from accumulating in the system
temp dir.

* ci(bench-latency): pin sibling deps to v0.6.0/v0.7.0; wire t_bench_latency into benchtests

* PR 2 Task 2.1: capture pre-split slug-set fixture for deletion-safety

Snapshots the BMF emitted by the legacy bench_throughput at the same
small-override knobs the topology-split tests will use, then commits
the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion-
safety check (Task 2.7) verifies that the post-split union is a strict
superset of this fixture so no slug silently disappears with the split.

Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape
suite (Task 2.1) passes against the committed JSON; the bench_spsc /
bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and
the superset suite (Task 2.7) fail loudly until those binaries +
the superset_check.py helper land.

Fixture covers 11 slugs:
  sipsic spsc/1p1c
  mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c
  nim_channels mpmc/{1,2,4}p{1,2,4}c

* PR 2 Task 2.3: bench_spsc binary (Sipsic 1p1c)

Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps
SipsicAdapter in a uniform queueInit closure and runs through
runThroughputHarness so this binary stays under 100 lines and shares
the per-run shaping (warmup, run-count, monotonic-ns timing,
remainder-spread distribution) with the rest of the topology split.

Per-binary intdefines (design 2.5):
  -d:BenchSpscRuns          (default 33)
  -d:BenchSpscMessageCount  (default 1_000_000)
  -d:BenchSpscWarmup        (default 3)

Behind -d:BenchSpscTestCompileTime the binary's `static` block
asserts the three defaults. The runtime + BMF emission contract is
covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3),
which compiles at 1000 messages * 2 runs and parses the emitted
lockfreequeues_sipsic/spsc/1p1c slug.

* PR 2 Task 2.4: bench_mpsc binary (Mupsic 1p1c, 2p1c, 4p1c)

Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic
requires per-thread Producer objects via queue.getProducer(idx); the
generic runThroughputHarness assumes a uniform queue.push call, so this
binary keeps the bespoke harness pattern from the legacy file:
pre-assign Producer values on the main thread (idx = 0..P-1) and ship
them into worker thread contexts. The shape-iteration loop instantiates
three concrete Mupsic types {1,2,4}p1c so P stays a static int.

Per-binary intdefines (design 2.5):
  -d:BenchMpscRuns          (default 33)
  -d:BenchMpscMessageCount  (default 1_000_000)
  -d:BenchMpscWarmup        (default 3)

Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind
-d:BenchMpscTestCompileTime the binary's static block asserts the three
defaults. Runtime behavior covered by tests/t_topology_split.nim's
bench_mpsc suite.

* PR 2 Task 2.5: bench_mpmc binary (Mupmuc grid + 8p8c, Sipmuc, channels)

Splits the MPMC slice out of the legacy bench_throughput.nim into a
single binary covering three queue families. Mupmuc and Sipmuc both
need per-thread Producer / Consumer values via getProducer(idx) /
getConsumer(idx); the harness pre-assigns those on the main thread and
ships them into worker contexts (mirrors the bench_throughput Mupmuc
pattern). Channels has a uniform push/pop interface and runs through
bench_common.runThroughputHarness.

Coverage:
  Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the
    issue #15 livelock regression case preserved from the pre-split
    bench_throughput fixture so the deletion-safety check does not
    silently lose that coverage.
  Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc
    per design 2.4).
  channels {1,2,4} P x {1,2,4} C grid.

Per-binary intdefines (design 2.5):
  -d:BenchMpmcRuns          (default 33)
  -d:BenchMpmcMessageCount  (default 1_000_000)
  -d:BenchMpmcWarmup        (default 3)

Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the
contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep
type resolution unambiguous.

* PR 2 Task 2.6: bench_unbounded binary (4 unbounded variants)

Splits the unbounded slice out of the legacy bench_throughput.nim.
Covers all 4 lockfreequeues unbounded variants at their natural shapes:

  unbounded_sipsic spsc_unbounded/1p1c                    (no DEBRA)
  unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c              (DEBRA on consumers)
  unbounded_mupsic mpsc_unbounded/{1,2,4}p1c              (DEBRA on producers)
  unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid   (DEBRA both sides)

UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch
reclamation, so producer/consumer threads call registerThread(manager)
on their own thread before queue.getProducer(handle) /
queue.getConsumer(handle). UnboundedMupsic reuses the existing
UnboundedMupsicAdapter (which pre-registers the consumer at adapter
init) and drives pop from the calling thread, mirroring the legacy
bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are
constructed inline with a heap-pointer DebraManager since their
constructors take a `ptr DebraManager`.

Per-variant intdefines (design 2.5):
  -d:UnboundedSipsicRuns / UnboundedSipsicMessageCount   (3 / 500_000)
  -d:UnboundedSipmucRuns / UnboundedSipmucMessageCount   (3 / 500_000)
  -d:UnboundedMupsicRuns / UnboundedMupsicMessageCount   (3 / 500_000)
  -d:UnboundedMupmucRuns / UnboundedMupmucMessageCount   (3 / 500_000)
  -d:BenchUnboundedWarmup                                (default 2)

Per-variant defines are chosen over a shared BenchUnboundedRuns so CI
can budget unbounded_mupmuc/4p4c independently from the cheaper
unbounded_sipsic/1p1c (design 2.5 rationale).

* PR 2 Task 2.7: superset_check.py + deletion-safety wiring

Adds benchmarks/scripts/superset_check.py: a slug-set subset check that
exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc +
bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in
the pre-split fixture, and exits 1 with the missing slugs listed on
stderr otherwise. The check operates on JSON top-level keys only; it
does not compare measure values, bounds, or ordering, so the harness
is free to reshape per-run statistics across the split without
tripping the deletion-safety guard.

Also wires the check into tests/t_topology_split.nim's deletion-safety
suite (compile + run all four binaries at small overrides, merge via
merge_bmf.py, then invoke superset_check.py against the committed
fixture and assert exit 0).

benchmarks/tests/test_superset_check.py covers 9 cases: identical
sets, strict superset, missing slugs (with alpha-sorted listing),
empty pre, wrong arg count, missing pre file, malformed JSON, and
top-level non-object. All exit codes use 1 for failure to match the
existing benchmarks/merge_bmf.py contract.

* PR 2 Task 2.8: bench.yml matrix over 5 topology binaries + per-step timeouts

Replaces the legacy bench-throughput + bench-latency job pair with a
single matrix job over the 5 topology-split binaries. Each matrix
entry runs in its own GitHub Actions job with an independent
`timeout-minutes: 12` budget so a hang in one binary cannot burn the
whole workflow's clock; the surviving binaries finish, the
bench-upload job merges what arrived, and the operator gets partial
Bencher coverage rather than no coverage. The compile step's `case`
sets per-binary `-d:` overrides matching design 2.5: bounded
throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3
runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup.

bench-upload now also runs the deletion-safety check
(superset_check.py vs tests/fixtures/pre-split-slugs.json) right after
merge_bmf.py, so any silent slug regression introduced by future
edits to the topology binaries fails the PR check.

Artifact naming changes: each matrix entry uploads
`bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's
download pattern is unchanged (`bench-*-bmf`) so it picks up every
new artifact without further edits.

actionlint clean; yamllint shows only the pre-existing
"missing document start" warning shared across all repo workflows.

* PR 2 Task 2.9: 5-input union test in test_merge_bmf.py

Adds test_five_input_union covering the upload-job pipeline shape that
PR 2's matrix produces: five sibling BMF fragments (one per binary —
bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency)
arrive via actions/download-artifact, are merged via merge_bmf.py, and
must produce a single output whose slug set is the disjoint union of
the inputs.

The test deliberately collides the sipsic 1p1c slug between bench_spsc
(throughput_ops_ms) and bench_latency (latency_p50_ns,
latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring
the production multi-measure-per-slug pattern from Track 1 Task 1.5.
The merge must preserve every measure on the shared slug; collisions
on (slug, measure) pairs continue to exit 1 (covered by the existing
test_collision_exits_1 case, no regression).

* PR 2 Task 2.10: delete bench_throughput.nim, rewire consumers

Removes benchmarks/nim/bench_throughput.nim now that the five topology-
split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded
+ bench_latency from PR 1) cover its full slug surface. The pre-split
fixture committed in Task 2.1 plus the superset_check.py guard wired
into bench.yml in Task 2.8 enforce that no slug from the legacy binary
silently disappears across the split.

Live consumers updated to reference the new binaries:
  benchmarks/runner.py     - builds + runs all 5 binaries, then
                             merges fragments via merge_bmf.py.
  lockfreequeues.nimble    - `nimble benchmarks` task iterates the
                             5 binaries and merges to results/latest.json.
  benchmarks/README.md     - rewritten to describe the 5-binary
                             pipeline, the matrix CI job, and the
                             merged BMF schema (single slug can carry
                             both throughput and latency measures).
  .gitignore               - tracks the four new throughput binaries;
                             also ignores the worktree-local `deps`
                             symlink that nim.cfg's relative path
                             expects.
  tests/t_bench_common.nim - removes the obsolete Task 0.10
                             bench_throughput integration suite (now
                             covered by t_topology_split.nim per
                             topology binary).
  benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim
                           - doc-comment rewires from bench_throughput
                             to bench_unbounded.

Comment-only references in bench_common.nim, bench_latency.nim,
render_readme.nim, and the new bench_*.nim files describe the legacy
binary as historical context (the legacy `runThroughputBenchmark`
proc / Mupmuc registration pattern that bench_common.runThroughputHarness
factored out) and stay verbatim.

* PR 2 Task 2.11: CHANGELOG entry under [Unreleased] for bench-rollup PR 2

Adds Track 2 PR 2 entries to the [Unreleased] section under Added,
Changed, and Removed:
  - Added: 4 new topology-split throughput binaries (bench_spsc,
    bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py
    deletion-safety guard, 5-input merge_bmf union test.
  - Changed: bench.yml matrix over 5 binaries with per-step timeouts
    + superset_check.py wired into bench-upload; runner.py and
    `nimble benchmarks` task iterate the 5 binaries; README rewrite.
  - Removed: bench_throughput.nim (replaced by the 4 topology-split
    binaries; pre-split fixture + superset_check.py guard pin
    deletion-safety).

* fix(bench): tighten bench_unbounded CI shape to fit 18-min budget

bench_unbounded matrix item was timing out at 10 minutes after only
4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed
shapes on 4-vCPU runners. Reduce message counts and runs across the
heavy fan-out variants (mupmuc 9-shape grid is the main offender) and
bump the job timeout to 18 min for cushion. Statistical signal is
preserved for Bencher trend tracking; absolute throughput numbers
will shift but per-shape comparisons remain meaningful.

- UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3)
- UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes)
- UnboundedMupmuc: 500Kx3 -> 50Kx2  (9 shapes)
- BenchUnboundedWarmup: 2 -> 1
- timeout-minutes: 12 -> 18

* fix(bench): gate oversubscribed C>=4 unbounded shapes behind a define

bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape:
4 consumers compete in a strict round-robin slot order (each consumer
must take its turn before another can pop), and on a 4-vCPU runner
one descheduled consumer blocks every other. The same starvation
pattern hits every unbounded mupmuc shape with C>=4.

Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops:

  - unbounded_sipmuc/mpmc_unbounded/1p4c
  - unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c}

mupsic shapes (multi-producer, single consumer) are kept regardless:
the round-robin ordering exists on consumers only, so a single consumer
cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json
and the deletion-safety guard would fail without it.

The full grid still runs in `bench-comparison.yml` nightly cron / on
workflow_dispatch where a beefier runner is available. PR CI now gets
the meaningful subset (4 mupmuc shapes covering each P,C in {1,2})
that fits a 4-vCPU runner without hanging.

* fix(bench): document p95 in BMF schema; clean up runner.py fragment files

Per Gemini review on PR #21.

1. README.md schema example listed only latency_p50_ns + latency_p99_ns,
   but bench_latency.nim emits all three (p50/p95/p99). Added p95 to
   the schema block and the prose paragraph below it. The Metrics
   section already correctly listed p50/p95/p99.

2. runner.py left N timestamped fragment files in RESULTS_DIR after
   each merged run (one per Nim binary) — only the merged file is the
   canonical artifact. Wrap the run+merge in try/finally and unlink
   each fragment after, so a bench-binary failure or a merge collision
   still cleans up whatever was produced.

* fix(bench): run upload on partial failures; correct timeout-comment

Three CI hardening fixes from code review on PR 2:

1. bench-upload had needs:[bench] without if:always(), so any single
   matrix-leg failure would skip the merge+upload entirely. Per the
   job comment the design is partial-coverage tolerant: surviving
   binaries should still be merged and uploaded. Add
   if: always() && (fork-PR guard) and a guard step that fails loudly
   when zero JSON fragments are present.

2. The per-binary upload-artifact step now runs with if: always() and
   if-no-files-found: ignore so a step-level timeout flushes any
   partial JSON the binary wrote, and a binary that crashed before
   writing the file does not turn the matrix leg red for the merger.

3. Job and step comments referenced a 12-min timeout while the
   configured value was 18 min. Update both comments to 18.

* fix(bench): drop unparseable BMF fragments before merge

bench-upload runs with if: always(), so a matrix leg that crashed
mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1
on the first malformed input, blocking uploads from surviving legs.
Pre-validate with json.load and drop any that fail, emitting a
::warning:: per dropped file.

* perf(bench): backoffOnPeerWait in busy-spin loops; document teardown order

- Add backoffOnPeerWait to producer/consumer busy loops in
  bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer)
  and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation
  in oversubscribed runs. Mupmuc producer already had it.
- Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/
  UMupmuc consumer loops in bench_unbounded.nim.
- Document the per-iteration teardown order in runUSipmucShape and
  reference it from runUMupmucShape: reset(q) MUST precede reset(manager[])
  because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager,
  which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC,
  system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip
  destructors (verified against project's own =destroy implementations
  in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])).

* docs(bench_unbounded): defend `create(T)` as alloc0, not destructor-on-uninit

Reviewer flagged that `create(DebraManager[...])` followed by
`manager[] = initDebraManager[...]()` "runs destructors on
uninitialised memory." It does not. `proc create*(T: typedesc, size)`
in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` —
zero-initialised allocation with no hook invocation. The follow-up
assignment then performs the only construction.

Add an explanatory note in the per-iteration teardown block comment
(runUSipmucShape) so this doesn't keep getting flagged. The other
harness (runUMupmucShape) already back-references that block comment.

No behaviour change.

* fix(bench): pass -d:danger in runner.py build_nim() to match CI

CI compiles topology-split bench binaries with -d:release -d:danger;
runner.py was missing -d:danger, so locally produced numbers were not
comparable to the cloud baseline.

* test: t_unbounded_padding red — segment alloc returns 16B-aligned, expects 64B

PR 3 Task 3.1 (cache-line padding audit, step 1 of 3).

Verifies bounded variants already conform; introduces the failing test that
will drive the unbounded fix in 3.2 and 3.3.

Bounded MPMC audit (read-only verification, design doc §4.2):
- src/lockfreequeues/sipsic.nim:19-20    head/tail Atomic[int]    {.align: CacheLineBytes.} ✓
- src/lockfreequeues/sipmuc.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupsic.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupmuc.nim:43-44    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
The bounded MPMC cell array also uses the documented array-of-cells trick
(typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line.
No bounded change required.

Unbounded variants currently FAIL both audit conditions:
- Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.}
  pragmas, so they share cache lines with neighbouring data.
- newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte
  aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma
  controls only the intra-struct offset, not the base.

Test additions:
- tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants:
  (a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer
      coordination atomics.
  (b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated
      Segment via the queue's allocator.
- Test-only accessors guarded by `when defined(testing):` in each unbounded
  module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest
  (returns offsets of audit-relevant fields). Compiled away in release.
- Wired into tests/test.nim aggregator.

RED state confirmed:
  Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED
  off.tail mod 64 was 8 (expected 0)
  off.prevConsumerIdx mod 64 was 16
  off.committed mod 64 was 24
  cast[uint](segPtr) mod 64 was 16

Existing test suite continues to compile (verified `nim c tests/test.nim`).

Refs: design doc §4.2; impl plan Task 3.1.

* feat(internal): add aligned_alloc.allocAligned via posix_memalign

PR 3 Task 3.2 (cache-line padding audit, step 2 of 3).

Adds the cache-line-aligned heap allocator that replaces c_calloc in the
unbounded queue Segment allocation path (call sites swap in Task 3.3).

Compile probe (impl plan §3.2.0):
  $ nim c -r tmp_probe.nim
  rc=0
  align=0
verifying that std/posix.posix_memalign is available on macOS (libSystem)
and Linux glibc with the same C signature, returning 64-byte aligned memory
on success. Probe file deleted post-verification per impl plan.

Implementation (src/lockfreequeues/internal/aligned_alloc.nim):
- ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with
  CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on
  rc != 0. Caller releases via the standard c_free per POSIX guarantee
  that posix_memalign blocks are free-compatible.
- Re-exports CacheLineBytes from atomic_dsl for callers in 3.3.
- isMainModule smoke confirms tail-call alignment.

Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim):
- Tiny payload (sizeof < 64).
- Line-sized payload (sizeof == 64).
- Big payload (sizeof > 64).
- 64 sequential allocations all 64B-aligned.
All 4 cases assert pointer alignment AND zero-initialization. Result:
  internal/aligned_alloc.allocAligned ....
  [Summary] 4 tests run: 4 OK, 0 FAILED.

The unbounded padding audit test (tests/t_unbounded_padding.nim) is still
RED at this commit; Task 3.3 swaps the Segment allocation call sites and
adds {.align: CacheLineBytes.} pragmas, taking it GREEN.

Refs: design doc §4.2; impl plan Task 3.2.

* fix(unbounded): pad Segment fields and base-align via posix_memalign

PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN).

Applies the cache-line padding fix to all four unbounded queue variants and
their per-segment producer/consumer atomics. Resolves the asymmetry called
out in design doc §4.2: bounded variants were already padded; unbounded
Segments were not, so cross-library comparison numbers in PR 3 would have
been biased against lockfreequeues' unbounded types.

Production changes (4 files):
- src/lockfreequeues/unbounded_sipsic.nim
  Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment}
  gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned.
- src/lockfreequeues/unbounded_sipmuc.nim
  Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
- src/lockfreequeues/unbounded_mupsic.nim
  Segment.{next, tail, committed} + UnboundedMupsic.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
  ``head: int`` (single consumer, non-atomic) is intentionally left
  unpadded — only the consumer touches it.
- src/lockfreequeues/unbounded_mupmuc.nim
  Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{
  headSegment, tailSegment} gain alignment. newSegment swaps to
  allocAligned.

The ``c_calloc`` import is dropped; ``c_free`` retained for segment
release (POSIX guarantees posix_memalign blocks are free-compatible).

Cross-platform fix in aligned_alloc.nim:
- ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on
  macOS because the SDK declares the first parameter ``__unsafe_indexable``
  (incomplete-type ``void *`` vs the wrapper's ``void **``).
- Replaced the ``import std/posix`` with a local ``importc, header:
  "<stdlib.h>"`` shim using the canonical C signature, which clang accepts
  in both C and C++ modes. Documented in aligned_alloc.nim header comment.

Verification:
- tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp.
- Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc:
    [Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED  (all four variants)
- t_unbounded_*_threaded suites pass — no false-sharing-induced timing
  regression in the threaded code paths.

Refs: design doc §4.2; impl plan Task 3.3.

* feat(bench): add loony_adapter (unbounded MPMC) behind compile gate

PR 3 Task 3.5 (comparison MVP, first FFI adapter).

Adds the Loony unbounded MPMC adapter and a parametric round-trip test in
tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so
the in-tree test suite passes vacuously when the package is absent.

Files:
- benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported
  = {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name.
- tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves
  set; parametric template extends to subsequent adapters in 3.6–3.10.

Loony storage-encoding fix (NOT a hack — required by Loony's design):
Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state
flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection
reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as
"slot empty". Loony was designed for ref/ptr types whose addresses are
8-byte-aligned and never null. Pushing the literal value 0 (or any value
whose low 3 bits are non-zero) breaks Loony's pop path.

Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3``
on push and ``(raw shr 3) - 1`` on pop:
- ``shl 3`` clears the low flag bits.
- ``+ 1`` ensures the encoded value is never zero.
This is documentation-equivalent to how Loony users push 8-byte-aligned,
non-null ref addresses; it does NOT change Loony's algorithm or affect
the comparison fairness — every push/pop still goes through Loony's full
fast/slow paths.

First-iteration RED state (recorded for the audit):
- Plain push of ``uint64(i)`` returned only 125 of 1000 items, all
  multiples of 8 (i.e. the values whose ``i and 7 == 0``).
- ``shl 3`` alone returned 999 of 1000 items (everything except 0,
  which collided with Loony's empty-slot sentinel).
- ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input.

Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available
+ --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK.
Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK.

Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK.

* feat(bench): add boost_lockfree_queue adapter (MPMC bounded, nim cpp)

Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps
boost::lockfree::queue<unsigned long long> with capacity-bounded
push (bounded_push) and pop. Topology: tMpmc.

Compile gates:
  -d:adapter_boost_lockfree_queue_available  -- enable
  nim cpp                                     -- required (else hard error)

Pulls headers from the conventional install paths (apt /usr/include,
brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts
-d:boostIncludeDir=<path> for non-standard locations.

Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric
round-trip test in tests/t_bench_adapters.nim gated by both the define
and 'cpp' build mode.

Adapter capacity argument is the fixed node-pool size; pushes that
exceed it return prFull. Default 1024 mirrors other bounded adapters.

* feat(bench): add boost_lockfree_spsc adapter (SPSC bounded, nim cpp)

Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer
ring buffer. The pop overload used here -- pop(out, max_size) -- returns
the count of items popped (csize_t); 1 means success, 0 means empty.

Compile gates:
  -d:adapter_boost_lockfree_spsc_available  -- enable
  nim cpp                                    -- required

Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header
search paths and indirection rationale identical to boost_lockfree_queue
adapter; smoke_boost.nim covers both adapters under one binary.

* feat(bench): add bench-ffi-crossbeam Rust cdylib (ArrayQueue + SegQueue)

C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and
crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C"
fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy)
that the Nim adapters in tasks 3.9 / 3.10 will importc.

- Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link
  the integration test against the same code path the cdylib exports.
- rust-toolchain.toml pins stable channel for reproducibility.
- src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic
  is intercepted (returns null) so we never unwind across FFI.
- tests/integration_test.rs: 6 tests covering round-trip set equality
  for both queues, full/empty edges, and null-pointer tolerance.

Build: cargo build --release  -> target/release/libbench_ffi_crossbeam.{so,dylib}
.gitignore extended for benchmarks/rust/*/target/ and Cargo.lock.

* feat(bench): add crossbeam_array_queue adapter (MPMC bounded, FFI cdylib)

importc the cb_array_init/push/pop/destroy fns from
benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a
typed Nim adapter with topologiesSupported = {tMpmc}.

The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling
crossbeam_seg_queue_adapter gate is also active, so combined builds
do not produce 'ignoring duplicate libraries' warnings on Apple ld.

Adds smoke_crossbeam.nim covering both crossbeam adapters under one
binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim
extended with the gated parametric round-trip suite.

Compile gates:
  -d:adapter_crossbeam_array_queue_available
  -d:crossbeamLibDir=<path>   (override search path for non-default builds)

* feat(bench): add crossbeam_seg_queue adapter (MPMC unbounded, FFI cdylib)

importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam
cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess
when the handle is non-null; the underlying cb_seg_push only reports
failure on null pointers).

topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored
so the wire-up sites (Task 3.11) can pass capacity uniformly across all
adapters without a per-adapter conditional.

Compile gate: -d:adapter_crossbeam_seg_queue_available
Optional override: -d:crossbeamLibDir=<path>

* feat(bench): wire MVP comparison adapters into bench_spsc/mpmc/unbounded

Each MVP adapter is pulled in only when its compile-time gate is set;
absent gates leave production builds unchanged. Variant lists are
populated at compile time via 'when declared(...)' so the CLI's
--list / unknown-variant error messages reflect the actual compiled-in
set.

Wiring map (per design 2.4 / Track 3 §3.11):

  bench_spsc:
    - boost_lockfree_spsc -> spsc/1p1c (gated by
      adapter_boost_lockfree_spsc_available)

  bench_mpmc:
    - boost_lockfree_queue   -> mpmc/{1,2,4}p{1,2,4}c
    - crossbeam_array_queue  -> mpmc/{1,2,4}p{1,2,4}c
    (each gated by its respective adapter_*_available define)

  bench_unbounded:
    - loony                  -> mpmc_unbounded/{1,2,4}p{1,2,4}c
    - crossbeam_seg_queue    -> mpmc_unbounded/{1,2,4}p{1,2,4}c

bench_mpsc has no MVP coverage today: none of the MVP libraries provide
a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC
parity for Track 3 and the existing in-tree mupsic baseline stays the
sole MPSC slug.

Verified locally: bench_mpmc + bench_unbounded produce expected
crossbeam_* slugs at the small fixture (10K msg, 2 runs).

* ci(bench): add boost+loony soft-skip flow per design §2.6

workflow_dispatch inputs force_skip_boost / force_skip_loony surface
the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay
them into step-level if: gates.

Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate
on any failure. The combined defines + compile-mode (c vs cpp) flow
into the per-binary compile step via a final 'Build ada…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant