PR 1: bench_latency wired to BMF (smoke shape, multi-measure-per-slug)#20
Conversation
There was a problem hiding this comment.
Code Review
This pull request rewrites the bench_latency tool to utilize the common benchmarking harness, enabling native Bencher Metric Format (BMF) output via a new --bmf-out flag. It unifies the PushResult and PopResult types across adapters by re-exporting them from bench_common, which simplifies the adapter implementations for various queue variants. Additionally, the PR introduces a new bench-latency GitHub workflow job and comprehensive integration tests for the latency benchmark and BMF merging logic. I have no feedback to provide.
400e205 to
33fc9c8
Compare
6a882f7 to
6d725c6
Compare
33fc9c8 to
459090f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request rewrites the bench_latency utility to leverage the bench_common.runLatencyHarness, enabling native Bencher Metric Format (BMF) JSON output and standardizing the CLI interface. It also unifies the PushResult and PopResult types across adapters and introduces comprehensive integration tests. The review feedback highlights an opportunity to improve CLI error messages for unknown short flags and suggests renaming a variable to better reflect the variant terminology used throughout the codebase.
| quit 1 | ||
| bmfOutPath = p.val | ||
| else: | ||
| echo "Unknown flag: --", p.key |
There was a problem hiding this comment.
The error message for unknown flags incorrectly assumes all flags are long options (prefixed with --). If a user provides an unknown short option (e.g., -x), the output will be Unknown flag: --x, which is slightly misleading.
echo "Unknown flag: ", (if p.kind == cmdLongOption: "--" else: "-"), p.key
| if positional.len == 0: | ||
| supported | ||
| else: | ||
| var groups = initHashSet[string]() |
There was a problem hiding this comment.
The variable name groups is used here, but in the context of bench_latency, the positional arguments represent individual variants rather than groups of variants (unlike bench_throughput). Renaming this to variants would improve clarity and consistency with the rest of the file.
var variants = initHashSet[string]()
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
459090f to
19c6dcb
Compare
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/review |
Code Review by Qodo
1. Bench-upload missing token scopes
|
| name: Merge BMF + Bencher upload (ubuntu-latest) | ||
| runs-on: ubuntu-latest | ||
| if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository | ||
| needs: [bench-throughput, bench-latency] | ||
| permissions: | ||
| pull-requests: write | ||
| checks: write | ||
| timeout-minutes: 10 | ||
|
|
||
| steps: | ||
| - name: Checkout project | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Download BMF artifacts | ||
| # Pattern matches both bench-throughput-bmf and bench-latency-bmf | ||
| # (and any future per-binary artifact uploaded by PR 2's topology | ||
| # split, which uses the same `bench-*-bmf` naming convention). | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| pattern: bench-*-bmf | ||
| path: ./bmf-inputs/ | ||
| merge-multiple: true |
There was a problem hiding this comment.
1. Bench-upload missing token scopes 🐞 Bug ☼ Reliability
The bench-upload job sets job-level permissions but does not grant contents: read (for actions/checkout) or actions: read (for actions/download-artifact). In repos/orgs that restrict default GITHUB_TOKEN permissions, this job can fail to checkout or download artifacts, preventing BMF merge and Bencher upload.
Agent Prompt
### Issue description
`bench-upload` overrides `GITHUB_TOKEN` permissions but does not include the permissions needed to (a) checkout the repository and (b) download workflow artifacts. This can cause 403/permission failures and stop the merge+upload pipeline.
### Issue Context
The job runs `actions/checkout@v4` and `actions/download-artifact@v4` after setting a restrictive `permissions:` block.
### Fix Focus Areas
- .github/workflows/bench.yml[182-208]
### Suggested change
Update `bench-upload.permissions` to include at least:
- `contents: read`
- `actions: read`
Keep existing:
- `pull-requests: write`
- `checks: write`
(Optionally, validate if Bencher/comment posting needs any additional scopes beyond these.)
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
19c6dcb to
7a6ff2c
Compare
|
/review |
|
Persistent review updated to latest commit 7a6ff2c |
|
Preparing review... |
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request rewrites the bench_latency benchmark to use the bench_common harness, enabling native Bencher Metric Format (BMF) output and aligning its CLI with bench_throughput. It unifies the PushResult and PopResult types across adapters, updates the CI workflow to merge benchmark results, and adds integration tests. Feedback suggests including the p999 latency metric in the stdout report to maintain consistency with previous versions and improve tail latency visibility.
| echo fmt" p99: {metrics.p99_ns:.0f} ns" | ||
| echo fmt" max: {metrics.max_ns:.0f} ns" |
There was a problem hiding this comment.
The p999 latency metric is calculated by the harness and available in the metrics object, but it is omitted from the stdout report. For consistency with the previous implementation and to provide better tail latency visibility during manual runs, it should be printed alongside the other percentiles.
echo fmt" p99: {metrics.p99_ns:.0f} ns"
echo fmt" p999: {metrics.p999_ns:.0f} ns"
echo fmt" max: {metrics.max_ns:.0f} ns"
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
7a6ff2c to
a41a91d
Compare
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request rewrites the bench_latency utility to utilize the bench_common harness, introduces native Bencher Metric Format (BMF) output, and unifies adapter result types across different queue implementations. It also adds a comprehensive integration test suite and updates the CI workflow to merge performance metrics. Feedback was provided to improve the test suite's robustness by using unique temporary directories for build artifacts to prevent collisions and ensuring cross-platform compatibility for binary execution.
| let outBin = getTempDir() / "bench_latency_t11_defaults" | ||
| let cmd = "nim c --threads:on -d:release -d:BenchLatencyTestCompileTime=1 " & | ||
| "-o:" & outBin & " " & BenchLatencySrc |
There was a problem hiding this comment.
The integration tests compile the bench_latency binary into the system temporary directory using static filenames (e.g., bench_latency_t11_defaults). This approach has two issues: it can lead to collisions if tests are run in parallel or in a shared environment, and it leaves behind compiled binaries that are never cleaned up. Consider using createTempDir to create a unique workspace for each test run and ensure the directory is removed via a defer block.
| ## Compile bench_latency.nim with extra -d: defines, return path to the | ||
| ## resulting binary. Compiles in release mode for realistic timing but | ||
| ## with tiny message counts so the integration test stays fast. | ||
| let outBin = getTempDir() / ("bench_latency_t" & suffix) |
There was a problem hiding this comment.
Similar to the static binary names in the intdefine tests, compileBenchLatency creates binaries in the temporary directory that are not cleaned up. Additionally, for cross-platform compatibility, you should append ExeExt to the binary name to ensure it works correctly on Windows, although the current project seems focused on Unix-like environments.
let outBin = getTempDir() / ("bench_latency_t" & suffix & "_" & $getCurrentProcessId() & ExeExt)
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
a41a91d to
b0c8224
Compare
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request refactors the latency benchmarking infrastructure by migrating bench_latency to a shared harness, unifying adapter result types, and adding native Bencher Metric Format (BMF) support. It also introduces comprehensive integration tests for benchmark configuration and BMF merging. Review feedback suggests improving CLI robustness by supporting standard POSIX flag formats and maintaining test code consistency by utilizing existing workspace helpers.
| of "bmf-out": | ||
| if p.val.len == 0: | ||
| echo "Missing value for --bmf-out" | ||
| quit 1 | ||
| bmfOutPath = p.val |
There was a problem hiding this comment.
The current CLI parsing for --bmf-out using parseopt expects the format --bmf-out=path. If a user provides --bmf-out path (with a space), p.val will be empty, and the program will quit with an error. While acceptable for a benchmark tool, it's worth noting that this deviates from standard POSIX-style flag parsing where both forms are typically supported.
|
|
||
| suite "bench_latency multi-measure-per-slug merge (Task 1.5)": | ||
| test "merge_bmf.py unions throughput + latency on shared slug": | ||
| let dir = createTempDir("bench_latency_t15_", "") |
There was a problem hiding this comment.
In the Task 1.5 test suite, createTempDir is called directly instead of using the newTestWorkspace helper defined at line 19. For consistency and to ensure proper cleanup (as newTestWorkspace includes a descriptive prefix), it's better to use the helper here as well.
let dir = newTestWorkspace("t15_merge")
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
b0c8224 to
55c16bf
Compare
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request refactors the latency benchmarking tool to utilize a shared harness and unifies the queue adapter interface by re-exporting result types from a common module. It introduces native Bencher Metric Format (BMF) output via a new CLI flag, updates the CI workflow to merge these metrics, and adds comprehensive integration tests. Feedback was provided to use a Nim template to reduce repetitive boilerplate in the benchmark variant dispatch logic.
| runLatencyHarness[SipsicAdapter[LatencyCapacity, uint64]]( | ||
| queueInit = initSipsic, | ||
| messageCount = BenchLatencyMessageCount, | ||
| runCount = BenchLatencyRuns, | ||
| warmupCount = BenchLatencyWarmupRuns, | ||
| ) | ||
| of "sipmuc": | ||
| runLatencyHarness[LockfreequeuesSipmucAdapter[LatencyCapacity, 1, uint64]]( | ||
| queueInit = initSipmuc, | ||
| messageCount = BenchLatencyMessageCount, | ||
| runCount = BenchLatencyRuns, | ||
| warmupCount = BenchLatencyWarmupRuns, | ||
| ) | ||
| of "mupsic": | ||
| runLatencyHarness[LockfreequeuesMupsicAdapter[LatencyCapacity, 1, uint64]]( | ||
| queueInit = initMupsic, | ||
| messageCount = BenchLatencyMessageCount, | ||
| runCount = BenchLatencyRuns, | ||
| warmupCount = BenchLatencyWarmupRuns, | ||
| ) | ||
| of "mupmuc": | ||
| runLatencyHarness[MupmucAdapter[LatencyCapacity, uint64]]( | ||
| queueInit = initMupmuc, | ||
| messageCount = BenchLatencyMessageCount, | ||
| runCount = BenchLatencyRuns, | ||
| warmupCount = BenchLatencyWarmupRuns, | ||
| ) |
There was a problem hiding this comment.
The runLatencyHarness calls for each variant are highly repetitive. Using a template can significantly reduce this boilerplate and improve maintainability, especially as more variants or configuration parameters are added in future PRs.
template runHarness(AdapterType, initProc): untyped =
runLatencyHarness[AdapterType](
queueInit = initProc,
messageCount = BenchLatencyMessageCount,
runCount = BenchLatencyRuns,
warmupCount = BenchLatencyWarmupRuns,
)
let metrics =
case variant
of "sipsic": runHarness(SipsicAdapter[LatencyCapacity, uint64], initSipsic)
of "sipmuc": runHarness(LockfreequeuesSipmucAdapter[LatencyCapacity, 1, uint64], initSipmuc)
of "mupsic": runHarness(LockfreequeuesMupsicAdapter[LatencyCapacity, 1, uint64], initMupsic)
of "mupmuc": runHarness(MupmucAdapter[LatencyCapacity, uint64], initMupmuc)
else:
raise newException(ValueError, "unknown variant: " & variant)
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
55c16bf to
6c4cb5f
Compare
|
/ai-review |
|
✅ Momus review posted — verdict APPROVE, 2 findings
|
|
/ai-review |
2 similar comments
|
/ai-review |
|
/ai-review |
There was a problem hiding this comment.
PR 1 of the bench-rollup series: rewrites bench_latency.nim on bench_common.runLatencyHarness, splits bench.yml into three jobs (throughput, latency, upload), unifies PushResult/PopResult types via re-export from adapter.nim, and adds a 7-test t_bench_latency suite. Two low-severity findings: the bench-latency CI job clones deps from the mutable main branch (inconsistent with bench-throughput and build.yml which pin to release tags), and the new tests are not wired into any CI workflow or nimble task.
Severity tally: 2 Lows.
Low
- BOT-A1 (
.github/workflows/bench.yml:143): bench-latency CI job clones deps frommain, making it non-deterministic - BOT-A2 (
tests/t_bench_latency.nim:1): New bench_latency tests are not wired into any CI workflow or nimble task
Noteworthy
- Clean unification of PushResult/PopResult via re-export from adapter.nim, eliminating the parallel definition split from PR 0.
- merge_bmf.py collision guard test is thorough — covers both the happy-path union and the (slug, measure) collision rejection.
- CI smoke shape tuning (50K × 11 runs × 2 warmup) is well-calibrated for the 12-min latency job budget.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| sudo apt-get -qq install -y clang | ||
|
|
||
| - name: Clone and install sibling Nim deps (nim-debra, nim-typestates) | ||
| run: | | ||
| set -e |
There was a problem hiding this comment.
BOT-A1 — Low (standards)
bench-latency CI job clones deps from main, making it non-deterministic
The new bench-latency job clones nim-debra and nim-typestates from --branch main. The sibling bench-throughput job (and build.yml) clone from pinned tags (v0.6.0, v0.7.0) with an explicit comment: "Pin to specific release tags (not main) for deterministic CI; bump in lockstep with build.yml and lockfreequeues.nimble's requires." Using main means any commit to those repos can silently break the bench-latency CI job without a corresponding change in this repo. This is inconsistent and undermines CI reproducibility.
| sudo apt-get -qq install -y clang | |
| - name: Clone and install sibling Nim deps (nim-debra, nim-typestates) | |
| run: | | |
| set -e | |
| Change `--branch main` to `--branch v0.6.0` for nim-debra and `--branch v0.7.0` for nim-typestates in the bench-latency job, matching bench-throughput. |
| ## Tests for benchmarks/nim/bench_latency.nim — the latency bench binary. | ||
| ## | ||
| ## Track 1 (PR 1) covers: per-binary intdefines (Task 1.1), --bmf-out | ||
| ## emission via runLatencyHarness (Task 1.2), and multi-measure-per-slug | ||
| ## merge with throughput (Task 1.5). | ||
| ## | ||
| ## The bench binary is invoked as a subprocess in the integration tests | ||
| ## (Tasks 1.2 / 1.5); compile-time intdefine assertions (Task 1.1) live | ||
| ## in the binary itself behind a `BenchLatencyTestCompileTime` flag and | ||
| ## are exercised from a tiny dedicated build invocation here. | ||
|
|
||
| import std/[json, os, osproc, strutils, tempfiles] | ||
| import unittest2 | ||
|
|
||
| const | ||
| RepoRoot = currentSourcePath().parentDir.parentDir | ||
| BenchLatencySrc = RepoRoot / "benchmarks" / "nim" / "bench_latency.nim" | ||
|
|
||
| proc newTestWorkspace(prefix: string): string = | ||
| ## Allocate a private workspace dir for one test. Each test that | ||
| ## compiles a bench_latency binary or writes a BMF file uses this so: | ||
| ## 1. parallel runs (or repeated runs in the same shell) cannot | ||
| ## collide on a static `/tmp/bench_latency_t11_*` filename, and | ||
| ## 2. compiled binaries don't accumulate in the system temp dir. | ||
| ## The caller is responsible for `removeDir` (typically via `defer`). | ||
| ## `prefix` is a per-test stem that matches the original static suffix | ||
| ## so test failure messages stay legible. | ||
| result = createTempDir("bench_latency_" & prefix & "_", "") | ||
|
|
||
| # ---------- Task 1.1: intdefine defaults + override ---------- | ||
|
|
||
| suite "bench_latency intdefines (Task 1.1)": | ||
| test "defaults: BenchLatencyRuns == 33, BenchLatencyMessageCount == 100_000": | ||
| # Compile bench_latency.nim with -d:BenchLatencyTestCompileTime=1. | ||
| # The binary, when this define is set, runs a `static` block that | ||
| # asserts the two intdefine defaults; if they are wrong, compilation | ||
| # fails with the static assert message. | ||
| let dir = newTestWorkspace("t11_defaults") | ||
| defer: removeDir(dir) | ||
| let outBin = dir / ("bench_latency" & ExeExt) | ||
| let cmd = "nim c --threads:on -d:release -d:BenchLatencyTestCompileTime=1 " & | ||
| "-o:" & outBin & " " & BenchLatencySrc | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 0 | ||
| if exitCode != 0: | ||
| echo "compile output:\n", output | ||
|
|
||
| test "overrides: -d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 take effect": | ||
| # Compile with overrides + a different test flag that checks the | ||
| # overridden values rather than the defaults. | ||
| let dir = newTestWorkspace("t11_overrides") | ||
| defer: removeDir(dir) | ||
| let outBin = dir / ("bench_latency" & ExeExt) | ||
| let cmd = "nim c --threads:on -d:release " & | ||
| "-d:BenchLatencyTestCompileTimeOverrides=1 " & | ||
| "-d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 " & | ||
| "-o:" & outBin & " " & BenchLatencySrc | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 0 | ||
| if exitCode != 0: | ||
| echo "compile output:\n", output | ||
|
|
||
| # ---------- Task 1.2: --bmf-out integration ---------- | ||
|
|
||
| proc compileBenchLatency( | ||
| extraDefs: openArray[string], dir: string | ||
| ): string = | ||
| ## Compile bench_latency.nim with extra -d: defines into `dir` and | ||
| ## return the binary path. Caller owns `dir` and must remove it. | ||
| ## Compiles in release mode for realistic timing but with tiny | ||
| ## message counts so the integration test stays fast. | ||
| let outBin = dir / ("bench_latency" & ExeExt) | ||
| var cmd = "nim c --threads:on -d:release" | ||
| for d in extraDefs: | ||
| cmd.add(" -d:" & d) | ||
| cmd.add(" -o:" & outBin & " " & BenchLatencySrc) | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| if exitCode != 0: | ||
| raise newException(IOError, "bench_latency compile failed:\n" & output) | ||
| result = outBin | ||
|
|
||
| suite "bench_latency --bmf-out integration (Task 1.2)": | ||
| test "sipsic variant emits latency_p50_ns / latency_p99_ns on expected slug": | ||
| # Override message count + runs to keep the integration run under ~5s. | ||
| let dir = newTestWorkspace("t12_sipsic") | ||
| defer: removeDir(dir) | ||
| let bin = compileBenchLatency(@[ | ||
| "BenchLatencyMessageCount=200", | ||
| "BenchLatencyRuns=2", | ||
| ], dir = dir) | ||
| let bmfPath = dir / "bench_latency.json" | ||
| let cmd = bin & " --bmf-out=" & bmfPath & " sipsic" | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 0 | ||
| check fileExists(bmfPath) | ||
| let node = parseJson(readFile(bmfPath)) | ||
| # Expected slug per design 2.2 / table at design line 357. | ||
| let slug = "lockfreequeues_sipsic/spsc/1p1c" | ||
| check node.hasKey(slug) | ||
| let s = node[slug] | ||
| check s.hasKey("latency_p50_ns") | ||
| check s.hasKey("latency_p95_ns") | ||
| check s.hasKey("latency_p99_ns") | ||
| check s["latency_p50_ns"]["value"].getFloat() > 0.0 | ||
| check s["latency_p99_ns"]["value"].getFloat() >= s["latency_p50_ns"]["value"].getFloat() | ||
| # Stdout text output preserved (acceptance: positional CLI behavior). | ||
| check output.contains("Sipsic") or output.contains("sipsic") | ||
|
|
||
| test "all four bounded variants emit latency_p50_ns / latency_p99_ns": | ||
| # Per impl plan Track 1 Acceptance Criteria: BMF JSON contains | ||
| # latency_p50_ns and latency_p99_ns for sipsic / sipmuc / mupsic / | ||
| # mupmuc on the 1p1c smoke shape. | ||
| let dir = newTestWorkspace("t12_all4") | ||
| defer: removeDir(dir) | ||
| let bin = compileBenchLatency(@[ | ||
| "BenchLatencyMessageCount=200", | ||
| "BenchLatencyRuns=2", | ||
| ], dir = dir) | ||
| let bmfPath = dir / "bench_latency.json" | ||
| let cmd = bin & " --bmf-out=" & bmfPath & | ||
| " sipsic mupmuc sipmuc mupsic" | ||
| let (_, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 0 | ||
| let node = parseJson(readFile(bmfPath)) | ||
| let expectedSlugs = @[ | ||
| "lockfreequeues_sipsic/spsc/1p1c", | ||
| "lockfreequeues_sipmuc/mpmc/1p1c", | ||
| "lockfreequeues_mupsic/mpsc/1p1c", | ||
| "lockfreequeues_mupmuc/mpmc/1p1c", | ||
| ] | ||
| for slug in expectedSlugs: | ||
| check node.hasKey(slug) | ||
| check node[slug].hasKey("latency_p50_ns") | ||
| check node[slug].hasKey("latency_p99_ns") | ||
|
|
||
| test "unknown variant exits 1": | ||
| let dir = newTestWorkspace("t12_unknown") | ||
| defer: removeDir(dir) | ||
| let bin = compileBenchLatency(@[ | ||
| "BenchLatencyMessageCount=200", | ||
| "BenchLatencyRuns=2", | ||
| ], dir = dir) | ||
| let cmd = bin & " bogus_variant" | ||
| let (_, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 1 | ||
|
|
||
| # ---------- Task 1.5: multi-measure-per-slug merge ---------- | ||
| # | ||
| # Validates the end-to-end shape that Track 1 ships: a single slug | ||
| # carries BOTH `throughput_ops_ms` (from bench_throughput's BMF | ||
| # fragment) and `latency_p50_ns` / `latency_p99_ns` (from bench_latency) | ||
| # AFTER `merge_bmf.py` unions the two fragments. Production CI does this | ||
| # with real bench output; the test uses two synthetic fragments so it | ||
| # stays fast and deterministic. | ||
|
|
||
| const MergeBmfPath = RepoRoot / "benchmarks" / "merge_bmf.py" | ||
|
|
||
| suite "bench_latency multi-measure-per-slug merge (Task 1.5)": | ||
| test "merge_bmf.py unions throughput + latency on shared slug": | ||
| let dir = createTempDir("bench_latency_t15_", "") | ||
| defer: removeDir(dir) | ||
| let throughputPath = dir / "throughput.json" | ||
| let latencyPath = dir / "latency.json" | ||
| let mergedPath = dir / "merged.json" | ||
| let slug = "lockfreequeues_sipsic/spsc/1p1c" | ||
|
|
||
| # Synthetic throughput fragment. | ||
| writeFile(throughputPath, | ||
| """{ | ||
| "lockfreequeues_sipsic/spsc/1p1c": { | ||
| "throughput_ops_ms": { | ||
| "value": 1234.5, | ||
| "lower_value": 1200.0, | ||
| "upper_value": 1270.0 | ||
| } | ||
| } | ||
| }""") | ||
| # Synthetic latency fragment on the SAME slug, distinct measures. | ||
| writeFile(latencyPath, | ||
| """{ | ||
| "lockfreequeues_sipsic/spsc/1p1c": { | ||
| "latency_p50_ns": { "value": 250.0 }, | ||
| "latency_p99_ns": { "value": 875.0 } | ||
| } | ||
| }""") | ||
|
|
||
| let cmd = "python3 " & MergeBmfPath & " " & mergedPath & | ||
| " " & throughputPath & " " & latencyPath | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 0 | ||
| if exitCode != 0: | ||
| echo "merge stdout/stderr:\n", output | ||
| check fileExists(mergedPath) | ||
|
|
||
| let node = parseJson(readFile(mergedPath)) | ||
| check node.hasKey(slug) | ||
| let s = node[slug] | ||
| # All three measures from BOTH fragments must coexist on the shared slug. | ||
| check s.hasKey("throughput_ops_ms") | ||
| check s.hasKey("latency_p50_ns") | ||
| check s.hasKey("latency_p99_ns") | ||
| check s["throughput_ops_ms"]["value"].getFloat() == 1234.5 | ||
| check s["latency_p50_ns"]["value"].getFloat() == 250.0 | ||
| check s["latency_p99_ns"]["value"].getFloat() == 875.0 | ||
|
|
||
| test "merge_bmf.py exits 1 on collision when same measure key in both inputs": | ||
| # Sanity: the per-slug union union semantics are NOT a free-for-all; | ||
| # ensure the collision guard from PR 0 Task 0.7 still fires when the | ||
| # latency fragment accidentally re-declares throughput_ops_ms on the | ||
| # same slug. This guards against silent overwrites that would erase | ||
| # one of the measures. | ||
| let dir = createTempDir("bench_latency_t15_collide_", "") | ||
| defer: removeDir(dir) | ||
| let aPath = dir / "a.json" | ||
| let bPath = dir / "b.json" | ||
| let mergedPath = dir / "merged.json" | ||
|
|
||
| writeFile(aPath, | ||
| """{ | ||
| "lockfreequeues_sipsic/spsc/1p1c": { | ||
| "throughput_ops_ms": { "value": 100.0 } | ||
| } | ||
| }""") | ||
| writeFile(bPath, | ||
| """{ | ||
| "lockfreequeues_sipsic/spsc/1p1c": { | ||
| "throughput_ops_ms": { "value": 200.0 } | ||
| } | ||
| }""") | ||
| let cmd = "python3 " & MergeBmfPath & " " & mergedPath & | ||
| " " & aPath & " " & bPath | ||
| let (output, exitCode) = execCmdEx(cmd) | ||
| check exitCode == 1 | ||
| check output.contains("collision") |
There was a problem hiding this comment.
BOT-A2 — Low (tests)
New bench_latency tests are not wired into any CI workflow or nimble task
The 7 tests in tests/t_bench_latency.nim (intdefine defaults, --bmf-out integration, merge_bmf.py collision guard) are never run by CI. tests/test.nim (the only nimble test entry point) does not include or import t_bench_latency. Neither build.yml nor bench.yml invokes nim c -r tests/t_bench_latency.nim. The same gap existed for t_bench_common.nim from PR 0, but the new tests in this PR add a fresh instance of the problem. Without CI execution, these tests will silently bitrot.
Exposes per-binary {.intdefine.} overrides for latency wall-time
control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount
(default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern
in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert
defaults compile cleanly and that -d:BenchLatencyRuns=2
-d:BenchLatencyMessageCount=1000 overrides take effect via
compile-time static blocks.
Also fixes a stale import (./adapters/lockfreequeues_sipsic ->
./adapters/lockfreequeues_sipsic_adapter) so the file compiles after
PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix.
The rest of the file (legacy runLatencyBenchmark / benchmarkLatency /
isMainModule) is preserved verbatim and gets fully rewritten in
Task 1.2 onto runLatencyHarness.
…Harness Replaces the legacy ping-pong RTT body with a CLI mirroring bench_throughput's: positional variant filter (sipsic / mupmuc / sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format JSON emission, stdout text preserved. Per-binary run-shape intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount / BenchLatencyWarmupRuns drive runLatencyHarness directly. Slug shape per design 2.2 (table at design line 357), 1p1c smoke shape only — full grid lands in PR 2: - lockfreequeues_sipsic/spsc/1p1c - lockfreequeues_sipmuc/mpmc/1p1c - lockfreequeues_mupsic/mpsc/1p1c - lockfreequeues_mupmuc/mpmc/1p1c Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns measures (p999 / max deferred to PR 6 per impl plan). Type unification: PushResult / PopResult were duplicated across benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim (new), forcing two nominally distinct enums to coexist. Aliasing adapter.PushResult / PopResult to bench_common's via re-export unifies the surface — both adapter packs (legacy sipsic / mupmuc and the newer sipmuc / mupsic / unbounded_*) now flow through the same runLatencyHarness and runThroughputHarness without per-call-site conversion. Verified: t_bench_common (26 tests) and bench_throughput --bmf-out integration unchanged. The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer( idx = 0) which bypasses Mupmuc's threadId-based registration, so the pre-built producer / consumer slots are safe to drive from either of the harness's pinger and ponger threads (Mupmuc's underlying Vyukov per-slot CAS protocol is fully concurrent-safe). Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering the sipsic single-variant happy path, the four-variant slug coverage required by Track 1 Acceptance Criteria, and the unknown-variant exit-1 path. All pass with -d:BenchLatencyMessageCount=200 -d:BenchLatencyRuns=2 in under ~10s on Apple silicon.
….yml
Restructures bench.yml from one all-in-one bench-throughput job into
three jobs:
bench-throughput (renamed from 'benchmark'):
builds + runs bench_throughput, uploads bench-throughput-bmf artifact.
bench-latency (NEW):
builds + runs bench_latency, uploads bench-latency-bmf artifact.
Compile-time overrides: BenchLatencyMessageCount=50000,
BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four
1p1c smoke-shape variants finish in single-digit seconds inside the
12-min job budget).
bench-upload (NEW, needs both):
downloads bench-*-bmf artifacts via actions/download-artifact@v4
pattern, runs python3 benchmarks/merge_bmf.py merged.json
$(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload
flows (PR comparison / push baseline / workflow_dispatch). One
bencher run invocation -> one Bencher Report -> latency + throughput
measures co-located on the shared per-slug history (design 1).
Bencher upload commands are unchanged in spirit: same PR / push /
workflow_dispatch gating, same --threshold-measure throughput on push
baselines, same yellow-warning preflight when BENCHER_API_TOKEN is
unset on forks. This commit does NOT correct the historical
`--threshold-measure throughput` -> `throughput_ops_ms` rename
deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure
threshold blocks in the same edit.
Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4
(line 359-366). The two tasks share the same file and are tightly
coupled (1.4 strictly depends on 1.3); folding them into one commit
keeps the workflow file consistent at every commit boundary.
actionlint clean. SC2046 word-splitting on $(ls ...) is intentional
(each path must arrive as a separate positional argv to merge_bmf.py)
and is suppressed inline with a justification comment; artifact
filenames are repo-controlled (no whitespace / glob chars).
Adds two cases to tests/t_bench_latency.nim that validate the
end-to-end shape Track 1 ships: a single slug carries BOTH
`throughput_ops_ms` (from bench_throughput's BMF fragment) and
`latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER
`merge_bmf.py` unions the two fragments.
1. Multi-measure union — writes synthetic throughput + latency
fragments on the same slug, runs python3 benchmarks/merge_bmf.py,
and asserts the merged JSON carries all three measures with their
original values intact.
2. Collision guard — re-declares `throughput_ops_ms` on the same
slug across both inputs and asserts merge_bmf.py exits 1 with a
'collision' message in stderr. This regression-tests PR 0 Task 0.7's
collision guard against silent overwrites that would erase one of
the colliding measures.
Both tests use createTempDir + defer removeDir for isolated, leak-free
fixtures and run in <200ms because they bypass the bench compile +
execute path entirely. Plan reference: impl plan Track 1 Task 1.5
(line 368-376).
Adds two bullets to the [Unreleased] Added section covering bench_latency --bmf-out emission and the new three-job bench.yml structure (bench- throughput / bench-latency / bench-upload). Adds a Changed bullet documenting the adapter.nim PushResult / PopResult unification. Plan reference: impl plan Track 1 Task 1.7 (line 386-393).
…ps to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity.
…t dirs Two CI hardening fixes from code review on PR 1: 1. The bench-upload job overrode GITHUB_TOKEN permissions to only pull-requests:write + checks:write but then ran actions/checkout and actions/download-artifact, which need contents:read and actions:read. Add both so the job is robust against orgs that restrict default token permissions. 2. Switch from merge-multiple:true to per-artifact subdirectories. Today's artifact filenames (throughput.json, latency.json) are unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/ unbounded binaries; each could end up writing files that collide under merge-multiple. Each artifact now lands at bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked over the recursive find.
The harness already computes metrics.p999_ns; surfacing it during manual runs costs nothing and matches the legacy bench_latency output. BMF wire-up for latency_p999_ns / latency_max_ns is deferred to PR 6 alongside threshold configuration.
Replace static /tmp/bench_latency_t1*_* paths with per-test createTempDir workspaces (via newTestWorkspace helper) cleaned up by defer: removeDir. Use ExeExt for the compiled binary name so the suffix is correct on Windows. Reshape compileBenchLatency to take an owned `dir` instead of a `suffix` so workspace lifecycle stays with the test that allocates it. Prevents collisions between parallel test runs / repeated runs in the same shell, and stops binaries from accumulating in the system temp dir.
…tency into benchtests
6c4cb5f to
2684687
Compare
2684687
into
feat/bench-rollup-pr0-bench-common
…ts (#19) * feat(bench): stub bench_common module with public API surface (Task 0.1) Compile-only stub for benchmarks/nim/bench_common.nim. Types and proc signatures are final per design doc section 2.1; bodies raise AssertionDefect with 'not implemented' so tasks 0.2-0.6 can be written and reviewed in parallel against this surface. Public surface stubbed: - Topology enum (tSpsc/tMpsc/tMpmc + unbounded variants) - PushResult / PopResult adapter primitives - BMFEmitter + initBMFEmitter / addMeasure / emit - Stats helpers (mean / stddev / minVal / maxVal / percentile) - Histogram + initHistogram / record / topK / percentile (HistogramTopK = 1000, HistogramReservoir = 99000) - LatencyMetrics + runLatencyHarness - ThroughputMetrics + runThroughputHarness Test (tests/t_bench_common.nim) is compile-time-only: it references each promised symbol via 'when declared' and 'when compiles', leaving behavior tests to subsequent tasks. Stub file: 170 LOC (acceptance cap: 200). * feat(bench): implement BMFEmitter with alpha-sorted output (Task 0.2) initBMFEmitter / addMeasure / emit per design doc section 2.1. - addMeasure stores via Table[slug, Table[measure, MeasureValue]]; NaN sentinel for lower/upper bounds maps to none[float](), finite values map to some(v). - emit writes JSON with slugs alpha-sorted at the top level and measures alpha-sorted within each slug. lower_value/upper_value fields are emitted only when the corresponding bound is some. - Output format: pretty(root, 2) plus a trailing newline so diffs in CI / local-dev surface line-level changes cleanly. Tests (tests/t_bench_common.nim): empty emit produces {}, two slugs alpha-ordered, measures alpha-ordered within slug, NaN bounds omitted from output, finite bounds emitted as lower_value/upper_value. * feat(bench): implement Stats helpers (Task 0.4) mean / stddev / minVal / maxVal / percentile bodies. Sort-based percentile uses linear interpolation between adjacent samples; matches numpy's default percentile interpolation ('linear'). stddev uses ddof=1 (sample) for consistency with numpy default. Empty-input behavior matches the legacy benchmarks/nim/stats.nim that this module replaces (returns 0.0 across the board). Tests: numpy-baseline parity for stddev on [1,2,3,4]; min/max on unordered input; percentile(p=0.5) on [0..99] = 49.5; percentile extremes (p=0.0, p=1.0) match min and max. * feat(bench): implement Histogram top-K + reservoir percentile (Task 0.3) Online percentile tracker per design doc section 2.1. - topKHeap: min-heap of HistogramTopK = 1000 largest values seen. record() pushes if heap < K; else replace if value > heap[0] (evicts the min into the body bucket via reservoirAdmit). - reservoir: Algorithm R uniform sample of body distribution, size HistogramReservoir = 99_000. - debugMode (initHistogram(true)): records every sample to debugAll and percentile() returns the exact sort answer over debugAll. Used as the test oracle and as a debug-only fallback per design 2.1. Lookup rule with stratified-estimator math: p == 1.0 -> top-K max seenAll*(1-p) <= topKHeap.len -> top-K rank lookup (linear interp) otherwise -> reservoir, with p rescaled to p_body = (p*seenAll) / (seenAll - K) since the reservoir samples ONLY the body stratum (values that lost the top-K race). Without rescaling a reservoir-based p99 underestimates by ~25% on log-normal data. Tests: p99 within 1% of sort fallback on 100K log-normal samples; percentile(1.0) == top-K max; p50 within 5% of sort fallback; debug mode returns exact sort answer; topK() returns ascending-sorted samples. * feat(bench): implement runThroughputHarness (Task 0.5) Generic N-producer / N-consumer throughput runner factored out of the legacy benchmarks/nim/bench_throughput.nim. Parameterized over the queue init proc and the queue type Q (constrained at instantiation time via 'mixin push' / 'mixin pop' so adapter-specific push/pop are resolved at the call site). The remainder-spread deadlock-avoidance rule is preserved verbatim: if messageCount mod numProducers != 0, the leading producers each get one extra item so the producer-side total exactly equals messageCount; same for consumers. Without this, the harness silently degrades into a deadlock (consumers waiting forever) or strands items in the queue (consumers stop early). Returns ThroughputMetrics{ops_ms_mean, ops_ms_stddev, runs}; warmup runs are discarded. Smoke test in tests/t_bench_common.nim with a 1024-cap channel SmokeAdapter at 1P/1C × 1000 messages × 1 run. * feat(bench): implement runLatencyHarness (Task 0.6) Ping-pong RTT runner. For each run: - allocates two queues fwd + rev of type Q via queueInit() - spawns 1 pinger + 1 ponger thread - pinger pushes a monotonic-ns timestamp via fwd, spins on rev for the echo, records RTT (ns) into a per-run Histogram - ponger pops from fwd, pushes the same value to rev Aggregation per design 2.5 'Run aggregation for latency': reported percentiles are the MEAN across runs of each percentile, NOT the percentile of unioned samples. Per-run histograms isolate scheduler artifacts to the run that contains them. Returns LatencyMetrics{p50_ns, p95_ns, p99_ns, p999_ns, max_ns, samples}. samples is the cumulative across-runs sample count. This task ships the 1P/1C smoke topology only; multi-thread sharding lands in PR 1 (Task 1.x). 1P/1C is sufficient for the smoke check because PR 1's first BMF measure uses Sipsic 1P/1C as the smoke shape. Test: SmokeAdapter (channel-backed), 1000 messages, 1 run, 0 warmup; asserts samples >= 1000 and p50 < p99 < max. * feat(bench): add merge_bmf.py BMF JSON merge utility (Task 0.7) Stateless union of per-slug measure dicts across N input BMF JSON files into one output JSON, per design doc section 4.3. Required because Bencher.dev creates a separate Report per 'bencher run' invocation; multiple uploads from sibling CI jobs would NOT co-locate multiple measures on a single per-slug history. Contract: merge_bmf.py <output_path> <input1> [input2 ...] - Slug regex: ^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$ - Measure regex: ^[a-z][a-z0-9_]*$ - Value fields ('value', 'lower_value', 'upper_value') must be finite, non-NaN floats. Any malformed input or non-finite value -> exit 1. - Collision (same slug + same measure across two inputs) -> exit 1 with stderr 'error: collision on slug=<slug> measure=<measure> in files <pathA> and <pathB>'. - Output is alpha-sorted at both top level (slugs) and within each slug (measures), 2-space indent. Tests in benchmarks/tests/test_merge_bmf.py cover all 8 design-spec cases plus an alpha-sort assertion: zero-input usage error, single empty pass-through, single valid round-trip, two disjoint slugs, shared slug + disjoint measures, collision, NaN rejection, invalid measure key (uppercase), invalid slug shape. * feat(bench): add 5 missing lockfreequeues adapters (Task 0.8) Adds the 5 adapters not yet present per design doc 2.2: - lockfreequeues_sipmuc_adapter.nim (bounded SPMC) - lockfreequeues_mupsic_adapter.nim (bounded MPSC) - lockfreequeues_unbounded_sipsic_adapter.nim - lockfreequeues_unbounded_sipmuc_adapter.nim - lockfreequeues_unbounded_mupmuc_adapter.nim Each adapter file exports: - 'topologiesSupported: set[Topology]' constant (consumed by PR 3 Task 3.11 to restrict shape generation; declared but unused in PR 0) - 'make<Lib>Adapter' factory taking 'capacity: int' - 'cleanup' tear-down proc - 'push(item)' returning bench_common.PushResult - 'pop()' returning bench_common.PopResult[T] Smoke test in tests/t_bench_common.nim: each adapter pushes 100 sequential uint64s, pops them, asserts count == 100 AND set-equality with the input range. Implementation note (Nim 2.2.6 codegen workaround): The unbounded adapters store the queue INLINE rather than via heap pointer. Heap-allocating UnboundedSipsic/UnboundedSipmuc/ UnboundedMupmuc and then calling reset() or =destroy() on the referent from a generic adapter cleanup proc triggers 'internal error: getTypeDescAux(tyGenericParam)' OR 'internal error: expr(nkIdent); unknown node kind' in Nim 2.2.6 when bench_common is also imported. Inline storage lets the compiler emit the per-instance =destroy automatically when the adapter goes out of scope, sidestepping the bug. The DEBRA manager is still heap-allocated because the queue constructors take 'ptr DebraManager'. Track upstream Nim repro for follow-up. * PR 0 Task 0.9: rename existing adapters with `_adapter` suffix and export `topologiesSupported` Per design section 2.2 (`<library_slug>_adapter.nim` filename convention): - Rename via `git mv` (history preserved): - lockfreequeues_sipsic.nim -> lockfreequeues_sipsic_adapter.nim - lockfreequeues_mupmuc.nim -> lockfreequeues_mupmuc_adapter.nim - lockfreequeues_unbounded_mupsic.nim -> lockfreequeues_unbounded_mupsic_adapter.nim - Update bench_throughput.nim import paths to match. - Add `topologiesSupported*: set[Topology]` constant to all four existing adapters (3 renamed + channels_adapter). Use selective `from ../bench_common import Topology, <member>` to avoid pulling PushResult into scope, which would conflict with adapter.PushResult (the legacy concept in ../adapter still imported by these adapters). - `topologiesSupported` is exported in PR 0 but consumed only by PR 3 Task 3.11; the "unused export" hint is expected per the impl plan. * PR 0 Task 0.10: native BMF emission via --bmf-out= flag in bench_throughput Add --bmf-out=<path> flag (parsed via std/parseopt) that emits Bencher Metric Format JSON for every variant that ran. With the flag absent the binary is bit-for-bit unchanged; the existing positional CLI (`bench_throughput sipsic mupmuc unbounded_mupsic channels`) is preserved verbatim. Slug shape per design 2.2: lockfreequeues_sipsic/spsc/1p1c lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c nim_channels/mpmc/{1,2,4}p{1,2,4}c Each slug carries the `throughput_ops_ms` measure as {value: mean, lower_value: mean-stddev, upper_value: mean+stddev} — matching the convention bmf_adapter.py used (which this code replaces in Task 0.12). Note on slug for built-in channels: the human-readable adapter name() returns "nim/channels" (kept for stdout text), but the slug uses underscore form `nim_channels` so it satisfies the merge_bmf.py SLUG_RE pattern `^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$`. Per-variant {.intdefine.}s for runs/warmup added so the integration test can build a fast variant via -d:BenchSipsicRuns=2 -d:BenchSipsicWarmup=0. Default values (10 runs, 0 warmup) match the prior hard-coded behavior. Selective `from ./bench_common import` is used to pull only {BMFEmitter, initBMFEmitter, addMeasure, emit} — importing the full module would shadow `results.ThroughputMetrics` with `bench_common.ThroughputMetrics` and break the legacy bespoke harness procs that still drive Mupmuc and UnboundedMupsic. Test: tests/t_bench_common.nim adds an integration test that compiles bench_throughput with small overrides, runs it with --bmf-out=<tmp> sipsic, parses the resulting JSON, and asserts the expected slug and measure exist with a positive value. Stdout text output is also asserted (`Sipsic` substring) so the legacy CLI contract is exercised. * PR 0 Task 0.11: bench.yml — native BMF + merge step, drop legacy parser Apply the four edits from design section 3 (PR 0): (a) Remove `Run adapter unit tests` step. The Python regex parser it sanity-checked is gone in Task 0.12; nothing left to test. (b) Remove `Convert to BMF JSON` step. bench_throughput now emits BMF natively via `--bmf-out=throughput.json` (Task 0.10). (c) Update `Run bench_throughput` step to add the `--bmf-out=throughput.json` flag while preserving the existing positional variant filter and the 20-minute step timeout. (d) Insert `Merge BMF JSON` step running `python3 benchmarks/merge_bmf.py merged.json throughput.json`. Single-input today, but stays in place for PR 2-4 when bench is split into per-topology binaries each producing its own fragment. (e) Change all three `bencher run` invocations and the debug-cat step to use `--file merged.json` instead of `bench_results.json`. actionlint clean. The bench upload contract is unchanged: same project, branch, testbed, thresholds, comparison semantics — only the file source flips from regex-parsed text to native BMF. * PR 0 Task 0.12: delete legacy bmf_adapter.py + bench_main.nim, refactor consumers Three legacy files removed via `git rm`: - `benchmarks/bmf_adapter.py` — regex parser for bench_throughput stdout. Replaced by native BMF emission via `--bmf-out=` (Task 0.10). - `benchmarks/test_bmf_adapter.py` — unit tests for the parser. Replaced by `benchmarks/tests/test_merge_bmf.py` (Task 0.7). - `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped bench_throughput + bench_latency and emitted a custom JSON shape. Bench_throughput is now the canonical entry point and emits BMF directly; bench_latency follows in PR 1. Live consumers refactored in this same task (per impl plan §0.12): - `benchmarks/runner.py`: build_nim now compiles bench_throughput.nim; run_nim invokes `bench_throughput --bmf-out=<path>` instead of `bench_main --runs=<n> -o=<path>`. The `runs` argument is preserved in the CLI for back-compat but is now compile-time-only. - `lockfreequeues.nimble`: `task benchmarks` switched to bench_throughput. Output path corrected to `.tmp/bench_throughput` (matching the project nim.cfg `--outdir:.tmp`). - `benchmarks/README.md`: rewritten to document the new flow (bench_common module, adapter convention, merge_bmf.py, native BMF emission, expected slug set). - `.gitignore`: drop the `bench_main` build-artifact ignore line. `git grep "bmf_adapter\|bench_main"` returns only intentional historical references in code comments and CHANGELOG (prior release notes). Acceptance: `nimble tasks` parses; `nim c benchmarks/nim/bench_throughput.nim` compiles; tests/t_bench_common.nim passes all 26 tests including the Task 0.10 --bmf-out integration test. * PR 0 Task 0.13: render_readme.nim consumes new BMF shape The legacy bench_main aggregator emitted JSON of shape `{results: [{implementation, thread_config, metrics: {...}}], metadata: {...}}`. Bench_main was deleted in Task 0.12, replaced by `bench_throughput --bmf-out` (Task 0.10) + `merge_bmf.py` (Task 0.7) which produce Bencher Metric Format JSON of shape `{<lib>/<topology>/<P>p<C>c: {<measure>: {value, lower_value?, upper_value?}}}`. `benchmarks/render_readme.nim` now walks the BMF shape directly: - New `parseSlug(slug)` helper splits `<lib>/<topology>/<P>p<C>c` into `(impl, "<P>P/<C>C")` for the two table columns. Strict on shape: rejects slugs with the wrong segment count or non-digit P/C, so malformed entries do not produce noise rows. - `loadRows` walks every top-level slug, pulls `<slug>.throughput_ops_ms.value` for the throughput column, and `<slug>.latency_p50_ns.value` for the optional latency column (PR 1+ emits this; PR 0 does not, so all PR 0 rows show "—"). - BMF has no `metadata` block, so the meta argument is always nil. `renderTable` already accepts nil and skips the platform header in that case — no change needed. - `when isMainModule` block guarded by `not defined(renderReadmeNoMain)` so unit tests can include this file without triggering the README rewrite. CLI invocations (`nim r benchmarks/render_readme.nim`) leave the define unset and run main as before. Test: tests/t_render_readme.nim (new) writes a fixture BMF JSON with three slugs (mupmuc/2p2c, sipsic/1p1c, nim_channels/1p1c), runs `loadRows` and `renderBlock`, and asserts: - 3 rows produced; impl/threads/throughput parsed correctly per slug - alpha sort by impl preserved - meta == nil (BMF carries no metadata) - rendered Markdown table contains every impl name and value - latency column is em-dash for every row (no `latency_p50_ns` in fixture) - missing JSON file falls back to placeholder - empty `{}` BMF document falls back to placeholder All 4 new tests + 26 existing t_bench_common tests pass. * PR 0 Task 0.14: CHANGELOG entry under [Unreleased] for bench-rollup PR 0 Document Track 0 of the bench-rollup feature under [Unreleased]: - Added: bench_common.nim module, 5 new lockfreequeues adapters, merge_bmf.py CLI, --bmf-out flag, per-variant run-count intdefines. - Changed: bench_throughput emits BMF natively, 3 adapters renamed to the <library_slug>_adapter.nim convention, render_readme.nim consumes the new BMF shape, runner.py / nimble task / benchmarks/README.md point at bench_throughput. - Removed: bmf_adapter.py, test_bmf_adapter.py, bench_main.nim. Each entry is one bullet (one logical change), no test plan checklist, no GitHub issue references. * fix(bench): plug heap-allocated queue leaks in latency and throughput harness Per Gemini review on PR #19. Adapters whose Q heap-allocates the backing queue (lockfreequeues bounded mupsic/sipmuc/mupmuc, channels) leaked the queue every time runOneLatencyRun or runOneThroughputRun returned: `var fwd = queueInit()` only frees the adapter struct's stack slot at scope exit, not the create()-d Mupsic/Sipmuc/Mupmuc/ Channel pointee. Across N runs * M warmups the leak grew without bound. Two-part fix: 1. bench_common.nim: harness now drops the queue via a uniform `mixin cleanup; when compiles(cleanup(q)): cleanup(q)` block after joinThread (and before any return path, including the elapsedNs<=0 short-circuit in runOneThroughputRun). `when compiles` keeps value-type adapters (sipsic, unbounded inline storage) usable without forcing them to define a no-op cleanup. 2. lockfreequeues_mupsic / sipmuc cleanup procs: prepend reset(a.queue[]) so Mupsic/Sipmuc's `=destroy` actually runs before the heap slot is freed. Bare dealloc skipped any queue-internal cleanup that an ARC/ORC-generated destructor would otherwise do (managed T fields, future atomic-cleanup hooks). Same fix on the mupmuc adapter inside deinitMupmucAdapter. 3. mupmuc and channels adapters gain a `cleanup` alias around their existing deinit*Adapter procs so the harness's mixin cleanup call resolves uniformly across adapter naming conventions. Verified by smoke build of `runLatencyHarness` + `runThroughputHarness` against LockfreequeuesMupsicAdapter; no leak-on-return, percentiles and ops/ms reported as expected. * fix(bench): destroy queue before manager; guard zero-duration throughput Two fixes flagged by code review on PR 0: 1. Unbounded MPMC adapter cleanup() freed the DebraManager while the inline queue still held a pointer to it. The queue's =destroy then called unbindClient(self.manager[]) on freed memory. Reset the queue first (running its destructor while the manager is still valid), then dealloc the manager. Applied to both unbounded_mupmuc and unbounded_sipmuc adapters. 2. The legacy bench_throughput run procs computed messageCount/elapsedNs without guarding against elapsedNs <= 0, which can occur on fast CI runners with sub-nanosecond clock granularity. A zero-duration measurement produces inf, which merge_bmf.py rejects via math.isfinite. Return 0.0 in that case so the sample is recorded as a valid (degenerate) value. * fix(bench): guard zero-thread harness; expose adapter queues; drop non-finite BMF Three further code-review fixes on PR 0: 1. runOneThroughputRun would crash with DivByZeroDefect on numProducers==0 / numConsumers==0 because the workload-distribution step does messageCount div N. The harness is exported, so a future caller could pass 0 thinking 'no producers' is a degenerate but valid shape. Guard with doAssert + a clear message instead of letting the worker thread crash with no context. Also assert messageCount >= 0. 2. lockfreequeues_mupsic_adapter and lockfreequeues_sipmuc_adapter stored a single Producer/Consumer at construction time but kept the underlying queue pointer non-exported, so multi-thread shapes couldn't actually register per-thread producers/consumers as the adapter docstrings already described. Export the queue field (used by today's bench_throughput multi-producer code path) and add explicit getProducer*/getConsumer* accessor procs that wrap queue.getProducer(idx)/getConsumer(idx) for harness consumers. 3. Defense in depth on the BMF emitter. The PR previously added elapsedNs <= 0 guards to the legacy bench_throughput run procs; this pass also rejects non-finite values inside addMeasure itself. If a future code path produces inf/NaN despite the upstream guard, the offending (slug, measure) entry is silently dropped instead of being written to the BMF, which merge_bmf.py would otherwise reject and fail the entire upload on. Same classification widened on toBound for lower/upper bounds. * perf(bench): bulk-percentile to avoid 5x reservoir sort per latency run Histogram.percentile() previously sorted the reservoir (up to 99k samples) on every call, so each latency run re-sorted the same data five times for p50/p95/p99/p999/max. Add Histogram.percentiles() that takes an array of probabilities and computes them all from a single sorted snapshot. Refactor percentile(openArray[float]) through a new percentileSorted() helper so the sort/lerp can be reused without a fresh allocation. Update runOneLatencyRun to use the bulk API. * fix(render_readme): bump row cap; restore platform/cores/timestamp line - Bump the row cap from 8 to 32. The bench matrix today already exceeds 8 (1 sipsic + 4 mupmuc + 3 unbounded mupsic + 3 channels = 11) and PR 3 / PR 4 add comparison adapters; the old cap silently truncated the table. PR 5 replaces this auto-rendered block with a hand-curated 4-row summary, so the higher cap is a transitional safeguard. - Reconstruct platform / cores / timestamp metadata at render time instead of pulling from JSON. BMF has no metadata block, so the per-table metadata line was dropping silently in the legacy -> BMF migration. detectMetadata() falls back to runtime detection (hostOS/hostCPU, countProcessors, now()) and honors three env vars (BENCH_RENDER_PLATFORM / _CORES / _TIMESTAMP) for devs re-rendering historical snapshots from a different host. renderTable still honors any meta JsonNode caller passes through (forward compatibility for a future BMF-with-metadata extension). * ci: pin sibling-dep clones to nim-debra v0.6.0 / nim-typestates v0.7.0 Both build.yml and bench.yml clone the matching siblings under `../nim-debra` and `../nim-typestates` so nim.cfg's `--path:` directives resolve. They were cloning `--branch main`, which floats: a downstream breaking change in either repo would silently propagate into our CI without the corresponding `requires` bump in lockfreequeues.nimble. Switch to specific release tags. Bumps now require touching three places in lockstep (this comment, the `requires` line, the matching release tag), which is the right cost for a dep update. Drop the stale "Once debra 0.3.0 publishes…" comment — debra is at v0.6.0 and typestates at v0.7.0, both released and registered. * ci(bench): cost gate — drop feat/** PR base; require `bench` label Two compounding gates so cascade-rebases of stacked feature PRs do not burn Bencher.dev metric-ingest quota: A) `pull_request: branches:` is reduced to [main, devel]. Stacked feature-to-feature PRs (`feat/foo` -> `feat/bar`) no longer trigger this workflow at all. Only the topmost PR in any stack (the one whose base is devel/main) gets PR-time bench treatment. B) The `Track PR benchmarks with Bencher` step is now gated on a `bench` label being present on the PR. Without the label the step is skipped even on PRs targeting devel; the bench binaries still compile and run (free GHA minutes, useful for stdout-level spot-checks), but no `bencher run` upload fires. Add the label deliberately when you want a comparison report on a revision. The preflight step is renamed and extended so the check summary displays a clear "skipped because no `bench` label" warning when B is the gate, paralleling the existing missing-token branch. Push to main/devel (baseline upload) and workflow_dispatch (manual upload) are unchanged. * docs(bench): align Quick Start CI command with current 1M/500k budget * ci(bench): wire t_bench_common into nimble benchtests + CI job Adds a 'benchtests' nimble task that compiles/runs the bench harness tests, which live outside srcDir and are intentionally not imported by tests/test.nim (avoids forcing every entry of the MM/sanitizer test matrix to compile bench harness threading machinery). Adds a 'bench-tests' job to bench.yml that invokes 'nimble benchtests' on every PR and push. The job has no secret dependencies, so fork PRs are allowed; only the github-actions[bot] loop-back guard is kept. Sibling Nim deps pin to v0.6.0 (debra) and v0.7.0 (typestates), matching the existing 'benchmark' job in this workflow. * PR 1 Task 1.1: BenchLatencyRuns/BenchLatencyMessageCount intdefines Exposes per-binary {.intdefine.} overrides for latency wall-time control: BenchLatencyRuns (default 33) and BenchLatencyMessageCount (default 100_000). Mirrors the BenchSipsicRuns / MessageCount pattern in bench_throughput.nim. Tests in tests/t_bench_latency.nim assert defaults compile cleanly and that -d:BenchLatencyRuns=2 -d:BenchLatencyMessageCount=1000 overrides take effect via compile-time static blocks. Also fixes a stale import (./adapters/lockfreequeues_sipsic -> ./adapters/lockfreequeues_sipsic_adapter) so the file compiles after PR 0 Task 0.9 renamed all bench adapters with the _adapter suffix. The rest of the file (legacy runLatencyBenchmark / benchmarkLatency / isMainModule) is preserved verbatim and gets fully rewritten in Task 1.2 onto runLatencyHarness. * PR 1 Task 1.2: rewrite bench_latency.nim onto bench_common.runLatencyHarness Replaces the legacy ping-pong RTT body with a CLI mirroring bench_throughput's: positional variant filter (sipsic / mupmuc / sipmuc / mupsic), --bmf-out=<path> for native Bencher Metric Format JSON emission, stdout text preserved. Per-binary run-shape intdefines wired in: BenchLatencyRuns / BenchLatencyMessageCount / BenchLatencyWarmupRuns drive runLatencyHarness directly. Slug shape per design 2.2 (table at design line 357), 1p1c smoke shape only — full grid lands in PR 2: - lockfreequeues_sipsic/spsc/1p1c - lockfreequeues_sipmuc/mpmc/1p1c - lockfreequeues_mupsic/mpsc/1p1c - lockfreequeues_mupmuc/mpmc/1p1c Each carries latency_p50_ns / latency_p95_ns / latency_p99_ns measures (p999 / max deferred to PR 6 per impl plan). Type unification: PushResult / PopResult were duplicated across benchmarks/nim/adapter.nim (legacy) and benchmarks/nim/bench_common.nim (new), forcing two nominally distinct enums to coexist. Aliasing adapter.PushResult / PopResult to bench_common's via re-export unifies the surface — both adapter packs (legacy sipsic / mupmuc and the newer sipmuc / mupsic / unbounded_*) now flow through the same runLatencyHarness and runThroughputHarness without per-call-site conversion. Verified: t_bench_common (26 tests) and bench_throughput --bmf-out integration unchanged. The Mupmuc / Sipsic adapters use getProducer(idx = 0) / getConsumer( idx = 0) which bypasses Mupmuc's threadId-based registration, so the pre-built producer / consumer slots are safe to drive from either of the harness's pinger and ponger threads (Mupmuc's underlying Vyukov per-slot CAS protocol is fully concurrent-safe). Tests: tests/t_bench_latency.nim adds three Task 1.2 cases covering the sipsic single-variant happy path, the four-variant slug coverage required by Track 1 Acceptance Criteria, and the unknown-variant exit-1 path. All pass with -d:BenchLatencyMessageCount=200 -d:BenchLatencyRuns=2 in under ~10s on Apple silicon. * PR 1 Tasks 1.3 + 1.4: bench-latency job + N-input merge step in bench.yml Restructures bench.yml from one all-in-one bench-throughput job into three jobs: bench-throughput (renamed from 'benchmark'): builds + runs bench_throughput, uploads bench-throughput-bmf artifact. bench-latency (NEW): builds + runs bench_latency, uploads bench-latency-bmf artifact. Compile-time overrides: BenchLatencyMessageCount=50000, BenchLatencyRuns=11, BenchLatencyWarmupRuns=2 (chosen so the four 1p1c smoke-shape variants finish in single-digit seconds inside the 12-min job budget). bench-upload (NEW, needs both): downloads bench-*-bmf artifacts via actions/download-artifact@v4 pattern, runs python3 benchmarks/merge_bmf.py merged.json $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload flows (PR comparison / push baseline / workflow_dispatch). One bencher run invocation -> one Bencher Report -> latency + throughput measures co-located on the shared per-slug history (design 1). Bencher upload commands are unchanged in spirit: same PR / push / workflow_dispatch gating, same --threshold-measure throughput on push baselines, same yellow-warning preflight when BENCHER_API_TOKEN is unset on forks. This commit does NOT correct the historical `--threshold-measure throughput` -> `throughput_ops_ms` rename deferred from PR 0 Task 0.10; PR 6 (Task 6.2) handles per-measure threshold blocks in the same edit. Plan reference: impl plan Track 1 Tasks 1.3 (line 350-357) and 1.4 (line 359-366). The two tasks share the same file and are tightly coupled (1.4 strictly depends on 1.3); folding them into one commit keeps the workflow file consistent at every commit boundary. actionlint clean. SC2046 word-splitting on $(ls ...) is intentional (each path must arrive as a separate positional argv to merge_bmf.py) and is suppressed inline with a justification comment; artifact filenames are repo-controlled (no whitespace / glob chars). * PR 1 Task 1.5: smoke test for multi-measure-per-slug merge Adds two cases to tests/t_bench_latency.nim that validate the end-to-end shape Track 1 ships: a single slug carries BOTH `throughput_ops_ms` (from bench_throughput's BMF fragment) and `latency_p50_ns` / `latency_p99_ns` (from bench_latency) AFTER `merge_bmf.py` unions the two fragments. 1. Multi-measure union — writes synthetic throughput + latency fragments on the same slug, runs python3 benchmarks/merge_bmf.py, and asserts the merged JSON carries all three measures with their original values intact. 2. Collision guard — re-declares `throughput_ops_ms` on the same slug across both inputs and asserts merge_bmf.py exits 1 with a 'collision' message in stderr. This regression-tests PR 0 Task 0.7's collision guard against silent overwrites that would erase one of the colliding measures. Both tests use createTempDir + defer removeDir for isolated, leak-free fixtures and run in <200ms because they bypass the bench compile + execute path entirely. Plan reference: impl plan Track 1 Task 1.5 (line 368-376). * PR 1 Task 1.7: CHANGELOG entry under [Unreleased] for bench-rollup PR 1 Adds two bullets to the [Unreleased] Added section covering bench_latency --bmf-out emission and the new three-job bench.yml structure (bench- throughput / bench-latency / bench-upload). Adds a Changed bullet documenting the adapter.nim PushResult / PopResult unification. Plan reference: impl plan Track 1 Task 1.7 (line 386-393). * fix(bench-latency): correct prefix on unknown-flag error; rename groups to variants Per Gemini review on PR #20. 1. Unknown-flag error message hardcoded `--` regardless of cmdShortOption vs cmdLongOption. `-x` printed as `Unknown flag: --x`. Now picks the right prefix from p.kind. 2. `groups` is a leftover name from bench_throughput where positional args ARE topology groups. In bench_latency they're individual variants. Renamed for clarity. * fix(bench): grant bench-upload contents/actions read; isolate artifact dirs Two CI hardening fixes from code review on PR 1: 1. The bench-upload job overrode GITHUB_TOKEN permissions to only pull-requests:write + checks:write but then ran actions/checkout and actions/download-artifact, which need contents:read and actions:read. Add both so the job is robust against orgs that restrict default token permissions. 2. Switch from merge-multiple:true to per-artifact subdirectories. Today's artifact filenames (throughput.json, latency.json) are unique, but PR 2's topology split adds bench_spsc/mpsc/mpmc/ unbounded binaries; each could end up writing files that collide under merge-multiple. Each artifact now lands at bmf-inputs/<artifact-name>/<file> and merge_bmf.py is invoked over the recursive find. * fix(bench): print p999 in latency stdout for operator visibility The harness already computes metrics.p999_ns; surfacing it during manual runs costs nothing and matches the legacy bench_latency output. BMF wire-up for latency_p999_ns / latency_max_ns is deferred to PR 6 alongside threshold configuration. * test(bench_latency): isolate test workspaces with createTempDir + ExeExt Replace static /tmp/bench_latency_t1*_* paths with per-test createTempDir workspaces (via newTestWorkspace helper) cleaned up by defer: removeDir. Use ExeExt for the compiled binary name so the suffix is correct on Windows. Reshape compileBenchLatency to take an owned `dir` instead of a `suffix` so workspace lifecycle stays with the test that allocates it. Prevents collisions between parallel test runs / repeated runs in the same shell, and stops binaries from accumulating in the system temp dir. * ci(bench-latency): pin sibling deps to v0.6.0/v0.7.0; wire t_bench_latency into benchtests * PR 2 Task 2.1: capture pre-split slug-set fixture for deletion-safety Snapshots the BMF emitted by the legacy bench_throughput at the same small-override knobs the topology-split tests will use, then commits the JSON to tests/fixtures/pre-split-slugs.json. Track 2's deletion- safety check (Task 2.7) verifies that the post-split union is a strict superset of this fixture so no slug silently disappears with the split. Adds tests/t_topology_split.nim as the RED test stub: the fixture-shape suite (Task 2.1) passes against the committed JSON; the bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded suites (Tasks 2.3-2.6) and the superset suite (Task 2.7) fail loudly until those binaries + the superset_check.py helper land. Fixture covers 11 slugs: sipsic spsc/1p1c mupmuc mpmc/{1,2,4,8}p{1,2,4,8}c (8p8c is the issue #15 livelock case) unbounded_mupsic mpsc_unbounded/{1,2,4}p1c nim_channels mpmc/{1,2,4}p{1,2,4}c * PR 2 Task 2.3: bench_spsc binary (Sipsic 1p1c) Splits the SPSC slice out of the legacy bench_throughput.nim. Wraps SipsicAdapter in a uniform queueInit closure and runs through runThroughputHarness so this binary stays under 100 lines and shares the per-run shaping (warmup, run-count, monotonic-ns timing, remainder-spread distribution) with the rest of the topology split. Per-binary intdefines (design 2.5): -d:BenchSpscRuns (default 33) -d:BenchSpscMessageCount (default 1_000_000) -d:BenchSpscWarmup (default 3) Behind -d:BenchSpscTestCompileTime the binary's `static` block asserts the three defaults. The runtime + BMF emission contract is covered by tests/t_topology_split.nim's bench_spsc suite (Task 2.3), which compiles at 1000 messages * 2 runs and parses the emitted lockfreequeues_sipsic/spsc/1p1c slug. * PR 2 Task 2.4: bench_mpsc binary (Mupsic 1p1c, 2p1c, 4p1c) Splits the MPSC slice out of the legacy bench_throughput.nim. Mupsic requires per-thread Producer objects via queue.getProducer(idx); the generic runThroughputHarness assumes a uniform queue.push call, so this binary keeps the bespoke harness pattern from the legacy file: pre-assign Producer values on the main thread (idx = 0..P-1) and ship them into worker thread contexts. The shape-iteration loop instantiates three concrete Mupsic types {1,2,4}p1c so P stays a static int. Per-binary intdefines (design 2.5): -d:BenchMpscRuns (default 33) -d:BenchMpscMessageCount (default 1_000_000) -d:BenchMpscWarmup (default 3) Slug shape: lockfreequeues_mupsic/mpsc/{1,2,4}p1c. Behind -d:BenchMpscTestCompileTime the binary's static block asserts the three defaults. Runtime behavior covered by tests/t_topology_split.nim's bench_mpsc suite. * PR 2 Task 2.5: bench_mpmc binary (Mupmuc grid + 8p8c, Sipmuc, channels) Splits the MPMC slice out of the legacy bench_throughput.nim into a single binary covering three queue families. Mupmuc and Sipmuc both need per-thread Producer / Consumer values via getProducer(idx) / getConsumer(idx); the harness pre-assigns those on the main thread and ships them into worker contexts (mirrors the bench_throughput Mupmuc pattern). Channels has a uniform push/pop interface and runs through bench_common.runThroughputHarness. Coverage: Mupmuc {1,2,4} P x {1,2,4} C grid (9 shapes) plus 8p8c. 8p8c is the issue #15 livelock regression case preserved from the pre-split bench_throughput fixture so the deletion-safety check does not silently lose that coverage. Sipmuc 1p{1,2,4}c (single-producer, multi-consumer; lives under mpmc per design 2.4). channels {1,2,4} P x {1,2,4} C grid. Per-binary intdefines (design 2.5): -d:BenchMpmcRuns (default 33) -d:BenchMpmcMessageCount (default 1_000_000) -d:BenchMpmcWarmup (default 3) Both lockfreequeues mupmuc and sipmuc export a `Consumer` symbol; the contexts qualify with `mupmuc.Consumer` / `sipmuc.Consumer` to keep type resolution unambiguous. * PR 2 Task 2.6: bench_unbounded binary (4 unbounded variants) Splits the unbounded slice out of the legacy bench_throughput.nim. Covers all 4 lockfreequeues unbounded variants at their natural shapes: unbounded_sipsic spsc_unbounded/1p1c (no DEBRA) unbounded_sipmuc mpmc_unbounded/1p{1,2,4}c (DEBRA on consumers) unbounded_mupsic mpsc_unbounded/{1,2,4}p1c (DEBRA on producers) unbounded_mupmuc mpmc_unbounded/{1,2,4}p{1,2,4}c grid (DEBRA both sides) UnboundedSipsic is a plain SPSC pair; the other three use DEBRA epoch reclamation, so producer/consumer threads call registerThread(manager) on their own thread before queue.getProducer(handle) / queue.getConsumer(handle). UnboundedMupsic reuses the existing UnboundedMupsicAdapter (which pre-registers the consumer at adapter init) and drives pop from the calling thread, mirroring the legacy bench_throughput path. UnboundedSipmuc and UnboundedMupmuc are constructed inline with a heap-pointer DebraManager since their constructors take a `ptr DebraManager`. Per-variant intdefines (design 2.5): -d:UnboundedSipsicRuns / UnboundedSipsicMessageCount (3 / 500_000) -d:UnboundedSipmucRuns / UnboundedSipmucMessageCount (3 / 500_000) -d:UnboundedMupsicRuns / UnboundedMupsicMessageCount (3 / 500_000) -d:UnboundedMupmucRuns / UnboundedMupmucMessageCount (3 / 500_000) -d:BenchUnboundedWarmup (default 2) Per-variant defines are chosen over a shared BenchUnboundedRuns so CI can budget unbounded_mupmuc/4p4c independently from the cheaper unbounded_sipsic/1p1c (design 2.5 rationale). * PR 2 Task 2.7: superset_check.py + deletion-safety wiring Adds benchmarks/scripts/superset_check.py: a slug-set subset check that exits 0 when the post-split BMF (union of bench_spsc + bench_mpsc + bench_mpmc + bench_unbounded after merge_bmf.py) covers every slug in the pre-split fixture, and exits 1 with the missing slugs listed on stderr otherwise. The check operates on JSON top-level keys only; it does not compare measure values, bounds, or ordering, so the harness is free to reshape per-run statistics across the split without tripping the deletion-safety guard. Also wires the check into tests/t_topology_split.nim's deletion-safety suite (compile + run all four binaries at small overrides, merge via merge_bmf.py, then invoke superset_check.py against the committed fixture and assert exit 0). benchmarks/tests/test_superset_check.py covers 9 cases: identical sets, strict superset, missing slugs (with alpha-sorted listing), empty pre, wrong arg count, missing pre file, malformed JSON, and top-level non-object. All exit codes use 1 for failure to match the existing benchmarks/merge_bmf.py contract. * PR 2 Task 2.8: bench.yml matrix over 5 topology binaries + per-step timeouts Replaces the legacy bench-throughput + bench-latency job pair with a single matrix job over the 5 topology-split binaries. Each matrix entry runs in its own GitHub Actions job with an independent `timeout-minutes: 12` budget so a hang in one binary cannot burn the whole workflow's clock; the surviving binaries finish, the bench-upload job merges what arrived, and the operator gets partial Bencher coverage rather than no coverage. The compile step's `case` sets per-binary `-d:` overrides matching design 2.5: bounded throughput at 1M messages * 5 runs * 2 warmup, unbounded at 500K * 3 runs * 2 warmup, latency at 50K messages * 11 runs * 2 warmup. bench-upload now also runs the deletion-safety check (superset_check.py vs tests/fixtures/pre-split-slugs.json) right after merge_bmf.py, so any silent slug regression introduced by future edits to the topology binaries fails the PR check. Artifact naming changes: each matrix entry uploads `bench-<binary>-bmf` (e.g. `bench-bench_spsc-bmf`); the upload job's download pattern is unchanged (`bench-*-bmf`) so it picks up every new artifact without further edits. actionlint clean; yamllint shows only the pre-existing "missing document start" warning shared across all repo workflows. * PR 2 Task 2.9: 5-input union test in test_merge_bmf.py Adds test_five_input_union covering the upload-job pipeline shape that PR 2's matrix produces: five sibling BMF fragments (one per binary — bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded, bench_latency) arrive via actions/download-artifact, are merged via merge_bmf.py, and must produce a single output whose slug set is the disjoint union of the inputs. The test deliberately collides the sipsic 1p1c slug between bench_spsc (throughput_ops_ms) and bench_latency (latency_p50_ns, latency_p99_ns) on the SLUG axis with disjoint MEASURE keys, mirroring the production multi-measure-per-slug pattern from Track 1 Task 1.5. The merge must preserve every measure on the shared slug; collisions on (slug, measure) pairs continue to exit 1 (covered by the existing test_collision_exits_1 case, no regression). * PR 2 Task 2.10: delete bench_throughput.nim, rewire consumers Removes benchmarks/nim/bench_throughput.nim now that the five topology- split binaries (bench_spsc / bench_mpsc / bench_mpmc / bench_unbounded + bench_latency from PR 1) cover its full slug surface. The pre-split fixture committed in Task 2.1 plus the superset_check.py guard wired into bench.yml in Task 2.8 enforce that no slug from the legacy binary silently disappears across the split. Live consumers updated to reference the new binaries: benchmarks/runner.py - builds + runs all 5 binaries, then merges fragments via merge_bmf.py. lockfreequeues.nimble - `nimble benchmarks` task iterates the 5 binaries and merges to results/latest.json. benchmarks/README.md - rewritten to describe the 5-binary pipeline, the matrix CI job, and the merged BMF schema (single slug can carry both throughput and latency measures). .gitignore - tracks the four new throughput binaries; also ignores the worktree-local `deps` symlink that nim.cfg's relative path expects. tests/t_bench_common.nim - removes the obsolete Task 0.10 bench_throughput integration suite (now covered by t_topology_split.nim per topology binary). benchmarks/nim/adapters/lockfreequeues_unbounded_mupsic_adapter.nim - doc-comment rewires from bench_throughput to bench_unbounded. Comment-only references in bench_common.nim, bench_latency.nim, render_readme.nim, and the new bench_*.nim files describe the legacy binary as historical context (the legacy `runThroughputBenchmark` proc / Mupmuc registration pattern that bench_common.runThroughputHarness factored out) and stay verbatim. * PR 2 Task 2.11: CHANGELOG entry under [Unreleased] for bench-rollup PR 2 Adds Track 2 PR 2 entries to the [Unreleased] section under Added, Changed, and Removed: - Added: 4 new topology-split throughput binaries (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded), superset_check.py deletion-safety guard, 5-input merge_bmf union test. - Changed: bench.yml matrix over 5 binaries with per-step timeouts + superset_check.py wired into bench-upload; runner.py and `nimble benchmarks` task iterate the 5 binaries; README rewrite. - Removed: bench_throughput.nim (replaced by the 4 topology-split binaries; pre-split fixture + superset_check.py guard pin deletion-safety). * fix(bench): tighten bench_unbounded CI shape to fit 18-min budget bench_unbounded matrix item was timing out at 10 minutes after only 4 of 16 shapes (sipsic 1p1c, sipmuc 1p1c/1p2c/1p4c) due to oversubscribed shapes on 4-vCPU runners. Reduce message counts and runs across the heavy fan-out variants (mupmuc 9-shape grid is the main offender) and bump the job timeout to 18 min for cushion. Statistical signal is preserved for Bencher trend tracking; absolute throughput numbers will shift but per-shape comparisons remain meaningful. - UnboundedSipsicMessageCount: 500K -> 200K (1 shape, runs kept at 3) - UnboundedSipmuc: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupsic: 500Kx3 -> 100Kx2 (3 shapes) - UnboundedMupmuc: 500Kx3 -> 50Kx2 (9 shapes) - BenchUnboundedWarmup: 2 -> 1 - timeout-minutes: 12 -> 18 * fix(bench): gate oversubscribed C>=4 unbounded shapes behind a define bench_unbounded was hanging in CI on the unbounded_sipmuc 1p4c shape: 4 consumers compete in a strict round-robin slot order (each consumer must take its turn before another can pop), and on a 4-vCPU runner one descheduled consumer blocks every other. The same starvation pattern hits every unbounded mupmuc shape with C>=4. Add `-d:BenchSkipOversubscribed` (set in PR CI bench.yml) that drops: - unbounded_sipmuc/mpmc_unbounded/1p4c - unbounded_mupmuc/mpmc_unbounded/{1p4c, 2p4c, 4p1c, 4p2c, 4p4c} mupsic shapes (multi-producer, single consumer) are kept regardless: the round-robin ordering exists on consumers only, so a single consumer cannot starve. mupsic 4p1c is also in tests/fixtures/pre-split-slugs.json and the deletion-safety guard would fail without it. The full grid still runs in `bench-comparison.yml` nightly cron / on workflow_dispatch where a beefier runner is available. PR CI now gets the meaningful subset (4 mupmuc shapes covering each P,C in {1,2}) that fits a 4-vCPU runner without hanging. * fix(bench): document p95 in BMF schema; clean up runner.py fragment files Per Gemini review on PR #21. 1. README.md schema example listed only latency_p50_ns + latency_p99_ns, but bench_latency.nim emits all three (p50/p95/p99). Added p95 to the schema block and the prose paragraph below it. The Metrics section already correctly listed p50/p95/p99. 2. runner.py left N timestamped fragment files in RESULTS_DIR after each merged run (one per Nim binary) — only the merged file is the canonical artifact. Wrap the run+merge in try/finally and unlink each fragment after, so a bench-binary failure or a merge collision still cleans up whatever was produced. * fix(bench): run upload on partial failures; correct timeout-comment Three CI hardening fixes from code review on PR 2: 1. bench-upload had needs:[bench] without if:always(), so any single matrix-leg failure would skip the merge+upload entirely. Per the job comment the design is partial-coverage tolerant: surviving binaries should still be merged and uploaded. Add if: always() && (fork-PR guard) and a guard step that fails loudly when zero JSON fragments are present. 2. The per-binary upload-artifact step now runs with if: always() and if-no-files-found: ignore so a step-level timeout flushes any partial JSON the binary wrote, and a binary that crashed before writing the file does not turn the matrix leg red for the merger. 3. Job and step comments referenced a 12-min timeout while the configured value was 18 min. Update both comments to 18. * fix(bench): drop unparseable BMF fragments before merge bench-upload runs with if: always(), so a matrix leg that crashed mid-write can leave a truncated JSON fragment. merge_bmf.py exits 1 on the first malformed input, blocking uploads from surviving legs. Pre-validate with json.load and drop any that fail, emitting a ::warning:: per dropped file. * perf(bench): backoffOnPeerWait in busy-spin loops; document teardown order - Add backoffOnPeerWait to producer/consumer busy loops in bench_mpmc (Sipmuc producer + Sipmuc consumer + Mupmuc consumer) and bench_mpsc (Mupsic producer + consumer) to avoid CPU starvation in oversubscribed runs. Mupmuc producer already had it. - Add backoffOnPeerWait to all four UnboundedSipsic/USipmuc/UMupsic/ UMupmuc consumer loops in bench_unbounded.nim. - Document the per-iteration teardown order in runUSipmucShape and reference it from runUMupmucShape: reset(q) MUST precede reset(manager[]) because UnboundedSipmuc/Mupmuc =destroy unbinds from the manager, which then asserts clientCount==0. Note that under Nim 2.x ARC/ORC, system.reset(x) is =destroy(x) + =wasMoved(x) — it does NOT skip destructors (verified against project's own =destroy implementations in src/lockfreequeues/unbounded_sipmuc.nim:reset(self.manager[])). * docs(bench_unbounded): defend `create(T)` as alloc0, not destructor-on-uninit Reviewer flagged that `create(DebraManager[...])` followed by `manager[] = initDebraManager[...]()` "runs destructors on uninitialised memory." It does not. `proc create*(T: typedesc, size)` in system/memalloc.nim is `cast[ptr T](alloc0(sizeof(T) * size))` — zero-initialised allocation with no hook invocation. The follow-up assignment then performs the only construction. Add an explanatory note in the per-iteration teardown block comment (runUSipmucShape) so this doesn't keep getting flagged. The other harness (runUMupmucShape) already back-references that block comment. No behaviour change. * fix(bench): pass -d:danger in runner.py build_nim() to match CI CI compiles topology-split bench binaries with -d:release -d:danger; runner.py was missing -d:danger, so locally produced numbers were not comparable to the cloud baseline. * test: t_unbounded_padding red — segment alloc returns 16B-aligned, expects 64B PR 3 Task 3.1 (cache-line padding audit, step 1 of 3). Verifies bounded variants already conform; introduces the failing test that will drive the unbounded fix in 3.2 and 3.3. Bounded MPMC audit (read-only verification, design doc §4.2): - src/lockfreequeues/sipsic.nim:19-20 head/tail Atomic[int] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/sipmuc.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/mupsic.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ - src/lockfreequeues/mupmuc.nim:43-44 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓ The bounded MPMC cell array also uses the documented array-of-cells trick (typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line. No bounded change required. Unbounded variants currently FAIL both audit conditions: - Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.} pragmas, so they share cache lines with neighbouring data. - newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma controls only the intra-struct offset, not the base. Test additions: - tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants: (a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer coordination atomics. (b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated Segment via the queue's allocator. - Test-only accessors guarded by `when defined(testing):` in each unbounded module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest (returns offsets of audit-relevant fields). Compiled away in release. - Wired into tests/test.nim aggregator. RED state confirmed: Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED off.tail mod 64 was 8 (expected 0) off.prevConsumerIdx mod 64 was 16 off.committed mod 64 was 24 cast[uint](segPtr) mod 64 was 16 Existing test suite continues to compile (verified `nim c tests/test.nim`). Refs: design doc §4.2; impl plan Task 3.1. * feat(internal): add aligned_alloc.allocAligned via posix_memalign PR 3 Task 3.2 (cache-line padding audit, step 2 of 3). Adds the cache-line-aligned heap allocator that replaces c_calloc in the unbounded queue Segment allocation path (call sites swap in Task 3.3). Compile probe (impl plan §3.2.0): $ nim c -r tmp_probe.nim rc=0 align=0 verifying that std/posix.posix_memalign is available on macOS (libSystem) and Linux glibc with the same C signature, returning 64-byte aligned memory on success. Probe file deleted post-verification per impl plan. Implementation (src/lockfreequeues/internal/aligned_alloc.nim): - ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on rc != 0. Caller releases via the standard c_free per POSIX guarantee that posix_memalign blocks are free-compatible. - Re-exports CacheLineBytes from atomic_dsl for callers in 3.3. - isMainModule smoke confirms tail-call alignment. Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim): - Tiny payload (sizeof < 64). - Line-sized payload (sizeof == 64). - Big payload (sizeof > 64). - 64 sequential allocations all 64B-aligned. All 4 cases assert pointer alignment AND zero-initialization. Result: internal/aligned_alloc.allocAligned .... [Summary] 4 tests run: 4 OK, 0 FAILED. The unbounded padding audit test (tests/t_unbounded_padding.nim) is still RED at this commit; Task 3.3 swaps the Segment allocation call sites and adds {.align: CacheLineBytes.} pragmas, taking it GREEN. Refs: design doc §4.2; impl plan Task 3.2. * fix(unbounded): pad Segment fields and base-align via posix_memalign PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN). Applies the cache-line padding fix to all four unbounded queue variants and their per-segment producer/consumer atomics. Resolves the asymmetry called out in design doc §4.2: bounded variants were already padded; unbounded Segments were not, so cross-library comparison numbers in PR 3 would have been biased against lockfreequeues' unbounded types. Production changes (4 files): - src/lockfreequeues/unbounded_sipsic.nim Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment} gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned. - src/lockfreequeues/unbounded_sipmuc.nim Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. - src/lockfreequeues/unbounded_mupsic.nim Segment.{next, tail, committed} + UnboundedMupsic.{headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. ``head: int`` (single consumer, non-atomic) is intentionally left unpadded — only the consumer touches it. - src/lockfreequeues/unbounded_mupmuc.nim Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{ headSegment, tailSegment} gain alignment. newSegment swaps to allocAligned. The ``c_calloc`` import is dropped; ``c_free`` retained for segment release (POSIX guarantees posix_memalign blocks are free-compatible). Cross-platform fix in aligned_alloc.nim: - ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on macOS because the SDK declares the first parameter ``__unsafe_indexable`` (incomplete-type ``void *`` vs the wrapper's ``void **``). - Replaced the ``import std/posix`` with a local ``importc, header: "<stdlib.h>"`` shim using the canonical C signature, which clang accepts in both C and C++ modes. Documented in aligned_alloc.nim header comment. Verification: - tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp. - Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc: [Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED (all four variants) - t_unbounded_*_threaded suites pass — no false-sharing-induced timing regression in the threaded code paths. Refs: design doc §4.2; impl plan Task 3.3. * feat(bench): add loony_adapter (unbounded MPMC) behind compile gate PR 3 Task 3.5 (comparison MVP, first FFI adapter). Adds the Loony unbounded MPMC adapter and a parametric round-trip test in tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so the in-tree test suite passes vacuously when the package is absent. Files: - benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported = {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name. - tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves set; parametric template extends to subsequent adapters in 3.6–3.10. Loony storage-encoding fix (NOT a hack — required by Loony's design): Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as "slot empty". Loony was designed for ref/ptr types whose addresses are 8-byte-aligned and never null. Pushing the literal value 0 (or any value whose low 3 bits are non-zero) breaks Loony's pop path. Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3`` on push and ``(raw shr 3) - 1`` on pop: - ``shl 3`` clears the low flag bits. - ``+ 1`` ensures the encoded value is never zero. This is documentation-equivalent to how Loony users push 8-byte-aligned, non-null ref addresses; it does NOT change Loony's algorithm or affect the comparison fairness — every push/pop still goes through Loony's full fast/slow paths. First-iteration RED state (recorded for the audit): - Plain push of ``uint64(i)`` returned only 125 of 1000 items, all multiples of 8 (i.e. the values whose ``i and 7 == 0``). - ``shl 3`` alone returned 999 of 1000 items (everything except 0, which collided with Loony's empty-slot sentinel). - ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input. Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available + --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK. Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK. Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK. * feat(bench): add boost_lockfree_queue adapter (MPMC bounded, nim cpp) Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps boost::lockfree::queue<unsigned long long> with capacity-bounded push (bounded_push) and pop. Topology: tMpmc. Compile gates: -d:adapter_boost_lockfree_queue_available -- enable nim cpp -- required (else hard error) Pulls headers from the conventional install paths (apt /usr/include, brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts -d:boostIncludeDir=<path> for non-standard locations. Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric round-trip test in tests/t_bench_adapters.nim gated by both the define and 'cpp' build mode. Adapter capacity argument is the fixed node-pool size; pushes that exceed it return prFull. Default 1024 mirrors other bounded adapters. * feat(bench): add boost_lockfree_spsc adapter (SPSC bounded, nim cpp) Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer ring buffer. The pop overload used here -- pop(out, max_size) -- returns the count of items popped (csize_t); 1 means success, 0 means empty. Compile gates: -d:adapter_boost_lockfree_spsc_available -- enable nim cpp -- required Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header search paths and indirection rationale identical to boost_lockfree_queue adapter; smoke_boost.nim covers both adapters under one binary. * feat(bench): add bench-ffi-crossbeam Rust cdylib (ArrayQueue + SegQueue) C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C" fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy) that the Nim adapters in tasks 3.9 / 3.10 will importc. - Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link the integration test against the same code path the cdylib exports. - rust-toolchain.toml pins stable channel for reproducibility. - src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic is intercepted (returns null) so we never unwind across FFI. - tests/integration_test.rs: 6 tests covering round-trip set equality for both queues, full/empty edges, and null-pointer tolerance. Build: cargo build --release -> target/release/libbench_ffi_crossbeam.{so,dylib} .gitignore extended for benchmarks/rust/*/target/ and Cargo.lock. * feat(bench): add crossbeam_array_queue adapter (MPMC bounded, FFI cdylib) importc the cb_array_init/push/pop/destroy fns from benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a typed Nim adapter with topologiesSupported = {tMpmc}. The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling crossbeam_seg_queue_adapter gate is also active, so combined builds do not produce 'ignoring duplicate libraries' warnings on Apple ld. Adds smoke_crossbeam.nim covering both crossbeam adapters under one binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim extended with the gated parametric round-trip suite. Compile gates: -d:adapter_crossbeam_array_queue_available -d:crossbeamLibDir=<path> (override search path for non-default builds) * feat(bench): add crossbeam_seg_queue adapter (MPMC unbounded, FFI cdylib) importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess when the handle is non-null; the underlying cb_seg_push only reports failure on null pointers). topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored so the wire-up sites (Task 3.11) can pass capacity uniformly across all adapters without a per-adapter conditional. Compile gate: -d:adapter_crossbeam_seg_queue_available Optional override: -d:crossbeamLibDir=<path> * feat(bench): wire MVP comparison adapters into bench_spsc/mpmc/unbounded Each MVP adapter is pulled in only when its compile-time gate is set; absent gates leave production builds unchanged. Variant lists are populated at compile time via 'when declared(...)' so the CLI's --list / unknown-variant error messages reflect the actual compiled-in set. Wiring map (per design 2.4 / Track 3 §3.11): bench_spsc: - boost_lockfree_spsc -> spsc/1p1c (gated by adapter_boost_lockfree_spsc_available) bench_mpmc: - boost_lockfree_queue -> mpmc/{1,2,4}p{1,2,4}c - crossbeam_array_queue -> mpmc/{1,2,4}p{1,2,4}c (each gated by its respective adapter_*_available define) bench_unbounded: - loony -> mpmc_unbounded/{1,2,4}p{1,2,4}c - crossbeam_seg_queue -> mpmc_unbounded/{1,2,4}p{1,2,4}c bench_mpsc has no MVP coverage today: none of the MVP libraries provide a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC parity for Track 3 and the existing in-tree mupsic baseline stays the sole MPSC slug. Verified locally: bench_mpmc + bench_unbounded produce expected crossbeam_* slugs at the small fixture (10K msg, 2 runs). * ci(bench): add boost+loony soft-skip flow per design §2.6 workflow_dispatch inputs force_skip_boost / force_skip_loony surface the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay them into step-level if: gates. Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate on any failure. The combined defines + compile-mode (c vs cpp) flow into the per-binary compile step via a final 'Build ada…
Stacked PR — Depends on PR #19 (
feat/bench-rollup-pr0-bench-common).PR 1 of the bench-rollup series. Proves multi-measure-per-slug end-to-end by wiring
bench_latency.niminto the BMF emission pipeline introduced in PR 0, then merging the latency fragment with the throughput fragment in CI before a singlebencher runupload.What changed
benchmarks/nim/bench_latency.nim— rewritten on top ofbench_common.runLatencyHarness. CLI mirrorsbench_throughput: positional variant filter (sipsic/mupmuc/sipmuc/mupsic),--bmf-out=<path>for native BMF emission, stdout text preserved. New per-binary intdefines:BenchLatencyRuns(default 33),BenchLatencyMessageCount(default 100_000),BenchLatencyWarmupRuns(default 3)..github/workflows/bench.yml— restructured from one all-in-one job into three:bench-throughput(renamed frombenchmark): builds + runsbench_throughput, uploadsbench-throughput-bmfartifact.bench-latency(NEW): builds + runsbench_latency, uploadsbench-latency-bmfartifact. CI-tuned shape: 50K messages × 11 runs × 2 warmup so the four 1p1c smoke variants finish in single-digit seconds inside the 12-min budget.bench-upload(NEW, needs both): downloadsbench-*-bmfartifacts, runspython3 benchmarks/merge_bmf.py merged.json $(ls bmf-inputs/*.json), then runs the existing Bencher CLI upload flows (PR comparison / push baseline / workflow_dispatch). Onebencher runinvocation -> one Bencher Report -> latency + throughput measures co-located on shared per-slug histories (Bencher creates a separate Report per invocation, so multiple uploads would NOT co-locate measures — see merge rationale in design 1).benchmarks/nim/adapter.nim— re-exportsPushResult/PopResultfrombench_commoninstead of defining its own copies, unifying the two parallel definitions introduced by PR 0 Task 0.1. Both adapter packs (legacylockfreequeues_sipsic/mupmuc/channelsand the newerlockfreequeues_sipmuc/mupsic/unbounded_*) now flow through the same harnesses without per-call-site type conversion.benchmarks/nim/adapters/lockfreequeues_sipsic_adapter.nimandlockfreequeues_mupmuc_adapter.nim— dropnewPopResult/emptyPopResultconstructors in favor of the plainPopResult[T]literal form (compatible with the unified non-case-object type). Mupmuc adapter doc clarifies thatgetProducer(idx = 0)/getConsumer(idx = 0)bypasses Mupmuc's threadId-based registration, so the pre-built producer / consumer slots are safe to drive from either of the harness's pinger and ponger threads (Mupmuc's underlying Vyukov per-slot CAS protocol is fully concurrent-safe).tests/t_bench_latency.nim(NEW) — 7 tests across three suites:--bmf-outintegration on sipsic single-variant + four-variant slug coverage + unknown-variant exit-1 (3 tests).CHANGELOG.md— bullets under[Unreleased]-> Added (bench_latency native BMF, bench-latency CI job) and Changed (adapter type unification).Emitted measures (this PR)
Each of the 4 smoke-shape slugs carries
latency_p50_ns,latency_p95_ns,latency_p99_ns:lockfreequeues_sipsic/spsc/1p1clockfreequeues_sipmuc/mpmc/1p1clockfreequeues_mupsic/mpsc/1p1clockfreequeues_mupmuc/mpmc/1p1clatency_p999_ns/latency_max_nsare deferred to PR 6's threshold-gating work, per the impl plan.Verification
nim c -r tests/test.nim— 188 core tests pass, 0 failures (no regression).nim c -r tests/t_bench_common.nim— 26 tests pass (PR 0 surface unchanged).nim c -r tests/t_bench_latency.nim— 7 tests pass (new).actionlint .github/workflows/bench.yml— clean.bench_latency --bmf-out=lat.json sipsic mupmuc sipmuc mupsicproduces all 4 slugs with all 3 measures, p50 < p95 < p99 < max.Follow-ups (not in this PR)
--threshold-measure throughput— preserved as-is in the push-baseline Bencher invocation; the BMF emission renamed the measure tothroughput_ops_msin PR 0 Task 0.10, and PR 6 (Task 6.2) reworks the threshold blocks per measure. Out of scope for PR 1.benchmarks/nim/tests/t_adapter_*.nim— orphaned per-adapter unit tests still reference the pre-PR-0-Task-0.9 adapter filenames (lockfreequeues_sipsicinstead oflockfreequeues_sipsic_adapter). Not wired into any CI / nimble task; pre-existing breakage. Will tidy in a later PR.runOneLatencyRun/runOneThroughputRundon't call adaptercleanup/deinit*, leaking ~2 small heap allocations per run × 36 runs × 4 variants for the latency path. Bounded and acceptable for the 1p1c smoke shape, but PR 2's topology split should add a generic teardown hook.References
/Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.mdsection 3 PR 1./Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.mdTrack 1 (Tasks 1.1-1.7).