Comparison MVP: Loony, Boost.LockFree, Crossbeam adapters + cache-line padding fix#22
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates third-party comparison adapters for Loony, Boost, and Crossbeam queues into the benchmark suite, including a Rust-based FFI shim for Crossbeam. It also implements cache-line alignment for unbounded queue segments using a new allocAligned utility and field alignment pragmas to mitigate false sharing. Feedback points out a race condition in the Loony adapter's pop logic and potential overflow issues in its item encoding scheme.
| if a.queue.isEmpty(): | ||
| PopResult[T](success: false) | ||
| else: | ||
| let raw = a.queue.pop() | ||
| PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64)) |
There was a problem hiding this comment.
The current pop implementation contains a race condition. In an MPMC context, isEmpty() followed by pop() is not atomic. If another consumer pops the last item between these two calls, a.queue.pop() will return the default value (0), which the adapter then incorrectly processes as a valid value (resulting in an underflow to high(uint64)). Since your encoding ensures that valid items are never 0, you should check the result of pop() directly and remove the isEmpty() check.
let raw = a.queue.pop()
if raw == 0:
PopResult[T](success: false)
else:
PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))
| discard a | ||
|
|
||
| proc push*[T](a: var LoonyAdapter[T], item: T): PushResult = | ||
| a.queue.push((uint64(item) + 1'u64) shl LoonyTagShift) |
There was a problem hiding this comment.
The encoding logic (uint64(item) + 1'u64) shl LoonyTagShift will lose information for values greater than or equal to 2^61 due to the 3-bit left shift. Additionally, if item == high(uint64), the result becomes 0, which is the 'empty' sentinel, causing the push to be effectively ignored. While this is likely safe for the current benchmark range, it should be documented as a limitation of the adapter.
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
4615926 to
7aed4e1
Compare
8b7a87f to
eb67e2b
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
eb67e2b to
daf38c5
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
466d152 to
6386498
Compare
daf38c5 to
ac54727
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive benchmarking comparison suite (Comparison MVP) by adding adapters for third-party libraries including Loony, Boost.LockFree, and Crossbeam. It features a new Rust crate to expose Crossbeam via a C-ABI and integrates these libraries into the existing benchmark harness and CI workflows. Additionally, the PR implements significant performance optimizations by ensuring cache-line alignment for unbounded queue segments using a new allocAligned utility and {.align.} pragmas to eliminate false sharing. Feedback identifies a race condition in the Loony adapter's pop logic and a potential overflow issue in its item encoding scheme.
| proc pop*[T](a: var LoonyAdapter[T]): PopResult[T] = | ||
| if a.queue.isEmpty(): | ||
| PopResult[T](success: false) | ||
| else: | ||
| let raw = a.queue.pop() | ||
| PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64)) |
There was a problem hiding this comment.
The pop implementation contains a race condition in multi-consumer scenarios. Checking isEmpty() before calling pop() is not atomic; another consumer thread could drain the queue between the check and the call, causing a.queue.pop() to return the default value (0). Because the adapter uses encoding to ensure valid items are never 0, it should check the result of pop() directly to determine if the queue was empty, rather than relying on a separate isEmpty() check.
proc pop*[T](a: var LoonyAdapter[T]): PopResult[T] =
let raw = a.queue.pop()
if raw == 0:
return PopResult[T](success: false)
PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))
| discard a | ||
|
|
||
| proc push*[T](a: var LoonyAdapter[T], item: T): PushResult = | ||
| a.queue.push((uint64(item) + 1'u64) shl LoonyTagShift) |
There was a problem hiding this comment.
The encoding logic (uint64(item) + 1'u64) shl LoonyTagShift will produce 0 if item is uint64.high (or any value where item + 1 overflows and then shifted). Since 0 is used by Loony to indicate an empty slot, pushing such values will break the queue's internal state or cause them to be misread as empty. While the current benchmark range (up to 1M) avoids this, the adapter should ideally handle the full range of uint64 or explicitly document this limitation.
6386498 to
5012581
Compare
…range Per Gemini review on PR #22. 1. Multi-consumer race in pop(): isEmpty() followed by pop() is two distinct calls. Between them, another consumer can drain the queue, leaving the second consumer to pop a default uint64 (0) and emit a spurious success with garbage payload. Switch to popping unconditionally and reading raw == 0 as the empty sentinel — the adapter's own (v+1) shl 3 encoding guarantees pushed values are never 0, so raw == 0 unambiguously means 'queue was empty when pop ran' even under concurrent draining. 2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3 wraps to 0 for item == uint64.high and would be misread as an empty slot. The harness pushes uint64(i) for i in 0 ..< messageCount so the bound is academic at current tuning, but contractually the adapter accepts uint64.high - 1 as the inclusive maximum.
aa6dd1f to
fd5b154
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/review |
Code Review by Qodo
1. Missing workflow token permissions
|
5012581 to
9585d15
Compare
…range Per Gemini review on PR #22. 1. Multi-consumer race in pop(): isEmpty() followed by pop() is two distinct calls. Between them, another consumer can drain the queue, leaving the second consumer to pop a default uint64 (0) and emit a spurious success with garbage payload. Switch to popping unconditionally and reading raw == 0 as the empty sentinel — the adapter's own (v+1) shl 3 encoding guarantees pushed values are never 0, so raw == 0 unambiguously means 'queue was empty when pop ran' even under concurrent draining. 2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3 wraps to 0 for item == uint64.high and would be misread as an empty slot. The harness pushes uint64(i) for i in 0 ..< messageCount so the bound is academic at current tuning, but contractually the adapter accepts uint64.high - 1 as the inclusive maximum.
fd5b154 to
4ce6826
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
|
/review |
|
/ai-review |
There was a problem hiding this comment.
PR adds five third-party comparison adapters (Loony, Boost.LockFree, Crossbeam), a new Rust FFI crate, a dedicated Crossbeam benchmark workflow, cache-line alignment fixes for unbounded queue segments, and supporting tests/docs. The code is well-structured, the soft-skip CI pattern is well-designed, and the alignment fix is correct. Audit demoted the unpinned-deps finding from medium to low: the affected workflow is nightly-only so it does not block PRs.
Severity tally: 3 Lows.
Low
- BOT-B1 (
.github/workflows/bench-comparison.yml:70): bench-comparison.yml usesmainbranch for sibling deps instead of tagged versions - BOT-B2 (
benchmarks/rust/bench-ffi-crossbeam/Cargo.toml:1): Rust FFI crate does not setpanic = "abort"; panic-unwind across FFI boundary is UB - BOT-B3 (
benchmarks/nim/adapters/loony_adapter.nim:39): Loony adapter docstring incorrectly claims int.high is 'far below' the encoding-safe limit
Noteworthy
- The
try/finallyreplacement fortry/except/raisein the four unbounded queue variants correctly handles Defect-class exceptions (not just CatchableError), closing a memory leak in the allocation failure path. - The
crossbeam_link.nimshared module elegantly avoids duplicate{.passL.}emission when both crossbeam adapters are enabled in the same compilation unit.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git | ||
| git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git |
There was a problem hiding this comment.
BOT-B1 — Low (standards)
bench-comparison.yml uses main branch for sibling deps instead of tagged versions
bench-comparison.yml clones nim-debra and nim-typestates from --branch main, but bench.yml and build.yml both pin to --branch v0.6.0 and --branch v0.7.0 respectively. This means the crossbeam comparison workflow can silently pick up breaking changes from the dependency main branches, producing compilation failures or benchmark results that are not comparable to those from bench.yml. The PR description states this workflow "mirrors bench.yml" — the dependency pinning should mirror as well. Severity demoted from medium to low: bench-comparison.yml runs on nightly cron and devel-push only, not on PR — it does not gate merges.
| git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git | |
| git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git | |
| Replace `--branch main` with `--branch v0.6.0` for nim-debra and `--branch v0.7.0` for nim-typestates on lines 70-71 of bench-comparison.yml. |
| [package] | ||
| name = "bench_ffi_crossbeam" | ||
| version = "0.1.0" | ||
| edition = "2021" | ||
| publish = false | ||
| description = "C-ABI cdylib wrapping crossbeam-queue::ArrayQueue and SegQueue for the lockfreequeues bench harness" | ||
| license = "Apache-2.0 OR MIT" | ||
|
|
||
| [lib] | ||
| crate-type = ["cdylib", "rlib"] | ||
| # rlib is included so `cargo test` can link the integration test binary | ||
| # against the same code path that the cdylib exposes. | ||
|
|
||
| [dependencies] | ||
| crossbeam-queue = "0.3" |
There was a problem hiding this comment.
BOT-B2 — Low (quality)
Rust FFI crate does not set panic = "abort"; panic-unwind across FFI boundary is UB
The cdylib exposes extern "C" functions called from Nim. Neither panic = "abort" nor catch_unwind wrappers are present. In particular, cb_seg_push calls SegQueue::push which allocates via the global allocator; if the allocator panics (e.g. OOM), the unwind crosses the C FFI boundary, which is undefined behavior per Rust's FFI rules. While OOM is extremely unlikely in CI for a bench crate, this is standard FFI hygiene and low-cost to fix.
| [package] | |
| name = "bench_ffi_crossbeam" | |
| version = "0.1.0" | |
| edition = "2021" | |
| publish = false | |
| description = "C-ABI cdylib wrapping crossbeam-queue::ArrayQueue and SegQueue for the lockfreequeues bench harness" | |
| license = "Apache-2.0 OR MIT" | |
| [lib] | |
| crate-type = ["cdylib", "rlib"] | |
| # rlib is included so `cargo test` can link the integration test binary | |
| # against the same code path that the cdylib exposes. | |
| [dependencies] | |
| crossbeam-queue = "0.3" | |
| Add `[profile.release]` with `panic = "abort"` to Cargo.toml, or wrap the crossbeam push/pop calls in `std::panic::catch_unwind`. |
| ## ``messageCount`` is bounded by ``int.high`` ≈ ``2^63``, far below the | ||
| ## safe limit at any tuning we'll ever ship, so this is academic — but | ||
| ## it is the contractual range and a debug-mode assert below makes a |
There was a problem hiding this comment.
BOT-B3 — Low (quality)
Loony adapter docstring incorrectly claims int.high is 'far below' the encoding-safe limit
The docstring states: "messageCount is bounded by int.high ≈ 2^63, far below the safe limit". However, the safe limit is (1'u64 shl 61) - 2 ≈ 2.3×10^18, while int.high = 2^63 - 1 ≈ 9.2×10^18 — roughly 4× larger than the limit. The actual benched message counts (500K–1M) are well within range, but the comment is factually wrong and could mislead someone tuning message counts in the future.
| ## ``messageCount`` is bounded by ``int.high`` ≈ ``2^63``, far below the | |
| ## safe limit at any tuning we'll ever ship, so this is academic — but | |
| ## it is the contractual range and a debug-mode assert below makes a | |
| Correct the comment to reflect that actual bench message counts (e.g., millions) are far below the limit, not `int.high`. |
2aa47ce to
ab19733
Compare
…pects 64B
PR 3 Task 3.1 (cache-line padding audit, step 1 of 3).
Verifies bounded variants already conform; introduces the failing test that
will drive the unbounded fix in 3.2 and 3.3.
Bounded MPMC audit (read-only verification, design doc §4.2):
- src/lockfreequeues/sipsic.nim:19-20 head/tail Atomic[int] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/sipmuc.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupsic.nim:42-43 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupmuc.nim:43-44 head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
The bounded MPMC cell array also uses the documented array-of-cells trick
(typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line.
No bounded change required.
Unbounded variants currently FAIL both audit conditions:
- Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.}
pragmas, so they share cache lines with neighbouring data.
- newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte
aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma
controls only the intra-struct offset, not the base.
Test additions:
- tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants:
(a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer
coordination atomics.
(b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated
Segment via the queue's allocator.
- Test-only accessors guarded by `when defined(testing):` in each unbounded
module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest
(returns offsets of audit-relevant fields). Compiled away in release.
- Wired into tests/test.nim aggregator.
RED state confirmed:
Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED
off.tail mod 64 was 8 (expected 0)
off.prevConsumerIdx mod 64 was 16
off.committed mod 64 was 24
cast[uint](segPtr) mod 64 was 16
Existing test suite continues to compile (verified `nim c tests/test.nim`).
Refs: design doc §4.2; impl plan Task 3.1.
PR 3 Task 3.2 (cache-line padding audit, step 2 of 3).
Adds the cache-line-aligned heap allocator that replaces c_calloc in the
unbounded queue Segment allocation path (call sites swap in Task 3.3).
Compile probe (impl plan §3.2.0):
$ nim c -r tmp_probe.nim
rc=0
align=0
verifying that std/posix.posix_memalign is available on macOS (libSystem)
and Linux glibc with the same C signature, returning 64-byte aligned memory
on success. Probe file deleted post-verification per impl plan.
Implementation (src/lockfreequeues/internal/aligned_alloc.nim):
- ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with
CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on
rc != 0. Caller releases via the standard c_free per POSIX guarantee
that posix_memalign blocks are free-compatible.
- Re-exports CacheLineBytes from atomic_dsl for callers in 3.3.
- isMainModule smoke confirms tail-call alignment.
Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim):
- Tiny payload (sizeof < 64).
- Line-sized payload (sizeof == 64).
- Big payload (sizeof > 64).
- 64 sequential allocations all 64B-aligned.
All 4 cases assert pointer alignment AND zero-initialization. Result:
internal/aligned_alloc.allocAligned ....
[Summary] 4 tests run: 4 OK, 0 FAILED.
The unbounded padding audit test (tests/t_unbounded_padding.nim) is still
RED at this commit; Task 3.3 swaps the Segment allocation call sites and
adds {.align: CacheLineBytes.} pragmas, taking it GREEN.
Refs: design doc §4.2; impl plan Task 3.2.
PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN).
Applies the cache-line padding fix to all four unbounded queue variants and
their per-segment producer/consumer atomics. Resolves the asymmetry called
out in design doc §4.2: bounded variants were already padded; unbounded
Segments were not, so cross-library comparison numbers in PR 3 would have
been biased against lockfreequeues' unbounded types.
Production changes (4 files):
- src/lockfreequeues/unbounded_sipsic.nim
Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment}
gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned.
- src/lockfreequeues/unbounded_sipmuc.nim
Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment,
tailSegment} gain alignment. newSegment swaps to allocAligned.
- src/lockfreequeues/unbounded_mupsic.nim
Segment.{next, tail, committed} + UnboundedMupsic.{headSegment,
tailSegment} gain alignment. newSegment swaps to allocAligned.
``head: int`` (single consumer, non-atomic) is intentionally left
unpadded — only the consumer touches it.
- src/lockfreequeues/unbounded_mupmuc.nim
Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{
headSegment, tailSegment} gain alignment. newSegment swaps to
allocAligned.
The ``c_calloc`` import is dropped; ``c_free`` retained for segment
release (POSIX guarantees posix_memalign blocks are free-compatible).
Cross-platform fix in aligned_alloc.nim:
- ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on
macOS because the SDK declares the first parameter ``__unsafe_indexable``
(incomplete-type ``void *`` vs the wrapper's ``void **``).
- Replaced the ``import std/posix`` with a local ``importc, header:
"<stdlib.h>"`` shim using the canonical C signature, which clang accepts
in both C and C++ modes. Documented in aligned_alloc.nim header comment.
Verification:
- tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp.
- Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc:
[Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED (all four variants)
- t_unbounded_*_threaded suites pass — no false-sharing-induced timing
regression in the threaded code paths.
Refs: design doc §4.2; impl plan Task 3.3.
PR 3 Task 3.5 (comparison MVP, first FFI adapter).
Adds the Loony unbounded MPMC adapter and a parametric round-trip test in
tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so
the in-tree test suite passes vacuously when the package is absent.
Files:
- benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported
= {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name.
- tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves
set; parametric template extends to subsequent adapters in 3.6–3.10.
Loony storage-encoding fix (NOT a hack — required by Loony's design):
Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state
flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection
reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as
"slot empty". Loony was designed for ref/ptr types whose addresses are
8-byte-aligned and never null. Pushing the literal value 0 (or any value
whose low 3 bits are non-zero) breaks Loony's pop path.
Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3``
on push and ``(raw shr 3) - 1`` on pop:
- ``shl 3`` clears the low flag bits.
- ``+ 1`` ensures the encoded value is never zero.
This is documentation-equivalent to how Loony users push 8-byte-aligned,
non-null ref addresses; it does NOT change Loony's algorithm or affect
the comparison fairness — every push/pop still goes through Loony's full
fast/slow paths.
First-iteration RED state (recorded for the audit):
- Plain push of ``uint64(i)`` returned only 125 of 1000 items, all
multiples of 8 (i.e. the values whose ``i and 7 == 0``).
- ``shl 3`` alone returned 999 of 1000 items (everything except 0,
which collided with Loony's empty-slot sentinel).
- ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input.
Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available
+ --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK.
Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK.
Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK.
Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps boost::lockfree::queue<unsigned long long> with capacity-bounded push (bounded_push) and pop. Topology: tMpmc. Compile gates: -d:adapter_boost_lockfree_queue_available -- enable nim cpp -- required (else hard error) Pulls headers from the conventional install paths (apt /usr/include, brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts -d:boostIncludeDir=<path> for non-standard locations. Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric round-trip test in tests/t_bench_adapters.nim gated by both the define and 'cpp' build mode. Adapter capacity argument is the fixed node-pool size; pushes that exceed it return prFull. Default 1024 mirrors other bounded adapters.
Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer ring buffer. The pop overload used here -- pop(out, max_size) -- returns the count of items popped (csize_t); 1 means success, 0 means empty. Compile gates: -d:adapter_boost_lockfree_spsc_available -- enable nim cpp -- required Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header search paths and indirection rationale identical to boost_lockfree_queue adapter; smoke_boost.nim covers both adapters under one binary.
C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and
crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C"
fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy)
that the Nim adapters in tasks 3.9 / 3.10 will importc.
- Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link
the integration test against the same code path the cdylib exports.
- rust-toolchain.toml pins stable channel for reproducibility.
- src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic
is intercepted (returns null) so we never unwind across FFI.
- tests/integration_test.rs: 6 tests covering round-trip set equality
for both queues, full/empty edges, and null-pointer tolerance.
Build: cargo build --release -> target/release/libbench_ffi_crossbeam.{so,dylib}
.gitignore extended for benchmarks/rust/*/target/ and Cargo.lock.
…lib)
importc the cb_array_init/push/pop/destroy fns from
benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a
typed Nim adapter with topologiesSupported = {tMpmc}.
The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling
crossbeam_seg_queue_adapter gate is also active, so combined builds
do not produce 'ignoring duplicate libraries' warnings on Apple ld.
Adds smoke_crossbeam.nim covering both crossbeam adapters under one
binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim
extended with the gated parametric round-trip suite.
Compile gates:
-d:adapter_crossbeam_array_queue_available
-d:crossbeamLibDir=<path> (override search path for non-default builds)
…lib)
importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam
cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess
when the handle is non-null; the underlying cb_seg_push only reports
failure on null pointers).
topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored
so the wire-up sites (Task 3.11) can pass capacity uniformly across all
adapters without a per-adapter conditional.
Compile gate: -d:adapter_crossbeam_seg_queue_available
Optional override: -d:crossbeamLibDir=<path>
Each MVP adapter is pulled in only when its compile-time gate is set;
absent gates leave production builds unchanged. Variant lists are
populated at compile time via 'when declared(...)' so the CLI's
--list / unknown-variant error messages reflect the actual compiled-in
set.
Wiring map (per design 2.4 / Track 3 §3.11):
bench_spsc:
- boost_lockfree_spsc -> spsc/1p1c (gated by
adapter_boost_lockfree_spsc_available)
bench_mpmc:
- boost_lockfree_queue -> mpmc/{1,2,4}p{1,2,4}c
- crossbeam_array_queue -> mpmc/{1,2,4}p{1,2,4}c
(each gated by its respective adapter_*_available define)
bench_unbounded:
- loony -> mpmc_unbounded/{1,2,4}p{1,2,4}c
- crossbeam_seg_queue -> mpmc_unbounded/{1,2,4}p{1,2,4}c
bench_mpsc has no MVP coverage today: none of the MVP libraries provide
a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC
parity for Track 3 and the existing in-tree mupsic baseline stays the
sole MPSC slug.
Verified locally: bench_mpmc + bench_unbounded produce expected
crossbeam_* slugs at the small fixture (10K msg, 2 runs).
workflow_dispatch inputs force_skip_boost / force_skip_loony surface
the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay
them into step-level if: gates.
Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate
on any failure. The combined defines + compile-mode (c vs cpp) flow
into the per-binary compile step via a final 'Build adapter define
flags' step that populates step.outputs.{flags,mode}.
Boost (libboost-dev) is wired into bench_spsc + bench_mpmc only.
Loony (nimble registry) is wired into bench_unbounded only. Other
binaries skip both install steps entirely (matrix-binary if-guards),
so the per-binary critical path is unchanged when no MVP adapter
applies.
Both install steps use continue-on-error: true; failure flips the
binary's compile flags so MVP slugs are simply omitted from the BMF
instead of failing the workflow (design Risk 4 soft-skip pattern).
Annotate-skipped emits ::warning title=Adapter skipped which surfaces
on the PR check summary.
Crossbeam adds a Rust toolchain dependency (~5 min cold install) which
would inflate the bench.yml critical path on every PR. Per design 2.6
and impl plan §3.13, Crossbeam runs in this dedicated workflow instead:
triggers:
- schedule: nightly 04:00 UTC
- workflow_dispatch
- push to devel that touches benchmarks/rust/** or
benchmarks/nim/adapters/crossbeam_*
steps:
- dtolnay/rust-toolchain@stable + Swatinem/rust-cache@v2
- cargo build --release of bench-ffi-crossbeam
- cargo test of the cdylib (sanity)
- smoke_crossbeam.nim compile-and-run
- bench_mpmc with -d:adapter_crossbeam_array_queue_available
- bench_unbounded with -d:adapter_crossbeam_seg_queue_available
- merge_bmf.py + Bencher upload (separate Report from bench.yml)
Linker resolution: the {.passL.} flags in the Nim adapters add the
default -L; we additionally pass -Wl,-rpath,$(pwd)/.../target/release
so the dylib resolves at runtime without LD_LIBRARY_PATH.
actionlint clean.
benchmarks/README.md: - New 'Comparison MVP' subsection: table of libraries x topologies x compile gates x install commands. - Documents the bench.yml soft-skip flow (Loony + Boost) vs bench-comparison.yml nightly cron (Crossbeam) split. - Local-run snippets for each adapter. - Mentions nim/smoke/ and rust/bench-ffi-crossbeam/ in the structure list. THIRD_PARTY_LICENSES.md (new): - Per-vendor block schema (matches design doc §4.5). - Records Loony (MIT, nimble-managed), Boost.LockFree (BSL-1.0, system package), and Crossbeam (Apache-2.0 OR MIT, Cargo-managed via the in-tree cdylib shim). - Reserves entries for concurrentqueue (PR 4) and uPlot (PR 5).
Records: five third-party adapters (Loony, Boost x2, Crossbeam x2), the bench-ffi-crossbeam Rust cdylib, smoke binaries under benchmarks/nim/smoke/, the new bench-comparison.yml workflow, the soft-skip flow added to bench.yml, the new THIRD_PARTY_LICENSES.md, the cache-line padding fix for unbounded queue segments via the new internal/aligned_alloc.nim helper, and the supporting test infrastructure (t_unbounded_padding.nim, t_aligned_alloc.nim, the t_bench_adapters.nim parametric round-trip suite).
…xonomy
merge_bmf.py enforces <library>/<topology>/<P>p<C>c (exactly 3 segments).
Three call sites violated this:
- "loony/LoonyQueue" produced 4-segment slugs
- "crossbeam_queue/SegQueue" produced 4-segment slugs
- "crossbeam_queue" for ArrayQueue collided with the SegQueue lib name
Use the canonical lib slugs from the design doc ("loony",
"crossbeam_array_queue", "crossbeam_seg_queue") so merge passes.
Loony is soft-skipped in PR CI today and Crossbeam runs only in nightly
cron, so this is preventative for downstream PRs and post-merge runs.
The cache-line padding fix swapped c_calloc -> allocAligned in newSegment but missed the manager allocations in newUnboundedSipmuc/Mupsic/Mupmuc auto-create constructors that landed via #18 (e8ee15b). Manager state (epoch counter, limbo bag pointers, registration table) benefits from the same 64-byte base alignment as Segment, and the import audit dropped c_calloc as unused — so the auto-create constructors broke at compile time. Same allocAligned helper, same OutOfMemDefect failure mode, same c_free release path (POSIX guarantees posix_memalign blocks are free-compatible).
…range Per Gemini review on PR #22. 1. Multi-consumer race in pop(): isEmpty() followed by pop() is two distinct calls. Between them, another consumer can drain the queue, leaving the second consumer to pop a default uint64 (0) and emit a spurious success with garbage payload. Switch to popping unconditionally and reading raw == 0 as the empty sentinel — the adapter's own (v+1) shl 3 encoding guarantees pushed values are never 0, so raw == 0 unambiguously means 'queue was empty when pop ran' even under concurrent draining. 2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3 wraps to 0 for item == uint64.high and would be misread as an empty slot. The harness pushes uint64(i) for i in 0 ..< messageCount so the bound is academic at current tuning, but contractually the adapter accepts uint64.high - 1 as the inclusive maximum.
Three code-review fixes on PR 3:
1. allocAligned[T] always passed CacheLineBytes to posix_memalign
regardless of alignof(T). For an over-aligned T (e.g. SSE/AVX
vectors or {.align: 128.} pragmas), the returned pointer would
not satisfy T's alignment contract — UB on dereference. Use
max(CacheLineBytes, alignof(T)) so the cache-line invariant is
preserved while honoring T's own alignment. Computed at compile
time; runtime alignment argument stays constant per instantiation.
2. bench-comparison.yml runs bencher run --github-actions but did not
declare job-level permissions. On repos with default read-only
GITHUB_TOKEN, this would prevent check-run / PR-annotation
publishing. Mirror bench.yml's bench-upload permissions block:
contents:read, actions:read, pull-requests:write, checks:write.
3. The loony_adapter Encoding range comment claimed
'0 .. uint64.high - 1' but the actual encoding (v + 1) shl 3
discards the top 3 bits, so the round-trip is only invertible for
v <= (1 shl 61) - 2. Document the true safe range and add a
debug-mode assert that catches misuse in non-release builds. The
assert is stripped under -d:release / -d:danger so bench
measurement runs see zero overhead.
aligned_alloc:
- Add Windows path via _aligned_malloc / _aligned_free (POSIX path
keeps posix_memalign + free). Library is no longer Linux/macOS-only.
- Add freeAligned(pointer) plus a typed ptr T overload so callers in
untyped destructor hooks share the same call site as typed callers.
- Migrate all c_free callsites that pair with allocAligned across the
four unbounded queues (segments, manager auto-create slot, retired-
segment destructor) to freeAligned. Drop the now-unused c_free
imports from each unbounded source.
unbounded_mupsic Segment.head:
- Add {.align: CacheLineBytes.} to the consumer head field. Without it,
head sat in the same 64-byte chunk as the producer-shared tail and
every consumer write invalidated producers' tail line. Extend the
cache-line padding audit (segmentHeadOffsetForTest +
t_unbounded_padding) to assert head's alignment so the regression
cannot recur.
manager-init failure path (mupmuc + mupsic + sipmuc auto-create overload):
- Replace `try ... except: cleanup; raise` with `try ... finally: if
not ok: cleanup`. Under Nim 2.0, bare `except:` matches only
CatchableError -- an OutOfMemDefect from inside initDebraManager
would skip cleanup and leak the heap-allocated DebraManager slot.
`finally` runs unconditionally on Defect-class raises too. Refresh
the comment that referred to a c_calloc zeroing path that no longer
exists (allocAligned does the zeroing now).
boost adapters (queue + spsc_queue):
- Switch backing storage from alloc0 to allocAligned[T] / freeAligned
so the placement-constructed C++ object gets cache-line alignment,
not just alignof(max_align_t).
Previous design gated the array adapter's link emission on the seg
adapter's gate being unset, so the seg adapter "owns" the flags when
both are enabled. This breaks the realistic CI configuration where
both adapter_crossbeam_*_available defines are set globally but a
given bench binary imports only one adapter. With both gates set and
only the array adapter imported, the array adapter's "skip" branch
fires and the seg adapter's emission block is never compiled in -
linker fails on undefined cb_array_* symbols.
Move the {.passL.} pragmas into crossbeam_link.nim. Both adapters
import it from inside their availability gate, so the link flags
emit exactly once per bench binary that uses ANY crossbeam adapter,
independent of which gates are set elsewhere. Nim's per-module
processing dedups the pragmas when both adapters are imported.
…ect loony encoding-range docstring - bench-comparison.yml: pin nim-debra v0.6.0 / nim-typestates v0.7.0 (was tracking 'main', non-deterministic). - bench-ffi-crossbeam: add [profile.release] panic = 'abort'. Unwinding a Rust panic across the C-ABI into Nim is undefined behavior. - loony_adapter docstring: int.high (2^63 - 1) is ABOVE the 2^61 - 2 encoding ceiling, not 'far below' it. Type bound alone does not guarantee safety; reframe around the shipped tuning range and keep the debug-mode assert as the actual safeguard.
b233388 to
41a08f3
Compare
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c (exactly 3 segments). PR #23's CI failed on the moodycamel slug: invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c' The same 4-segment violations exist for loony and crossbeam_seg_queue in bench_unbounded.nim, and the crossbeam_array_queue dispatch was emitting a colliding "crossbeam_queue" lib name. Apply the canonical lib slugs from the design doc so all dispatch sites yield 3 segments after the runner appends "/<topology>/<shape>": loony/LoonyQueue -> loony crossbeam_queue/SegQueue -> crossbeam_seg_queue crossbeam_queue (Array) -> crossbeam_array_queue moodycamel/ConcurrentQueue -> moodycamel The PR #22 worktree applies the loony and crossbeam fixes too; git rebase will harmonize the duplicate edits. The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]") are human-readable echo strings, not BMF slugs, and stay as-is.
Depends on PR #19, PR #20, PR #21.
Summary
Track 3 (Comparison MVP) of the bench-rollup feature. Adds five third-party adapters so each topology has at least three distinct libraries plotted on the same Bencher dashboard, plus the cache-line padding fix flagged by design §4.2.
What's in this PR
Comparison MVP adapters (Track 3 §3.5–3.10)
LoonyQueuempmc_unbounded-d:adapter_loony_availableboost::lockfree::queuempmc(bounded)-d:adapter_boost_lockfree_queue_available(requiresnim cpp)boost::lockfree::spsc_queuespsc(bounded)-d:adapter_boost_lockfree_spsc_available(requiresnim cpp)crossbeam_queue::ArrayQueuempmc(bounded)-d:adapter_crossbeam_array_queue_availablecrossbeam_queue::SegQueuempmc_unbounded-d:adapter_crossbeam_seg_queue_availableEach adapter is wrapped in
when defined(adapter_<lib>_available):so absent gates produce zero symbol references and production builds are unchanged. Round-trip (1000-item set equality) tests are wired intotests/t_bench_adapters.nim, gated by the same defines.New Rust crate
benchmarks/rust/bench-ffi-crossbeam/A
cdylibexposing 8extern "C"fns wrappingcrossbeam_queue::ArrayQueue<u64>andcrossbeam_queue::SegQueue<u64>.rust-toolchain.tomlpinsstable. Six integration tests cover round-trip set equality for both queues, capacity edges, empty-pop, and null-pointer tolerance.CI integration
bench.ymlnow installs Loony + Boost.LockFree on every PR via the design §2.6 soft-skip pattern (install -> smoke -> set-flag-on-success -> annotate-on-failure).workflow_dispatchacceptsforce_skip_boost/force_skip_loonyboolean inputs to exercise the skip path manually.bench-comparison.ymlruns Crossbeam on a nightly cron (0 4 * * *),workflow_dispatch, and targeted path pushes todevel. It owns its own Bencher Report. Crossbeam runs in this nightly workflow only — not inbench.yml— to keep PR critical-path time unchanged.Cache-line padding audit (Track 3 §3.1–3.3)
The four unbounded queue variants previously allocated segments via
c_calloc(16-byte ABI alignment). Now allocated viaposix_memalignthrough the newsrc/lockfreequeues/internal/aligned_alloc.nimhelper, with{.align: CacheLineBytes.}on per-segment Atomic fields. Verified bytests/t_unbounded_padding.nim(8 assertions across 4 variants, green under c/cpp/arc/refc).Documentation
THIRD_PARTY_LICENSES.mdrecords license obligations for the comparison MVP libraries (Loony MIT, Boost BSL-1.0, Crossbeam Apache-2.0 OR MIT) with placeholder slots for concurrentqueue (PR 4) and uPlot (PR 5).benchmarks/README.mdgains a "Comparison MVP" subsection with library/topology/install table and per-adapter local-run snippets.CHANGELOG.md[Unreleased]updated with Added (5 adapters, Rust crate, smoke binaries, bench-comparison.yml, soft-skip flow, THIRD_PARTY_LICENSES, aligned_alloc) and Fixed (cache-line padding for unbounded segments).Verification
nimble test-> 200 OK / 0 FAILED / 0 SKIPPED.cargo test --releaseon the Rust crate -> 6 passed.actionlintclean on bothbench.ymlandbench-comparison.yml.t_bench_adapters.nimexercises Crossbeam ArrayQueue + SegQueue locally (Cargo + cdylib available); Loony and Boost are gated and skip vacuously without their respective installs.