Skip to content

Comparison MVP: Loony, Boost.LockFree, Crossbeam adapters + cache-line padding fix#22

Closed
elijahr wants to merge 22 commits intofeat/bench-rollup-pr2-topology-splitfrom
feat/bench-rollup-pr3-comparison-mvp
Closed

Comparison MVP: Loony, Boost.LockFree, Crossbeam adapters + cache-line padding fix#22
elijahr wants to merge 22 commits intofeat/bench-rollup-pr2-topology-splitfrom
feat/bench-rollup-pr3-comparison-mvp

Conversation

@elijahr
Copy link
Copy Markdown
Owner

@elijahr elijahr commented May 1, 2026

Depends on PR #19, PR #20, PR #21.

Summary

Track 3 (Comparison MVP) of the bench-rollup feature. Adds five third-party adapters so each topology has at least three distinct libraries plotted on the same Bencher dashboard, plus the cache-line padding fix flagged by design §4.2.

What's in this PR

Comparison MVP adapters (Track 3 §3.5–3.10)

Library Variant Topology Compile gate
Loony LoonyQueue mpmc_unbounded -d:adapter_loony_available
Boost.LockFree boost::lockfree::queue mpmc (bounded) -d:adapter_boost_lockfree_queue_available (requires nim cpp)
Boost.LockFree boost::lockfree::spsc_queue spsc (bounded) -d:adapter_boost_lockfree_spsc_available (requires nim cpp)
Crossbeam crossbeam_queue::ArrayQueue mpmc (bounded) -d:adapter_crossbeam_array_queue_available
Crossbeam crossbeam_queue::SegQueue mpmc_unbounded -d:adapter_crossbeam_seg_queue_available

Each adapter is wrapped in when defined(adapter_<lib>_available): so absent gates produce zero symbol references and production builds are unchanged. Round-trip (1000-item set equality) tests are wired into tests/t_bench_adapters.nim, gated by the same defines.

New Rust crate benchmarks/rust/bench-ffi-crossbeam/

A cdylib exposing 8 extern "C" fns wrapping crossbeam_queue::ArrayQueue<u64> and crossbeam_queue::SegQueue<u64>. rust-toolchain.toml pins stable. Six integration tests cover round-trip set equality for both queues, capacity edges, empty-pop, and null-pointer tolerance.

CI integration

  • bench.yml now installs Loony + Boost.LockFree on every PR via the design §2.6 soft-skip pattern (install -> smoke -> set-flag-on-success -> annotate-on-failure). workflow_dispatch accepts force_skip_boost / force_skip_loony boolean inputs to exercise the skip path manually.
  • New bench-comparison.yml runs Crossbeam on a nightly cron (0 4 * * *), workflow_dispatch, and targeted path pushes to devel. It owns its own Bencher Report. Crossbeam runs in this nightly workflow only — not in bench.yml — to keep PR critical-path time unchanged.

Cache-line padding audit (Track 3 §3.1–3.3)

The four unbounded queue variants previously allocated segments via c_calloc (16-byte ABI alignment). Now allocated via posix_memalign through the new src/lockfreequeues/internal/aligned_alloc.nim helper, with {.align: CacheLineBytes.} on per-segment Atomic fields. Verified by tests/t_unbounded_padding.nim (8 assertions across 4 variants, green under c/cpp/arc/refc).

Documentation

  • New THIRD_PARTY_LICENSES.md records license obligations for the comparison MVP libraries (Loony MIT, Boost BSL-1.0, Crossbeam Apache-2.0 OR MIT) with placeholder slots for concurrentqueue (PR 4) and uPlot (PR 5).
  • benchmarks/README.md gains a "Comparison MVP" subsection with library/topology/install table and per-adapter local-run snippets.
  • CHANGELOG.md [Unreleased] updated with Added (5 adapters, Rust crate, smoke binaries, bench-comparison.yml, soft-skip flow, THIRD_PARTY_LICENSES, aligned_alloc) and Fixed (cache-line padding for unbounded segments).

Verification

  • nimble test -> 200 OK / 0 FAILED / 0 SKIPPED.
  • cargo test --release on the Rust crate -> 6 passed.
  • actionlint clean on both bench.yml and bench-comparison.yml.
  • t_bench_adapters.nim exercises Crossbeam ArrayQueue + SegQueue locally (Cargo + cdylib available); Loony and Boost are gated and skip vacuously without their respective installs.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates third-party comparison adapters for Loony, Boost, and Crossbeam queues into the benchmark suite, including a Rust-based FFI shim for Crossbeam. It also implements cache-line alignment for unbounded queue segments using a new allocAligned utility and field alignment pragmas to mitigate false sharing. Feedback points out a race condition in the Loony adapter's pop logic and potential overflow issues in its item encoding scheme.

Comment on lines +65 to +69
if a.queue.isEmpty():
PopResult[T](success: false)
else:
let raw = a.queue.pop()
PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current pop implementation contains a race condition. In an MPMC context, isEmpty() followed by pop() is not atomic. If another consumer pops the last item between these two calls, a.queue.pop() will return the default value (0), which the adapter then incorrectly processes as a valid value (resulting in an underflow to high(uint64)). Since your encoding ensures that valid items are never 0, you should check the result of pop() directly and remove the isEmpty() check.

    let raw = a.queue.pop()
    if raw == 0:
      PopResult[T](success: false)
    else:
      PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))

discard a

proc push*[T](a: var LoonyAdapter[T], item: T): PushResult =
a.queue.push((uint64(item) + 1'u64) shl LoonyTagShift)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The encoding logic (uint64(item) + 1'u64) shl LoonyTagShift will lose information for values greater than or equal to 2^61 due to the 3-bit left shift. Additionally, if item == high(uint64), the result becomes 0, which is the 'empty' sentinel, causing the push to be effectively ignored. While this is likely safe for the current benchmark range, it should be documented as a limitation of the adapter.

elijahr added a commit that referenced this pull request May 1, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 4615926 to 7aed4e1 Compare May 1, 2026 18:12
@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from 8b7a87f to eb67e2b Compare May 1, 2026 18:13
elijahr added a commit that referenced this pull request May 1, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🐰 Bencher Report

Branchfeat/bench-rollup-pr3-comparison-mvp
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarklatency_p50_nsMeasure (units)latency_p95_nsMeasure (units)latency_p99_nsMeasure (units)throughput_ops_msMeasure (units)
lockfreequeues_mupmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
280.82 units📈 view plot
⚠️ NO THRESHOLD
317.00 units📈 view plot
⚠️ NO THRESHOLD
402.82 units📈 view plot
⚠️ NO THRESHOLD
12,178.36 units
lockfreequeues_mupmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
19,419.99 units
lockfreequeues_mupmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
15,407.78 units
lockfreequeues_mupmuc/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
19,428.28 units
lockfreequeues_mupmuc/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
16,699.13 units
lockfreequeues_mupmuc/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
16,013.81 units
lockfreequeues_mupmuc/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
16,774.94 units
lockfreequeues_mupmuc/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
11,913.26 units
lockfreequeues_mupmuc/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
9,718.54 units
lockfreequeues_mupmuc/mpmc/8p8c📈 view plot
⚠️ NO THRESHOLD
16,723.96 units
lockfreequeues_mupsic/mpsc/1p1c📈 view plot
⚠️ NO THRESHOLD
284.18 units📈 view plot
⚠️ NO THRESHOLD
321.82 units📈 view plot
⚠️ NO THRESHOLD
449.73 units📈 view plot
⚠️ NO THRESHOLD
31,949.25 units
lockfreequeues_mupsic/mpsc/2p1c📈 view plot
⚠️ NO THRESHOLD
17,146.61 units
lockfreequeues_mupsic/mpsc/4p1c📈 view plot
⚠️ NO THRESHOLD
12,744.16 units
lockfreequeues_sipmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
265.91 units📈 view plot
⚠️ NO THRESHOLD
315.36 units📈 view plot
⚠️ NO THRESHOLD
410.73 units📈 view plot
⚠️ NO THRESHOLD
12,664.55 units
lockfreequeues_sipmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
18,599.41 units
lockfreequeues_sipmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
13,601.47 units
lockfreequeues_sipsic/spsc/1p1c📈 view plot
⚠️ NO THRESHOLD
386.00 units📈 view plot
⚠️ NO THRESHOLD
455.45 units📈 view plot
⚠️ NO THRESHOLD
583.00 units📈 view plot
⚠️ NO THRESHOLD
9,760.42 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
10,197.96 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
4,683.84 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
9,673.90 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p2c📈 view plot
⚠️ NO THRESHOLD
8,660.52 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
9,925.23 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
9,564.59 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/4p1c📈 view plot
⚠️ NO THRESHOLD
9,604.96 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
15,730.01 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
10,247.69 units
lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
31,452.98 units
nim_channels/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
1,894.99 units
nim_channels/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
787.50 units
nim_channels/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
294.01 units
nim_channels/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
4,975.00 units
nim_channels/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
1,175.53 units
nim_channels/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
546.04 units
nim_channels/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
3,802.81 units
nim_channels/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
1,673.80 units
nim_channels/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
1,322.83 units
🐰 View full continuous benchmarking report in Bencher

@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from eb67e2b to daf38c5 Compare May 2, 2026 02:32
elijahr added a commit that referenced this pull request May 2, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 466d152 to 6386498 Compare May 2, 2026 03:50
@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from daf38c5 to ac54727 Compare May 2, 2026 03:51
elijahr added a commit that referenced this pull request May 2, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
elijahr added a commit that referenced this pull request May 2, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive benchmarking comparison suite (Comparison MVP) by adding adapters for third-party libraries including Loony, Boost.LockFree, and Crossbeam. It features a new Rust crate to expose Crossbeam via a C-ABI and integrates these libraries into the existing benchmark harness and CI workflows. Additionally, the PR implements significant performance optimizations by ensuring cache-line alignment for unbounded queue segments using a new allocAligned utility and {.align.} pragmas to eliminate false sharing. Feedback identifies a race condition in the Loony adapter's pop logic and a potential overflow issue in its item encoding scheme.

Comment on lines +64 to +69
proc pop*[T](a: var LoonyAdapter[T]): PopResult[T] =
if a.queue.isEmpty():
PopResult[T](success: false)
else:
let raw = a.queue.pop()
PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pop implementation contains a race condition in multi-consumer scenarios. Checking isEmpty() before calling pop() is not atomic; another consumer thread could drain the queue between the check and the call, causing a.queue.pop() to return the default value (0). Because the adapter uses encoding to ensure valid items are never 0, it should check the result of pop() directly to determine if the queue was empty, rather than relying on a separate isEmpty() check.

  proc pop*[T](a: var LoonyAdapter[T]): PopResult[T] = 
    let raw = a.queue.pop()
    if raw == 0:
      return PopResult[T](success: false)
    PopResult[T](success: true, value: T((raw shr LoonyTagShift) - 1'u64))

discard a

proc push*[T](a: var LoonyAdapter[T], item: T): PushResult =
a.queue.push((uint64(item) + 1'u64) shl LoonyTagShift)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The encoding logic (uint64(item) + 1'u64) shl LoonyTagShift will produce 0 if item is uint64.high (or any value where item + 1 overflows and then shifted). Since 0 is used by Loony to indicate an empty slot, pushing such values will break the queue's internal state or cause them to be misread as empty. While the current benchmark range (up to 1M) avoids this, the adapter should ideally handle the full range of uint64 or explicitly document this limitation.

@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 6386498 to 5012581 Compare May 2, 2026 06:00
elijahr added a commit that referenced this pull request May 2, 2026
…range

Per Gemini review on PR #22.

1. Multi-consumer race in pop(): isEmpty() followed by pop() is two
   distinct calls. Between them, another consumer can drain the queue,
   leaving the second consumer to pop a default uint64 (0) and emit a
   spurious success with garbage payload. Switch to popping
   unconditionally and reading raw == 0 as the empty sentinel — the
   adapter's own (v+1) shl 3 encoding guarantees pushed values are
   never 0, so raw == 0 unambiguously means 'queue was empty when pop
   ran' even under concurrent draining.

2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3
   wraps to 0 for item == uint64.high and would be misread as an empty
   slot. The harness pushes uint64(i) for i in 0 ..< messageCount so
   the bound is academic at current tuning, but contractually the
   adapter accepts uint64.high - 1 as the inclusive maximum.
@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from aa6dd1f to fd5b154 Compare May 2, 2026 06:02
elijahr added a commit that referenced this pull request May 2, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Missing workflow token permissions 🐞 Bug ☼ Reliability
Description
bench-comparison.yml runs bencher run --github-actions but does not declare job permissions, so
repositories with default read-only GITHUB_TOKEN permissions can fail to publish Bencher’s GitHub
checks/annotations and break the workflow’s upload stage.
Code

.github/workflows/bench-comparison.yml[R164-203]

+      - name: Install Bencher CLI
+        uses: bencherdev/bencher@main
+
+      - name: Bencher token preflight
+        run: |
+          if [ -z "$BENCHER_API_TOKEN" ]; then
+            echo "::warning title=Bencher upload skipped::BENCHER_API_TOKEN is not set; bench-comparison ran but did not upload."
+          else
+            echo "Bencher token present; proceeding with upload."
+          fi
+
+      - name: Track scheduled benchmarks with Bencher
+        if: github.event_name == 'schedule' && env.BENCHER_API_TOKEN != ''
+        run: |
+          bencher run \
+            --project lockfreequeues \
+            --token '${{ secrets.BENCHER_API_TOKEN }}' \
+            --branch devel \
+            --testbed ubuntu-latest \
+            --threshold-measure throughput \
+            --threshold-test t_test \
+            --threshold-max-sample-size 64 \
+            --threshold-lower-boundary 0.99 \
+            --thresholds-reset \
+            --adapter json \
+            --file merged.json \
+            --github-actions '${{ secrets.GITHUB_TOKEN }}' \
+            --err
+
+      - name: Track devel-push benchmarks with Bencher
+        if: github.event_name == 'push' && env.BENCHER_API_TOKEN != ''
+        run: |
+          bencher run \
+            --project lockfreequeues \
+            --token '${{ secrets.BENCHER_API_TOKEN }}' \
+            --branch devel \
+            --testbed ubuntu-latest \
+            --adapter json \
+            --file merged.json \
+            --github-actions '${{ secrets.GITHUB_TOKEN }}'
Evidence
bench-comparison.yml invokes Bencher with the GitHub Actions token, which typically requires
checks: write (and sometimes pull-requests: write) permissions. In contrast, the existing
bench.yml workflow explicitly sets those permissions for its Bencher upload job, but the new
workflow does not.

.github/workflows/bench-comparison.yml[164-214]
.github/workflows/bench.yml[331-343]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bench-comparison.yml` calls `bencher run --github-actions ...` without setting explicit `permissions:`. On repos/orgs with default read-only token permissions, this commonly prevents check-run / PR annotation publishing and can fail the Bencher step.

### Issue Context
`bench.yml` already sets explicit permissions for its Bencher upload job; the new workflow should mirror that pattern to avoid relying on repo defaults.

### Fix Focus Areas
- .github/workflows/bench-comparison.yml[25-38]
- .github/workflows/bench-comparison.yml[164-214]
- .github/workflows/bench.yml[331-343]

### Suggested change
Add a `permissions:` block to the `bench-crossbeam` job (at minimum `checks: write`; if Bencher comments/PR annotations are desired, also `pull-requests: write`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Aligned alloc ignores alignof 🐞 Bug ≡ Correctness
Description
allocAligned[T] always requests CacheLineBytes alignment, which can be smaller than
alignof(T); if a future caller uses an over-aligned T, dereferencing the returned pointer
becomes undefined behavior.
Code

src/lockfreequeues/internal/aligned_alloc.nim[R35-48]

+proc allocAligned*[T](): ptr T =
+  ## Allocate one zero-initialized ``T`` on a ``CacheLineBytes`` boundary.
+  ##
+  ## Raises ``OutOfMemDefect`` on allocation failure (matches the existing
+  ## ``c_calloc`` failure path in unbounded queue ``newSegment`` procs).
+  ##
+  ## The caller owns the returned pointer; release with ``freeAligned`` or
+  ## the standard ``c_free`` (``posix_memalign`` blocks are compatible with
+  ## ``free`` per POSIX).
+  var p: pointer
+  if posix_memalign(addr p, csize_t(CacheLineBytes), csize_t(sizeof(T))) != 0:
+    raise newException(OutOfMemDefect, "posix_memalign failed for " & $T)
+  zeroMem(p, sizeof(T))
+  result = cast[ptr T](p)
Evidence
The allocator hard-codes the requested alignment to CacheLineBytes and does not account for the
actual alignment requirement of T. While current usage appears to target cache-line alignment, the
proc is exported and generic, so it should respect alignof(T) to remain correct as usage evolves.

src/lockfreequeues/internal/aligned_alloc.nim[28-48]
src/lockfreequeues/unbounded_sipsic.nim[23-47]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`allocAligned[T]` calls `posix_memalign(..., CacheLineBytes, ...)` for every `T`. If `alignof(T) > CacheLineBytes`, the returned pointer does not satisfy `T`’s alignment contract, which is undefined behavior when the pointer is dereferenced.

### Issue Context
Even if all current callers happen to have `alignof(T) <= CacheLineBytes`, this is a generic exported allocator and is easy to reuse elsewhere.

### Fix Focus Areas
- src/lockfreequeues/internal/aligned_alloc.nim[35-48]

### Suggested change
Compute the requested alignment as `max(CacheLineBytes, alignof(T))` (and ensure it’s a valid `posix_memalign` alignment: power-of-two and multiple of `sizeof(pointer)`). Then pass that computed alignment into `posix_memalign`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

3. Loony range comment incorrect 🐞 Bug ⚙ Maintainability
Description
The Loony adapter claims it supports values up to uint64.high - 1, but the (v + 1) shl 3
encoding loses the top 3 bits and is only invertible for much smaller values; this mismatch can
mislead future callers and cause silent corruption for large payloads.
Code

benchmarks/nim/adapters/loony_adapter.nim[R32-37]

+## **Encoding range.** ``push`` accepts ``T`` values in
+## ``0 .. uint64.high - 1`` (after the ``+1`` step the upper bound saturates
+## the ``uint64`` width before the ``shl 3``). ``uint64.high`` would wrap
+## to ``0`` and be misread as an empty slot. The harness pushes
+## ``uint64(i)`` for ``i in 0 ..< messageCount`` so this bound is academic
+## at current message-count tuning, but it is the contractual range.
Evidence
The comment states a near-full uint64 range, but the actual encoding multiplies by 8 via a left
shift, which discards high bits on overflow; the push/pop encoding is therefore not a bijection
across that claimed range.

benchmarks/nim/adapters/loony_adapter.nim[32-37]
benchmarks/nim/adapters/loony_adapter.nim[72-88]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The adapter documentation overstates the safe encoding domain. The current `(v + 1) shl 3` scheme is only invertible when the upper 3 bits of `(v + 1)` are zero.

### Issue Context
Today’s bench payloads are small, but the adapter is generic and the comment describes a broader contract than the code actually provides.

### Fix Focus Areas
- benchmarks/nim/adapters/loony_adapter.nim[15-37]
- benchmarks/nim/adapters/loony_adapter.nim[72-88]

### Suggested change
Update the “Encoding range” comment to the true safe range, and (optionally) add a debug-time assert in `push` that rejects values outside the safe domain (guarded so it doesn’t impact `-d:danger` bench builds).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 5012581 to 9585d15 Compare May 2, 2026 10:30
elijahr added a commit that referenced this pull request May 2, 2026
…range

Per Gemini review on PR #22.

1. Multi-consumer race in pop(): isEmpty() followed by pop() is two
   distinct calls. Between them, another consumer can drain the queue,
   leaving the second consumer to pop a default uint64 (0) and emit a
   spurious success with garbage payload. Switch to popping
   unconditionally and reading raw == 0 as the empty sentinel — the
   adapter's own (v+1) shl 3 encoding guarantees pushed values are
   never 0, so raw == 0 unambiguously means 'queue was empty when pop
   ran' even under concurrent draining.

2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3
   wraps to 0 for item == uint64.high and would be misread as an empty
   slot. The harness pushes uint64(i) for i in 0 ..< messageCount so
   the bound is academic at current tuning, but contractually the
   adapter accepts uint64.high - 1 as the inclusive maximum.
@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from fd5b154 to 4ce6826 Compare May 2, 2026 10:33
elijahr added a commit that referenced this pull request May 2, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR adds five third-party comparison adapters (Loony, Boost.LockFree, Crossbeam), a new Rust FFI crate, a dedicated Crossbeam benchmark workflow, cache-line alignment fixes for unbounded queue segments, and supporting tests/docs. The code is well-structured, the soft-skip CI pattern is well-designed, and the alignment fix is correct. Audit demoted the unpinned-deps finding from medium to low: the affected workflow is nightly-only so it does not block PRs.

Severity tally: 3 Lows.

Low

  • BOT-B1 (.github/workflows/bench-comparison.yml:70): bench-comparison.yml uses main branch for sibling deps instead of tagged versions
  • BOT-B2 (benchmarks/rust/bench-ffi-crossbeam/Cargo.toml:1): Rust FFI crate does not set panic = "abort"; panic-unwind across FFI boundary is UB
  • BOT-B3 (benchmarks/nim/adapters/loony_adapter.nim:39): Loony adapter docstring incorrectly claims int.high is 'far below' the encoding-safe limit

Noteworthy

  • The try/finally replacement for try/except/raise in the four unbounded queue variants correctly handles Defect-class exceptions (not just CatchableError), closing a memory leak in the allocation failure path.
  • The crossbeam_link.nim shared module elegantly avoids duplicate {.passL.} emission when both crossbeam adapters are enabled in the same compilation unit.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

Comment thread .github/workflows/bench-comparison.yml Outdated
Comment on lines +70 to +71
git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-B1 — Low (standards)
bench-comparison.yml uses main branch for sibling deps instead of tagged versions
bench-comparison.yml clones nim-debra and nim-typestates from --branch main, but bench.yml and build.yml both pin to --branch v0.6.0 and --branch v0.7.0 respectively. This means the crossbeam comparison workflow can silently pick up breaking changes from the dependency main branches, producing compilation failures or benchmark results that are not comparable to those from bench.yml. The PR description states this workflow "mirrors bench.yml" — the dependency pinning should mirror as well. Severity demoted from medium to low: bench-comparison.yml runs on nightly cron and devel-push only, not on PR — it does not gate merges.

Suggested change
git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
Replace `--branch main` with `--branch v0.6.0` for nim-debra and `--branch v0.7.0` for nim-typestates on lines 70-71 of bench-comparison.yml.

Comment on lines +1 to +15
[package]
name = "bench_ffi_crossbeam"
version = "0.1.0"
edition = "2021"
publish = false
description = "C-ABI cdylib wrapping crossbeam-queue::ArrayQueue and SegQueue for the lockfreequeues bench harness"
license = "Apache-2.0 OR MIT"

[lib]
crate-type = ["cdylib", "rlib"]
# rlib is included so `cargo test` can link the integration test binary
# against the same code path that the cdylib exposes.

[dependencies]
crossbeam-queue = "0.3"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-B2 — Low (quality)
Rust FFI crate does not set panic = "abort"; panic-unwind across FFI boundary is UB
The cdylib exposes extern "C" functions called from Nim. Neither panic = "abort" nor catch_unwind wrappers are present. In particular, cb_seg_push calls SegQueue::push which allocates via the global allocator; if the allocator panics (e.g. OOM), the unwind crosses the C FFI boundary, which is undefined behavior per Rust's FFI rules. While OOM is extremely unlikely in CI for a bench crate, this is standard FFI hygiene and low-cost to fix.

Suggested change
[package]
name = "bench_ffi_crossbeam"
version = "0.1.0"
edition = "2021"
publish = false
description = "C-ABI cdylib wrapping crossbeam-queue::ArrayQueue and SegQueue for the lockfreequeues bench harness"
license = "Apache-2.0 OR MIT"
[lib]
crate-type = ["cdylib", "rlib"]
# rlib is included so `cargo test` can link the integration test binary
# against the same code path that the cdylib exposes.
[dependencies]
crossbeam-queue = "0.3"
Add `[profile.release]` with `panic = "abort"` to Cargo.toml, or wrap the crossbeam push/pop calls in `std::panic::catch_unwind`.

Comment on lines +39 to +41
## ``messageCount`` is bounded by ``int.high`` ≈ ``2^63``, far below the
## safe limit at any tuning we'll ever ship, so this is academic — but
## it is the contractual range and a debug-mode assert below makes a
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-B3 — Low (quality)
Loony adapter docstring incorrectly claims int.high is 'far below' the encoding-safe limit
The docstring states: "messageCount is bounded by int.high ≈ 2^63, far below the safe limit". However, the safe limit is (1'u64 shl 61) - 2 ≈ 2.3×10^18, while int.high = 2^63 - 1 ≈ 9.2×10^18 — roughly 4× larger than the limit. The actual benched message counts (500K–1M) are well within range, but the comment is factually wrong and could mislead someone tuning message counts in the future.

Suggested change
## ``messageCount`` is bounded by ``int.high`` ≈ ``2^63``, far below the
## safe limit at any tuning we'll ever ship, so this is academic — but
## it is the contractual range and a debug-mode assert below makes a
Correct the comment to reflect that actual bench message counts (e.g., millions) are far below the limit, not `int.high`.

@elijahr elijahr force-pushed the feat/bench-rollup-pr2-topology-split branch from 2aa47ce to ab19733 Compare May 5, 2026 03:49
elijahr added 21 commits May 4, 2026 22:49
…pects 64B

PR 3 Task 3.1 (cache-line padding audit, step 1 of 3).

Verifies bounded variants already conform; introduces the failing test that
will drive the unbounded fix in 3.2 and 3.3.

Bounded MPMC audit (read-only verification, design doc §4.2):
- src/lockfreequeues/sipsic.nim:19-20    head/tail Atomic[int]    {.align: CacheLineBytes.} ✓
- src/lockfreequeues/sipmuc.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupsic.nim:42-43    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
- src/lockfreequeues/mupmuc.nim:43-44    head/tail Atomic[uint64] {.align: CacheLineBytes.} ✓
The bounded MPMC cell array also uses the documented array-of-cells trick
(typestates/mpmc_cell.nim:11) so adjacent slots never share a cache line.
No bounded change required.

Unbounded variants currently FAIL both audit conditions:
- Segment fields (head, tail, prevConsumerIdx, committed) lack {.align: CacheLineBytes.}
  pragmas, so they share cache lines with neighbouring data.
- newSegment() uses c_calloc, which on x86_64 Linux glibc returns 16-byte
  aligned memory (MALLOC_ALIGNMENT = 2*sizeof(size_t)). The {.align.} pragma
  controls only the intra-struct offset, not the base.

Test additions:
- tests/t_unbounded_padding.nim — 8 assertions across 4 unbounded variants:
  (a) offsetOf(SegmentField) mod CacheLineBytes == 0 for the producer/consumer
      coordination atomics.
  (b) cast[uint](segPtr) mod CacheLineBytes == 0 for a freshly-allocated
      Segment via the queue's allocator.
- Test-only accessors guarded by `when defined(testing):` in each unbounded
  module: headSegmentForTest (returns pointer) and segmentHeadOffsetForTest
  (returns offsets of audit-relevant fields). Compiled away in release.
- Wired into tests/test.nim aggregator.

RED state confirmed:
  Unbounded queue Segment cache-line padding [Summary] 8 tests run: 0 OK, 8 FAILED
  off.tail mod 64 was 8 (expected 0)
  off.prevConsumerIdx mod 64 was 16
  off.committed mod 64 was 24
  cast[uint](segPtr) mod 64 was 16

Existing test suite continues to compile (verified `nim c tests/test.nim`).

Refs: design doc §4.2; impl plan Task 3.1.
PR 3 Task 3.2 (cache-line padding audit, step 2 of 3).

Adds the cache-line-aligned heap allocator that replaces c_calloc in the
unbounded queue Segment allocation path (call sites swap in Task 3.3).

Compile probe (impl plan §3.2.0):
  $ nim c -r tmp_probe.nim
  rc=0
  align=0
verifying that std/posix.posix_memalign is available on macOS (libSystem)
and Linux glibc with the same C signature, returning 64-byte aligned memory
on success. Probe file deleted post-verification per impl plan.

Implementation (src/lockfreequeues/internal/aligned_alloc.nim):
- ``proc allocAligned*[T](): ptr T`` — calls posix_memalign with
  CacheLineBytes (64) alignment, zero-inits, raises OutOfMemDefect on
  rc != 0. Caller releases via the standard c_free per POSIX guarantee
  that posix_memalign blocks are free-compatible.
- Re-exports CacheLineBytes from atomic_dsl for callers in 3.3.
- isMainModule smoke confirms tail-call alignment.

Tests (tests/t_aligned_alloc.nim, wired into tests/test.nim):
- Tiny payload (sizeof < 64).
- Line-sized payload (sizeof == 64).
- Big payload (sizeof > 64).
- 64 sequential allocations all 64B-aligned.
All 4 cases assert pointer alignment AND zero-initialization. Result:
  internal/aligned_alloc.allocAligned ....
  [Summary] 4 tests run: 4 OK, 0 FAILED.

The unbounded padding audit test (tests/t_unbounded_padding.nim) is still
RED at this commit; Task 3.3 swaps the Segment allocation call sites and
adds {.align: CacheLineBytes.} pragmas, taking it GREEN.

Refs: design doc §4.2; impl plan Task 3.2.
PR 3 Task 3.3 (cache-line padding audit, step 3 of 3 — GREEN).

Applies the cache-line padding fix to all four unbounded queue variants and
their per-segment producer/consumer atomics. Resolves the asymmetry called
out in design doc §4.2: bounded variants were already padded; unbounded
Segments were not, so cross-library comparison numbers in PR 3 would have
been biased against lockfreequeues' unbounded types.

Production changes (4 files):
- src/lockfreequeues/unbounded_sipsic.nim
  Segment.{next, head, tail} + UnboundedSipsic.{headSegment, tailSegment}
  gain {.align: CacheLineBytes.}. newSegment swaps c_calloc -> allocAligned.
- src/lockfreequeues/unbounded_sipmuc.nim
  Segment.{next, tail, prevConsumerIdx} + UnboundedSipmuc.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
- src/lockfreequeues/unbounded_mupsic.nim
  Segment.{next, tail, committed} + UnboundedMupsic.{headSegment,
  tailSegment} gain alignment. newSegment swaps to allocAligned.
  ``head: int`` (single consumer, non-atomic) is intentionally left
  unpadded — only the consumer touches it.
- src/lockfreequeues/unbounded_mupmuc.nim
  Segment.{next, tail, prevConsumerIdx, committed} + UnboundedMupmuc.{
  headSegment, tailSegment} gain alignment. newSegment swaps to
  allocAligned.

The ``c_calloc`` import is dropped; ``c_free`` retained for segment
release (POSIX guarantees posix_memalign blocks are free-compatible).

Cross-platform fix in aligned_alloc.nim:
- ``std/posix.posix_memalign`` triggers a clang error under ``nim cpp`` on
  macOS because the SDK declares the first parameter ``__unsafe_indexable``
  (incomplete-type ``void *`` vs the wrapper's ``void **``).
- Replaced the ``import std/posix`` with a local ``importc, header:
  "<stdlib.h>"`` shim using the canonical C signature, which clang accepts
  in both C and C++ modes. Documented in aligned_alloc.nim header comment.

Verification:
- tests/t_unbounded_padding.nim: 8/8 GREEN under both nim c and nim cpp.
- Full suite under nim c (orc), nim cpp, --mm:arc, --mm:refc:
    [Summary] 200 tests run: 200 OK, 0 FAILED, 0 SKIPPED  (all four variants)
- t_unbounded_*_threaded suites pass — no false-sharing-induced timing
  regression in the threaded code paths.

Refs: design doc §4.2; impl plan Task 3.3.
PR 3 Task 3.5 (comparison MVP, first FFI adapter).

Adds the Loony unbounded MPMC adapter and a parametric round-trip test in
tests/t_bench_adapters.nim, both gated on -d:adapter_loony_available so
the in-tree test suite passes vacuously when the package is absent.

Files:
- benchmarks/nim/adapters/loony_adapter.nim — adapter, topologiesSupported
  = {tMpmcUnbounded}, makeLoonyAdapter / push / pop / cleanup / name.
- tests/t_bench_adapters.nim — push/pop 1000 uint64 round-trip preserves
  set; parametric template extends to subsequent adapters in 3.6–3.10.

Loony storage-encoding fix (NOT a hack — required by Loony's design):
Loony reserves the low 3 bits of each slot for RESUME/WRITER/READER state
flags (loony/spec.nim: SLOTMASK = ~0x7). Loony's empty-slot detection
reads ``(prev and SLOTMASK) == 0``, treating a zero "pointer" payload as
"slot empty". Loony was designed for ref/ptr types whose addresses are
8-byte-aligned and never null. Pushing the literal value 0 (or any value
whose low 3 bits are non-zero) breaks Loony's pop path.

Adapter encodes the harness's ``uint64(i)`` payload as ``(v + 1) shl 3``
on push and ``(raw shr 3) - 1`` on pop:
- ``shl 3`` clears the low flag bits.
- ``+ 1`` ensures the encoded value is never zero.
This is documentation-equivalent to how Loony users push 8-byte-aligned,
non-null ref addresses; it does NOT change Loony's algorithm or affect
the comparison fairness — every push/pop still goes through Loony's full
fast/slow paths.

First-iteration RED state (recorded for the audit):
- Plain push of ``uint64(i)`` returned only 125 of 1000 items, all
  multiples of 8 (i.e. the values whose ``i and 7 == 0``).
- ``shl 3`` alone returned 999 of 1000 items (everything except 0,
  which collided with Loony's empty-slot sentinel).
- ``(v + 1) shl 3`` returned all 1000 items, set-equal to the input.

Verification: tests/t_bench_adapters.nim with -d:adapter_loony_available
+ --path:loony-0.3.1 --path:arc-0.0.4 -> [Summary] 1 tests run: 1 OK.
Full in-tree suite (-d:testing without the loony gate) -> 200/200 OK.

Refs: design doc §3 PR 3; impl plan Task 3.5; loony/spec.nim SLOTMASK.
Boost.LockFree is C++ header-only (BSL-1.0). Adapter wraps
boost::lockfree::queue<unsigned long long> with capacity-bounded
push (bounded_push) and pop. Topology: tMpmc.

Compile gates:
  -d:adapter_boost_lockfree_queue_available  -- enable
  nim cpp                                     -- required (else hard error)

Pulls headers from the conventional install paths (apt /usr/include,
brew /opt/homebrew/opt/boost/include, /usr/local/include) and accepts
-d:boostIncludeDir=<path> for non-standard locations.

Adds smoke_boost.nim (compile-and-run sanity for CI) and a parametric
round-trip test in tests/t_bench_adapters.nim gated by both the define
and 'cpp' build mode.

Adapter capacity argument is the fixed node-pool size; pushes that
exceed it return prFull. Default 1024 mirrors other bounded adapters.
Boost.LockFree's spsc_queue is a wait-free single-producer single-consumer
ring buffer. The pop overload used here -- pop(out, max_size) -- returns
the count of items popped (csize_t); 1 means success, 0 means empty.

Compile gates:
  -d:adapter_boost_lockfree_spsc_available  -- enable
  nim cpp                                    -- required

Topology: tSpsc. Capacity is the fixed ring size (default 1024). Header
search paths and indirection rationale identical to boost_lockfree_queue
adapter; smoke_boost.nim covers both adapters under one binary.
C-ABI shim around crossbeam_queue::ArrayQueue<u64> (bounded MPMC) and
crossbeam_queue::SegQueue<u64> (unbounded MPMC). Exposes 8 extern "C"
fns (cb_array_init/push/pop/destroy + cb_seg_init/push/pop/destroy)
that the Nim adapters in tasks 3.9 / 3.10 will importc.

- Cargo.toml: crate-type = [cdylib, rlib]; rlib so cargo test can link
  the integration test against the same code path the cdylib exports.
- rust-toolchain.toml pins stable channel for reproducibility.
- src/lib.rs: 8 extern fns; null-pointer safe; ArrayQueue::new(0) panic
  is intercepted (returns null) so we never unwind across FFI.
- tests/integration_test.rs: 6 tests covering round-trip set equality
  for both queues, full/empty edges, and null-pointer tolerance.

Build: cargo build --release  -> target/release/libbench_ffi_crossbeam.{so,dylib}
.gitignore extended for benchmarks/rust/*/target/ and Cargo.lock.
…lib)

importc the cb_array_init/push/pop/destroy fns from
benchmarks/rust/bench-ffi-crossbeam (Task 3.8) and surface them as a
typed Nim adapter with topologiesSupported = {tMpmc}.

The adapter avoids emitting -L/-lbench_ffi_crossbeam when its sibling
crossbeam_seg_queue_adapter gate is also active, so combined builds
do not produce 'ignoring duplicate libraries' warnings on Apple ld.

Adds smoke_crossbeam.nim covering both crossbeam adapters under one
binary (Task 3.10 lands the seg adapter); tests/t_bench_adapters.nim
extended with the gated parametric round-trip suite.

Compile gates:
  -d:adapter_crossbeam_array_queue_available
  -d:crossbeamLibDir=<path>   (override search path for non-default builds)
…lib)

importc the cb_seg_init/push/pop/destroy fns from the bench-ffi-crossbeam
cdylib. SegQueue is unbounded -- push always succeeds (returns prSuccess
when the handle is non-null; the underlying cb_seg_push only reports
failure on null pointers).

topologiesSupported = {tMpmcUnbounded}. capacity arg accepted-but-ignored
so the wire-up sites (Task 3.11) can pass capacity uniformly across all
adapters without a per-adapter conditional.

Compile gate: -d:adapter_crossbeam_seg_queue_available
Optional override: -d:crossbeamLibDir=<path>
Each MVP adapter is pulled in only when its compile-time gate is set;
absent gates leave production builds unchanged. Variant lists are
populated at compile time via 'when declared(...)' so the CLI's
--list / unknown-variant error messages reflect the actual compiled-in
set.

Wiring map (per design 2.4 / Track 3 §3.11):

  bench_spsc:
    - boost_lockfree_spsc -> spsc/1p1c (gated by
      adapter_boost_lockfree_spsc_available)

  bench_mpmc:
    - boost_lockfree_queue   -> mpmc/{1,2,4}p{1,2,4}c
    - crossbeam_array_queue  -> mpmc/{1,2,4}p{1,2,4}c
    (each gated by its respective adapter_*_available define)

  bench_unbounded:
    - loony                  -> mpmc_unbounded/{1,2,4}p{1,2,4}c
    - crossbeam_seg_queue    -> mpmc_unbounded/{1,2,4}p{1,2,4}c

bench_mpsc has no MVP coverage today: none of the MVP libraries provide
a dedicated MPSC API at the natural shape; design 2.4 is silent on MPSC
parity for Track 3 and the existing in-tree mupsic baseline stays the
sole MPSC slug.

Verified locally: bench_mpmc + bench_unbounded produce expected
crossbeam_* slugs at the small fixture (10K msg, 2 runs).
workflow_dispatch inputs force_skip_boost / force_skip_loony surface
the override hook; FORCE_SKIP_BOOST / FORCE_SKIP_LOONY env vars relay
them into step-level if: gates.

Per library: install -> smoke -> set ADAPTER_*=1 on success -> annotate
on any failure. The combined defines + compile-mode (c vs cpp) flow
into the per-binary compile step via a final 'Build adapter define
flags' step that populates step.outputs.{flags,mode}.

Boost (libboost-dev) is wired into bench_spsc + bench_mpmc only.
Loony (nimble registry) is wired into bench_unbounded only. Other
binaries skip both install steps entirely (matrix-binary if-guards),
so the per-binary critical path is unchanged when no MVP adapter
applies.

Both install steps use continue-on-error: true; failure flips the
binary's compile flags so MVP slugs are simply omitted from the BMF
instead of failing the workflow (design Risk 4 soft-skip pattern).
Annotate-skipped emits ::warning title=Adapter skipped which surfaces
on the PR check summary.
Crossbeam adds a Rust toolchain dependency (~5 min cold install) which
would inflate the bench.yml critical path on every PR. Per design 2.6
and impl plan §3.13, Crossbeam runs in this dedicated workflow instead:

  triggers:
    - schedule: nightly 04:00 UTC
    - workflow_dispatch
    - push to devel that touches benchmarks/rust/** or
      benchmarks/nim/adapters/crossbeam_*

  steps:
    - dtolnay/rust-toolchain@stable + Swatinem/rust-cache@v2
    - cargo build --release of bench-ffi-crossbeam
    - cargo test of the cdylib (sanity)
    - smoke_crossbeam.nim compile-and-run
    - bench_mpmc with -d:adapter_crossbeam_array_queue_available
    - bench_unbounded with -d:adapter_crossbeam_seg_queue_available
    - merge_bmf.py + Bencher upload (separate Report from bench.yml)

Linker resolution: the {.passL.} flags in the Nim adapters add the
default -L; we additionally pass -Wl,-rpath,$(pwd)/.../target/release
so the dylib resolves at runtime without LD_LIBRARY_PATH.

actionlint clean.
benchmarks/README.md:
- New 'Comparison MVP' subsection: table of libraries x topologies x
  compile gates x install commands.
- Documents the bench.yml soft-skip flow (Loony + Boost) vs
  bench-comparison.yml nightly cron (Crossbeam) split.
- Local-run snippets for each adapter.
- Mentions nim/smoke/ and rust/bench-ffi-crossbeam/ in the structure
  list.

THIRD_PARTY_LICENSES.md (new):
- Per-vendor block schema (matches design doc §4.5).
- Records Loony (MIT, nimble-managed), Boost.LockFree (BSL-1.0,
  system package), and Crossbeam (Apache-2.0 OR MIT, Cargo-managed
  via the in-tree cdylib shim).
- Reserves entries for concurrentqueue (PR 4) and uPlot (PR 5).
Records: five third-party adapters (Loony, Boost x2, Crossbeam x2),
the bench-ffi-crossbeam Rust cdylib, smoke binaries under
benchmarks/nim/smoke/, the new bench-comparison.yml workflow, the
soft-skip flow added to bench.yml, the new THIRD_PARTY_LICENSES.md,
the cache-line padding fix for unbounded queue segments via the new
internal/aligned_alloc.nim helper, and the supporting test
infrastructure (t_unbounded_padding.nim, t_aligned_alloc.nim,
the t_bench_adapters.nim parametric round-trip suite).
…xonomy

merge_bmf.py enforces <library>/<topology>/<P>p<C>c (exactly 3 segments).
Three call sites violated this:

- "loony/LoonyQueue" produced 4-segment slugs
- "crossbeam_queue/SegQueue" produced 4-segment slugs
- "crossbeam_queue" for ArrayQueue collided with the SegQueue lib name

Use the canonical lib slugs from the design doc ("loony",
"crossbeam_array_queue", "crossbeam_seg_queue") so merge passes.

Loony is soft-skipped in PR CI today and Crossbeam runs only in nightly
cron, so this is preventative for downstream PRs and post-merge runs.
The cache-line padding fix swapped c_calloc -> allocAligned in newSegment
but missed the manager allocations in newUnboundedSipmuc/Mupsic/Mupmuc auto-create
constructors that landed via #18 (e8ee15b). Manager state (epoch counter,
limbo bag pointers, registration table) benefits from the same 64-byte
base alignment as Segment, and the import audit dropped c_calloc as
unused — so the auto-create constructors broke at compile time.

Same allocAligned helper, same OutOfMemDefect failure mode, same c_free
release path (POSIX guarantees posix_memalign blocks are free-compatible).
…range

Per Gemini review on PR #22.

1. Multi-consumer race in pop(): isEmpty() followed by pop() is two
   distinct calls. Between them, another consumer can drain the queue,
   leaving the second consumer to pop a default uint64 (0) and emit a
   spurious success with garbage payload. Switch to popping
   unconditionally and reading raw == 0 as the empty sentinel — the
   adapter's own (v+1) shl 3 encoding guarantees pushed values are
   never 0, so raw == 0 unambiguously means 'queue was empty when pop
   ran' even under concurrent draining.

2. Encoding range note in the module docstring: (uint64(item) + 1) shl 3
   wraps to 0 for item == uint64.high and would be misread as an empty
   slot. The harness pushes uint64(i) for i in 0 ..< messageCount so
   the bound is academic at current tuning, but contractually the
   adapter accepts uint64.high - 1 as the inclusive maximum.
Three code-review fixes on PR 3:

1. allocAligned[T] always passed CacheLineBytes to posix_memalign
   regardless of alignof(T). For an over-aligned T (e.g. SSE/AVX
   vectors or {.align: 128.} pragmas), the returned pointer would
   not satisfy T's alignment contract — UB on dereference. Use
   max(CacheLineBytes, alignof(T)) so the cache-line invariant is
   preserved while honoring T's own alignment. Computed at compile
   time; runtime alignment argument stays constant per instantiation.

2. bench-comparison.yml runs bencher run --github-actions but did not
   declare job-level permissions. On repos with default read-only
   GITHUB_TOKEN, this would prevent check-run / PR-annotation
   publishing. Mirror bench.yml's bench-upload permissions block:
   contents:read, actions:read, pull-requests:write, checks:write.

3. The loony_adapter Encoding range comment claimed
   '0 .. uint64.high - 1' but the actual encoding (v + 1) shl 3
   discards the top 3 bits, so the round-trip is only invertible for
   v <= (1 shl 61) - 2. Document the true safe range and add a
   debug-mode assert that catches misuse in non-release builds. The
   assert is stripped under -d:release / -d:danger so bench
   measurement runs see zero overhead.
aligned_alloc:
  - Add Windows path via _aligned_malloc / _aligned_free (POSIX path
    keeps posix_memalign + free). Library is no longer Linux/macOS-only.
  - Add freeAligned(pointer) plus a typed ptr T overload so callers in
    untyped destructor hooks share the same call site as typed callers.
  - Migrate all c_free callsites that pair with allocAligned across the
    four unbounded queues (segments, manager auto-create slot, retired-
    segment destructor) to freeAligned. Drop the now-unused c_free
    imports from each unbounded source.

unbounded_mupsic Segment.head:
  - Add {.align: CacheLineBytes.} to the consumer head field. Without it,
    head sat in the same 64-byte chunk as the producer-shared tail and
    every consumer write invalidated producers' tail line. Extend the
    cache-line padding audit (segmentHeadOffsetForTest +
    t_unbounded_padding) to assert head's alignment so the regression
    cannot recur.

manager-init failure path (mupmuc + mupsic + sipmuc auto-create overload):
  - Replace `try ... except: cleanup; raise` with `try ... finally: if
    not ok: cleanup`. Under Nim 2.0, bare `except:` matches only
    CatchableError -- an OutOfMemDefect from inside initDebraManager
    would skip cleanup and leak the heap-allocated DebraManager slot.
    `finally` runs unconditionally on Defect-class raises too. Refresh
    the comment that referred to a c_calloc zeroing path that no longer
    exists (allocAligned does the zeroing now).

boost adapters (queue + spsc_queue):
  - Switch backing storage from alloc0 to allocAligned[T] / freeAligned
    so the placement-constructed C++ object gets cache-line alignment,
    not just alignof(max_align_t).
Previous design gated the array adapter's link emission on the seg
adapter's gate being unset, so the seg adapter "owns" the flags when
both are enabled. This breaks the realistic CI configuration where
both adapter_crossbeam_*_available defines are set globally but a
given bench binary imports only one adapter. With both gates set and
only the array adapter imported, the array adapter's "skip" branch
fires and the seg adapter's emission block is never compiled in -
linker fails on undefined cb_array_* symbols.

Move the {.passL.} pragmas into crossbeam_link.nim. Both adapters
import it from inside their availability gate, so the link flags
emit exactly once per bench binary that uses ANY crossbeam adapter,
independent of which gates are set elsewhere. Nim's per-module
processing dedups the pragmas when both adapters are imported.
…ect loony encoding-range docstring

- bench-comparison.yml: pin nim-debra v0.6.0 / nim-typestates v0.7.0
  (was tracking 'main', non-deterministic).
- bench-ffi-crossbeam: add [profile.release] panic = 'abort'. Unwinding
  a Rust panic across the C-ABI into Nim is undefined behavior.
- loony_adapter docstring: int.high (2^63 - 1) is ABOVE the 2^61 - 2
  encoding ceiling, not 'far below' it. Type bound alone does not
  guarantee safety; reframe around the shipped tuning range and keep
  the debug-mode assert as the actual safeguard.
@elijahr elijahr force-pushed the feat/bench-rollup-pr3-comparison-mvp branch from b233388 to 41a08f3 Compare May 5, 2026 03:50
elijahr added a commit that referenced this pull request May 5, 2026
merge_bmf.py rejects slugs that aren't <library>/<topology>/<P>p<C>c
(exactly 3 segments). PR #23's CI failed on the moodycamel slug:

  invalid slug shape 'moodycamel/ConcurrentQueue/mpmc_unbounded/1p1c'

The same 4-segment violations exist for loony and crossbeam_seg_queue
in bench_unbounded.nim, and the crossbeam_array_queue dispatch was
emitting a colliding "crossbeam_queue" lib name. Apply the canonical
lib slugs from the design doc so all dispatch sites yield 3 segments
after the runner appends "/<topology>/<shape>":

  loony/LoonyQueue        -> loony
  crossbeam_queue/SegQueue -> crossbeam_seg_queue
  crossbeam_queue (Array)  -> crossbeam_array_queue
  moodycamel/ConcurrentQueue -> moodycamel

The PR #22 worktree applies the loony and crossbeam fixes too;
git rebase will harmonize the duplicate edits.

The adapter `name` constants (e.g. "moodycamel/ConcurrentQueue[uint64]")
are human-readable echo strings, not BMF slugs, and stay as-is.
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 5, 2026

Consolidated into #19 — fast-forwarded feat/bench-rollup-pr0-bench-common to this branch's tip; all commits preserved in #19. Closing this stacked PR as redundant.

@elijahr elijahr closed this May 5, 2026
@elijahr elijahr deleted the feat/bench-rollup-pr3-comparison-mvp branch May 5, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant