bench-rollup: shared harness + comparison adapters + interactive charts#19
bench-rollup: shared harness + comparison adapters + interactive charts#19
Conversation
There was a problem hiding this comment.
Code Review
This pull request overhauls the benchmarking infrastructure, replacing the legacy aggregator and regex-based parser with native Bencher Metric Format (BMF) emission in bench_throughput.nim. It introduces a shared harness module (bench_common.nim), a comprehensive set of lock-free queue adapters, and a merge_bmf.py utility for combining results. Review feedback focuses on correcting a type hint in the Python merge script and ensuring CLI error messages are directed to stderr instead of stdout to follow best practices.
| def _die(msg: str) -> "int": | ||
| """Print to stderr and return exit code 1.""" | ||
| print(msg, file=sys.stderr) | ||
| return 1 |
| if p.val.len == 0: | ||
| echo "Missing value for --bmf-out" | ||
| quit 1 |
| echo "Unknown flag: --", p.key | ||
| quit 1 |
| echo "Unknown variant group: ", arg | ||
| echo "Supported: ", SupportedGroups | ||
| quit 1 |
Compile-only stub for benchmarks/nim/bench_common.nim. Types and proc signatures are final per design doc section 2.1; bodies raise AssertionDefect with 'not implemented' so tasks 0.2-0.6 can be written and reviewed in parallel against this surface. Public surface stubbed: - Topology enum (tSpsc/tMpsc/tMpmc + unbounded variants) - PushResult / PopResult adapter primitives - BMFEmitter + initBMFEmitter / addMeasure / emit - Stats helpers (mean / stddev / minVal / maxVal / percentile) - Histogram + initHistogram / record / topK / percentile (HistogramTopK = 1000, HistogramReservoir = 99000) - LatencyMetrics + runLatencyHarness - ThroughputMetrics + runThroughputHarness Test (tests/t_bench_common.nim) is compile-time-only: it references each promised symbol via 'when declared' and 'when compiles', leaving behavior tests to subsequent tasks. Stub file: 170 LOC (acceptance cap: 200).
initBMFEmitter / addMeasure / emit per design doc section 2.1.
- addMeasure stores via Table[slug, Table[measure, MeasureValue]];
NaN sentinel for lower/upper bounds maps to none[float](),
finite values map to some(v).
- emit writes JSON with slugs alpha-sorted at the top level and
measures alpha-sorted within each slug. lower_value/upper_value
fields are emitted only when the corresponding bound is some.
- Output format: pretty(root, 2) plus a trailing newline so
diffs in CI / local-dev surface line-level changes cleanly.
Tests (tests/t_bench_common.nim): empty emit produces {}, two
slugs alpha-ordered, measures alpha-ordered within slug, NaN bounds
omitted from output, finite bounds emitted as lower_value/upper_value.
mean / stddev / minVal / maxVal / percentile bodies. Sort-based
percentile uses linear interpolation between adjacent samples;
matches numpy's default percentile interpolation ('linear').
stddev uses ddof=1 (sample) for consistency with numpy default.
Empty-input behavior matches the legacy benchmarks/nim/stats.nim
that this module replaces (returns 0.0 across the board).
Tests: numpy-baseline parity for stddev on [1,2,3,4]; min/max on
unordered input; percentile(p=0.5) on [0..99] = 49.5; percentile
extremes (p=0.0, p=1.0) match min and max.
Online percentile tracker per design doc section 2.1.
- topKHeap: min-heap of HistogramTopK = 1000 largest values seen.
record() pushes if heap < K; else replace if value > heap[0]
(evicts the min into the body bucket via reservoirAdmit).
- reservoir: Algorithm R uniform sample of body distribution, size
HistogramReservoir = 99_000.
- debugMode (initHistogram(true)): records every sample to debugAll
and percentile() returns the exact sort answer over debugAll. Used
as the test oracle and as a debug-only fallback per design 2.1.
Lookup rule with stratified-estimator math:
p == 1.0 -> top-K max
seenAll*(1-p) <= topKHeap.len -> top-K rank lookup (linear interp)
otherwise -> reservoir, with p rescaled to
p_body = (p*seenAll) / (seenAll - K)
since the reservoir samples ONLY the
body stratum (values that lost the
top-K race). Without rescaling a
reservoir-based p99 underestimates
by ~25% on log-normal data.
Tests: p99 within 1% of sort fallback on 100K log-normal samples;
percentile(1.0) == top-K max; p50 within 5% of sort fallback;
debug mode returns exact sort answer; topK() returns ascending-sorted
samples.
Generic N-producer / N-consumer throughput runner factored out of
the legacy benchmarks/nim/bench_throughput.nim. Parameterized over the
queue init proc and the queue type Q (constrained at instantiation
time via 'mixin push' / 'mixin pop' so adapter-specific push/pop are
resolved at the call site).
The remainder-spread deadlock-avoidance rule is preserved verbatim:
if messageCount mod numProducers != 0, the leading producers each get
one extra item so the producer-side total exactly equals messageCount;
same for consumers. Without this, the harness silently degrades into
a deadlock (consumers waiting forever) or strands items in the queue
(consumers stop early).
Returns ThroughputMetrics{ops_ms_mean, ops_ms_stddev, runs}; warmup
runs are discarded. Smoke test in tests/t_bench_common.nim with a
1024-cap channel SmokeAdapter at 1P/1C × 1000 messages × 1 run.
Ping-pong RTT runner. For each run:
- allocates two queues fwd + rev of type Q via queueInit()
- spawns 1 pinger + 1 ponger thread
- pinger pushes a monotonic-ns timestamp via fwd, spins on rev for
the echo, records RTT (ns) into a per-run Histogram
- ponger pops from fwd, pushes the same value to rev
Aggregation per design 2.5 'Run aggregation for latency': reported
percentiles are the MEAN across runs of each percentile, NOT the
percentile of unioned samples. Per-run histograms isolate scheduler
artifacts to the run that contains them.
Returns LatencyMetrics{p50_ns, p95_ns, p99_ns, p999_ns, max_ns,
samples}. samples is the cumulative across-runs sample count.
This task ships the 1P/1C smoke topology only; multi-thread sharding
lands in PR 1 (Task 1.x). 1P/1C is sufficient for the smoke check
because PR 1's first BMF measure uses Sipsic 1P/1C as the smoke shape.
Test: SmokeAdapter (channel-backed), 1000 messages, 1 run, 0 warmup;
asserts samples >= 1000 and p50 < p99 < max.
Stateless union of per-slug measure dicts across N input BMF JSON
files into one output JSON, per design doc section 4.3. Required
because Bencher.dev creates a separate Report per 'bencher run'
invocation; multiple uploads from sibling CI jobs would NOT co-locate
multiple measures on a single per-slug history.
Contract:
merge_bmf.py <output_path> <input1> [input2 ...]
- Slug regex: ^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$
- Measure regex: ^[a-z][a-z0-9_]*$
- Value fields ('value', 'lower_value', 'upper_value') must be finite,
non-NaN floats. Any malformed input or non-finite value -> exit 1.
- Collision (same slug + same measure across two inputs) -> exit 1
with stderr 'error: collision on slug=<slug> measure=<measure> in
files <pathA> and <pathB>'.
- Output is alpha-sorted at both top level (slugs) and within each
slug (measures), 2-space indent.
Tests in benchmarks/tests/test_merge_bmf.py cover all 8 design-spec
cases plus an alpha-sort assertion: zero-input usage error, single
empty pass-through, single valid round-trip, two disjoint slugs,
shared slug + disjoint measures, collision, NaN rejection, invalid
measure key (uppercase), invalid slug shape.
Adds the 5 adapters not yet present per design doc 2.2: - lockfreequeues_sipmuc_adapter.nim (bounded SPMC) - lockfreequeues_mupsic_adapter.nim (bounded MPSC) - lockfreequeues_unbounded_sipsic_adapter.nim - lockfreequeues_unbounded_sipmuc_adapter.nim - lockfreequeues_unbounded_mupmuc_adapter.nim Each adapter file exports: - 'topologiesSupported: set[Topology]' constant (consumed by PR 3 Task 3.11 to restrict shape generation; declared but unused in PR 0) - 'make<Lib>Adapter' factory taking 'capacity: int' - 'cleanup' tear-down proc - 'push(item)' returning bench_common.PushResult - 'pop()' returning bench_common.PopResult[T] Smoke test in tests/t_bench_common.nim: each adapter pushes 100 sequential uint64s, pops them, asserts count == 100 AND set-equality with the input range. Implementation note (Nim 2.2.6 codegen workaround): The unbounded adapters store the queue INLINE rather than via heap pointer. Heap-allocating UnboundedSipsic/UnboundedSipmuc/ UnboundedMupmuc and then calling reset() or =destroy() on the referent from a generic adapter cleanup proc triggers 'internal error: getTypeDescAux(tyGenericParam)' OR 'internal error: expr(nkIdent); unknown node kind' in Nim 2.2.6 when bench_common is also imported. Inline storage lets the compiler emit the per-instance =destroy automatically when the adapter goes out of scope, sidestepping the bug. The DEBRA manager is still heap-allocated because the queue constructors take 'ptr DebraManager'. Track upstream Nim repro for follow-up.
…port `topologiesSupported` Per design section 2.2 (`<library_slug>_adapter.nim` filename convention): - Rename via `git mv` (history preserved): - lockfreequeues_sipsic.nim -> lockfreequeues_sipsic_adapter.nim - lockfreequeues_mupmuc.nim -> lockfreequeues_mupmuc_adapter.nim - lockfreequeues_unbounded_mupsic.nim -> lockfreequeues_unbounded_mupsic_adapter.nim - Update bench_throughput.nim import paths to match. - Add `topologiesSupported*: set[Topology]` constant to all four existing adapters (3 renamed + channels_adapter). Use selective `from ../bench_common import Topology, <member>` to avoid pulling PushResult into scope, which would conflict with adapter.PushResult (the legacy concept in ../adapter still imported by these adapters). - `topologiesSupported` is exported in PR 0 but consumed only by PR 3 Task 3.11; the "unused export" hint is expected per the impl plan.
…ughput
Add --bmf-out=<path> flag (parsed via std/parseopt) that emits Bencher
Metric Format JSON for every variant that ran. With the flag absent the
binary is bit-for-bit unchanged; the existing positional CLI
(`bench_throughput sipsic mupmuc unbounded_mupsic channels`) is
preserved verbatim.
Slug shape per design 2.2:
lockfreequeues_sipsic/spsc/1p1c
lockfreequeues_mupmuc/mpmc/{1,2,4,8}p{1,2,4,8}c
lockfreequeues_unbounded_mupsic/mpsc_unbounded/{1,2,4}p1c
nim_channels/mpmc/{1,2,4}p{1,2,4}c
Each slug carries the `throughput_ops_ms` measure as
{value: mean, lower_value: mean-stddev, upper_value: mean+stddev} —
matching the convention bmf_adapter.py used (which this code replaces
in Task 0.12).
Note on slug for built-in channels: the human-readable adapter name()
returns "nim/channels" (kept for stdout text), but the slug uses
underscore form `nim_channels` so it satisfies the merge_bmf.py
SLUG_RE pattern `^[a-z][a-z0-9_]*/[a-z][a-z0-9_]*/\d+p\d+c$`.
Per-variant {.intdefine.}s for runs/warmup added so the integration
test can build a fast variant via -d:BenchSipsicRuns=2
-d:BenchSipsicWarmup=0. Default values (10 runs, 0 warmup) match the
prior hard-coded behavior.
Selective `from ./bench_common import` is used to pull only
{BMFEmitter, initBMFEmitter, addMeasure, emit} — importing the full
module would shadow `results.ThroughputMetrics` with
`bench_common.ThroughputMetrics` and break the legacy bespoke harness
procs that still drive Mupmuc and UnboundedMupsic.
Test: tests/t_bench_common.nim adds an integration test that compiles
bench_throughput with small overrides, runs it with --bmf-out=<tmp>
sipsic, parses the resulting JSON, and asserts the expected slug and
measure exist with a positive value. Stdout text output is also
asserted (`Sipsic` substring) so the legacy CLI contract is exercised.
Apply the four edits from design section 3 (PR 0):
(a) Remove `Run adapter unit tests` step. The Python regex parser it
sanity-checked is gone in Task 0.12; nothing left to test.
(b) Remove `Convert to BMF JSON` step. bench_throughput now emits BMF
natively via `--bmf-out=throughput.json` (Task 0.10).
(c) Update `Run bench_throughput` step to add the `--bmf-out=throughput.json`
flag while preserving the existing positional variant filter and the
20-minute step timeout.
(d) Insert `Merge BMF JSON` step running
`python3 benchmarks/merge_bmf.py merged.json throughput.json`.
Single-input today, but stays in place for PR 2-4 when bench is
split into per-topology binaries each producing its own fragment.
(e) Change all three `bencher run` invocations and the debug-cat step
to use `--file merged.json` instead of `bench_results.json`.
actionlint clean. The bench upload contract is unchanged: same
project, branch, testbed, thresholds, comparison semantics — only the
file source flips from regex-parsed text to native BMF.
…or consumers Three legacy files removed via `git rm`: - `benchmarks/bmf_adapter.py` — regex parser for bench_throughput stdout. Replaced by native BMF emission via `--bmf-out=` (Task 0.10). - `benchmarks/test_bmf_adapter.py` — unit tests for the parser. Replaced by `benchmarks/tests/test_merge_bmf.py` (Task 0.7). - `benchmarks/nim/bench_main.nim` — aggregator binary that wrapped bench_throughput + bench_latency and emitted a custom JSON shape. Bench_throughput is now the canonical entry point and emits BMF directly; bench_latency follows in PR 1. Live consumers refactored in this same task (per impl plan §0.12): - `benchmarks/runner.py`: build_nim now compiles bench_throughput.nim; run_nim invokes `bench_throughput --bmf-out=<path>` instead of `bench_main --runs=<n> -o=<path>`. The `runs` argument is preserved in the CLI for back-compat but is now compile-time-only. - `lockfreequeues.nimble`: `task benchmarks` switched to bench_throughput. Output path corrected to `.tmp/bench_throughput` (matching the project nim.cfg `--outdir:.tmp`). - `benchmarks/README.md`: rewritten to document the new flow (bench_common module, adapter convention, merge_bmf.py, native BMF emission, expected slug set). - `.gitignore`: drop the `bench_main` build-artifact ignore line. `git grep "bmf_adapter\|bench_main"` returns only intentional historical references in code comments and CHANGELOG (prior release notes). Acceptance: `nimble tasks` parses; `nim c benchmarks/nim/bench_throughput.nim` compiles; tests/t_bench_common.nim passes all 26 tests including the Task 0.10 --bmf-out integration test.
The legacy bench_main aggregator emitted JSON of shape
`{results: [{implementation, thread_config, metrics: {...}}], metadata: {...}}`.
Bench_main was deleted in Task 0.12, replaced by `bench_throughput
--bmf-out` (Task 0.10) + `merge_bmf.py` (Task 0.7) which produce
Bencher Metric Format JSON of shape
`{<lib>/<topology>/<P>p<C>c: {<measure>: {value, lower_value?, upper_value?}}}`.
`benchmarks/render_readme.nim` now walks the BMF shape directly:
- New `parseSlug(slug)` helper splits `<lib>/<topology>/<P>p<C>c` into
`(impl, "<P>P/<C>C")` for the two table columns. Strict on shape:
rejects slugs with the wrong segment count or non-digit P/C, so
malformed entries do not produce noise rows.
- `loadRows` walks every top-level slug, pulls
`<slug>.throughput_ops_ms.value` for the throughput column, and
`<slug>.latency_p50_ns.value` for the optional latency column
(PR 1+ emits this; PR 0 does not, so all PR 0 rows show "—").
- BMF has no `metadata` block, so the meta argument is always nil.
`renderTable` already accepts nil and skips the platform header in
that case — no change needed.
- `when isMainModule` block guarded by `not defined(renderReadmeNoMain)`
so unit tests can include this file without triggering the README
rewrite. CLI invocations (`nim r benchmarks/render_readme.nim`) leave
the define unset and run main as before.
Test: tests/t_render_readme.nim (new) writes a fixture BMF JSON with
three slugs (mupmuc/2p2c, sipsic/1p1c, nim_channels/1p1c), runs
`loadRows` and `renderBlock`, and asserts:
- 3 rows produced; impl/threads/throughput parsed correctly per slug
- alpha sort by impl preserved
- meta == nil (BMF carries no metadata)
- rendered Markdown table contains every impl name and value
- latency column is em-dash for every row (no `latency_p50_ns` in fixture)
- missing JSON file falls back to placeholder
- empty `{}` BMF document falls back to placeholder
All 4 new tests + 26 existing t_bench_common tests pass.
Document Track 0 of the bench-rollup feature under [Unreleased]: - Added: bench_common.nim module, 5 new lockfreequeues adapters, merge_bmf.py CLI, --bmf-out flag, per-variant run-count intdefines. - Changed: bench_throughput emits BMF natively, 3 adapters renamed to the <library_slug>_adapter.nim convention, render_readme.nim consumes the new BMF shape, runner.py / nimble task / benchmarks/README.md point at bench_throughput. - Removed: bmf_adapter.py, test_bmf_adapter.py, bench_main.nim. Each entry is one bullet (one logical change), no test plan checklist, no GitHub issue references.
6a882f7 to
6d725c6
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request overhauls the benchmarking infrastructure by introducing a shared harness module (bench_common.nim), native Bencher Metric Format (BMF) JSON emission, and updated adapters for various lock-free queue implementations. It replaces the legacy bench_main aggregator and Python-based regex parser with a robust system including a BMF merging utility and updated documentation. Feedback identifies memory leaks in the latency and throughput harnesses due to missing queue cleanup, and resource leaks in adapters where heap-allocated queues are deallocated without invoking destructors.
| proc runOneLatencyRun[Q]( | ||
| queueInit: proc(): Q, | ||
| messageCount: int, | ||
| ): LatencyMetrics = | ||
| ## One ping-pong RTT run. Allocates fwd + rev queues of type Q, | ||
| ## spawns 1 pinger and 1 ponger, records RTT samples, returns | ||
| ## LatencyMetrics{p50, p95, p99, p999, max, samples}. | ||
| ## | ||
| ## NOTE: This task ships the 1P/1C smoke topology only. PR 1 (Task 1.x) | ||
| ## extends to multi-pinger / multi-ponger by sharding `messageCount` | ||
| ## across additional threads with the remainder-spread rule from | ||
| ## runThroughputHarness. The 1P/1C path keeps RTT semantics clean | ||
| ## (no scheduler interleave between ping and pong) which is what the | ||
| ## p50 / p99 measurements in PR 1 hinge on. | ||
| var fwd = queueInit() | ||
| var rev = queueInit() | ||
| var hist = initHistogram() | ||
|
|
||
| var pingerCtx = PingerCtx[Q]( | ||
| fwd: addr fwd, rev: addr rev, | ||
| count: messageCount, | ||
| histogram: addr hist, | ||
| ) | ||
| var pongerCtx = PongerCtx[Q]( | ||
| fwd: addr fwd, rev: addr rev, | ||
| count: messageCount, | ||
| ) | ||
|
|
||
| var pingerThread: Thread[ptr PingerCtx[Q]] | ||
| var pongerThread: Thread[ptr PongerCtx[Q]] | ||
|
|
||
| createThread(pingerThread, pingerThreadBody[Q], addr pingerCtx) | ||
| createThread(pongerThread, pongerThreadBody[Q], addr pongerCtx) | ||
| joinThread(pingerThread) | ||
| joinThread(pongerThread) | ||
|
|
||
| result = LatencyMetrics( | ||
| p50_ns: hist.percentile(0.50), | ||
| p95_ns: hist.percentile(0.95), | ||
| p99_ns: hist.percentile(0.99), | ||
| p999_ns: hist.percentile(0.999), | ||
| max_ns: hist.percentile(1.0), | ||
| samples: hist.seenAll, | ||
| ) |
There was a problem hiding this comment.
The runOneLatencyRun procedure leaks memory because it creates two queue instances (fwd and rev) but never calls their cleanup procedures. Since the adapters (e.g., LockfreequeuesMupsicAdapter) allocate memory on the heap, this will lead to significant memory accumulation over multiple benchmark runs.
| proc runOneThroughputRun[Q]( | ||
| queueInit: proc(capacity: int): Q, | ||
| capacity: int, | ||
| numProducers: int, | ||
| numConsumers: int, | ||
| messageCount: int, | ||
| ): float = | ||
| ## One run of the throughput harness; returns ops/ms. | ||
| ## | ||
| ## Distribution invariants (mirrors legacy bench_throughput | ||
| ## `runThroughputBenchmark`): | ||
| ## 1. sum(producer.count) == messageCount | ||
| ## 2. sum(consumer.count) == messageCount | ||
| ## A naive `messageCount div N` truncates and breaks both when | ||
| ## `messageCount` is not divisible by `numProducers` or `numConsumers`, | ||
| ## which deadlocks the consumers (waiting for items the producers never | ||
| ## enqueue) or strands items in the queue (consumers stop early). | ||
| ## Spread the remainder over the first `messageCount mod N` workers so | ||
| ## totals match exactly for any (P, C, messageCount) triple. | ||
| var queue = queueInit(capacity) | ||
| let baseP = messageCount div numProducers | ||
| let remP = messageCount mod numProducers | ||
| let baseC = messageCount div numConsumers | ||
| let remC = messageCount mod numConsumers | ||
|
|
||
| var producerThreads = newSeq[Thread[ptr ProducerCtx[Q]]](numProducers) | ||
| var consumerThreads = newSeq[Thread[ptr ConsumerCtx[Q]]](numConsumers) | ||
| var producerCtxs = newSeq[ProducerCtx[Q]](numProducers) | ||
| var consumerCtxs = newSeq[ConsumerCtx[Q]](numConsumers) | ||
|
|
||
| var nextStart = 0 | ||
| for i in 0 ..< numProducers: | ||
| let count = baseP + (if i < remP: 1 else: 0) | ||
| producerCtxs[i] = ProducerCtx[Q]( | ||
| queue: addr queue, | ||
| startIdx: nextStart, | ||
| count: count, | ||
| ) | ||
| nextStart += count | ||
|
|
||
| for i in 0 ..< numConsumers: | ||
| let count = baseC + (if i < remC: 1 else: 0) | ||
| consumerCtxs[i] = ConsumerCtx[Q]( | ||
| queue: addr queue, | ||
| count: count, | ||
| ) | ||
|
|
||
| # Monotonic clock — `epochTime` (wall clock) can step backward across | ||
| # NTP adjustments and skew throughput numbers. Nanosecond precision: | ||
| # ms-precision buckets multiple short runs into the same integer ms, | ||
| # producing identical samples and stddev=0 on a fast CI runner. ops/ms | ||
| # is reconstructed as a float at print time. | ||
| let startTime = getMonoTime() | ||
|
|
||
| for i in 0 ..< numProducers: | ||
| createThread(producerThreads[i], producerThreadBody[Q], | ||
| addr producerCtxs[i]) | ||
| for i in 0 ..< numConsumers: | ||
| createThread(consumerThreads[i], consumerThreadBody[Q], | ||
| addr consumerCtxs[i]) | ||
|
|
||
| for i in 0 ..< numProducers: | ||
| joinThread(producerThreads[i]) | ||
| for i in 0 ..< numConsumers: | ||
| joinThread(consumerThreads[i]) | ||
|
|
||
| let elapsedNs = float(inNanoseconds(getMonoTime() - startTime)) | ||
| if elapsedNs <= 0.0: | ||
| return 0.0 | ||
| result = float(messageCount) * 1_000_000.0 / elapsedNs | ||
|
|
| proc cleanup*[N, P: static int, T]( | ||
| a: var LockfreequeuesMupsicAdapter[N, P, T] | ||
| ) = | ||
| if a.queue != nil: | ||
| dealloc(a.queue) | ||
| a.queue = nil |
There was a problem hiding this comment.
| proc cleanup*[N, C: static int, T]( | ||
| a: var LockfreequeuesSipmucAdapter[N, C, T] | ||
| ) = | ||
| if a.queue != nil: | ||
| dealloc(a.queue) | ||
| a.queue = nil |
… harness Per Gemini review on PR #19. Adapters whose Q heap-allocates the backing queue (lockfreequeues bounded mupsic/sipmuc/mupmuc, channels) leaked the queue every time runOneLatencyRun or runOneThroughputRun returned: `var fwd = queueInit()` only frees the adapter struct's stack slot at scope exit, not the create()-d Mupsic/Sipmuc/Mupmuc/ Channel pointee. Across N runs * M warmups the leak grew without bound. Two-part fix: 1. bench_common.nim: harness now drops the queue via a uniform `mixin cleanup; when compiles(cleanup(q)): cleanup(q)` block after joinThread (and before any return path, including the elapsedNs<=0 short-circuit in runOneThroughputRun). `when compiles` keeps value-type adapters (sipsic, unbounded inline storage) usable without forcing them to define a no-op cleanup. 2. lockfreequeues_mupsic / sipmuc cleanup procs: prepend reset(a.queue[]) so Mupsic/Sipmuc's `=destroy` actually runs before the heap slot is freed. Bare dealloc skipped any queue-internal cleanup that an ARC/ORC-generated destructor would otherwise do (managed T fields, future atomic-cleanup hooks). Same fix on the mupmuc adapter inside deinitMupmucAdapter. 3. mupmuc and channels adapters gain a `cleanup` alias around their existing deinit*Adapter procs so the harness's mixin cleanup call resolves uniformly across adapter naming conventions. Verified by smoke build of `runLatencyHarness` + `runThroughputHarness` against LockfreequeuesMupsicAdapter; no leak-on-return, percentiles and ops/ms reported as expected.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/review |
Code Review by Qodo
1.
|
Replaces the PR 5 placeholder with the real uPlot block per design §4.5 schema. Documents source, version (1.6.27), MIT license, vendored path, and a precise upgrade procedure (jsdelivr URL, SHA-256 verification via jsdelivr's package metadata). `.gitattributes` gains two rules per design §4.6: - `docs/assets/uplot-*.js` — linguist-vendored + linguist-generated. Confirmed via `git check-attr -a` against the vendored bundle. - `docs/assets/bench-results/*.json` — linguist-generated only. CI-emitted snapshots are not third-party imports, so the vendored flag would mislead.
Replaces the auto-rendered four-row table with a hand-curated four-row summary covering the four bounded lockfreequeues variants at one representative shape each, followed by a link to the live chart page (design §4.4 / Task 5.8). Initial cells contain `_to be filled at next release_` placeholders; the next release PR fills them in per the "Updating the README summary" procedure documented in benchmarks/README.md. Why hand-curated: - The chart page already absorbs run-to-run noise; the README does not need to. - Four cells, refreshed once per release cycle, is faster to update by hand than to re-introduce an auto-renderer the chart page has superseded. - The README is the first artefact consumers see; a noisy CI run should not silently regress its headline numbers. Pre-deletion release-tag check (per impl plan 5.8): - v3.2.0 and v4.0.0 each ship `benchmarks/render_readme.nim` in their tagged tree. Deleting the file on devel does NOT mutate those tags (git tags point at frozen commits). - No CI workflow, nimble task, or test runner references the renderer; the test suite (`tests/test.nim`) does not import `t_render_readme`. Confirmed via `git grep` against `*.yml`, `*.nimble`, `*.cfg`, and `tests/test.nim`. Files removed: - `benchmarks/render_readme.nim` — auto-rendered the README markers. - `tests/t_render_readme.nim` — `include`s the renderer; would not compile after deletion. `benchmarks/README.md` gains an "Updating the README summary" section codifying the hand-curation procedure (which shapes to read, where to read them from, when to commit). The chart-contract test runner is added to the test commands list. Two remaining `render_readme` mentions are intentional: - `CHANGELOG.md` PR-0 entry (historical record of the PR-0 rewrite). - `benchmarks/README.md` explanatory note about the deletion path.
Adds Track 5 entries to [Unreleased]: chart wiring + assets, BMF snapshot publishing pipeline with three-layer loop-prevention, devel-triggered docs deploy + post-deploy asset verification, THIRD_PARTY_LICENSES + .gitattributes uPlot entries, the new chart-contract Python test, and the README BENCHMARKS markers switching from auto-rendered to hand-curated. Documents removal of `benchmarks/render_readme.nim` and its test, with the pre-deletion release-tag check noted. Manual chart performance evidence (LCP/FCP screenshot per design §3 PR 5 acceptance) will be attached to the PR description after push, since it requires a live mike deploy URL.
…y old uPlot Per Gemini review on PR #24. 1. `docs/benchmarks.md` and `docs/assets/bench-charts.js`: `./assets/` resolves to `/<version>/benchmarks/assets/` under mkdocs `use_directory_urls` (default true), which 404s — assets actually live at `/<version>/assets/`. Switch to `../assets/` so the chart page reaches them via one level up. 2. `.github/workflows/docs.yml` mike-asset verifier: same path bug — was probing `/dev/benchmarks/assets/bench-results/latest.json` which would have returned 404 once the snapshot landed (the verify step only runs on devel push, so the bug wouldn't surface until then). Aligned with the page's `../assets/` path. 3. `bench-charts.js` ResizeObserver leak on log-scale toggle: `attachResizeObserver` was called from inside `rebuild()` which fires on every toggle. Each click left a stranded observer + a stale `plot` reference its setSize closure was still firing on. Refactored to a single observer attached after the first build, which resolves `plot` lazily through a getter — toggle just swaps the underlying instance. 4. `bench-charts.js` orphaned uPlot DOM/listeners: `rebuild()` mounted a new uPlot without destroying the previous one. Added `if (plot) plot.destroy()` at the top of `rebuild()`. Combined with the observer fix above, the toggle is now state-clean across arbitrarily many flips. Syntax-checked bench-charts.js with `node --check`.
Three code-review fixes on PR 5: 1. Snapshot push race. The bench-upload Snapshot step pushed devel without rebasing, so a concurrent push during a long bench run would reject non-fast-forward and fail the entire job. Add a fetch + rebase + retry loop (3 attempts) and push HEAD:devel. The commit is already [skip ci], so the retry cannot trigger an infinite bench loop on a successful push. 2. Hardcoded GitHub Pages verify URL. The docs.yml verification step curled https://elijahr.github.io/lockfreequeues/... which is wrong on forks or after a repo rename. Derive the URL from github.repository_owner and github.event.repository.name so the verifier follows the repo wherever it lives. 3. Log-scale accepts non-positive. The chart defaults to log Y (distr: 3) but merge_bmf.py only enforces finiteness, so a malformed BMF could deliver a 0 or negative throughput value. Treat <= 0 as missing data in toUplotData. Throughput in ops/ms cannot legitimately be non-positive (a 0 indicates a degenerate run that bench_throughput's elapsedNs guard already rejects), so surfacing as missing is the correct semantics.
The PR4 -> PR5 rebase merged two permissions blocks: PR4's contents:read (needed by actions/checkout + actions/download-artifact when default GITHUB_TOKEN is read-only) and PR5's contents:write (needed by the snapshot-push step). YAML rejects duplicate keys, so GitHub Actions failed the workflow at parse time before any job started — visible as a workflow-level failure with no job rows. Collapse to contents:write, which is a superset of read.
- Vendor `uPlot.min.css` 1.6.27 alongside the existing IIFE JS bundle. Without the companion stylesheet, uPlot's axes, grid lines, and cursor are misaligned (uPlot ships the layout rules in CSS, not inline). Wire it via a `<link>` in benchmarks.md AND add it to mkdocs `extra_css` so it loads on every page (cheap, ~2KB) and the cursor/grid layout works regardless of which page renders the chart. - Update .gitattributes vendoring rules to cover the .css file too. - Update THIRD_PARTY_LICENSES upgrade procedure for the dual bundle. - bench-charts.js: replace per-comparison regex+parseInt with a pre-parse-and-sort pass that computes (p, c, total) once per label. Non-matching labels keep a sentinel that puts them ahead of matched shapes, ordered lexicographically among themselves. - test_bench_charts_contract: replace the manual `v == v and v not in (inf, -inf)` finite check with `math.isfinite(v)`.
…link - Series key is now `<library> (<topology>)` instead of bare library name. A library that exposes multiple topologies (e.g. mupmuc/spsc AND mupmuc/mpmc) used to merge into one series and silently overwrite shape values. - Cache `_shapeMap` on each library's first hover so the setCursor hook does an O(1) Map lookup per visible series instead of an O(N) linear `find` on every mousemove. - Drop the redundant `<link rel="stylesheet">` for uplot's CSS in docs/benchmarks.md — it's already loaded globally via mkdocs.yml's `extra_css`. The duplicate caused two network requests and risked load-order conflicts. - Document the contract-test version-string bump in the uPlot upgrade procedure, and clarify that the CSS comes via `extra_css` (not a `<link>`) so future upgraders touch the right config. The duplicate-network-request risk and the silent-overwrite bug were the user-facing failures; the Map cache is a small ergonomic win on the same code path.
…JS parseSlug
- bench.yml: remove [skip ci] from the snapshot commit. The marker
also suppressed docs.yml, which has no paths-ignore and must run
on snapshot pushes to deploy fresh chart data to Pages. Loop
prevention is unaffected: bench.yml's existing paths-ignore on
docs/assets/bench-results/** already short-circuits the bench job
on the snapshot commit, and the bot-actor guard on bench-upload is
the defense-in-depth layer. Comment block updated to reflect the
two-layer scheme (was three layers).
- test_bench_charts_contract: SLUG_RE accepted only 3-segment slugs;
bench-charts.js parseSlug accepts 3+ via slice(1, -1).join('/').
Loosen the python regex to match so the contract test does not
reject nested-topology slugs the JS chart would render.
Track 6 (PR 6, final track in the bench-rollup feature stack) adds latency-regression gating in Bencher and the supporting BMF measures. bench_latency.nim now emits the full p50/p95/p99/p999/max latency tuple per bounded variant slug. Each call to runLatencyHarness adds two new measures alongside the existing trio: em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns) em.addMeasure(slug, "latency_max_ns", metrics.max_ns) The merged BMF on each bench job thus carries a complete tail-latency profile per (library, topology) cell, available to the Bencher dashboard and any downstream comparison chart. bench_common.nim raises HistogramTopK from 1000 to 5000. At production sample counts (BenchLatencyMessageCount=100_000 x BenchLatencyRuns=33, seenAll ~ 3.3M per aggregated histogram) the p999 tail rank is ~3300; K=1000 forced the p999 lookup into the rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K stratum with headroom for stochastic boundary variation. Memory cost is 5000 x 8B = 40KB additional per histogram, negligible relative to the 99K-sample reservoir. bench.yml's base-branch tracking step (push to devel/main) configures per-measure thresholds in a single bencher run invocation: latency_p99_ns with --threshold-upper-boundary 0.99 (regression = latency increase) and throughput_ops_ms with --threshold-lower-boundary 0.99 (regression = throughput drop). Both share --threshold-test t_test --threshold-max-sample-size 64, terminated by --thresholds-reset so only the explicitly-listed thresholds remain active. This also corrects a prior measure-name mismatch: the earlier --threshold-measure throughput never matched any emitted measure (the actual key is throughput_ops_ms), so the previous throughput threshold was a no-op. Threshold activation requires >= 10 prior runs accumulated in Bencher to calibrate the t-test baseline (Task 6.4 stability soak gate). Bencher does not emit alerts on measures with insufficient sample history, so the configuration is dormant until the soak completes post-merge. Tests: - t_bench_common.nim: HistogramTopK >= 5000 assertion + p999 within 5% of sort fallback on 3.3M log-normal samples (27 tests, all passing). - t_bench_latency.nim: every bounded variant slug carries latency_p999_ns + latency_max_ns with monotone tail ordering p99 <= p999 <= max (7 tests, all passing). - tests/test.nim main suite unaffected (200 tests, all passing). Refs design doc /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md section 3 PR 6 and impl plan /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 6 (Tasks 6.1-6.5).
…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.
Code review on PR 6 noted that the new HistogramTopK assertion in tests/t_bench_common.nim and the latency CLI assertions in tests/t_bench_latency.nim were not imported by tests/test.nim and not run by any CI step, so the validation added by this PR was not actually enforced in CI. Wire up via nimble's separation-of-concerns pattern (Option B from the review): - lockfreequeues.nimble gains a 'benchtests' task that compiles and runs t_bench_common, t_bench_latency, and t_bench_adapters under one Nim invocation each. The bench harness tests live outside srcDir; importing them from tests/test.nim would force every entry of the 8-invocation MM/sanitizer test matrix to compile bench_common's threading machinery. A dedicated task keeps the regular nimble test matrix lean while still enforcing the bench-side assertions. - A second 'benchteststress' task enables the gated 3.3M-sample HistogramTopK p999 stress shape (slow, ~10-15s release-mode), so default benchtests stays under a second. - bench.yml gains a 'bench-tests' job (ubuntu-latest) that runs 'nimble benchtests' once per workflow trigger, with the same fork-PR / bot-actor guards as the bench matrix jobs. The 3.3M-sample stress test in t_bench_common is now gated behind -d:BenchCommonStress so it only runs under benchteststress, not the default benchtests run.
The previous wording called `percentile(1.0)` "the top-K head", which is misleading for a min-heap implementation: `topKHeap[0]` is the SMALLEST element in the top-K (the admission threshold for new outliers), not the largest. The implementation snapshots the heap into a sequence, sorts it, and returns `topk[^1]` — the maximum. Comment-only change; behaviour unchanged.
Replaces the Qodo PR-Agent reusable workflow with Momus (axiomantic/momus), an open-source from-scratch PR review bot.
…KEY -> LLM_API_KEY, pass MOMUS_APP_* secrets, set trigger_mention
There was a problem hiding this comment.
PR 0 of the bench-rollup feature: introduces a shared bench harness module (bench_common.nim), rewires the CI bench workflow to use native BMF JSON emission via --bmf-out, adds 5 new lockfreequeues adapters, renames 3 existing adapters, introduces merge_bmf.py for stateless BMF union, and rewrites render_readme.nim for the new BMF shape. Two substantive issues found: the per-variant warmup defaults silently dropped from 3 to 0 for bounded variants (contradicts the 'pure refactor' claim and affects benchmark numbers), and 14 new tests across t_render_readme.nim and test_merge_bmf.py are not wired into any CI workflow.
Severity tally: 2 Mediums, 1 Low.
Medium (blocking)
- BOT-A1 (
benchmarks/nim/bench_throughput.nim:521): Per-variant warmup defaults silently dropped from 3 to 0 for bounded variants - BOT-A2 (
lockfreequeues.nimble:65): 14 new tests not wired into CI; merge_bmf and render_readme tests provide no gating
Low
- BOT-A3 (
tests/t_bench_common.nim:256): SmokeAdapter leaks its Channel in throughput and latency smoke tests
Noteworthy
- The BMFEmitter + Histogram implementation is careful about edge cases (NaN/Inf guard in addMeasure, zero-duration guard in elapsed time calculations, deterministic reservoir seed for reproducibility).
- The merge_bmf.py collision detection and schema validation (slug regex, measure regex, NaN rejection, bool rejection) is thorough and well-tested.
- The CI bench.yml now pins sibling deps to tagged releases (v0.6.0, v0.7.0) instead of
main, improving CI determinism.
Verdict: REQUEST_CHANGES.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| BenchSipsicWarmup {.intdefine.} = 0 | ||
| BenchMupmucRuns {.intdefine.} = 10 | ||
| BenchMupmucWarmup {.intdefine.} = 0 | ||
| BenchChannelsRuns {.intdefine.} = 10 | ||
| BenchChannelsWarmup {.intdefine.} = 0 |
There was a problem hiding this comment.
BOT-A1 — Medium (bug)
Per-variant warmup defaults silently dropped from 3 to 0 for bounded variants
The new per-variant compile-time constants BenchSipsicWarmup, BenchMupmucWarmup, and BenchChannelsWarmup all default to 0 (lines 521-525). The old code called benchmarkThroughput/benchmarkMupmucNP*C without an explicit warmup argument, picking up the default WarmupRuns=3 (line 44). In CI the old behavior was warmup=2 (via -d:WarmupRuns=2). The PR description states 'This is a pure refactor PR. No new variants are benched, no regression vs the prior CI numbers is expected' and claims the binary is 'bit-for-bit unchanged from the prior release' when --bmf-out is absent. The warmup change contradicts both claims: the first timed runs will absorb cold-cache/JIT effects, producing systematically lower throughput numbers across all bounded variants (sipsic, mupmuc, channels) on every local and CI invocation. The unbounded_mupsic variants are unaffected — they still use WarmupRuns=3 (line 465).
| BenchSipsicWarmup {.intdefine.} = 0 | |
| BenchMupmucRuns {.intdefine.} = 10 | |
| BenchMupmucWarmup {.intdefine.} = 0 | |
| BenchChannelsRuns {.intdefine.} = 10 | |
| BenchChannelsWarmup {.intdefine.} = 0 | |
| Either set the per-variant warmup defaults to match the old behavior (BenchSipsicWarmup=3, BenchMupmucWarmup=3, BenchChannelsWarmup=3), or if the change is intentional, update the PR description and CHANGELOG to document it as a Changed behavior. Also ensure the CI bench.yml -d:WarmupRuns flag actually reaches these variants (currently it doesn't, since the isMainModule calls pass the per-variant constants explicitly). |
| task benchtests, "Runs the bench harness test suite": | ||
| # The bench harness lives outside `srcDir`, so its dedicated tests | ||
| # (`tests/t_bench_*.nim`) are NOT imported by `tests/test.nim` to | ||
| # keep the regular `nimble test` matrix free of the bench harness's |
There was a problem hiding this comment.
BOT-A2 — Medium (tests)
14 new tests not wired into CI; merge_bmf and render_readme tests provide no gating
The PR adds 4 tests in tests/t_render_readme.nim and 10 tests in benchmarks/tests/test_merge_bmf.py, but neither file is referenced in any CI workflow or nimble task. The benchtests nimble task (line 65) only runs tests/t_bench_common.nim. The bench.yml bench-tests job only invokes nimble benchtests. The Python tests require python3 -m unittest benchmarks.tests.test_merge_bmf -v and the Nim tests require a separate compile+run of t_render_readme.nim — neither is automated. These tests cover schema validation, collision detection, slug parsing, BMF shape parsing, and edge cases (NaN rejection, missing files). Without CI gating, regressions in merge_bmf.py or render_readme.nim won't be caught pre-merge.
| # keep the regular `nimble test` matrix free of the bench harness's | |
| Add a steps to the bench.yml `bench-tests` job to run `python3 -m unittest benchmarks.tests.test_merge_bmf -v` and to compile+run `tests/t_render_readme.nim`. Alternatively, extend the `benchtests` nimble task to include these invocations. |
| queueInit = proc(cap: int): SmokeAdapter = initSmokeAdapter(cap), | ||
| capacity = 1024, | ||
| numProducers = 1, | ||
| numConsumers = 1, | ||
| messageCount = 1000, | ||
| runCount = 1, | ||
| warmupCount = 0, | ||
| ) | ||
| check metrics.runs == 1 | ||
| check metrics.ops_ms_mean > 0.0 | ||
|
|
||
| # ---------- Task 0.6: runLatencyHarness smoke ---------- | ||
|
|
||
| suite "bench_common runLatencyHarness": | ||
| test "smoke: 1P/1C, 1000 messages, 1 run, 0 warmup; p50<p99<max": | ||
| let metrics = runLatencyHarness[SmokeAdapter]( | ||
| queueInit = proc(): SmokeAdapter = initSmokeAdapter(1024), |
There was a problem hiding this comment.
BOT-A3 — Low (quality)
SmokeAdapter leaks its Channel in throughput and latency smoke tests
The SmokeAdapter type (lines 256-272) allocates a Channel[uint64] via create in initSmokeAdapter but defines no cleanup proc. The bench_common harness (both runOneThroughputRun and runOneLatencyRun) uses when compiles(cleanup(queue)) to conditionally call cleanup, which never fires for SmokeAdapter. Every invocation of the runThroughputHarness and runLatencyHarness smoke tests leaks a Channel. While tests are short-lived and the OS reclaims the memory at process exit, this is a leak that could mask real issues if these smoke adapters are copied into production adapter code.
| queueInit = proc(cap: int): SmokeAdapter = initSmokeAdapter(cap), | |
| capacity = 1024, | |
| numProducers = 1, | |
| numConsumers = 1, | |
| messageCount = 1000, | |
| runCount = 1, | |
| warmupCount = 0, | |
| ) | |
| check metrics.runs == 1 | |
| check metrics.ops_ms_mean > 0.0 | |
| # ---------- Task 0.6: runLatencyHarness smoke ---------- | |
| suite "bench_common runLatencyHarness": | |
| test "smoke: 1P/1C, 1000 messages, 1 run, 0 warmup; p50<p99<max": | |
| let metrics = runLatencyHarness[SmokeAdapter]( | |
| queueInit = proc(): SmokeAdapter = initSmokeAdapter(1024), | |
| Add a `cleanup` proc to SmokeAdapter that closes and deallocates the channel, matching the pattern in channels_adapter.nim. |
|
/ai-review |
There was a problem hiding this comment.
This omnibus PR consolidates 7 stacked PRs for the bench-rollup feature stack (81 commits, 70 files, +11,296/-1,493). It adds a shared bench harness with native BMF emission, topology-split binaries, comparison adapters for 7 external queue libraries, an interactive uPlot chart, CI matrix with per-step timeouts, cache-line-aligned segment allocation, and Bencher threshold gating. The code is thorough, well-tested, and the CI pipeline is carefully designed with soft-skip patterns and loop-prevention guards. Only one minor nit found.
Severity tally: 1 Nit.
Nit
- BOT-B1 (
tests/t_aligned_alloc.nim:10): Test file uses raw c_free instead of freeAligned from the module under test.
Noteworthy
- Cache-line-aligned segment allocation (
allocAligned/freeAlignedwith POSIXposix_memalignand Windows_aligned_mallocpaths) is a genuine fix for false sharing in unbounded queues, verified by 8 dedicated assertions int_unbounded_padding.nim. - The soft-skip CI pattern (install → smoke → set-flag-on-success → annotate-on-failure) gracefully handles unavailable external libraries without failing the workflow, keeping partial bencher coverage for surviving binaries.
- Deletion-safety guard (
superset_check.py) run in CI between merge and upload ensures the topology split never silently drops slugs — a robust regression prevention pattern.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| from system/ansi_c import c_free | ||
|
|
||
| type | ||
| Tiny = object | ||
| a: int | ||
|
|
||
| Line = object | ||
| a: int | ||
| b: int | ||
| c: int | ||
| d: int | ||
| e: int | ||
| f: int | ||
| g: int | ||
| h: int | ||
|
|
||
| Big = object | ||
| a: array[256, byte] | ||
|
|
||
| suite "internal/aligned_alloc.allocAligned": | ||
| test "tiny payload (sizeof < CacheLineBytes) is 64B-aligned": | ||
| let p = allocAligned[Tiny]() | ||
| check p != nil | ||
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | ||
| check p.a == 0 # zero-initialized | ||
| c_free(p) | ||
|
|
||
| test "line-sized payload (sizeof == CacheLineBytes) is 64B-aligned": | ||
| let p = allocAligned[Line]() | ||
| check p != nil | ||
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | ||
| check p.a == 0 | ||
| check p.h == 0 | ||
| c_free(p) | ||
|
|
||
| test "big payload (sizeof > CacheLineBytes) is 64B-aligned": | ||
| let p = allocAligned[Big]() | ||
| check p != nil | ||
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | ||
| for i in 0 ..< 256: | ||
| check p.a[i] == 0.byte | ||
| c_free(p) | ||
|
|
||
| test "many allocations are all 64B-aligned": | ||
| var ptrs: array[64, ptr Line] | ||
| for i in 0 ..< 64: | ||
| ptrs[i] = allocAligned[Line]() | ||
| check ptrs[i] != nil | ||
| check (cast[uint](ptrs[i]) mod CacheLineBytes.uint) == 0 | ||
| for i in 0 ..< 64: | ||
| c_free(ptrs[i]) |
There was a problem hiding this comment.
BOT-B1 — Nit (tests)
Test file uses raw c_free instead of freeAligned from the module under test.
tests/t_aligned_alloc.nim imports lockfreequeues/internal/aligned_alloc (which exports freeAligned), but deallocates test allocations via system/ansi_c.c_free instead of using the module's own freeAligned. On POSIX (the only CI platform) freeAligned delegates to c_free, so tests pass, but the freeAligned code path added in this PR is left untested. On Windows the test would be incorrect (_aligned_free is required).
| from system/ansi_c import c_free | |
| type | |
| Tiny = object | |
| a: int | |
| Line = object | |
| a: int | |
| b: int | |
| c: int | |
| d: int | |
| e: int | |
| f: int | |
| g: int | |
| h: int | |
| Big = object | |
| a: array[256, byte] | |
| suite "internal/aligned_alloc.allocAligned": | |
| test "tiny payload (sizeof < CacheLineBytes) is 64B-aligned": | |
| let p = allocAligned[Tiny]() | |
| check p != nil | |
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | |
| check p.a == 0 # zero-initialized | |
| c_free(p) | |
| test "line-sized payload (sizeof == CacheLineBytes) is 64B-aligned": | |
| let p = allocAligned[Line]() | |
| check p != nil | |
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | |
| check p.a == 0 | |
| check p.h == 0 | |
| c_free(p) | |
| test "big payload (sizeof > CacheLineBytes) is 64B-aligned": | |
| let p = allocAligned[Big]() | |
| check p != nil | |
| check (cast[uint](p) mod CacheLineBytes.uint) == 0 | |
| for i in 0 ..< 256: | |
| check p.a[i] == 0.byte | |
| c_free(p) | |
| test "many allocations are all 64B-aligned": | |
| var ptrs: array[64, ptr Line] | |
| for i in 0 ..< 64: | |
| ptrs[i] = allocAligned[Line]() | |
| check ptrs[i] != nil | |
| check (cast[uint](ptrs[i]) mod CacheLineBytes.uint) == 0 | |
| for i in 0 ..< 64: | |
| c_free(ptrs[i]) | |
| Replace all `c_free(p)` / `c_free(ptrs[i])` calls with `freeAligned(p)` / `freeAligned(ptrs[i])` and drop the `from system/ansi_c import c_free` import. |
Drops c_free in favor of the module's freeAligned helper so the test exercises the public API symmetrically. Same underlying allocator (c_free under POSIX), but freeAligned is the documented inverse of allocAligned and the right call when _aligned_free is needed on Windows.
bench-rollup feature stack ships as 4.2.0. Version bumped from 4.1.0 (already-released; tagged v4.1.0 on 2026-05-01) and the previously [Unreleased] CHANGELOG section moved under the [4.2.0] header dated 2026-05-05. release.yml triggers on push to devel: it parses this version, creates v4.2.0 tag if absent, and publishes the GitHub Release with auto-generated notes from commit history.
This reverts commit eb56a15.
Conflict resolutions:
- src/lockfreequeues/unbounded_{mupmuc,mupsic,sipmuc}.nim: kept HEAD's
allocAligned[DebraManager[MaxThreads]]() (raises OutOfMemDefect
internally per aligned_alloc.nim) and dropped devel's c_calloc + nil
check. The cache-line padding work (PR 19) supersedes c_calloc here.
- mkdocs.yml: took devel's 'Concepts: Safety Model / Slot Ownership
Typestates' nav block. devel deleted docs/api/ops.md, so HEAD's
'Support: Operations' entry would break the docs build.
- CHANGELOG.md: combined both Unreleased sections. Promoted devel's
CASAttempt typestate restructure to a top-level BREAKING entry;
appended devel's Changed/Added/Fixed under '(typestates 0.7 uplift)'
subheaders so the bench-rollup and typestates-uplift sections are
visually separable in the next release notes.
Verified: 210/210 tests pass under nim c --threads:on tests/test.nim
with typestates 0.7.2 + this merge.
The merge from devel surfaced that this branch's build.yml + bench.yml still pinned nim-typestates@v0.7.0 / nim-debra@v0.6.0. lockfreequeues now requires typestates >= 0.7.2 (CASAttempt union, match-macro generic-context fixes) and debra >= 0.7.0 (no published tag yet); v0.7.0 fails compilation with 'match arm head must be a single state identifier' on the new opaque-state typestate match invocations. Match devel's approach: track --branch main until nimble.directory picks up the new tags. Updates 3 clone blocks (build.yml + 2 in bench.yml) plus the now-stale 'pin to release tags' rationale comments. Verified locally with main of nim-typestates (= v0.7.2 sha db7159f) and nim-debra: 210/210 tests pass.
There was a problem hiding this comment.
This omnibus PR consolidates 7 stacked PRs (Tracks 0–6) into a single bench-rollup: shared harness with native BMF emission, topology-split binaries, comparison adapters (Crossbeam, Boost, Loony, MoodyCamel, threading.Channels, system.Channel), interactive uPlot charts, latency thresholds, and snapshot publishing. One confirmed bug found: the bench-comparison.yml workflow carries a stale threshold measure name that was already fixed in bench.yml's Track 6, making the Crossbeam Bencher threshold silently inoperative.
Severity tally: 1 Low.
Low
- BOT-A1 (
.github/workflows/bench-comparison.yml:196): Bencher threshold measure name mismatch in bench-comparison.yml:throughputvsthroughput_ops_ms
Noteworthy
- Well-structured soft-skip adapter pattern: install → smoke → set-flag-on-success → annotate-on-failure, with workflow_dispatch force-skip overrides. Graceful degradation without workflow failures.
- Three-layer loop-prevention for snapshot publishing (paths-ignore, bot-actor guard, explicit ref checkout) is carefully reasoned and avoids the common [skip ci] / docs-deploy tension.
- The Histogram top-K + Algorithm R reservoir stratified-percentile implementation in bench_common.nim is mathematically sound with proper rescaling for body-stratum lookup.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| --token '${{ secrets.BENCHER_API_TOKEN }}' \ | ||
| --branch devel \ | ||
| --testbed ubuntu-latest \ | ||
| --threshold-measure throughput \ |
There was a problem hiding this comment.
BOT-A1 — Low (bug)
Bencher threshold measure name mismatch in bench-comparison.yml: throughput vs throughput_ops_ms
The "Track scheduled benchmarks with Bencher" step in bench-comparison.yml uses --threshold-measure throughput but the actual measure key emitted by every bench binary (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded) is throughput_ops_ms. This is the same bug that bench.yml's Track 6 explicitly corrected (the Track 6 comment reads: "the prior config used --threshold-measure throughput; the actual measure key ... is throughput_ops_ms"). The bench-comparison.yml was not updated to match, so the Crossbeam threshold will never match any measure — rendering Bencher regression detection on the nightly crossbeam workflow a silent no-op. Note: this workflow runs on a nightly cron schedule and is not on the PR critical path, so this is flagged as low severity (follow-up fix) rather than blocking.
| --threshold-measure throughput \ | |
| Replace `--threshold-measure throughput` with `--threshold-measure throughput_ops_ms` on line 196 of bench-comparison.yml, matching the fix already applied in bench.yml line 773. |
|
/ai-review |
There was a problem hiding this comment.
This omnibus PR rolls up 7 tracks of work (81 commits, 70 files) adding a shared bench harness, comparison adapters (Boost, Loony, Crossbeam, MoodyCamel, threading.Chan, system.Channel), topology-split binaries, interactive uPlot charts, and latency p99/p999/p9999 thresholds. Two substantive issues found: a crossbeam threshold measure name mismatch that makes Bencher gating a no-op in the comparison workflow (carried over from a bug the main bench.yml already fixed), and an incorrect rescaling formula in a docstring comment.
Severity tally: 1 Medium, 1 Low.
Medium (blocking)
- BOT-C1 (
.github/workflows/bench-comparison.yml:196): Bencher threshold measure name mismatch in bench-comparison.yml:throughputvsthroughput_ops_ms
Low
- BOT-C2 (
benchmarks/nim/bench_common.nim:341): Histogram percentile docstring formula does not match the code
Noteworthy
- The bench_common.nim shared harness is well-structured with proper cleanup (
when compiles(cleanup(...))) and thorough docstrings. - The Rust cdylib (bench-ffi-crossbeam) has proper null-pointer guards, panic safety, and comprehensive integration tests.
- The Loony adapter's documented encoding constraint (v+1)<<3 is well-documented with range limits and debug asserts.
Verdict: REQUEST_CHANGES.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| --token '${{ secrets.BENCHER_API_TOKEN }}' \ | ||
| --branch devel \ | ||
| --testbed ubuntu-latest \ | ||
| --threshold-measure throughput \ |
There was a problem hiding this comment.
BOT-C1 — Medium (bug)
Bencher threshold measure name mismatch in bench-comparison.yml: throughput vs throughput_ops_ms
The "Track scheduled benchmarks with Bencher" step in bench-comparison.yml uses --threshold-measure throughput but the actual measure key emitted by every bench binary (bench_spsc, bench_mpsc, bench_mpmc, bench_unbounded) is throughput_ops_ms. This is the same bug that bench.yml's Track 6 explicitly corrected (the Track 6 comment reads: "the earlier --threshold-measure throughput never matched any emitted measure (the actual key is throughput_ops_ms), so the previous throughput threshold was a no-op"). The bench-comparison.yml was missed when bench.yml was fixed. The threshold-gating for crossbeam comparison will never fire, making the scheduled regression detection silently non-functional.
| --threshold-measure throughput \ | |
| Change `--threshold-measure throughput` to `--threshold-measure throughput_ops_ms` at line 196 of bench-comparison.yml. |
| let pBodyClamped = max(0.0, min(1.0, pBody)) | ||
| return percentile(h.reservoir, pBodyClamped) | ||
|
|
There was a problem hiding this comment.
BOT-C2 — Low (quality)
Histogram percentile docstring formula does not match the code
The Histogram.percentile proc's docstring says the body-stratum rescaling is p_body = (p*seenAll - K) / (seenAll - K) (subtracting K from the numerator). The code does let pBody = (pc * float(h.seenAll)) / float(bodySize) (no K subtraction). The code is correct — the body stratum holds the seenAll - K smallest samples, so the target rank within the body is p * seenAll, not p * seenAll - K. The docstring formula is wrong. The same docstring appears in the Histogram.percentiles proc at line ~371.
| let pBodyClamped = max(0.0, min(1.0, pBody)) | |
| return percentile(h.reservoir, pBodyClamped) | |
| Update the docstring to match the code: `p_body = (p * seenAll) / (seenAll - K)`. |
BOT-C1 (Medium): bench-comparison.yml's scheduled-run threshold step used `--threshold-measure throughput` but the BMF emitter writes the measure key as `throughput_ops_ms`. The threshold therefore matched no measure and was a silent no-op; gating on Crossbeam vs lockfreequeues regressions never fired. Same bug class as PR 6's main bench.yml fix; aligning the two workflows on the canonical key. BOT-C2 (Low): Histogram percentile docstring described the body-stratum rescale as `(p*seenAll - K) / (seenAll - K)`. The actual code computes `(p*seenAll) / (seenAll - K)` (without the K subtraction in the numerator), and the worked examples below the formula already use the correct ratio (e.g. 50_000/99_000 for p=0.50). Bringing the formula and the worked examples into agreement with the implementation; tidying a stale '99_000 - 0' artefact from an earlier draft.
There was a problem hiding this comment.
This omnibus PR consolidates 7 previously-reviewed PRs (Track 0–6) adding a shared bench harness, 9 comparison adapters, topology-split bench binaries, interactive uPlot charts, and a snapshot publishing pipeline. Overall the code is well-structured with strong defensive patterns. Found 2 non-blocking quality issues: bench-comparison.yml's nightly threshold reset may silently remove bench.yml's latency threshold configuration, and the bench-tests job clones sibling deps from main rather than the release tags advertised in the PR description.
Severity tally: 2 Lows.
Low
- BOT-A2 (
.github/workflows/bench-comparison.yml:189): Nightly bench-comparison bencher run may remove bench.yml's latency threshold. - BOT-A3 (
.github/workflows/bench.yml:100): bench-tests job clones sibling Nim deps from main, contradicting PR description.
Noteworthy
- Consistent soft-skip pattern across all comparison adapters with install/smoke/set-flag stages and force_skip workflow_dispatch inputs — well-designed for CI resilience.
- Cache-line alignment fix for unbounded queue segments via posix_memalign shim is clean and well-documented with platform-conditional fallback for Windows.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| if: github.event_name == 'schedule' && env.BENCHER_API_TOKEN != '' | ||
| run: | | ||
| bencher run \ | ||
| --project lockfreequeues \ | ||
| --token '${{ secrets.BENCHER_API_TOKEN }}' \ | ||
| --branch devel \ | ||
| --testbed ubuntu-latest \ | ||
| --threshold-measure throughput_ops_ms \ | ||
| --threshold-test t_test \ | ||
| --threshold-max-sample-size 64 \ | ||
| --threshold-lower-boundary 0.99 \ | ||
| --thresholds-reset \ | ||
| --adapter json \ | ||
| --file merged.json \ | ||
| --github-actions '${{ secrets.GITHUB_TOKEN }}' \ | ||
| --err | ||
|
|
||
| - name: Track devel-push benchmarks with Bencher |
There was a problem hiding this comment.
BOT-A2 — Low (quality)
Nightly bench-comparison bencher run may remove bench.yml's latency threshold.
bench-comparison.yml's scheduled (nightly cron) bencher run uploads to the same Bencher project (lockfreequeues) and branch (devel) as bench.yml's push-to-devel run, but it configures only --threshold-measure throughput_ops_ms --thresholds-reset. bench.yml configures BOTH throughput_ops_ms AND latency_p99_ns thresholds. Since bencher stores thresholds per project+branch and --thresholds-reset clears all prior thresholds, the nightly crossbeam run will remove the latency threshold that bench.yml configured on the previous push. The two workflows would fence on threshold configuration: each night, bench-comparison resets to throughput-only; each devel push, bench.yml restores both throughput and latency. During the window between a nightly and the next devel push, latency regression alerting is silently absent.
| if: github.event_name == 'schedule' && env.BENCHER_API_TOKEN != '' | |
| run: | | |
| bencher run \ | |
| --project lockfreequeues \ | |
| --token '${{ secrets.BENCHER_API_TOKEN }}' \ | |
| --branch devel \ | |
| --testbed ubuntu-latest \ | |
| --threshold-measure throughput_ops_ms \ | |
| --threshold-test t_test \ | |
| --threshold-max-sample-size 64 \ | |
| --threshold-lower-boundary 0.99 \ | |
| --thresholds-reset \ | |
| --adapter json \ | |
| --file merged.json \ | |
| --github-actions '${{ secrets.GITHUB_TOKEN }}' \ | |
| --err | |
| - name: Track devel-push benchmarks with Bencher | |
| Either (a) add `--threshold-measure latency_p99_ns --threshold-test t_test --threshold-max-sample-size 64 --threshold-upper-boundary 0.99` to the bench-comparison nightly invocation so both thresholds survive, or (b) remove `--thresholds-reset` from bench-comparison and add `--threshold-measure throughput_ops_ms --threshold-test t_test --threshold-max-sample-size 64 --threshold-lower-boundary 0.99` so both workflows append rather than reset. |
| uses: jiro4989/setup-nim-action@v2 | ||
| with: | ||
| nim-version: 'stable' | ||
|
|
||
| - name: Install build deps (Linux) | ||
| run: | | ||
| sudo apt-get update -q -y | ||
| sudo apt-get -qq install -y clang | ||
|
|
||
| - name: Clone and install sibling Nim deps (nim-debra, nim-typestates) | ||
| run: | | ||
| set -e |
There was a problem hiding this comment.
BOT-A3 — Low (quality)
bench-tests job clones sibling Nim deps from main, contradicting PR description.
The PR description states the new bench-tests CI job is "pinned to v0.6.0 / v0.7.0 sibling deps", but the actual bench.yml code clones nim-debra and nim-typestates from --branch main (not tags). Meanwhile bench-comparison.yml correctly uses --branch v0.6.0 and --branch v0.7.0. Using main means any upstream commit to those repos can break CI non-deterministically. The code comment in the bench-tests job header itself says "Sibling Nim deps are pinned to release tags (not main)" which further contradicts the code's actual --branch main. The benchmark matrix job's parallel clone step has a comment that acknowledges the tradeoff ("Tracks main until nimble.directory picks up the relevant tags"), but the bench-tests job's own comment is misleading and the overall discrepancy between PR description, code comments, and code behavior is confusing for maintainers debugging CI breakage.
| uses: jiro4989/setup-nim-action@v2 | |
| with: | |
| nim-version: 'stable' | |
| - name: Install build deps (Linux) | |
| run: | | |
| sudo apt-get update -q -y | |
| sudo apt-get -qq install -y clang | |
| - name: Clone and install sibling Nim deps (nim-debra, nim-typestates) | |
| run: | | |
| set -e | |
| Either update the PR description and the bench-tests job's code comment to accurately reflect that the bench-tests job tracks `main`, or align the code with the description by pinning to `v0.6.0` / `v0.7.0` (as bench-comparison.yml already does). |
Consolidated omnibus PR for the bench-rollup feature stack. Originally landed
as 7 stacked PRs (Track 0 through Track 6); fast-forwarded into this branch
and the descendants closed.
81 commits / 70 files changed / +11,296 / -1,493.
What ships
Track 0 — Shared bench harness + native BMF emission
benchmarks/nim/bench_common.nim:Topologyenum,BMFEmitter(alpha-sorted Bencher Metric Format JSON),
Histogram(min-heap top-K +Algorithm R reservoir),
runThroughputHarness/runLatencyHarness,Stats helpers.
benchmarks/merge_bmf.py: stateless union of BMF JSON fragments;exits 1 on
(slug, measure)collisions.bench_throughput.nimgains--bmf-out=<path>; bit-for-bit unchangedwithout the flag.
unbounded_sipmuc, unbounded_mupmuc); 3 existing renamed to canonical
<library_slug>_adapter.nim.bmf_adapter.py,bench_main.nim,render_readme.nim.bench.ymlrewired to use native BMF;merge_bmf.pyproduces thesingle
merged.jsonconsumed bybencher run.Track 1 — Latency wired to BMF, multi-measure-per-slug end-to-end
bench_latency.nimrewritten onrunLatencyHarness; CLI mirrorsbench_throughput. Per-binary intdefines:BenchLatencyRuns(33),BenchLatencyMessageCount(100_000),BenchLatencyWarmupRuns(3).bench.ymlsplit into three jobs:bench-throughput,bench-latency(new),
bench-upload(merges fragments, singlebencher run).latency_p50_ns/latency_p95_ns/latency_p99_ns.adapter.nimre-exportsPushResult/PopResultfrombench_commonto unify the two parallel definitions.
Track 2 — Topology split (5 binaries) + per-step CI timeouts
bench_throughput.nim:bench_spsc.nim,bench_mpsc.nim,bench_mpmc.nim,bench_unbounded.nim(plus the latency binary from Track 1).
independently.
tests/fixtures/pre-split-slugs.json+superset_check.pyensure the post-split BMF is a superset of the pre-split slug set.
timeout-minutesso a hang in one variantcannot burn the whole budget.
Track 3 — Comparison MVP: Loony, Boost.LockFree, Crossbeam adapters
LoonyQueue(mpmc_unbounded),boost::lockfree::queue(mpmc bounded),
boost::lockfree::spsc_queue(spsc bounded),crossbeam::ArrayQueue(mpmc bounded),crossbeam::SegQueue(mpmc_unbounded).
benchmarks/rust/bench-ffi-crossbeam/cdylib (stabletoolchain,panic = "abort"for cross-FFI safety).annotate-on-failure.
workflow_dispatchacceptsforce_skip_*inputs.bench-comparison.ymlruns Crossbeam on a nightly cron (separateBencher Report; not on PR critical path).
posix_memalignthrough newinternal/aligned_alloc.nim. Verified bytests/t_unbounded_padding.nim.Track 4 — Comparison expansion: Threading.Channels, Nim Channel, MoodyCamel
ConcurrentQueuevendored as single header atbenchmarks/vendor/concurrentqueue/concurrentqueue.h(commitd655418bb644b7f85159d94c591d7d983949fb81);extern "C"wrapperisolates Nim from upstream template machinery.
THIRD_PARTY_LICENSES.mdgains its first vendored entry.Track 5 — Interactive uPlot charts + mkdocs deploy from devel
docs/assets/uplot-1.6.27.iife.min.jsvendored (47884 bytes,SHA-256 verified against jsdelivr).
docs/assets/bench-charts.{js,css}renders an interactive throughputchart with library-toggle legend, log-scale Y toggle, hover tooltip,
graceful fallbacks. Contract test
test_bench_charts_contract.pyguards the BMF -> chart shape.
docs/benchmarks.mdembeds the chart via mkdocs-material'smd_in_html. Page is registered in nav.merged.jsonis copied todocs/assets/bench-results/<sha>.jsonandlatest.jsonon everydevel push; commits as
github-actions[bot]. Three-layerloop-prevention: paths-ignore, actor guard, narrow
[skip ci].docs.ymlextended to redeploy mike'sdev/alias on devel pushes.Post-deploy curl verifies the published BMF asset returns HTTP 200
and parses as JSON.
Track 6 — Latency thresholds + p999 / max measures
bench_latency.nimemitslatency_p999_nsandlatency_max_nsalongside the p50/p95/p99 trio. Full tail-latency profile per
(library, topology) cell.
HistogramTopK1000 -> 5000: at production sample counts(
BenchLatencyMessageCount=100_000xBenchLatencyRuns=33,seenAll ~ 3.3M), p999 rank is ~3300 — K=5000 keeps p999 in theexact top-K stratum. Memory cost: 40KB per histogram.
the existing throughput threshold).
bench-testsCI job runs the bench harness test suite(
benchtestsnimble task) on every PR; pinned tov0.6.0/v0.7.0sibling deps; no fork-PR guard (no secrets needed forunit tests).
Tests added
t_bench_common.nim: 26 tests — BMFEmitter ordering, Histogrampercentile accuracy on log-normal, harness smoke, all 5 new
lockfreequeues adapters round-trip.
t_bench_latency.nim: 7 tests — intdefine compile-time asserts,--bmf-outintegration, multi-fragment merge + collision guard.t_bench_adapters.nim: round-trip set-equality tests for everycomparison adapter, gated by the same
-d:adapter_*_availabledefines.
t_topology_split.nim: per-binary slug coverage + superset check.t_unbounded_padding.nim: 8 cache-line-alignment assertions acrossthe 4 unbounded variants (c/cpp/arc/refc).
t_aligned_alloc.nim: aligned_alloc helper unit tests.benchmarks/tests/test_merge_bmf.py: 10 tests covering slug regex,measure regex, collision detection, alpha-sorted output,
NaN rejection.
benchmarks/tests/test_superset_check.py: deletion-safety regressiontests for the topology split.
benchmarks/tests/test_bench_charts_contract.py: 9 tests guardingthe BMF -> JS chart contract.
Verification
All Track-level review feedback (13 lows from constituent PRs)
addressed in-tree before consolidation. Original constituent PRs were
each individually approved by Momus. Local nim test suite: 200/200
pass.
Cost gate context
Bencher upload cost gate is in place: PR-base allowlist limited to
main/devel(nofeat/**), and PR uploads require thebenchlabel. This omnibus PR will trigger
bench-throughput/bench-latencyruns on push (non-billable to Bencher absent thelabel) but the
bencher runupload step skips without it. Add thebenchlabel deliberately to record this revision.