feat(bench): latency thresholds + p999/max measures (Track 6, final) by elijahr · Pull Request #25 · elijahr/lockfreequeues

elijahr · 2026-05-01T17:16:47Z

Final PR in the bench-rollup feature stack. Depends on PR #19, PR #20, PR #21, PR #22, PR #23, PR #24.

Track 6 (PR 6) of the bench-rollup feature: adds latency-regression
gating in Bencher and the supporting BMF measures. After this PR
merges down through the stack, the benchmark CI on devel produces
a complete tail-latency profile per (library, topology) cell and
flags p99 latency regressions on each push.

What changed

`bench_latency.nim` — emit `latency_p999_ns` + `latency_max_ns` (Task 6.1)

Each bounded variant slug now carries the full p50 / p95 / p99 / p999 / max
latency tuple in the merged BMF. runLatencyHarness adds two new measures
alongside the existing trio:

em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns)
em.addMeasure(slug, "latency_max_ns",  metrics.max_ns)

The merged BMF on each bench job thus carries a complete tail-latency
profile per (library, topology) cell, available to the Bencher dashboard
and any downstream comparison chart.

`bench_common.nim` — `HistogramTopK` 1000 → 5000 (Task 6.2)

At production sample counts (BenchLatencyMessageCount=100_000 ×
BenchLatencyRuns=33, seenAll ≈ 3.3M per aggregated histogram), the
p999 tail rank is ~3300; K=1000 forced the p999 lookup into the
rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K
stratum with headroom for stochastic boundary variation. Memory cost:
5000 × 8B = 40KB additional per histogram, negligible relative to the
99K-sample reservoir.

A new t_bench_common.nim test asserts p999 within 5% of sort fallback
on a 3.3M log-normal stream, directly verifying the K=5000 design choice.

`bench.yml` — per-measure threshold args (Task 6.3)

The base-branch tracking step (push to devel / main) configures
per-measure thresholds in a single bencher run invocation:

latency_p99_ns with --threshold-upper-boundary 0.99 (regression =
latency increase past upper bound)
throughput_ops_ms with --threshold-lower-boundary 0.99 (regression =
throughput drop below lower bound)

Both share --threshold-test t_test --threshold-max-sample-size 64,
terminated by --thresholds-reset so only the explicitly-listed
thresholds remain active.

This also corrects a prior measure-name mismatch: the earlier
--threshold-measure throughput never matched any emitted measure
(the actual key is throughput_ops_ms), so the previous throughput
threshold was a no-op. After this PR, the throughput threshold actually
fires.

Stability soak gate (Task 6.4)

Threshold activation requires ≥ 10 prior runs accumulated in Bencher
to calibrate the t-test baseline. Bencher will not emit alerts on
measures with insufficient sample history, so the configuration is
dormant until the soak completes post-merge. The configuration is
in place so activation is purely a function of accumulated run count,
not workflow edits.

The soak is a runtime gate, not a code gate — it cannot be exercised
inside this PR. After merge, the recommended verification flow is:

Let devel accumulate ≥ 10 bench runs (each push triggers one).
Inspect the Bencher dashboard for latency_p99_ns coefficient of
variation across runs; target CV ≤ 5%.
If CV > 5%, defer threshold activation by raising
--threshold-max-sample-size or revisiting the bench harness for
noise sources (warmup count, sample size, NUMA pinning).
If CV ≤ 5%, the threshold is calibrated and will alert on the next
regression > 1%.

CHANGELOG (Task 6.5)

Three new bullets under [Unreleased] / Added covering the threshold
config, the new measures, and the K=5000 bump with rationale.

Tests

t_bench_common.nim: 27 tests pass (includes new HistogramTopK >= 5000
assertion and the 3.3M log-normal p999 accuracy test).
t_bench_latency.nim: 7 tests pass (every bounded variant slug carries
latency_p999_ns + latency_max_ns; tail order
p99 <= p999 <= max enforced on the smoke shape).
tests/test.nim main suite: 200 tests pass, no regressions.
actionlint clean on .github/workflows/bench.yml.
Smoke run of bench_latency --bmf-out=... confirms the merged BMF
emits all five latency measures per slug.

Follow-ups (post-merge, outside this PR)

Stability soak (Task 6.4) — observe ≥ 10 runs of latency_p99_ns on
devel, document CV in a follow-up note or release announcement.
If latency CV ≤ 5%, the thresholds activate automatically on the
11th run; no further workflow edits required.

References

Design doc: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md §3 PR 6
Impl plan: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 6 (Tasks 6.1-6.5)

gemini-code-assist

Code Review

This pull request expands the benchmarking suite by adding latency_p999_ns and latency_max_ns measures and increasing the HistogramTopK constant to 5000 to maintain accuracy at production sample sizes. It also updates the CI configuration for regression gating and includes comprehensive tests for the new metrics. Review feedback suggests refining the terminology in code comments for better precision and updating internal test documentation to reflect the broader set of assertions.

gemini-code-assist · 2026-05-01T17:18:54Z

+  # Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
+  # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at
+  # production sample counts (3.3M samples); max is `percentile(1.0)`,
+  # always the top-K head.


The term "top-K head" is technically misleading in the context of a min-heap (where the head is the minimum of the top-K) or a sorted sequence (where the maximum is at the tail). Using "maximum" is more precise.

# always the top-K maximum.

gemini-code-assist · 2026-05-01T17:18:54Z

-  test "all four bounded variants emit latency_p50_ns / latency_p99_ns":
+  test "all four bounded variants emit latency_p50_ns / latency_p99_ns / latency_p999_ns / latency_max_ns":
    # Per impl plan Track 1 Acceptance Criteria: BMF JSON contains
    # latency_p50_ns and latency_p99_ns for sipsic / sipmuc / mupsic /


This internal comment should be updated to reflect that the test now asserts the presence of the full latency profile, not just p50 and p99.

# the full latency profile (p50..max) for sipsic / sipmuc / mupsic /

github-actions · 2026-05-01T18:22:36Z

Bencher Report

Branch	feat/bench-rollup-pr6-latency-thresholds
Testbed	ubuntu-latest

⚠️ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
latency_max_ns (Measure (units))
latency_p95_ns (Measure (units))
throughput_ops_ms (Measure (units))
latency_p99_ns (Measure (units))
latency_p50_ns (Measure (units))
latency_p999_ns (Measure (units))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results

Benchmark	latency_max_ns	Measure (units) x 1e3	latency_p50_ns	Measure (units)	latency_p95_ns	Measure (units)	latency_p999_ns	Measure (units)	latency_p99_ns	Measure (units)	throughput_ops_ms	Measure (units)
lockfreequeues_mupmuc/mpmc/1p1c	📈 view plot ⚠️ NO THRESHOLD	22.15 units x 1e3	📈 view plot ⚠️ NO THRESHOLD	431.36 units	📈 view plot ⚠️ NO THRESHOLD	508.09 units	📈 view plot ⚠️ NO THRESHOLD	915.36 units	📈 view plot ⚠️ NO THRESHOLD	549.45 units	📈 view plot ⚠️ NO THRESHOLD	14,209.93 units
lockfreequeues_mupmuc/mpmc/1p2c											📈 view plot ⚠️ NO THRESHOLD	13,935.54 units
lockfreequeues_mupmuc/mpmc/1p4c											📈 view plot ⚠️ NO THRESHOLD	13,762.08 units
lockfreequeues_mupmuc/mpmc/2p1c											📈 view plot ⚠️ NO THRESHOLD	15,441.95 units
lockfreequeues_mupmuc/mpmc/2p2c											📈 view plot ⚠️ NO THRESHOLD	13,105.87 units
lockfreequeues_mupmuc/mpmc/2p4c											📈 view plot ⚠️ NO THRESHOLD	11,269.55 units
lockfreequeues_mupmuc/mpmc/4p1c											📈 view plot ⚠️ NO THRESHOLD	10,439.25 units
lockfreequeues_mupmuc/mpmc/4p2c											📈 view plot ⚠️ NO THRESHOLD	10,391.07 units
lockfreequeues_mupmuc/mpmc/4p4c											📈 view plot ⚠️ NO THRESHOLD	9,611.06 units
lockfreequeues_mupmuc/mpmc/8p8c											📈 view plot ⚠️ NO THRESHOLD	9,489.58 units
lockfreequeues_mupsic/mpsc/1p1c	📈 view plot ⚠️ NO THRESHOLD	28.41 units x 1e3	📈 view plot ⚠️ NO THRESHOLD	433.18 units	📈 view plot ⚠️ NO THRESHOLD	475.82 units	📈 view plot ⚠️ NO THRESHOLD	920.73 units	📈 view plot ⚠️ NO THRESHOLD	504.27 units	📈 view plot ⚠️ NO THRESHOLD	13,944.89 units
lockfreequeues_mupsic/mpsc/2p1c											📈 view plot ⚠️ NO THRESHOLD	15,673.10 units
lockfreequeues_mupsic/mpsc/4p1c											📈 view plot ⚠️ NO THRESHOLD	8,768.87 units
lockfreequeues_sipmuc/mpmc/1p1c	📈 view plot ⚠️ NO THRESHOLD	26.09 units x 1e3	📈 view plot ⚠️ NO THRESHOLD	428.18 units	📈 view plot ⚠️ NO THRESHOLD	464.55 units	📈 view plot ⚠️ NO THRESHOLD	799.27 units	📈 view plot ⚠️ NO THRESHOLD	489.18 units	📈 view plot ⚠️ NO THRESHOLD	14,145.84 units
lockfreequeues_sipmuc/mpmc/1p2c											📈 view plot ⚠️ NO THRESHOLD	14,118.85 units
lockfreequeues_sipmuc/mpmc/1p4c											📈 view plot ⚠️ NO THRESHOLD	13,328.74 units
lockfreequeues_sipsic/spsc/1p1c	📈 view plot ⚠️ NO THRESHOLD	32.30 units x 1e3	📈 view plot ⚠️ NO THRESHOLD	703.73 units	📈 view plot ⚠️ NO THRESHOLD	757.09 units	📈 view plot ⚠️ NO THRESHOLD	2,495.00 units	📈 view plot ⚠️ NO THRESHOLD	800.36 units	📈 view plot ⚠️ NO THRESHOLD	8,682.44 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p1c											📈 view plot ⚠️ NO THRESHOLD	9,496.42 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p2c											📈 view plot ⚠️ NO THRESHOLD	5,455.25 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p1c											📈 view plot ⚠️ NO THRESHOLD	9,629.60 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p2c											📈 view plot ⚠️ NO THRESHOLD	10,251.53 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/1p1c											📈 view plot ⚠️ NO THRESHOLD	9,486.14 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/2p1c											📈 view plot ⚠️ NO THRESHOLD	10,135.32 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/4p1c											📈 view plot ⚠️ NO THRESHOLD	11,114.67 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p1c											📈 view plot ⚠️ NO THRESHOLD	17,646.18 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p2c											📈 view plot ⚠️ NO THRESHOLD	12,223.68 units
lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c											📈 view plot ⚠️ NO THRESHOLD	20,569.98 units
moodycamel/mpmc_unbounded/1p1c											📈 view plot ⚠️ NO THRESHOLD	21,902.30 units
moodycamel/mpmc_unbounded/1p2c											📈 view plot ⚠️ NO THRESHOLD	15,004.80 units
moodycamel/mpmc_unbounded/1p4c											📈 view plot ⚠️ NO THRESHOLD	13,514.79 units
moodycamel/mpmc_unbounded/2p1c											📈 view plot ⚠️ NO THRESHOLD	31,385.61 units
moodycamel/mpmc_unbounded/2p2c											📈 view plot ⚠️ NO THRESHOLD	12,250.68 units
moodycamel/mpmc_unbounded/2p4c											📈 view plot ⚠️ NO THRESHOLD	13,376.27 units
moodycamel/mpmc_unbounded/4p1c											📈 view plot ⚠️ NO THRESHOLD	33,493.54 units
moodycamel/mpmc_unbounded/4p2c											📈 view plot ⚠️ NO THRESHOLD	11,933.16 units
moodycamel/mpmc_unbounded/4p4c											📈 view plot ⚠️ NO THRESHOLD	14,992.85 units
nim_channel/mpsc/1p1c											📈 view plot ⚠️ NO THRESHOLD	2,583.71 units
nim_channel/mpsc/2p1c											📈 view plot ⚠️ NO THRESHOLD	4,754.77 units
nim_channel/mpsc/4p1c											📈 view plot ⚠️ NO THRESHOLD	4,531.03 units
nim_channels/mpmc/1p1c											📈 view plot ⚠️ NO THRESHOLD	2,940.04 units
nim_channels/mpmc/1p2c											📈 view plot ⚠️ NO THRESHOLD	919.46 units
nim_channels/mpmc/1p4c											📈 view plot ⚠️ NO THRESHOLD	251.18 units
nim_channels/mpmc/2p1c											📈 view plot ⚠️ NO THRESHOLD	4,187.10 units
nim_channels/mpmc/2p2c											📈 view plot ⚠️ NO THRESHOLD	1,443.78 units
nim_channels/mpmc/2p4c											📈 view plot ⚠️ NO THRESHOLD	603.63 units
nim_channels/mpmc/4p1c											📈 view plot ⚠️ NO THRESHOLD	4,763.42 units
nim_channels/mpmc/4p2c											📈 view plot ⚠️ NO THRESHOLD	1,947.95 units
nim_channels/mpmc/4p4c											📈 view plot ⚠️ NO THRESHOLD	1,645.37 units

🐰 View full continuous benchmarking report in Bencher

elijahr · 2026-05-02T05:40:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces p999 and max latency measures, updates Bencher regression gating thresholds, and increases the HistogramTopK constant from 1000 to 5000. Feedback focuses on correcting a recurring technical inaccuracy in the documentation and code comments: the current benchmarking harness averages per-run percentiles rather than unioning all 3.3M samples into a single histogram. Consequently, the rationale for the HistogramTopK increase should be adjusted to reflect that it provides headroom for larger sample sizes rather than being a strict requirement for the existing aggregation logic.

gemini-code-assist · 2026-05-02T05:42:30Z

+  HistogramTopK* = 5000        ## exact top-K largest samples seen.
+                               ## Sized to cover p999 at production sample
+                               ## sizes (Track 6 Task 6.2): at
+                               ## `BenchLatencyMessageCount=100_000` and
+                               ## `BenchLatencyRuns=33`, `seenAll ≈ 3.3M`
+                               ## per aggregated histogram, so the p999
+                               ## tail count is ~3300. K=5000 keeps the
+                               ## p999 lookup inside the exact top-K
+                               ## stratum with headroom for stochastic
+                               ## variation around the boundary. Memory
+                               ## cost is 5000 × 8B = 40KB per histogram,
+                               ## negligible vs the 99K reservoir.


The rationale for increasing HistogramTopK to 5000 appears to be based on a misunderstanding of how runLatencyHarness aggregates data. As implemented in lines 409-444, the harness calculates percentiles for each run independently and then averages those results; it does not union all samples into a single aggregated histogram.

With the default BenchLatencyMessageCount=100,000, each histogram instance only sees 100,000 samples. In this case, the p99.9 tail count is exactly 100, which was already well-covered by the previous HistogramTopK=1000. Furthermore, since HistogramTopK (1000) + HistogramReservoir (99,000) == 100,000, the histogram was already capturing every single sample exactly for the default message count.

While increasing K to 5000 is harmless, the comments should be corrected to reflect that this provides headroom for larger BenchLatencyMessageCount overrides, rather than suggesting it's required for the current 33-run aggregation.

HistogramTopK* = 5000 ## exact top-K largest samples seen. ## Sized to cover p999 even if ## `BenchLatencyMessageCount` is increased ## significantly (up to 5M samples). Note: ## current harness averages per-run ## percentiles rather than unioning.

gemini-code-assist · 2026-05-02T05:42:30Z

+  # Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
+  # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at
+  # production sample counts (3.3M samples); max is `percentile(1.0)`,
+  # always the top-K head.


This comment repeats the incorrect assumption that 3.3M samples are aggregated into a single histogram. As noted in bench_common.nim, each run uses its own histogram of 100,000 samples (by default).

# Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio. # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum # even if MessageCount is increased; max is `percentile(1.0)`, # always the top-K head.

gemini-code-assist · 2026-05-02T05:42:30Z

+- `HistogramTopK` raised from 1000 to 5000 (PR 6, Task 6.2). At
+  production sample counts (`BenchLatencyMessageCount=100_000` ×
+  `BenchLatencyRuns=33` ≈ 3.3M samples per aggregated histogram), the
+  p999 tail rank is ~3300; K=1000 forced the p999 lookup into the
+  rescaled-reservoir stratum, while K=5000 keeps it in the exact
+  top-K stratum with headroom. Memory cost: 5000 × 8B = 40KB
+  additional per histogram, negligible vs the 99K-sample reservoir.
+  New `t_bench_common.nim` test asserts p999 within 5% of sort
+  fallback on a 3.3M log-normal stream.


The changelog entry should be updated to remove the incorrect rationale about 3.3M samples per aggregated histogram, as the harness averages percentiles across runs instead of unioning them.

…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.

elijahr · 2026-05-02T06:12:52Z

/gemini review

gemini-code-assist · 2026-05-02T06:12:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

elijahr · 2026-05-02T10:13:49Z

/review

qodo-code-review · 2026-05-02T10:13:54Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

1. Bench tests not executed 🐞 Bug ☼ Reliability

Description

The new assertions added in t_bench_common.nim and t_bench_latency.nim are not imported by
tests/test.nim (the nimble test entrypoint) and are not run by the bench GitHub workflow, so they
won’t actually validate HistogramTopK=5000 or the new latency_p999/max measures in CI.

Code

tests/t_bench_common.nim[R241-259]

+  test "HistogramTopK is at least 5000 (anticipates MessageCount up to ~5M)":
+    check HistogramTopK >= 5000
+
+  test "p999 within 5% of sort fallback at 3.3M-sample stress shape":
+    # Stress-test the K=5000 design choice at a single-histogram volume
+    # that an operator could reach by overriding BenchLatencyMessageCount
+    # upward. At seenAll=3.3M the p999 tail count is 3300 and lies
+    # inside the K=5000 exact top-K stratum, so percentile(0.999) is
+    # read from the exact top-K heap.
+    # Tolerance is 5% per impl plan acceptance criterion. Test is
+    # heavier than the 100K case (~3.3M record() calls) but still
+    # tractable under release-mode `nimble test`.
+    let samples = generateLogNormal(3_300_000, 0xDEADBEEF'i64)
+    let exact = percentile(samples, 0.999)
+    var h = initHistogram()
+    for v in samples: h.record(v)
+    let approx = h.percentile(0.999)
+    let relErr = abs(approx - exact) / exact
+    check relErr < 0.05

Evidence
This PR adds new histogram sizing/accuracy tests and extends the latency integration tests, but the
repository’s test runner (tests/test.nim) does not import these modules, and the bench workflow’s
only explicit test execution is the adapter smoke test. Therefore these new validations are not
exercised by default CI runs.
tests/t_bench_common.nim[228-260]
tests/t_bench_latency.nim[64-124]
tests/test.nim[1-49]
.github/workflows/bench.yml[208-223]
lockfreequeues.nimble[18-32]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
New/updated tests in `tests/t_bench_common.nim` and `tests/t_bench_latency.nim` are not executed by the default test entrypoint (`tests/test.nim`) nor by any CI workflow steps, so the intended validation added in this PR is currently not enforced.

### Issue Context
- `nimble test` runs `tests/test.nim` per `lockfreequeues.nimble`.
- The bench workflow currently runs `tests/t_bench_adapters.nim` as a smoke test, but does not run the bench_common/bench_latency test modules.

### Fix Focus Areas
- tests/test.nim[1-49]
- lockfreequeues.nimble[18-32]
- .github/workflows/bench.yml[208-223]
- tests/t_bench_common.nim[228-260]
- tests/t_bench_latency.nim[64-124]

### Suggested approach
- Option A (simple): Import `t_bench_common` and `t_bench_latency` from `tests/test.nim` so they run under `nimble test`.
- Option B (preferred if you want isolation): Add a dedicated nimble task (e.g. `benchtests`) that runs these modules, and add a CI step/workflow to execute that task.
- If you wire `t_bench_common.nim` into the default suite, consider gating the 3.3M-sample stress test behind a define so regular CI remains fast while still allowing an explicit “stress” run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.

elijahr · 2026-05-02T11:13:12Z

/review

qodo-code-review · 2026-05-02T11:13:16Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

github-actions · 2026-05-02T11:13:43Z

Preparing review...

elijahr · 2026-05-02T11:35:24Z

/review

github-actions · 2026-05-02T11:36:00Z

Preparing review...

…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.

elijahr · 2026-05-03T06:30:17Z

Retrigger

elijahr · 2026-05-03T06:31:53Z

Retrigger after concurrency-check workflow_id filter fix.

github-actions

Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.

PR adds latency-p99 Bencher regression gating, emits p999/max latency measures, bumps HistogramTopK 1000→5000 for anticipatory headroom, and introduces a bench-tests CI job. The throughput measure-name fix (throughput_ops_ms) corrects a prior no-op. Core logic is sound; one minor test gap where the harness-level smoke test verifies p50<p99<max but omits the p999≥p99 ordering invariant now that p999 is a production measure (covered in heavier integration test).

No findings.

Noteworthy

Correctly fixes a dormant throughput-threshold no-op: --threshold-measure throughput never matched the actual throughput_ops_ms key emitted by the bench binaries. After this PR, throughput regression gating actually fires.
HistogramTopK bump commentary is thorough: clearly documents why K=5000 is anticipatory (for larger MessageCount overrides, not for the default), and the 3.3M-sample stress test directly validates the design choice under operator-driven configuration.

Verdict: APPROVE.

Commands

Comment /ai-review to request a re-review of the latest changes.
Reply to a finding with won't fix, by design, or not a bug to decline it.
Reply with instead, ... to propose an alternative fix.

axiomantic-momus · 2026-05-03T13:23:59Z

✅ Momus review posted — verdict APPROVE, 0 findings

████████████████████ 100%

run log

axiomantic-momus

Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.

This PR (Track 6, final) adds latency p999/max measures to the BMF emission, bumps HistogramTopK from 1000→5000 with supporting stress tests, configures per-measure Bencher thresholds (latency_p99_ns upper-boundary + throughput_ops_ms lower-boundary), fixes a prior measure-name mismatch, and adds a bench-tests CI job.

Severity tally: 1 Low.

Low

BOT-A2 (.github/workflows/bench.yml:105): bench-tests job uses floating dependency refs while bench job pins to version tags

Noteworthy

Fixes a latent CI bug: prior --threshold-measure throughput was a no-op because the actual measure key is throughput_ops_ms — the threshold now correctly targets the right measure name.
The 3.3M-sample stress test is gated behind -d:BenchCommonStress so the CI bench-tests stay fast while the design choice is still validation-gated.

Verdict: APPROVE.

Commands

Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
Reply to a finding with won't fix, by design, or not a bug to decline it.
Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

axiomantic-momus · 2026-05-03T13:38:00Z

+          git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
+          git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git


BOT-A2 — Low (standards)
bench-tests job uses floating dependency refs while bench job pins to version tags
The new bench-tests job clones nim-debra and nim-typestates from --branch main (a floating ref), while the existing bench job (lines 172-173) pins these same dependencies to --branch v0.6.0 and --branch v0.7.0. This is a minor inconsistency: the bench job correctly pins for reproducible benchmark data, while bench-tests is a validation job that benefits from catching upstream breaks early. The existing bench-comparison.yml also uses --branch main for these deps, establishing precedent that floating refs are acceptable in non-benchmarking CI jobs.

Suggested change

git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git

git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git

For consistency, consider pinning:

git clone --depth 1 --branch v0.6.0 https://github.com/elijahr/nim-debra.git

git clone --depth 1 --branch v0.7.0 https://github.com/elijahr/nim-typestates.git

Or add a comment noting the intentional floating-ref choice for early breakage detection.

elijahr · 2026-05-04T06:29:03Z

/ai-review

elijahr · 2026-05-04T06:38:25Z

/ai-review

elijahr · 2026-05-04T18:05:21Z

/ai-review

axiomantic-momus

This PR (Track 6/final) adds latency p99 Bencher regression gating, p999/max BMF measures, HistogramTopK bump to 5000, a benchtests nimble task and CI job, and a Momus review workflow. Overall the changes are well-structured with good test additions. Three non-blocking issues identified: a stale contradictory comment, a CI test job with an unnecessarily restrictive fork-PR guard, and inconsistency in dependency pinning between CI jobs.

Severity tally: 2 Lows.

Low

BOT-B1 (.github/workflows/bench.yml:84): bench-tests CI job's fork-PR guard is unnecessarily restrictive
BOT-B3 (.github/workflows/bench.yml:101): bench-tests CI job clones deps from main instead of tagged releases

Noteworthy

Corrects a benign but longstanding measure-name mismatch in bench.yml: --threshold-measure throughput (nonexistent) → throughput_ops_ms (actual measure key), so the throughput regression threshold will now actually fire.

Verdict: APPROVE.

Commands

Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
Reply to a finding with won't fix, by design, or not a bug to decline it.
Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

axiomantic-momus · 2026-05-04T18:20:35Z

+       github.event.pull_request.head.repo.full_name == github.repository)
+    timeout-minutes: 10
+


BOT-B1 — Low (tests)
bench-tests CI job's fork-PR guard is unnecessarily restrictive
The new bench-tests job copies its if: guard from the bench job, which checks head.repo.full_name == github.repository to prevent fork PRs from running (bench needs BENCHER_API_TOKEN unavailable to forks). bench-tests uses no secrets — it only clones sibling public repos, compiles, and runs tests. The fork-PR guard therefore unnecessarily prevents external PR contributors from getting bench harness test coverage. The guard should be simplified to only check github.actor != 'github-actions[bot]'.

axiomantic-momus · 2026-05-04T18:20:35Z

+      - name: Clone and install sibling Nim deps (nim-debra, nim-typestates)
+        run: |
+          set -e
+          cd ..
+          git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
+          git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
+          (cd nim-typestates && nimble install -y)


BOT-B3 — Low (quality)
bench-tests CI job clones deps from main instead of tagged releases
The bench-tests job clones nim-debra and nim-typestates from --branch main, while the sibling bench job (line ~155) pins to --branch v0.6.0 and --branch v0.7.0 respectively. Using untagged main makes the CI non-deterministic: a breaking change pushed to a sibling repo's main could cause spurious test failures in lockfreequeues PRs.

Track 6 (PR 6, final track in the bench-rollup feature stack) adds latency-regression gating in Bencher and the supporting BMF measures. bench_latency.nim now emits the full p50/p95/p99/p999/max latency tuple per bounded variant slug. Each call to runLatencyHarness adds two new measures alongside the existing trio: em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns) em.addMeasure(slug, "latency_max_ns", metrics.max_ns) The merged BMF on each bench job thus carries a complete tail-latency profile per (library, topology) cell, available to the Bencher dashboard and any downstream comparison chart. bench_common.nim raises HistogramTopK from 1000 to 5000. At production sample counts (BenchLatencyMessageCount=100_000 x BenchLatencyRuns=33, seenAll ~ 3.3M per aggregated histogram) the p999 tail rank is ~3300; K=1000 forced the p999 lookup into the rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K stratum with headroom for stochastic boundary variation. Memory cost is 5000 x 8B = 40KB additional per histogram, negligible relative to the 99K-sample reservoir. bench.yml's base-branch tracking step (push to devel/main) configures per-measure thresholds in a single bencher run invocation: latency_p99_ns with --threshold-upper-boundary 0.99 (regression = latency increase) and throughput_ops_ms with --threshold-lower-boundary 0.99 (regression = throughput drop). Both share --threshold-test t_test --threshold-max-sample-size 64, terminated by --thresholds-reset so only the explicitly-listed thresholds remain active. This also corrects a prior measure-name mismatch: the earlier --threshold-measure throughput never matched any emitted measure (the actual key is throughput_ops_ms), so the previous throughput threshold was a no-op. Threshold activation requires >= 10 prior runs accumulated in Bencher to calibrate the t-test baseline (Task 6.4 stability soak gate). Bencher does not emit alerts on measures with insufficient sample history, so the configuration is dormant until the soak completes post-merge. Tests: - t_bench_common.nim: HistogramTopK >= 5000 assertion + p999 within 5% of sort fallback on 3.3M log-normal samples (27 tests, all passing). - t_bench_latency.nim: every bounded variant slug carries latency_p999_ns + latency_max_ns with monotone tail ordering p99 <= p999 <= max (7 tests, all passing). - tests/test.nim main suite unaffected (200 tests, all passing). Refs design doc /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md section 3 PR 6 and impl plan /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 6 (Tasks 6.1-6.5).

…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.

Code review on PR 6 noted that the new HistogramTopK assertion in tests/t_bench_common.nim and the latency CLI assertions in tests/t_bench_latency.nim were not imported by tests/test.nim and not run by any CI step, so the validation added by this PR was not actually enforced in CI. Wire up via nimble's separation-of-concerns pattern (Option B from the review): - lockfreequeues.nimble gains a 'benchtests' task that compiles and runs t_bench_common, t_bench_latency, and t_bench_adapters under one Nim invocation each. The bench harness tests live outside srcDir; importing them from tests/test.nim would force every entry of the 8-invocation MM/sanitizer test matrix to compile bench_common's threading machinery. A dedicated task keeps the regular nimble test matrix lean while still enforcing the bench-side assertions. - A second 'benchteststress' task enables the gated 3.3M-sample HistogramTopK p999 stress shape (slow, ~10-15s release-mode), so default benchtests stays under a second. - bench.yml gains a 'bench-tests' job (ubuntu-latest) that runs 'nimble benchtests' once per workflow trigger, with the same fork-PR / bot-actor guards as the bench matrix jobs. The 3.3M-sample stress test in t_bench_common is now gated behind -d:BenchCommonStress so it only runs under benchteststress, not the default benchtests run.

The previous wording called `percentile(1.0)` "the top-K head", which is misleading for a min-heap implementation: `topKHeap[0]` is the SMALLEST element in the top-K (the admission threshold for new outliers), not the largest. The implementation snapshots the heap into a sequence, sorts it, and returns `topk[^1]` — the maximum. Comment-only change; behaviour unchanged.

Replaces the Qodo PR-Agent reusable workflow with Momus (axiomantic/momus), an open-source from-scratch PR review bot.

…KEY -> LLM_API_KEY, pass MOMUS_APP_* secrets, set trigger_mention

axiomantic-momus

PR 25 adds latency p999/max measures to BMF emission, bumps HistogramTopK from 1000 to 5000, configures per-measure Bencher thresholds (correcting a prior measure-name mismatch that made the throughput threshold a no-op), and adds corresponding tests. The logic is sound and well-documented. One stale inline comment contradicts the new behavior.

No findings.

Noteworthy

The Bencher threshold configuration correctly switches --threshold-measure throughput (which never matched any emitted key) to --threshold-measure throughput_ops_ms, making the throughput regression gate functional for the first time.
The K=5000 sizing is well-justified with explicit headroom math, and the compile-time HistogramTopK >= 5000 assertion in CI prevents silent regression of the constant.

Verdict: APPROVE.

Commands

Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
Reply to a finding with won't fix, by design, or not a bug to decline it.
Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

elijahr · 2026-05-05T04:19:49Z

Consolidated into #19 — fast-forwarded feat/bench-rollup-pr0-bench-common to this branch's tip; all commits preserved in #19. Closing this stacked PR as redundant.

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from ad82876 to 4a7225a Compare May 1, 2026 18:16

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 8675d56 to f7784ed Compare May 1, 2026 18:16

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 4a7225a to af4acd7 Compare May 2, 2026 03:00

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from f7784ed to dea6f0b Compare May 2, 2026 03:29

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from af4acd7 to 87f9c8f Compare May 2, 2026 03:52

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from dea6f0b to 3445be9 Compare May 2, 2026 03:52

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 87f9c8f to 1e2f5d2 Compare May 2, 2026 04:09

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 3445be9 to dbce882 Compare May 2, 2026 04:09

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 1e2f5d2 to e524cf6 Compare May 2, 2026 06:09

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from dbce882 to 2739f9b Compare May 2, 2026 06:12

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from e524cf6 to 3ec9960 Compare May 2, 2026 10:38

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 2739f9b to e3a5e05 Compare May 2, 2026 10:41

elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 57a9bc2 to ee67fe2 Compare May 2, 2026 11:35

elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from b77fd58 to b98c337 Compare May 2, 2026 11:36