Skip to content

feat(bench): latency thresholds + p999/max measures (Track 6, final)#25

Closed
elijahr wants to merge 6 commits intofeat/bench-rollup-pr5-interactive-chartsfrom
feat/bench-rollup-pr6-latency-thresholds
Closed

feat(bench): latency thresholds + p999/max measures (Track 6, final)#25
elijahr wants to merge 6 commits intofeat/bench-rollup-pr5-interactive-chartsfrom
feat/bench-rollup-pr6-latency-thresholds

Conversation

@elijahr
Copy link
Copy Markdown
Owner

@elijahr elijahr commented May 1, 2026

Final PR in the bench-rollup feature stack. Depends on PR #19, PR #20, PR #21, PR #22, PR #23, PR #24.

Track 6 (PR 6) of the bench-rollup feature: adds latency-regression
gating in Bencher and the supporting BMF measures. After this PR
merges down through the stack, the benchmark CI on devel produces
a complete tail-latency profile per (library, topology) cell and
flags p99 latency regressions on each push.

What changed

bench_latency.nim — emit latency_p999_ns + latency_max_ns (Task 6.1)

Each bounded variant slug now carries the full p50 / p95 / p99 / p999 / max
latency tuple in the merged BMF. runLatencyHarness adds two new measures
alongside the existing trio:

em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns)
em.addMeasure(slug, "latency_max_ns",  metrics.max_ns)

The merged BMF on each bench job thus carries a complete tail-latency
profile per (library, topology) cell, available to the Bencher dashboard
and any downstream comparison chart.

bench_common.nimHistogramTopK 1000 → 5000 (Task 6.2)

At production sample counts (BenchLatencyMessageCount=100_000 ×
BenchLatencyRuns=33, seenAll ≈ 3.3M per aggregated histogram), the
p999 tail rank is ~3300; K=1000 forced the p999 lookup into the
rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K
stratum with headroom for stochastic boundary variation. Memory cost:
5000 × 8B = 40KB additional per histogram, negligible relative to the
99K-sample reservoir.

A new t_bench_common.nim test asserts p999 within 5% of sort fallback
on a 3.3M log-normal stream, directly verifying the K=5000 design choice.

bench.yml — per-measure threshold args (Task 6.3)

The base-branch tracking step (push to devel / main) configures
per-measure thresholds in a single bencher run invocation:

  • latency_p99_ns with --threshold-upper-boundary 0.99 (regression =
    latency increase past upper bound)
  • throughput_ops_ms with --threshold-lower-boundary 0.99 (regression =
    throughput drop below lower bound)

Both share --threshold-test t_test --threshold-max-sample-size 64,
terminated by --thresholds-reset so only the explicitly-listed
thresholds remain active.

This also corrects a prior measure-name mismatch: the earlier
--threshold-measure throughput never matched any emitted measure
(the actual key is throughput_ops_ms), so the previous throughput
threshold was a no-op. After this PR, the throughput threshold actually
fires.

Stability soak gate (Task 6.4)

Threshold activation requires ≥ 10 prior runs accumulated in Bencher
to calibrate the t-test baseline. Bencher will not emit alerts on
measures with insufficient sample history, so the configuration is
dormant until the soak completes post-merge. The configuration is
in place so activation is purely a function of accumulated run count,
not workflow edits.

The soak is a runtime gate, not a code gate — it cannot be exercised
inside this PR. After merge, the recommended verification flow is:

  1. Let devel accumulate ≥ 10 bench runs (each push triggers one).
  2. Inspect the Bencher dashboard for latency_p99_ns coefficient of
    variation across runs; target CV ≤ 5%.
  3. If CV > 5%, defer threshold activation by raising
    --threshold-max-sample-size or revisiting the bench harness for
    noise sources (warmup count, sample size, NUMA pinning).
  4. If CV ≤ 5%, the threshold is calibrated and will alert on the next
    regression > 1%.

CHANGELOG (Task 6.5)

Three new bullets under [Unreleased] / Added covering the threshold
config, the new measures, and the K=5000 bump with rationale.

Tests

  • t_bench_common.nim: 27 tests pass (includes new HistogramTopK >= 5000
    assertion and the 3.3M log-normal p999 accuracy test).
  • t_bench_latency.nim: 7 tests pass (every bounded variant slug carries
    latency_p999_ns + latency_max_ns; tail order
    p99 <= p999 <= max enforced on the smoke shape).
  • tests/test.nim main suite: 200 tests pass, no regressions.
  • actionlint clean on .github/workflows/bench.yml.
  • Smoke run of bench_latency --bmf-out=... confirms the merged BMF
    emits all five latency measures per slug.

Follow-ups (post-merge, outside this PR)

  • Stability soak (Task 6.4) — observe ≥ 10 runs of latency_p99_ns on
    devel, document CV in a follow-up note or release announcement.
  • If latency CV ≤ 5%, the thresholds activate automatically on the
    11th run; no further workflow edits required.

References

  • Design doc: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md §3 PR 6
  • Impl plan: /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 6 (Tasks 6.1-6.5)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the benchmarking suite by adding latency_p999_ns and latency_max_ns measures and increasing the HistogramTopK constant to 5000 to maintain accuracy at production sample sizes. It also updates the CI configuration for regression gating and includes comprehensive tests for the new metrics. Review feedback suggests refining the terminology in code comments for better precision and updating internal test documentation to reflect the broader set of assertions.

Comment thread benchmarks/nim/bench_latency.nim Outdated
# Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
# Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at
# production sample counts (3.3M samples); max is `percentile(1.0)`,
# always the top-K head.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The term "top-K head" is technically misleading in the context of a min-heap (where the head is the minimum of the top-K) or a sorted sequence (where the maximum is at the tail). Using "maximum" is more precise.

  # always the top-K maximum.

Comment thread tests/t_bench_latency.nim
test "all four bounded variants emit latency_p50_ns / latency_p99_ns":
test "all four bounded variants emit latency_p50_ns / latency_p99_ns / latency_p999_ns / latency_max_ns":
# Per impl plan Track 1 Acceptance Criteria: BMF JSON contains
# latency_p50_ns and latency_p99_ns for sipsic / sipmuc / mupsic /
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This internal comment should be updated to reflect that the test now asserts the presence of the full latency profile, not just p50 and p99.

    # the full latency profile (p50..max) for sipsic / sipmuc / mupsic /

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from ad82876 to 4a7225a Compare May 1, 2026 18:16
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 8675d56 to f7784ed Compare May 1, 2026 18:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🐰 Bencher Report

Branchfeat/bench-rollup-pr6-latency-thresholds
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarklatency_max_nsMeasure (units) x 1e3latency_p50_nsMeasure (units)latency_p95_nsMeasure (units)latency_p999_nsMeasure (units)latency_p99_nsMeasure (units)throughput_ops_msMeasure (units)
lockfreequeues_mupmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
22.15 units x 1e3📈 view plot
⚠️ NO THRESHOLD
431.36 units📈 view plot
⚠️ NO THRESHOLD
508.09 units📈 view plot
⚠️ NO THRESHOLD
915.36 units📈 view plot
⚠️ NO THRESHOLD
549.45 units📈 view plot
⚠️ NO THRESHOLD
14,209.93 units
lockfreequeues_mupmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
13,935.54 units
lockfreequeues_mupmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
13,762.08 units
lockfreequeues_mupmuc/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
15,441.95 units
lockfreequeues_mupmuc/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
13,105.87 units
lockfreequeues_mupmuc/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
11,269.55 units
lockfreequeues_mupmuc/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
10,439.25 units
lockfreequeues_mupmuc/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
10,391.07 units
lockfreequeues_mupmuc/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
9,611.06 units
lockfreequeues_mupmuc/mpmc/8p8c📈 view plot
⚠️ NO THRESHOLD
9,489.58 units
lockfreequeues_mupsic/mpsc/1p1c📈 view plot
⚠️ NO THRESHOLD
28.41 units x 1e3📈 view plot
⚠️ NO THRESHOLD
433.18 units📈 view plot
⚠️ NO THRESHOLD
475.82 units📈 view plot
⚠️ NO THRESHOLD
920.73 units📈 view plot
⚠️ NO THRESHOLD
504.27 units📈 view plot
⚠️ NO THRESHOLD
13,944.89 units
lockfreequeues_mupsic/mpsc/2p1c📈 view plot
⚠️ NO THRESHOLD
15,673.10 units
lockfreequeues_mupsic/mpsc/4p1c📈 view plot
⚠️ NO THRESHOLD
8,768.87 units
lockfreequeues_sipmuc/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
26.09 units x 1e3📈 view plot
⚠️ NO THRESHOLD
428.18 units📈 view plot
⚠️ NO THRESHOLD
464.55 units📈 view plot
⚠️ NO THRESHOLD
799.27 units📈 view plot
⚠️ NO THRESHOLD
489.18 units📈 view plot
⚠️ NO THRESHOLD
14,145.84 units
lockfreequeues_sipmuc/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
14,118.85 units
lockfreequeues_sipmuc/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
13,328.74 units
lockfreequeues_sipsic/spsc/1p1c📈 view plot
⚠️ NO THRESHOLD
32.30 units x 1e3📈 view plot
⚠️ NO THRESHOLD
703.73 units📈 view plot
⚠️ NO THRESHOLD
757.09 units📈 view plot
⚠️ NO THRESHOLD
2,495.00 units📈 view plot
⚠️ NO THRESHOLD
800.36 units📈 view plot
⚠️ NO THRESHOLD
8,682.44 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
9,496.42 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
5,455.25 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
9,629.60 units
lockfreequeues_unbounded_mupmuc/mpmc_unbounded/2p2c📈 view plot
⚠️ NO THRESHOLD
10,251.53 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
9,486.14 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
10,135.32 units
lockfreequeues_unbounded_mupsic/mpsc_unbounded/4p1c📈 view plot
⚠️ NO THRESHOLD
11,114.67 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
17,646.18 units
lockfreequeues_unbounded_sipmuc/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
12,223.68 units
lockfreequeues_unbounded_sipsic/spsc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
20,569.98 units
moodycamel/mpmc_unbounded/1p1c📈 view plot
⚠️ NO THRESHOLD
21,902.30 units
moodycamel/mpmc_unbounded/1p2c📈 view plot
⚠️ NO THRESHOLD
15,004.80 units
moodycamel/mpmc_unbounded/1p4c📈 view plot
⚠️ NO THRESHOLD
13,514.79 units
moodycamel/mpmc_unbounded/2p1c📈 view plot
⚠️ NO THRESHOLD
31,385.61 units
moodycamel/mpmc_unbounded/2p2c📈 view plot
⚠️ NO THRESHOLD
12,250.68 units
moodycamel/mpmc_unbounded/2p4c📈 view plot
⚠️ NO THRESHOLD
13,376.27 units
moodycamel/mpmc_unbounded/4p1c📈 view plot
⚠️ NO THRESHOLD
33,493.54 units
moodycamel/mpmc_unbounded/4p2c📈 view plot
⚠️ NO THRESHOLD
11,933.16 units
moodycamel/mpmc_unbounded/4p4c📈 view plot
⚠️ NO THRESHOLD
14,992.85 units
nim_channel/mpsc/1p1c📈 view plot
⚠️ NO THRESHOLD
2,583.71 units
nim_channel/mpsc/2p1c📈 view plot
⚠️ NO THRESHOLD
4,754.77 units
nim_channel/mpsc/4p1c📈 view plot
⚠️ NO THRESHOLD
4,531.03 units
nim_channels/mpmc/1p1c📈 view plot
⚠️ NO THRESHOLD
2,940.04 units
nim_channels/mpmc/1p2c📈 view plot
⚠️ NO THRESHOLD
919.46 units
nim_channels/mpmc/1p4c📈 view plot
⚠️ NO THRESHOLD
251.18 units
nim_channels/mpmc/2p1c📈 view plot
⚠️ NO THRESHOLD
4,187.10 units
nim_channels/mpmc/2p2c📈 view plot
⚠️ NO THRESHOLD
1,443.78 units
nim_channels/mpmc/2p4c📈 view plot
⚠️ NO THRESHOLD
603.63 units
nim_channels/mpmc/4p1c📈 view plot
⚠️ NO THRESHOLD
4,763.42 units
nim_channels/mpmc/4p2c📈 view plot
⚠️ NO THRESHOLD
1,947.95 units
nim_channels/mpmc/4p4c📈 view plot
⚠️ NO THRESHOLD
1,645.37 units
🐰 View full continuous benchmarking report in Bencher

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 4a7225a to af4acd7 Compare May 2, 2026 03:00
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from f7784ed to dea6f0b Compare May 2, 2026 03:29
@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from af4acd7 to 87f9c8f Compare May 2, 2026 03:52
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from dea6f0b to 3445be9 Compare May 2, 2026 03:52
@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 87f9c8f to 1e2f5d2 Compare May 2, 2026 04:09
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 3445be9 to dbce882 Compare May 2, 2026 04:09
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces p999 and max latency measures, updates Bencher regression gating thresholds, and increases the HistogramTopK constant from 1000 to 5000. Feedback focuses on correcting a recurring technical inaccuracy in the documentation and code comments: the current benchmarking harness averages per-run percentiles rather than unioning all 3.3M samples into a single histogram. Consequently, the rationale for the HistogramTopK increase should be adjusted to reflect that it provides headroom for larger sample sizes rather than being a strict requirement for the existing aggregation logic.

Comment thread benchmarks/nim/bench_common.nim Outdated
Comment on lines +160 to +171
HistogramTopK* = 5000 ## exact top-K largest samples seen.
## Sized to cover p999 at production sample
## sizes (Track 6 Task 6.2): at
## `BenchLatencyMessageCount=100_000` and
## `BenchLatencyRuns=33`, `seenAll ≈ 3.3M`
## per aggregated histogram, so the p999
## tail count is ~3300. K=5000 keeps the
## p999 lookup inside the exact top-K
## stratum with headroom for stochastic
## variation around the boundary. Memory
## cost is 5000 × 8B = 40KB per histogram,
## negligible vs the 99K reservoir.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rationale for increasing HistogramTopK to 5000 appears to be based on a misunderstanding of how runLatencyHarness aggregates data. As implemented in lines 409-444, the harness calculates percentiles for each run independently and then averages those results; it does not union all samples into a single aggregated histogram.

With the default BenchLatencyMessageCount=100,000, each histogram instance only sees 100,000 samples. In this case, the p99.9 tail count is exactly 100, which was already well-covered by the previous HistogramTopK=1000. Furthermore, since HistogramTopK (1000) + HistogramReservoir (99,000) == 100,000, the histogram was already capturing every single sample exactly for the default message count.

While increasing K to 5000 is harmless, the comments should be corrected to reflect that this provides headroom for larger BenchLatencyMessageCount overrides, rather than suggesting it's required for the current 33-run aggregation.

  HistogramTopK* = 5000        ## exact top-K largest samples seen.
                               ## Sized to cover p999 even if
                               ## `BenchLatencyMessageCount` is increased
                               ## significantly (up to 5M samples). Note:
                               ## current harness averages per-run
                               ## percentiles rather than unioning.

Comment thread benchmarks/nim/bench_latency.nim Outdated
Comment on lines +147 to +150
# Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
# Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at
# production sample counts (3.3M samples); max is `percentile(1.0)`,
# always the top-K head.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment repeats the incorrect assumption that 3.3M samples are aggregated into a single histogram. As noted in bench_common.nim, each run uses its own histogram of 100,000 samples (by default).

  # Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
  # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum
  # even if MessageCount is increased; max is `percentile(1.0)`,
  # always the top-K head.

Comment thread CHANGELOG.md Outdated
Comment on lines +34 to +42
- `HistogramTopK` raised from 1000 to 5000 (PR 6, Task 6.2). At
production sample counts (`BenchLatencyMessageCount=100_000` ×
`BenchLatencyRuns=33` ≈ 3.3M samples per aggregated histogram), the
p999 tail rank is ~3300; K=1000 forced the p999 lookup into the
rescaled-reservoir stratum, while K=5000 keeps it in the exact
top-K stratum with headroom. Memory cost: 5000 × 8B = 40KB
additional per histogram, negligible vs the 99K-sample reservoir.
New `t_bench_common.nim` test asserts p999 within 5% of sort
fallback on a 3.3M log-normal stream.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The changelog entry should be updated to remove the incorrect rationale about 3.3M samples per aggregated histogram, as the harness averages percentiles across runs instead of unioning them.

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 1e2f5d2 to e524cf6 Compare May 2, 2026 06:09
elijahr added a commit that referenced this pull request May 2, 2026
…ntiles

Per Gemini review on PR #25.

The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim,
t_bench_common.nim, and CHANGELOG repeatedly claimed that
runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M
per aggregated histogram"). It does not — design 2.5 explicitly
says "mean across runs of each percentile", so each Histogram only
ever sees BenchLatencyMessageCount samples (default 100K), not
MessageCount × Runs.

At the default 100K MessageCount the previous K=1000 was already
adequate — TopK + Reservoir already captured every sample exactly
(1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into
the rescaled stratum at default settings.

Reframed all four sites: the K=5000 sizing is **anticipatory** for
operators who override BenchLatencyMessageCount upward (~5M-sample
overrides), keeping p999 (tail rank = MessageCount × 0.001) inside
the exact top-K stratum at that volume. The 3.3M-sample stress test
in t_bench_common.nim is reframed as a stress shape that exercises
the K=5000 design choice, not a depiction of production volume.

No code logic changed; comments + CHANGELOG only.
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from dbce882 to 2739f9b Compare May 2, 2026 06:12
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Bench tests not executed 🐞 Bug ☼ Reliability
Description
The new assertions added in t_bench_common.nim and t_bench_latency.nim are not imported by
tests/test.nim (the nimble test entrypoint) and are not run by the bench GitHub workflow, so they
won’t actually validate HistogramTopK=5000 or the new latency_p999/max measures in CI.
Code

tests/t_bench_common.nim[R241-259]

+  test "HistogramTopK is at least 5000 (anticipates MessageCount up to ~5M)":
+    check HistogramTopK >= 5000
+
+  test "p999 within 5% of sort fallback at 3.3M-sample stress shape":
+    # Stress-test the K=5000 design choice at a single-histogram volume
+    # that an operator could reach by overriding BenchLatencyMessageCount
+    # upward. At seenAll=3.3M the p999 tail count is 3300 and lies
+    # inside the K=5000 exact top-K stratum, so percentile(0.999) is
+    # read from the exact top-K heap.
+    # Tolerance is 5% per impl plan acceptance criterion. Test is
+    # heavier than the 100K case (~3.3M record() calls) but still
+    # tractable under release-mode `nimble test`.
+    let samples = generateLogNormal(3_300_000, 0xDEADBEEF'i64)
+    let exact = percentile(samples, 0.999)
+    var h = initHistogram()
+    for v in samples: h.record(v)
+    let approx = h.percentile(0.999)
+    let relErr = abs(approx - exact) / exact
+    check relErr < 0.05
Evidence
This PR adds new histogram sizing/accuracy tests and extends the latency integration tests, but the
repository’s test runner (tests/test.nim) does not import these modules, and the bench workflow’s
only explicit test execution is the adapter smoke test. Therefore these new validations are not
exercised by default CI runs.

tests/t_bench_common.nim[228-260]
tests/t_bench_latency.nim[64-124]
tests/test.nim[1-49]
.github/workflows/bench.yml[208-223]
lockfreequeues.nimble[18-32]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
New/updated tests in `tests/t_bench_common.nim` and `tests/t_bench_latency.nim` are not executed by the default test entrypoint (`tests/test.nim`) nor by any CI workflow steps, so the intended validation added in this PR is currently not enforced.

### Issue Context
- `nimble test` runs `tests/test.nim` per `lockfreequeues.nimble`.
- The bench workflow currently runs `tests/t_bench_adapters.nim` as a smoke test, but does not run the bench_common/bench_latency test modules.

### Fix Focus Areas
- tests/test.nim[1-49]
- lockfreequeues.nimble[18-32]
- .github/workflows/bench.yml[208-223]
- tests/t_bench_common.nim[228-260]
- tests/t_bench_latency.nim[64-124]

### Suggested approach
- Option A (simple): Import `t_bench_common` and `t_bench_latency` from `tests/test.nim` so they run under `nimble test`.
- Option B (preferred if you want isolation): Add a dedicated nimble task (e.g. `benchtests`) that runs these modules, and add a CI step/workflow to execute that task.
- If you wire `t_bench_common.nim` into the default suite, consider gating the 3.3M-sample stress test behind a define so regular CI remains fast while still allowing an explicit “stress” run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from e524cf6 to 3ec9960 Compare May 2, 2026 10:38
elijahr added a commit that referenced this pull request May 2, 2026
…ntiles

Per Gemini review on PR #25.

The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim,
t_bench_common.nim, and CHANGELOG repeatedly claimed that
runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M
per aggregated histogram"). It does not — design 2.5 explicitly
says "mean across runs of each percentile", so each Histogram only
ever sees BenchLatencyMessageCount samples (default 100K), not
MessageCount × Runs.

At the default 100K MessageCount the previous K=1000 was already
adequate — TopK + Reservoir already captured every sample exactly
(1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into
the rescaled stratum at default settings.

Reframed all four sites: the K=5000 sizing is **anticipatory** for
operators who override BenchLatencyMessageCount upward (~5M-sample
overrides), keeping p999 (tail rank = MessageCount × 0.001) inside
the exact top-K stratum at that volume. The 3.3M-sample stress test
in t_bench_common.nim is reframed as a stress shape that exercises
the K=5000 design choice, not a depiction of production volume.

No code logic changed; comments + CHANGELOG only.
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 2739f9b to e3a5e05 Compare May 2, 2026 10:41
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Preparing review...

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 2, 2026

/review

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from 57a9bc2 to ee67fe2 Compare May 2, 2026 11:35
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Preparing review...

elijahr added a commit that referenced this pull request May 2, 2026
…ntiles

Per Gemini review on PR #25.

The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim,
t_bench_common.nim, and CHANGELOG repeatedly claimed that
runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M
per aggregated histogram"). It does not — design 2.5 explicitly
says "mean across runs of each percentile", so each Histogram only
ever sees BenchLatencyMessageCount samples (default 100K), not
MessageCount × Runs.

At the default 100K MessageCount the previous K=1000 was already
adequate — TopK + Reservoir already captured every sample exactly
(1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into
the rescaled stratum at default settings.

Reframed all four sites: the K=5000 sizing is **anticipatory** for
operators who override BenchLatencyMessageCount upward (~5M-sample
overrides), keeping p999 (tail rank = MessageCount × 0.001) inside
the exact top-K stratum at that volume. The 3.3M-sample stress test
in t_bench_common.nim is reframed as a stress shape that exercises
the K=5000 design choice, not a depiction of production volume.

No code logic changed; comments + CHANGELOG only.
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from b77fd58 to b98c337 Compare May 2, 2026 11:36
@elijahr elijahr reopened this May 3, 2026
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

Retrigger

@elijahr elijahr closed this May 3, 2026
@elijahr elijahr reopened this May 3, 2026
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 3, 2026

Retrigger after concurrency-check workflow_id filter fix.

@elijahr elijahr closed this May 3, 2026
@elijahr elijahr reopened this May 3, 2026
@elijahr elijahr closed this May 3, 2026
@elijahr elijahr reopened this May 3, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.

PR adds latency-p99 Bencher regression gating, emits p999/max latency measures, bumps HistogramTopK 1000→5000 for anticipatory headroom, and introduces a bench-tests CI job. The throughput measure-name fix (throughput_ops_ms) corrects a prior no-op. Core logic is sound; one minor test gap where the harness-level smoke test verifies p50<p99<max but omits the p999≥p99 ordering invariant now that p999 is a production measure (covered in heavier integration test).

No findings.

Noteworthy

  • Correctly fixes a dormant throughput-threshold no-op: --threshold-measure throughput never matched the actual throughput_ops_ms key emitted by the bench binaries. After this PR, throughput regression gating actually fires.
  • HistogramTopK bump commentary is thorough: clearly documents why K=5000 is anticipatory (for larger MessageCount overrides, not for the default), and the 3.3M-sample stress test directly validates the design choice under operator-driven configuration.

Verdict: APPROVE.

Commands
  • Comment /ai-review to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

@elijahr elijahr closed this May 3, 2026
@elijahr elijahr reopened this May 3, 2026
@axiomantic-momus
Copy link
Copy Markdown

axiomantic-momus Bot commented May 3, 2026

Momus review posted — verdict APPROVE, 0 findings

████████████████████ 100%

run log

Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.

This PR (Track 6, final) adds latency p999/max measures to the BMF emission, bumps HistogramTopK from 1000→5000 with supporting stress tests, configures per-measure Bencher thresholds (latency_p99_ns upper-boundary + throughput_ops_ms lower-boundary), fixes a prior measure-name mismatch, and adds a bench-tests CI job.

Severity tally: 1 Low.

Low

  • BOT-A2 (.github/workflows/bench.yml:105): bench-tests job uses floating dependency refs while bench job pins to version tags

Noteworthy

  • Fixes a latent CI bug: prior --threshold-measure throughput was a no-op because the actual measure key is throughput_ops_ms — the threshold now correctly targets the right measure name.
  • The 3.3M-sample stress test is gated behind -d:BenchCommonStress so the CI bench-tests stay fast while the design choice is still validation-gated.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

Comment thread .github/workflows/bench.yml Outdated
Comment on lines +105 to +106
git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-A2 — Low (standards)
bench-tests job uses floating dependency refs while bench job pins to version tags
The new bench-tests job clones nim-debra and nim-typestates from --branch main (a floating ref), while the existing bench job (lines 172-173) pins these same dependencies to --branch v0.6.0 and --branch v0.7.0. This is a minor inconsistency: the bench job correctly pins for reproducible benchmark data, while bench-tests is a validation job that benefits from catching upstream breaks early. The existing bench-comparison.yml also uses --branch main for these deps, establishing precedent that floating refs are acceptable in non-benchmarking CI jobs.

Suggested change
git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
For consistency, consider pinning:
git clone --depth 1 --branch v0.6.0 https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch v0.7.0 https://github.com/elijahr/nim-typestates.git
Or add a comment noting the intentional floating-ref choice for early breakage detection.

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

2 similar comments
@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 4, 2026

/ai-review

Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR (Track 6/final) adds latency p99 Bencher regression gating, p999/max BMF measures, HistogramTopK bump to 5000, a benchtests nimble task and CI job, and a Momus review workflow. Overall the changes are well-structured with good test additions. Three non-blocking issues identified: a stale contradictory comment, a CI test job with an unnecessarily restrictive fork-PR guard, and inconsistency in dependency pinning between CI jobs.

Severity tally: 2 Lows.

Low

  • BOT-B1 (.github/workflows/bench.yml:84): bench-tests CI job's fork-PR guard is unnecessarily restrictive
  • BOT-B3 (.github/workflows/bench.yml:101): bench-tests CI job clones deps from main instead of tagged releases

Noteworthy

  • Corrects a benign but longstanding measure-name mismatch in bench.yml: --threshold-measure throughput (nonexistent) → throughput_ops_ms (actual measure key), so the throughput regression threshold will now actually fire.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

Comment thread .github/workflows/bench.yml Outdated
Comment on lines +84 to +86
github.event.pull_request.head.repo.full_name == github.repository)
timeout-minutes: 10

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-B1 — Low (tests)
bench-tests CI job's fork-PR guard is unnecessarily restrictive
The new bench-tests job copies its if: guard from the bench job, which checks head.repo.full_name == github.repository to prevent fork PRs from running (bench needs BENCHER_API_TOKEN unavailable to forks). bench-tests uses no secrets — it only clones sibling public repos, compiles, and runs tests. The fork-PR guard therefore unnecessarily prevents external PR contributors from getting bench harness test coverage. The guard should be simplified to only check github.actor != 'github-actions[bot]'.

Comment on lines +101 to +107
- name: Clone and install sibling Nim deps (nim-debra, nim-typestates)
run: |
set -e
cd ..
git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git
git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git
(cd nim-typestates && nimble install -y)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOT-B3 — Low (quality)
bench-tests CI job clones deps from main instead of tagged releases
The bench-tests job clones nim-debra and nim-typestates from --branch main, while the sibling bench job (line ~155) pins to --branch v0.6.0 and --branch v0.7.0 respectively. Using untagged main makes the CI non-deterministic: a breaking change pushed to a sibling repo's main could cause spurious test failures in lockfreequeues PRs.

@elijahr elijahr force-pushed the feat/bench-rollup-pr5-interactive-charts branch from fe063a2 to b01168c Compare May 5, 2026 03:52
elijahr added 6 commits May 4, 2026 22:53
Track 6 (PR 6, final track in the bench-rollup feature stack) adds
latency-regression gating in Bencher and the supporting BMF measures.

bench_latency.nim now emits the full p50/p95/p99/p999/max latency tuple
per bounded variant slug. Each call to runLatencyHarness adds two new
measures alongside the existing trio:

  em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns)
  em.addMeasure(slug, "latency_max_ns",  metrics.max_ns)

The merged BMF on each bench job thus carries a complete tail-latency
profile per (library, topology) cell, available to the Bencher dashboard
and any downstream comparison chart.

bench_common.nim raises HistogramTopK from 1000 to 5000. At production
sample counts (BenchLatencyMessageCount=100_000 x BenchLatencyRuns=33,
seenAll ~ 3.3M per aggregated histogram) the p999 tail rank is ~3300;
K=1000 forced the p999 lookup into the rescaled-reservoir stratum,
while K=5000 keeps it in the exact top-K stratum with headroom for
stochastic boundary variation. Memory cost is 5000 x 8B = 40KB
additional per histogram, negligible relative to the 99K-sample
reservoir.

bench.yml's base-branch tracking step (push to devel/main) configures
per-measure thresholds in a single bencher run invocation: latency_p99_ns
with --threshold-upper-boundary 0.99 (regression = latency increase) and
throughput_ops_ms with --threshold-lower-boundary 0.99 (regression =
throughput drop). Both share --threshold-test t_test
--threshold-max-sample-size 64, terminated by --thresholds-reset so only
the explicitly-listed thresholds remain active.

This also corrects a prior measure-name mismatch: the earlier
--threshold-measure throughput never matched any emitted measure (the
actual key is throughput_ops_ms), so the previous throughput threshold
was a no-op.

Threshold activation requires >= 10 prior runs accumulated in Bencher
to calibrate the t-test baseline (Task 6.4 stability soak gate).
Bencher does not emit alerts on measures with insufficient sample
history, so the configuration is dormant until the soak completes
post-merge.

Tests:
- t_bench_common.nim: HistogramTopK >= 5000 assertion + p999 within 5%
  of sort fallback on 3.3M log-normal samples (27 tests, all passing).
- t_bench_latency.nim: every bounded variant slug carries
  latency_p999_ns + latency_max_ns with monotone tail ordering p99 <=
  p999 <= max (7 tests, all passing).
- tests/test.nim main suite unaffected (200 tests, all passing).

Refs design doc /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md
section 3 PR 6 and impl plan
/Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md
Track 6 (Tasks 6.1-6.5).
…ntiles

Per Gemini review on PR #25.

The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim,
t_bench_common.nim, and CHANGELOG repeatedly claimed that
runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M
per aggregated histogram"). It does not — design 2.5 explicitly
says "mean across runs of each percentile", so each Histogram only
ever sees BenchLatencyMessageCount samples (default 100K), not
MessageCount × Runs.

At the default 100K MessageCount the previous K=1000 was already
adequate — TopK + Reservoir already captured every sample exactly
(1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into
the rescaled stratum at default settings.

Reframed all four sites: the K=5000 sizing is **anticipatory** for
operators who override BenchLatencyMessageCount upward (~5M-sample
overrides), keeping p999 (tail rank = MessageCount × 0.001) inside
the exact top-K stratum at that volume. The 3.3M-sample stress test
in t_bench_common.nim is reframed as a stress shape that exercises
the K=5000 design choice, not a depiction of production volume.

No code logic changed; comments + CHANGELOG only.
Code review on PR 6 noted that the new HistogramTopK assertion in
tests/t_bench_common.nim and the latency CLI assertions in
tests/t_bench_latency.nim were not imported by tests/test.nim and not
run by any CI step, so the validation added by this PR was not actually
enforced in CI.

Wire up via nimble's separation-of-concerns pattern (Option B from the
review):

- lockfreequeues.nimble gains a 'benchtests' task that compiles and
  runs t_bench_common, t_bench_latency, and t_bench_adapters under one
  Nim invocation each. The bench harness tests live outside srcDir;
  importing them from tests/test.nim would force every entry of the
  8-invocation MM/sanitizer test matrix to compile bench_common's
  threading machinery. A dedicated task keeps the regular nimble test
  matrix lean while still enforcing the bench-side assertions.

- A second 'benchteststress' task enables the gated 3.3M-sample
  HistogramTopK p999 stress shape (slow, ~10-15s release-mode), so
  default benchtests stays under a second.

- bench.yml gains a 'bench-tests' job (ubuntu-latest) that runs
  'nimble benchtests' once per workflow trigger, with the same
  fork-PR / bot-actor guards as the bench matrix jobs.

The 3.3M-sample stress test in t_bench_common is now gated behind
-d:BenchCommonStress so it only runs under benchteststress, not the
default benchtests run.
The previous wording called `percentile(1.0)` "the top-K head", which
is misleading for a min-heap implementation: `topKHeap[0]` is the
SMALLEST element in the top-K (the admission threshold for new
outliers), not the largest. The implementation snapshots the heap
into a sequence, sorts it, and returns `topk[^1]` — the maximum.

Comment-only change; behaviour unchanged.
Replaces the Qodo PR-Agent reusable workflow with Momus
(axiomantic/momus), an open-source from-scratch PR review bot.
…KEY -> LLM_API_KEY, pass MOMUS_APP_* secrets, set trigger_mention
@elijahr elijahr force-pushed the feat/bench-rollup-pr6-latency-thresholds branch from 3eee5ed to 462db8b Compare May 5, 2026 03:54
Copy link
Copy Markdown

@axiomantic-momus axiomantic-momus Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 25 adds latency p999/max measures to BMF emission, bumps HistogramTopK from 1000 to 5000, configures per-measure Bencher thresholds (correcting a prior measure-name mismatch that made the throughput threshold a no-op), and adds corresponding tests. The logic is sound and well-documented. One stale inline comment contradicts the new behavior.

No findings.

Noteworthy

  • The Bencher threshold configuration correctly switches --threshold-measure throughput (which never matched any emitted key) to --threshold-measure throughput_ops_ms, making the throughput regression gate functional for the first time.
  • The K=5000 sizing is well-justified with explicit headroom math, and the compile-time HistogramTopK >= 5000 assertion in CI prevents silent regression of the constant.

Verdict: APPROVE.

Commands
  • Comment /ai-review or mention @axiomantic-momus[bot] to request a re-review of the latest changes.
  • Reply to a finding with won't fix, by design, or not a bug to decline it.
  • Reply with instead, ... to propose an alternative fix.

Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.

@elijahr
Copy link
Copy Markdown
Owner Author

elijahr commented May 5, 2026

Consolidated into #19 — fast-forwarded feat/bench-rollup-pr0-bench-common to this branch's tip; all commits preserved in #19. Closing this stacked PR as redundant.

@elijahr elijahr closed this May 5, 2026
@elijahr elijahr deleted the feat/bench-rollup-pr6-latency-thresholds branch May 5, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant