feat(bench): latency thresholds + p999/max measures (Track 6, final)#25
Conversation
There was a problem hiding this comment.
Code Review
This pull request expands the benchmarking suite by adding latency_p999_ns and latency_max_ns measures and increasing the HistogramTopK constant to 5000 to maintain accuracy at production sample sizes. It also updates the CI configuration for regression gating and includes comprehensive tests for the new metrics. Review feedback suggests refining the terminology in code comments for better precision and updating internal test documentation to reflect the broader set of assertions.
| # Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio. | ||
| # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at | ||
| # production sample counts (3.3M samples); max is `percentile(1.0)`, | ||
| # always the top-K head. |
| test "all four bounded variants emit latency_p50_ns / latency_p99_ns": | ||
| test "all four bounded variants emit latency_p50_ns / latency_p99_ns / latency_p999_ns / latency_max_ns": | ||
| # Per impl plan Track 1 Acceptance Criteria: BMF JSON contains | ||
| # latency_p50_ns and latency_p99_ns for sipsic / sipmuc / mupsic / |
ad82876 to
4a7225a
Compare
8675d56 to
f7784ed
Compare
4a7225a to
af4acd7
Compare
f7784ed to
dea6f0b
Compare
af4acd7 to
87f9c8f
Compare
dea6f0b to
3445be9
Compare
87f9c8f to
1e2f5d2
Compare
3445be9 to
dbce882
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces p999 and max latency measures, updates Bencher regression gating thresholds, and increases the HistogramTopK constant from 1000 to 5000. Feedback focuses on correcting a recurring technical inaccuracy in the documentation and code comments: the current benchmarking harness averages per-run percentiles rather than unioning all 3.3M samples into a single histogram. Consequently, the rationale for the HistogramTopK increase should be adjusted to reflect that it provides headroom for larger sample sizes rather than being a strict requirement for the existing aggregation logic.
| HistogramTopK* = 5000 ## exact top-K largest samples seen. | ||
| ## Sized to cover p999 at production sample | ||
| ## sizes (Track 6 Task 6.2): at | ||
| ## `BenchLatencyMessageCount=100_000` and | ||
| ## `BenchLatencyRuns=33`, `seenAll ≈ 3.3M` | ||
| ## per aggregated histogram, so the p999 | ||
| ## tail count is ~3300. K=5000 keeps the | ||
| ## p999 lookup inside the exact top-K | ||
| ## stratum with headroom for stochastic | ||
| ## variation around the boundary. Memory | ||
| ## cost is 5000 × 8B = 40KB per histogram, | ||
| ## negligible vs the 99K reservoir. |
There was a problem hiding this comment.
The rationale for increasing HistogramTopK to 5000 appears to be based on a misunderstanding of how runLatencyHarness aggregates data. As implemented in lines 409-444, the harness calculates percentiles for each run independently and then averages those results; it does not union all samples into a single aggregated histogram.
With the default BenchLatencyMessageCount=100,000, each histogram instance only sees 100,000 samples. In this case, the p99.9 tail count is exactly 100, which was already well-covered by the previous HistogramTopK=1000. Furthermore, since HistogramTopK (1000) + HistogramReservoir (99,000) == 100,000, the histogram was already capturing every single sample exactly for the default message count.
While increasing K to 5000 is harmless, the comments should be corrected to reflect that this provides headroom for larger BenchLatencyMessageCount overrides, rather than suggesting it's required for the current 33-run aggregation.
HistogramTopK* = 5000 ## exact top-K largest samples seen.
## Sized to cover p999 even if
## `BenchLatencyMessageCount` is increased
## significantly (up to 5M samples). Note:
## current harness averages per-run
## percentiles rather than unioning.
| # Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio. | ||
| # Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum at | ||
| # production sample counts (3.3M samples); max is `percentile(1.0)`, | ||
| # always the top-K head. |
There was a problem hiding this comment.
This comment repeats the incorrect assumption that 3.3M samples are aggregated into a single histogram. As noted in bench_common.nim, each run uses its own histogram of 100,000 samples (by default).
# Track 6 Task 6.1: emit p999 + max alongside the p50/p95/p99 trio.
# Histogram K=5000 (Task 6.2) covers p999 in the exact top-K stratum
# even if MessageCount is increased; max is `percentile(1.0)`,
# always the top-K head.
| - `HistogramTopK` raised from 1000 to 5000 (PR 6, Task 6.2). At | ||
| production sample counts (`BenchLatencyMessageCount=100_000` × | ||
| `BenchLatencyRuns=33` ≈ 3.3M samples per aggregated histogram), the | ||
| p999 tail rank is ~3300; K=1000 forced the p999 lookup into the | ||
| rescaled-reservoir stratum, while K=5000 keeps it in the exact | ||
| top-K stratum with headroom. Memory cost: 5000 × 8B = 40KB | ||
| additional per histogram, negligible vs the 99K-sample reservoir. | ||
| New `t_bench_common.nim` test asserts p999 within 5% of sort | ||
| fallback on a 3.3M log-normal stream. |
1e2f5d2 to
e524cf6
Compare
…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.
dbce882 to
2739f9b
Compare
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/review |
Code Review by Qodo
1. Bench tests not executed
|
e524cf6 to
3ec9960
Compare
…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.
2739f9b to
e3a5e05
Compare
|
/review |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
Preparing review... |
|
/review |
57a9bc2 to
ee67fe2
Compare
|
Preparing review... |
…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.
b77fd58 to
b98c337
Compare
|
Retrigger |
|
Retrigger after concurrency-check workflow_id filter fix. |
There was a problem hiding this comment.
Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.
PR adds latency-p99 Bencher regression gating, emits p999/max latency measures, bumps HistogramTopK 1000→5000 for anticipatory headroom, and introduces a bench-tests CI job. The throughput measure-name fix (throughput_ops_ms) corrects a prior no-op. Core logic is sound; one minor test gap where the harness-level smoke test verifies p50<p99<max but omits the p999≥p99 ordering invariant now that p999 is a production measure (covered in heavier integration test).
No findings.
Noteworthy
- Correctly fixes a dormant throughput-threshold no-op:
--threshold-measure throughputnever matched the actualthroughput_ops_mskey emitted by the bench binaries. After this PR, throughput regression gating actually fires. - HistogramTopK bump commentary is thorough: clearly documents why K=5000 is anticipatory (for larger MessageCount overrides, not for the default), and the 3.3M-sample stress test directly validates the design choice under operator-driven configuration.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewto request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
|
✅ Momus review posted — verdict APPROVE, 0 findings
|
There was a problem hiding this comment.
Note: verdict was APPROVE but downgraded to COMMENT because the GitHub Actions installation token cannot approve PRs.
This PR (Track 6, final) adds latency p999/max measures to the BMF emission, bumps HistogramTopK from 1000→5000 with supporting stress tests, configures per-measure Bencher thresholds (latency_p99_ns upper-boundary + throughput_ops_ms lower-boundary), fixes a prior measure-name mismatch, and adds a bench-tests CI job.
Severity tally: 1 Low.
Low
- BOT-A2 (
.github/workflows/bench.yml:105): bench-tests job uses floating dependency refs while bench job pins to version tags
Noteworthy
- Fixes a latent CI bug: prior
--threshold-measure throughputwas a no-op because the actual measure key isthroughput_ops_ms— the threshold now correctly targets the right measure name. - The 3.3M-sample stress test is gated behind
-d:BenchCommonStressso the CI bench-tests stay fast while the design choice is still validation-gated.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git | ||
| git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git |
There was a problem hiding this comment.
BOT-A2 — Low (standards)
bench-tests job uses floating dependency refs while bench job pins to version tags
The new bench-tests job clones nim-debra and nim-typestates from --branch main (a floating ref), while the existing bench job (lines 172-173) pins these same dependencies to --branch v0.6.0 and --branch v0.7.0. This is a minor inconsistency: the bench job correctly pins for reproducible benchmark data, while bench-tests is a validation job that benefits from catching upstream breaks early. The existing bench-comparison.yml also uses --branch main for these deps, establishing precedent that floating refs are acceptable in non-benchmarking CI jobs.
| git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git | |
| git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git | |
| For consistency, consider pinning: | |
| git clone --depth 1 --branch v0.6.0 https://github.com/elijahr/nim-debra.git | |
| git clone --depth 1 --branch v0.7.0 https://github.com/elijahr/nim-typestates.git | |
| Or add a comment noting the intentional floating-ref choice for early breakage detection. |
|
/ai-review |
2 similar comments
|
/ai-review |
|
/ai-review |
There was a problem hiding this comment.
This PR (Track 6/final) adds latency p99 Bencher regression gating, p999/max BMF measures, HistogramTopK bump to 5000, a benchtests nimble task and CI job, and a Momus review workflow. Overall the changes are well-structured with good test additions. Three non-blocking issues identified: a stale contradictory comment, a CI test job with an unnecessarily restrictive fork-PR guard, and inconsistency in dependency pinning between CI jobs.
Severity tally: 2 Lows.
Low
- BOT-B1 (
.github/workflows/bench.yml:84): bench-tests CI job's fork-PR guard is unnecessarily restrictive - BOT-B3 (
.github/workflows/bench.yml:101): bench-tests CI job clones deps from main instead of tagged releases
Noteworthy
- Corrects a benign but longstanding measure-name mismatch in bench.yml: --threshold-measure throughput (nonexistent) → throughput_ops_ms (actual measure key), so the throughput regression threshold will now actually fire.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
| github.event.pull_request.head.repo.full_name == github.repository) | ||
| timeout-minutes: 10 | ||
|
|
There was a problem hiding this comment.
BOT-B1 — Low (tests)
bench-tests CI job's fork-PR guard is unnecessarily restrictive
The new bench-tests job copies its if: guard from the bench job, which checks head.repo.full_name == github.repository to prevent fork PRs from running (bench needs BENCHER_API_TOKEN unavailable to forks). bench-tests uses no secrets — it only clones sibling public repos, compiles, and runs tests. The fork-PR guard therefore unnecessarily prevents external PR contributors from getting bench harness test coverage. The guard should be simplified to only check github.actor != 'github-actions[bot]'.
| - name: Clone and install sibling Nim deps (nim-debra, nim-typestates) | ||
| run: | | ||
| set -e | ||
| cd .. | ||
| git clone --depth 1 --branch main https://github.com/elijahr/nim-debra.git | ||
| git clone --depth 1 --branch main https://github.com/elijahr/nim-typestates.git | ||
| (cd nim-typestates && nimble install -y) |
There was a problem hiding this comment.
BOT-B3 — Low (quality)
bench-tests CI job clones deps from main instead of tagged releases
The bench-tests job clones nim-debra and nim-typestates from --branch main, while the sibling bench job (line ~155) pins to --branch v0.6.0 and --branch v0.7.0 respectively. Using untagged main makes the CI non-deterministic: a breaking change pushed to a sibling repo's main could cause spurious test failures in lockfreequeues PRs.
fe063a2 to
b01168c
Compare
Track 6 (PR 6, final track in the bench-rollup feature stack) adds latency-regression gating in Bencher and the supporting BMF measures. bench_latency.nim now emits the full p50/p95/p99/p999/max latency tuple per bounded variant slug. Each call to runLatencyHarness adds two new measures alongside the existing trio: em.addMeasure(slug, "latency_p999_ns", metrics.p999_ns) em.addMeasure(slug, "latency_max_ns", metrics.max_ns) The merged BMF on each bench job thus carries a complete tail-latency profile per (library, topology) cell, available to the Bencher dashboard and any downstream comparison chart. bench_common.nim raises HistogramTopK from 1000 to 5000. At production sample counts (BenchLatencyMessageCount=100_000 x BenchLatencyRuns=33, seenAll ~ 3.3M per aggregated histogram) the p999 tail rank is ~3300; K=1000 forced the p999 lookup into the rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K stratum with headroom for stochastic boundary variation. Memory cost is 5000 x 8B = 40KB additional per histogram, negligible relative to the 99K-sample reservoir. bench.yml's base-branch tracking step (push to devel/main) configures per-measure thresholds in a single bencher run invocation: latency_p99_ns with --threshold-upper-boundary 0.99 (regression = latency increase) and throughput_ops_ms with --threshold-lower-boundary 0.99 (regression = throughput drop). Both share --threshold-test t_test --threshold-max-sample-size 64, terminated by --thresholds-reset so only the explicitly-listed thresholds remain active. This also corrects a prior measure-name mismatch: the earlier --threshold-measure throughput never matched any emitted measure (the actual key is throughput_ops_ms), so the previous throughput threshold was a no-op. Threshold activation requires >= 10 prior runs accumulated in Bencher to calibrate the t-test baseline (Task 6.4 stability soak gate). Bencher does not emit alerts on measures with insufficient sample history, so the configuration is dormant until the soak completes post-merge. Tests: - t_bench_common.nim: HistogramTopK >= 5000 assertion + p999 within 5% of sort fallback on 3.3M log-normal samples (27 tests, all passing). - t_bench_latency.nim: every bounded variant slug carries latency_p999_ns + latency_max_ns with monotone tail ordering p99 <= p999 <= max (7 tests, all passing). - tests/test.nim main suite unaffected (200 tests, all passing). Refs design doc /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md section 3 PR 6 and impl plan /Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.md Track 6 (Tasks 6.1-6.5).
…ntiles Per Gemini review on PR #25. The Track 6 Task 6.2 commentary in HistogramTopK, bench_latency.nim, t_bench_common.nim, and CHANGELOG repeatedly claimed that runLatencyHarness aggregates samples across runs ("seenAll ≈ 3.3M per aggregated histogram"). It does not — design 2.5 explicitly says "mean across runs of each percentile", so each Histogram only ever sees BenchLatencyMessageCount samples (default 100K), not MessageCount × Runs. At the default 100K MessageCount the previous K=1000 was already adequate — TopK + Reservoir already captured every sample exactly (1000 + 99,000 ≥ 100,000), so K=1000 was never spilling p999 into the rescaled stratum at default settings. Reframed all four sites: the K=5000 sizing is **anticipatory** for operators who override BenchLatencyMessageCount upward (~5M-sample overrides), keeping p999 (tail rank = MessageCount × 0.001) inside the exact top-K stratum at that volume. The 3.3M-sample stress test in t_bench_common.nim is reframed as a stress shape that exercises the K=5000 design choice, not a depiction of production volume. No code logic changed; comments + CHANGELOG only.
Code review on PR 6 noted that the new HistogramTopK assertion in tests/t_bench_common.nim and the latency CLI assertions in tests/t_bench_latency.nim were not imported by tests/test.nim and not run by any CI step, so the validation added by this PR was not actually enforced in CI. Wire up via nimble's separation-of-concerns pattern (Option B from the review): - lockfreequeues.nimble gains a 'benchtests' task that compiles and runs t_bench_common, t_bench_latency, and t_bench_adapters under one Nim invocation each. The bench harness tests live outside srcDir; importing them from tests/test.nim would force every entry of the 8-invocation MM/sanitizer test matrix to compile bench_common's threading machinery. A dedicated task keeps the regular nimble test matrix lean while still enforcing the bench-side assertions. - A second 'benchteststress' task enables the gated 3.3M-sample HistogramTopK p999 stress shape (slow, ~10-15s release-mode), so default benchtests stays under a second. - bench.yml gains a 'bench-tests' job (ubuntu-latest) that runs 'nimble benchtests' once per workflow trigger, with the same fork-PR / bot-actor guards as the bench matrix jobs. The 3.3M-sample stress test in t_bench_common is now gated behind -d:BenchCommonStress so it only runs under benchteststress, not the default benchtests run.
The previous wording called `percentile(1.0)` "the top-K head", which is misleading for a min-heap implementation: `topKHeap[0]` is the SMALLEST element in the top-K (the admission threshold for new outliers), not the largest. The implementation snapshots the heap into a sequence, sorts it, and returns `topk[^1]` — the maximum. Comment-only change; behaviour unchanged.
Replaces the Qodo PR-Agent reusable workflow with Momus (axiomantic/momus), an open-source from-scratch PR review bot.
…KEY -> LLM_API_KEY, pass MOMUS_APP_* secrets, set trigger_mention
3eee5ed to
462db8b
Compare
There was a problem hiding this comment.
PR 25 adds latency p999/max measures to BMF emission, bumps HistogramTopK from 1000 to 5000, configures per-measure Bencher thresholds (correcting a prior measure-name mismatch that made the throughput threshold a no-op), and adds corresponding tests. The logic is sound and well-documented. One stale inline comment contradicts the new behavior.
No findings.
Noteworthy
- The Bencher threshold configuration correctly switches
--threshold-measure throughput(which never matched any emitted key) to--threshold-measure throughput_ops_ms, making the throughput regression gate functional for the first time. - The K=5000 sizing is well-justified with explicit headroom math, and the compile-time
HistogramTopK >= 5000assertion in CI prevents silent regression of the constant.
Verdict: APPROVE.
Commands
- Comment
/ai-reviewor mention @axiomantic-momus[bot] to request a re-review of the latest changes. - Reply to a finding with
won't fix,by design, ornot a bugto decline it. - Reply with
instead, ...to propose an alternative fix.
Powered by Momus running deepseek/deepseek-v4-pro via openrouter.ai.
Final PR in the bench-rollup feature stack. Depends on PR #19, PR #20, PR #21, PR #22, PR #23, PR #24.
Track 6 (PR 6) of the bench-rollup feature: adds latency-regression
gating in Bencher and the supporting BMF measures. After this PR
merges down through the stack, the benchmark CI on
develproducesa complete tail-latency profile per (library, topology) cell and
flags p99 latency regressions on each push.
What changed
bench_latency.nim— emitlatency_p999_ns+latency_max_ns(Task 6.1)Each bounded variant slug now carries the full p50 / p95 / p99 / p999 / max
latency tuple in the merged BMF.
runLatencyHarnessadds two new measuresalongside the existing trio:
The merged BMF on each bench job thus carries a complete tail-latency
profile per (library, topology) cell, available to the Bencher dashboard
and any downstream comparison chart.
bench_common.nim—HistogramTopK1000 → 5000 (Task 6.2)At production sample counts (
BenchLatencyMessageCount=100_000×BenchLatencyRuns=33,seenAll ≈ 3.3Mper aggregated histogram), thep999 tail rank is ~3300; K=1000 forced the p999 lookup into the
rescaled-reservoir stratum, while K=5000 keeps it in the exact top-K
stratum with headroom for stochastic boundary variation. Memory cost:
5000 × 8B = 40KB additional per histogram, negligible relative to the
99K-sample reservoir.
A new
t_bench_common.nimtest asserts p999 within 5% of sort fallbackon a 3.3M log-normal stream, directly verifying the K=5000 design choice.
bench.yml— per-measure threshold args (Task 6.3)The base-branch tracking step (
pushtodevel/main) configuresper-measure thresholds in a single
bencher runinvocation:latency_p99_nswith--threshold-upper-boundary 0.99(regression =latency increase past upper bound)
throughput_ops_mswith--threshold-lower-boundary 0.99(regression =throughput drop below lower bound)
Both share
--threshold-test t_test --threshold-max-sample-size 64,terminated by
--thresholds-resetso only the explicitly-listedthresholds remain active.
This also corrects a prior measure-name mismatch: the earlier
--threshold-measure throughputnever matched any emitted measure(the actual key is
throughput_ops_ms), so the previous throughputthreshold was a no-op. After this PR, the throughput threshold actually
fires.
Stability soak gate (Task 6.4)
Threshold activation requires ≥ 10 prior runs accumulated in Bencher
to calibrate the t-test baseline. Bencher will not emit alerts on
measures with insufficient sample history, so the configuration is
dormant until the soak completes post-merge. The configuration is
in place so activation is purely a function of accumulated run count,
not workflow edits.
The soak is a runtime gate, not a code gate — it cannot be exercised
inside this PR. After merge, the recommended verification flow is:
develaccumulate ≥ 10 bench runs (each push triggers one).latency_p99_nscoefficient ofvariation across runs; target CV ≤ 5%.
--threshold-max-sample-sizeor revisiting the bench harness fornoise sources (warmup count, sample size, NUMA pinning).
regression > 1%.
CHANGELOG (Task 6.5)
Three new bullets under
[Unreleased] / Addedcovering the thresholdconfig, the new measures, and the K=5000 bump with rationale.
Tests
t_bench_common.nim: 27 tests pass (includes newHistogramTopK >= 5000assertion and the 3.3M log-normal p999 accuracy test).
t_bench_latency.nim: 7 tests pass (every bounded variant slug carrieslatency_p999_ns+latency_max_ns; tail orderp99 <= p999 <= maxenforced on the smoke shape).tests/test.nimmain suite: 200 tests pass, no regressions.actionlintclean on.github/workflows/bench.yml.bench_latency --bmf-out=...confirms the merged BMFemits all five latency measures per slug.
Follow-ups (post-merge, outside this PR)
latency_p99_nsondevel, document CV in a follow-up note or release announcement.11th run; no further workflow edits required.
References
/Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-design.md§3 PR 6/Users/eek/.local/spellbook/docs/Users-eek-Development-lockfreequeues/plans/2026-05-01-bench-rollup-impl.mdTrack 6 (Tasks 6.1-6.5)