-
Notifications
You must be signed in to change notification settings - Fork 0
feat(loadgen): implement load test harness for messaging workers #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 33 commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
c470c55
docs: add load test messaging workers design
claude 70502fd
docs: add load test messaging workers implementation plan
claude 182ef25
feat(loadgen): scaffold main.go with subcommand dispatch
claude b6e9ac1
feat(loadgen): add Preset type and four built-in presets
claude 354afa0
test(loadgen): guard preset lookup ok in uniform/realistic shape tests
claude 2ae8310
feat(loadgen): deterministic fixture generation from (preset, seed)
claude 7cd9a80
test(loadgen): drop unused default branch in realistic room-type switch
claude 3df2cd6
fix(loadgen): address gocritic/errcheck findings in preset.go
claude 4641d98
refactor(loadgen): pass Preset by pointer; revert lint config bump
claude b53be6b
test(loadgen): cover pickMembers padding and sampleWithoutReplacement…
claude 9a9b6bc
feat(loadgen): Seed and Teardown mongo collections from fixtures
claude 3787437
feat(loadgen): Prometheus registry with loadgen collectors
claude 68e48b3
feat(loadgen): collector correlates publishes with replies and broadc…
claude 3d483b8
fix(loadgen): close race in Collector samples; add coverage tests
claude c10f31b
feat(loadgen): percentiles, summary printer, CSV export, exit code
claude 9ef41ef
test(loadgen): drop redundant nolint; _test.go is already excluded fr…
claude a5d86e5
feat(loadgen): open-loop generator with injected publisher
claude 7e79a79
fix(loadgen): clear Collector orphans on publish failure; tighten tests
claude 2bad977
feat(loadgen): JetStream consumer-lag sampler
claude 9c8d962
fix(loadgen): warn (not debug) on consumer poll errors; document Snap…
claude 021a409
feat(loadgen): wire seed/run/teardown subcommands in main.go
claude eac94f2
fix(loadgen): skip byReqID in canonical mode to avoid false missing-r…
claude b4ea921
feat(loadgen): docker-compose harness, Dockerfile, grafana dashboard
claude feb4c19
fix(loadgen): drop NATS scrape job (port 8222 serves JSON, not Promet…
claude d3b1e54
feat(loadgen): scoped Makefile for harness
claude 6084ba7
test(loadgen): integration test for end-to-end wiring
claude dd19404
docs(loadgen): add operator README
claude 69c0eab
test(loadgen): add unit tests for main helpers and sampler Snapshot
claude 57d9f93
fix(loadgen): address final review — indexes, canonical rate, DM broa…
claude 1905810
refactor(loadgen): simplify pass — pre-compute content, unify handler…
claude eb8eea8
fix(loadgen): split sent counter into warmup/measured phases for clea…
claude fdde0d0
fix(loadgen): index users.account so broadcast-worker enrichment isn'…
claude 54acee8
perf(loadgen): dispatch publishes to worker pool; add opt-in pprof
claude 45ff2ad
Merge branch 'main' into claude/load-test-messaging-workers-tDKZn
hmchangw 8a9e64d
fix: group to channel
hmchangw 6ef91ee
fix linting
hmchangw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
2,780 changes: 2,780 additions & 0 deletions
2,780
docs/superpowers/plans/2026-04-21-load-test-messaging-workers.md
Large diffs are not rendered by default.
Oops, something went wrong.
620 changes: 620 additions & 0 deletions
620
docs/superpowers/specs/2026-04-21-load-test-messaging-workers-design.md
Large diffs are not rendered by default.
Oops, something went wrong.
203 changes: 203 additions & 0 deletions
203
docs/superpowers/specs/2026-04-24-loadgen-worker-pool-design.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| # Loadgen Worker-Pool Dispatch + pprof — Design | ||
|
|
||
| ## Purpose | ||
|
|
||
| The loadgen's actual publish rate falls materially below the target rate at | ||
| moderate throughput. At `--rate=1000` observed actual rate is ~775 msg/s | ||
| (~77% delivery). Root cause: the publisher runs on the `time.Ticker`'s | ||
| goroutine serially, and `time.Ticker` drops ticks that fire while a publish | ||
| is still in progress. Any per-publish stall (NATS write-lock contention, | ||
| GC pause, scheduler hiccup) above the 1 ms/tick budget silently loses a | ||
| tick. | ||
|
|
||
| This spec fixes that by dispatching publishes to a small worker pool and | ||
| adds opt-in pprof so future bottlenecks are diagnosable. | ||
|
|
||
| ## Scope | ||
|
|
||
| ### In scope | ||
|
|
||
| - `Generator.Run` dispatches each tick's publish to a bounded pool of | ||
| goroutines. The ticker itself stays punctual. | ||
| - New env var `MAX_IN_FLIGHT` (default `200`) caps concurrent publishes. | ||
| Saturation (pool full when a tick fires) is an explicit signal, not a | ||
| silent drop: the ticker records | ||
| `loadgen_publish_errors_total{reason="saturated"}` and moves on. | ||
| - `MAX_IN_FLIGHT=0` falls back to the current serial behavior. Useful as | ||
| a bisection tool and a conservative default for whoever wants | ||
| reproducible comparisons. | ||
| - On graceful shutdown / `ctx.Done()`, `Run` returns only after all | ||
| in-flight publishes drain (bounded by a small timeout). | ||
| - New env var `PPROF_ADDR` (default `""`, meaning disabled). When set | ||
| (e.g. `:6060`), loadgen exposes `net/http/pprof` handlers on a | ||
| separate HTTP server. Never on by default — pprof isn't exposed in | ||
| production-ish deployments unless the operator opts in. | ||
| - Docker-compose loadgen service documents both new env vars. | ||
|
|
||
| ### Out of scope | ||
|
|
||
| - Changes to the Collector, ConsumerSampler, Report, Preset, Seed, or | ||
| integration test — none are publish-hot-path. | ||
| - `golang.org/x/time/rate.Limiter` — the worker-pool fix addresses the | ||
| real structural cause (ticker/publish coupling). If worker-pool | ||
| saturation becomes the new bottleneck, re-evaluate then. | ||
| - `sync.Pool` allocation-reuse tuning — defer until pprof identifies GC | ||
| as the next-order concern. | ||
| - Dedicated NATS connection for publishes vs. subscriptions — only | ||
| justified if pprof identifies the NATS write lock as the bottleneck | ||
| after the worker pool lands. | ||
| - Default-rate bump — reasoned about separately. | ||
|
|
||
| ## Architecture | ||
|
|
||
| Before: | ||
|
|
||
| ```text | ||
| ticker goroutine: [wait tick] → publishOne (JSON + NATS write + metrics) → [wait tick] → … | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| one slow call here silently loses a tick | ||
| ``` | ||
|
|
||
| After: | ||
|
|
||
| ```text | ||
| ticker goroutine: [wait tick] → reserve sem slot → spawn publish goroutine → [wait tick] → … | ||
|
|
||
| publish goroutine: [publishOne] → release sem slot | ||
| publish goroutine: [publishOne] → release sem slot | ||
| publish goroutine: [publishOne] → release sem slot (up to MAX_IN_FLIGHT concurrently) | ||
| ``` | ||
|
|
||
| The ticker goroutine's per-tick work shrinks to a semaphore send + goroutine | ||
| spawn — tens of nanoseconds. It cannot overshoot the ticker interval at any | ||
| realistic rate. | ||
|
|
||
| ## Components | ||
|
|
||
| ### `Generator.Run` (modified) | ||
|
|
||
| - Read `g.cfg.MaxInFlight` from `GeneratorConfig`. | ||
| - If `MaxInFlight <= 0`: run serially as today (preserves legacy behavior | ||
| and gives a bisection switch). | ||
| - Else: create `sem := make(chan struct{}, MaxInFlight)` and | ||
| `var wg sync.WaitGroup`. On each tick, non-blocking `select`: | ||
| - Slot available: take it, `wg.Add(1)`, `go func() { defer wg.Done(); | ||
| defer func() { <-sem }(); g.publishOne(ctx) }()`. | ||
| - No slot: increment | ||
| `loadgen_publish_errors_total{reason="saturated"}` and continue — | ||
| the tick is dropped but at least it's observable. | ||
| - On `ctx.Done()`: stop the ticker, then `wg.Wait()` with a bounded grace | ||
| period (5 s). If the grace expires, log and return — in-flight | ||
| goroutines complete on their own after NATS drain in main. | ||
|
|
||
| ### `GeneratorConfig` (modified) | ||
|
|
||
| Add one field: | ||
|
|
||
| ```go | ||
| type GeneratorConfig struct { | ||
| … existing fields … | ||
| MaxInFlight int | ||
| } | ||
| ``` | ||
|
|
||
| ### `main.go` (modified) | ||
|
|
||
| Add to `config`: | ||
|
|
||
| ```go | ||
| type config struct { | ||
| … existing fields … | ||
| MaxInFlight int `env:"MAX_IN_FLIGHT" envDefault:"200"` | ||
| PProfAddr string `env:"PPROF_ADDR" envDefault:""` | ||
| } | ||
| ``` | ||
|
|
||
| Pass `cfg.MaxInFlight` into `GeneratorConfig` when constructing the generator. | ||
|
|
||
| On startup, if `PProfAddr != ""`: register `net/http/pprof` handlers on a | ||
| new `http.ServeMux` and start a separate `http.Server` listening on that | ||
| addr. Log the resulting URL. The server doesn't share the metrics mux — | ||
| pprof is genuinely separate, opt-in infrastructure, and keeping it off the | ||
| metrics port avoids accidental exposure when the metrics mux is scraped | ||
| by Prometheus. | ||
|
|
||
| On `ctx.Done()`: gracefully shut down the pprof server with a 2 s timeout. | ||
|
|
||
| ### Metrics | ||
|
|
||
| No new metrics. The existing `loadgen_publish_errors_total` counter with | ||
| `reason="saturated"` is the single new label value for pool saturation. | ||
| This keeps the Grafana dashboard's "Publish errors/sec by reason" panel | ||
| working out of the box. | ||
|
|
||
| ## Error handling | ||
|
|
||
| - `sem <- struct{}{}` is never blocking because we use non-blocking | ||
| `select` — if the pool is full, we record saturation and move on. No | ||
| unbounded goroutine growth under sustained overload. | ||
| - Inside each publish goroutine, `publishOne` already handles its own | ||
| errors (counters for marshal/publish failures, `RecordPublishFailed` | ||
| on the Collector). | ||
| - Graceful shutdown: the `Run` method returns only after in-flight | ||
| publishes drain or the bounded grace period elapses. The caller | ||
| (`main.go runRun`) already calls `collector.DiscardBefore` and | ||
| `collector.Finalize` after `Run` returns, so late-arriving publishes | ||
| correctly integrate with the summary. | ||
|
|
||
| ## Testing | ||
|
|
||
| ### New unit test | ||
|
|
||
| `TestGenerator_MaxInFlightZeroRunsSerially` — with `MaxInFlight=0`, the | ||
| generator's behavior is unchanged from today. Reuses the existing | ||
| `TestGenerator_SendsExpectedCount` assertion style. | ||
|
|
||
| ### Adjusted unit test | ||
|
|
||
| `TestGenerator_SendsExpectedCount` — still valid with `MaxInFlight > 0`, | ||
| but the count may be closer to the theoretical target since the ticker | ||
| is no longer blocked. | ||
|
|
||
| ### New unit test | ||
|
|
||
| `TestGenerator_PoolSaturationCountedAsError` — artificially slow the | ||
| publisher via an injected blocking `Publisher`. Run at a rate that | ||
| exceeds the pool's capacity. Assert the `saturated` counter increments. | ||
|
|
||
| ### Integration test | ||
|
|
||
| No change. The existing `tools/loadgen/integration_test.go` exercises | ||
| `Generator.Run` with a fake gatekeeper + broadcast-worker and makes no | ||
| assumptions about ticker coupling. | ||
|
|
||
| ### Coverage target | ||
|
|
||
| `generator.go` to stay at ≥ 90% for `Run`, `publishOne`, `content` per | ||
| the existing plan. | ||
|
|
||
| ## Dependencies | ||
|
|
||
| No new third-party dependencies. All new code uses stdlib: `net/http`, | ||
| `net/http/pprof`, `sync`. | ||
|
|
||
| ## Rollout | ||
|
|
||
| - Both env vars have safe defaults (`MAX_IN_FLIGHT=200`, `PPROF_ADDR=""`). | ||
| - Existing deployments pick up the worker pool automatically with | ||
| improved actual-rate fidelity at moderate throughput. Operators | ||
| concerned about the behavior change can set `MAX_IN_FLIGHT=0` to | ||
| get the legacy serial path. | ||
| - pprof stays off unless explicitly enabled via `PPROF_ADDR`. | ||
| - Internal-only to the loadgen service; no cross-service contract | ||
| change. | ||
|
|
||
| ## Future work (deferred) | ||
|
|
||
| - Dedicated publish-side `*nats.Conn` — only if profiling identifies the | ||
| NATS connection write lock as the remaining bottleneck. | ||
| - `sync.Pool` for `SendMessageRequest` / `MessageEvent` / byte buffers | ||
| to reduce per-publish GC pressure — only if GC shows up in a | ||
| profile. | ||
| - Background UUID generation — only if `crypto/rand` shows up | ||
| prominently. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| # loadgen | ||
|
|
||
| Capacity-baseline load generator for the single-site messaging pipeline | ||
| (`message-gatekeeper` → `MESSAGES_CANONICAL` → `message-worker` + | ||
| `broadcast-worker`). Single Go binary with three subcommands. | ||
|
|
||
| ## Quick start | ||
|
|
||
| ``` | ||
| make -C tools/loadgen/deploy up | ||
| make -C tools/loadgen/deploy seed PRESET=medium | ||
| make -C tools/loadgen/deploy run PRESET=medium RATE=500 DURATION=60s | ||
| ``` | ||
|
|
||
| For live dashboards: | ||
|
|
||
| ``` | ||
| make -C tools/loadgen/deploy run-dashboards PRESET=medium | ||
| # Grafana at http://localhost:3000 (anonymous admin) | ||
| ``` | ||
|
|
||
| Tear down: | ||
|
|
||
| ``` | ||
| make -C tools/loadgen/deploy down | ||
| ``` | ||
|
|
||
| ## Presets | ||
|
|
||
| | preset | users | rooms | notes | | ||
| |-------------|--------|-------|--------------------------------------------------------| | ||
| | `small` | 10 | 5 | uniform, 200-byte content | | ||
| | `medium` | 1 000 | 100 | uniform, 200-byte content | | ||
| | `large` | 10 000 | 1 000 | uniform, 200-byte content | | ||
| | `realistic` | 1 000 | 100 | Zipf senders, mixed room sizes, 50–2000 bytes, mentions| | ||
|
|
||
| ## Subcommands | ||
|
|
||
| - `loadgen seed --preset=<name> [--seed=42]` — idempotently populate | ||
| MongoDB with deterministic fixtures. | ||
| - `loadgen run --preset=<name> [flags]` — open-loop publish at `--rate` | ||
| msgs/sec for `--duration`, print a summary at the end. Flags: | ||
| `--seed`, `--warmup`, `--inject=frontdoor|canonical`, `--csv=<path>`. | ||
| - `loadgen teardown` — drop the three seeded collections. | ||
|
|
||
| ## Reading the summary | ||
|
|
||
| - `final_pending == 0` on both durables, zero errors → the pipeline is | ||
| sustaining your target rate. | ||
| - `final_pending` climbing, or error counts > 0 → over capacity or a | ||
| regression upstream of the worker. | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - Not a CI regression gate. Invoked manually. | ||
| - Not an auth benchmark. Uses shared `backend.creds`. | ||
| - Not a cross-site benchmark. Single-site only. | ||
| - Not an absolute-number tool. Numbers vary by host — compare within one | ||
| machine across changes, don't compare across machines. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add unit tests for the new exported helpers.
As per coding guidelines: "Every exported function in
pkg/must have corresponding test cases." Please add tests forUserResponseWildcard,RoomEventWildcard, andUserRoomEventWildcardinpkg/subject/subject_test.goasserting the exact returned strings.🤖 Prompt for AI Agents