diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index bdce6e3..f1d6d5e 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -57,3 +57,17 @@ jobs: with: tool: cargo-audit - run: cargo audit + + msrv: + name: MSRV (Rust 1.88) + runs-on: ubuntu-latest + # Verifies the declared `rust-version`. Lints are capped to `warn` here so + # this stays a pure compile/edition/dependency gate — the `checks` job + # enforces `-D warnings` on stable. + env: + RUSTFLAGS: --cap-lints=warn + steps: + - uses: actions/checkout@v4 + - uses: dtolnay/rust-toolchain@1.88 + - uses: Swatinem/rust-cache@v2 + - run: cargo build --workspace --all-features --locked diff --git a/AGENTS.md b/AGENTS.md index daa974a..b9901ec 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -21,13 +21,13 @@ Started 2026-05-10. Target end: 2026-05-17 (1 week with 4 parallel agents instea ### Active agent registry -| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | -|---|---|---|---|---|---| -| **A** | `cold_start` + `sustained` scenarios | `feat/m2-scenarios-cs` | `crates/mcp-loadtest/src/scenario/{cold_start,sustained}.rs`; `tests/scenarios_basic.rs` | `session.rs`, `scenario/mod.rs` (trait), `metrics/*` | ✅ done | -| **B** | `hang_detect` + `deadlock_probe` scenario | `feat/m2-deadlock` | `src/hang_detector.rs` (body); `src/scenario/deadlock_probe.rs`; `tests/deadlock.rs` | `session.rs`, `scenario/mod.rs` (trait), `metrics/*` | ✅ done | -| **C** | hdrhistogram metrics layer | `feat/m2-metrics` | `src/metrics/{mod,histogram,throughput}.rs` | none — fully independent | ✅ done | -| **D** | Mock servers + Python framing fixes | `feat/m2-mocks` | `tests/fixtures/mock-{broken,slow,crash}.py` | (none) | ✅ done | -| **integration** | Wire-up + ci-checks + commit | `main` | `scenario/mod.rs` (uncomment registrations); `tests/deadlock.rs` (replace inline helper with real `DeadlockProbe`) | all M2 outputs | ✅ done | +| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | +| --------------- | ----------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------- | ------- | +| **A** | `cold_start` + `sustained` scenarios | `feat/m2-scenarios-cs` | `crates/mcp-loadtest/src/scenario/{cold_start,sustained}.rs`; `tests/scenarios_basic.rs` | `session.rs`, `scenario/mod.rs` (trait), `metrics/*` | ✅ done | +| **B** | `hang_detect` + `deadlock_probe` scenario | `feat/m2-deadlock` | `src/hang_detector.rs` (body); `src/scenario/deadlock_probe.rs`; `tests/deadlock.rs` | `session.rs`, `scenario/mod.rs` (trait), `metrics/*` | ✅ done | +| **C** | hdrhistogram metrics layer | `feat/m2-metrics` | `src/metrics/{mod,histogram,throughput}.rs` | none — fully independent | ✅ done | +| **D** | Mock servers + Python framing fixes | `feat/m2-mocks` | `tests/fixtures/mock-{broken,slow,crash}.py` | (none) | ✅ done | +| **integration** | Wire-up + ci-checks + commit | `main` | `scenario/mod.rs` (uncomment registrations); `tests/deadlock.rs` (replace inline helper with real `DeadlockProbe`) | all M2 outputs | ✅ done | ### Conflicts to expect @@ -117,6 +117,7 @@ Algorithm: see DESIGN.md §15.1 for the precise decision tree. ### Mock fixtures — `tests/fixtures/mock-*.py` Contract enforced by `_common.py`: + - Read JSON-RPC frames from stdin (one per line). - Write JSON-RPC frames to stdout (one per line). - Exit 0 on stdin EOF. @@ -152,6 +153,7 @@ If you're a fresh agent / contributor entering this sprint: 7. Wait for the integration step. If your work needs an interface change: + - **STOP.** Do not break the contract. - Open an issue or sync with the main session. - All agents pause until the contract is renegotiated. @@ -164,21 +166,21 @@ Started 2026-05-11. Final sprint before user-review hand-off. **Repo stays priva ### M7 active agent registry -| Agent | Scope | Files OWNED | Status | -|---|---|---|---| -| **U** | Protocol fuzzer scenario — random/malformed JSON-RPC payloads | `src/scenario/fuzzer.rs` (new); `src/analysis/fuzz_report.rs` (new); `tests/fuzzer.rs` (new) | open | -| **V** | Coverage tracking + per-tool SLO assertions | `src/analysis/coverage.rs` (new); add `coverage: Option` to `Report`; extend `Run::execute`; `tests/coverage.rs` (new) | open | -| **W** | `mcp-loadtest serve --mcp` self-hosted MCP server | `src/serve/mod.rs` + `src/serve/tools.rs` (new); add `Serve` Cmd; `tests/serve_smoke.rs` (new) | done (2026-05-11) | -| **X** | README rewrite + cookbook | `README.md`; `docs/examples/{ci-integration,custom-scenario,debugging-deadlocks}.md` (new) | open | +| Agent | Scope | Files OWNED | Status | +| ----- | ------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | +| **U** | Protocol fuzzer scenario — random/malformed JSON-RPC payloads | `src/scenario/fuzzer.rs` (new); `src/analysis/fuzz_report.rs` (new); `tests/fuzzer.rs` (new) | done (2026-05-11) | +| **V** | Coverage tracking + per-tool SLO assertions | `src/analysis/coverage.rs` (new); add `coverage: Option` to `Report`; extend `Run::execute`; `tests/coverage.rs` (new) | done (2026-05-11) | +| **W** | `mcp-loadtest serve --mcp` self-hosted MCP server | `src/serve/mod.rs` + `src/serve/tools.rs` (new); add `Serve` Cmd; `tests/serve_smoke.rs` (new) | done (2026-05-11) | +| **X** | README rewrite + cookbook | `README.md`; `docs/examples/{ci-integration,custom-scenario,debugging-deadlocks}.md` (new) | done (2026-05-11) | ### M7 sprint exit criteria -- [ ] ci-checks green; `cargo test --workspace` 100% pass -- [ ] `mcp-loadtest fuzz --server "..."` against `mock-broken.py` reports interesting findings (panic / hang / parse error) -- [ ] Coverage shows "echo" exercised in scenarios_basic -- [ ] `mcp-loadtest serve --mcp` boots; `tools/list` returns `deadlock_probe`/`sustained_load`/`compare_runs` (minimum) -- [ ] README leads with the deadlock demo + has 3 cookbook examples -- [ ] /security-review + /release-checks pass +- [x] ci-checks green; `cargo test --workspace` 100% pass (Windows PowerShell script, 2026-05-11) +- [ ] Fuzzer scenario via `mcp-loadtest run --config ` against `mock-broken.py` reports interesting findings (dedicated `fuzz` subcommand deferred) +- [x] Coverage shows "echo" exercised in coverage/scenario tests +- [x] `mcp-loadtest serve --mcp` boots; `tools/list` returns `deadlock_probe`/`sustained_load`/`compare_runs` (minimum) +- [x] README leads with the deadlock demo + has 3 cookbook examples +- [x] /security-review + /release-checks pass (security review: 1 HIGH + 3 lower findings, all fixed in `ad2e5c6`; release-checks run — see CHANGELOG Notes + the `run_strict` / `strict_validation` e2e tests) After M7: STOP. Hand off to user for review BEFORE flipping repo public. @@ -190,12 +192,12 @@ Started 2026-05-11. Target: 1 week. ### M6 active agent registry -| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | -|---|---|---|---|---|---| -| **Q** | Real-time TUI dashboard via ratatui + crossterm | `feat/m6-tui` | `src/tui/mod.rs` (uncomment + enrich); `src/tui/dashboard.rs` (new); `tests/tui_smoke.rs` (new) | `metrics::Recorder` (clone), `report/*` | done (2026-05-11) | -| **R** | `race_detector` analyzer + `RaceCheck` scenario | `feat/m6-race` | `src/analysis/race_detector.rs`; `src/scenario/race_check.rs` (new); `tests/race_check.rs` (new) | session, scenario trait | done (2026-05-11) | -| **S** | Cross-server CLI subcommand `mcp-loadtest cross` | `feat/m6-cross` | `crates/mcp-loadtest-cli/src/cmd_cross.rs` (new); add `Cross` Cmd variant in main.rs | run, scenario, report (read) | done (2026-05-11) | -| **T** | ProcessSampler enrichment (fd + threads) + Soak leak-regression | `feat/m6-process-leak` | `src/metrics/process.rs` (extend); `src/report/mod.rs` ONLY to add `peak_fd: u64`, `peak_threads: u64` to ProcessStats (locked-API expansion); refactor `Soak` to use the new fields | sysinfo API | open | +| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | +| ----- | --------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------- | ----------------- | +| **Q** | Real-time TUI dashboard via ratatui + crossterm | `feat/m6-tui` | `src/tui/mod.rs` (uncomment + enrich); `src/tui/dashboard.rs` (new); `tests/tui_smoke.rs` (new) | `metrics::Recorder` (clone), `report/*` | done (2026-05-11) | +| **R** | `race_detector` analyzer + `RaceCheck` scenario | `feat/m6-race` | `src/analysis/race_detector.rs`; `src/scenario/race_check.rs` (new); `tests/race_check.rs` (new) | session, scenario trait | done (2026-05-11) | +| **S** | Cross-server CLI subcommand `mcp-loadtest cross` | `feat/m6-cross` | `crates/mcp-loadtest-cli/src/cmd_cross.rs` (new); add `Cross` Cmd variant in main.rs | run, scenario, report (read) | done (2026-05-11) | +| **T** | ProcessSampler enrichment (fd + threads) + Soak leak-regression | `feat/m6-process-leak` | `src/metrics/process.rs` (extend); `src/report/mod.rs` ONLY to add `peak_fd: u64`, `peak_threads: u64` to ProcessStats (locked-API expansion); refactor `Soak` to use the new fields | sysinfo API | done (2026-05-11) | ### M6 conflicts to expect @@ -218,22 +220,22 @@ Started 2026-05-11. Target: 1 week. Mostly post-hoc analysis on top of the M3 `R ### M5 active agent registry -| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | -|---|---|---|---|---|---| -| **M** | `Ramp` scenario + `BreakingPointDetector` | `feat/m5-breaking-point` | `src/analysis/breaking_point.rs` (new); `src/scenario/ramp.rs` (new); `tests/breaking_point.rs` (new) | `analysis/mod.rs` (declare module), `scenario/mod.rs` (register), `metrics/*`, locked `Report` | open | -| **N** | Performance grading (A-F per latency / concurrency / error) | `feat/m5-grading` | `src/analysis/grading.rs` (new); `tests/grading.rs` (new) | `report/*` (read), `metrics/*` | open | -| **O** | Pattern engine — multi-step + weighted random + think-time | `feat/m5-patterns` | `src/scenario/pattern.rs` (new); refactor `src/scenario/sustained.rs` to drive Patterns; `tests/patterns.rs` (new) | `Session::call_tool`, locked `Scenario` trait | open | -| **P** | `compare` subcommand + soak scenario polish | `feat/m5-compare-soak` | `src/scenario/soak.rs` (new); `crates/mcp-loadtest-cli/src/cmd_compare.rs` (new); add `mcp-loadtest compare baseline.json current.json` to CLI | `report/json.rs` (read) for the on-disk shape | done | +| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | +| ----- | ----------------------------------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ----------------- | +| **M** | `Ramp` scenario + `BreakingPointDetector` | `feat/m5-breaking-point` | `src/analysis/breaking_point.rs` (new); `src/scenario/ramp.rs` (new); `tests/breaking_point.rs` (new) | `analysis/mod.rs` (declare module), `scenario/mod.rs` (register), `metrics/*`, locked `Report` | done (2026-05-11) | +| **N** | Performance grading (A-F per latency / concurrency / error) | `feat/m5-grading` | `src/analysis/grading.rs` (new); `tests/grading.rs` (new) | `report/*` (read), `metrics/*` | done (2026-05-11) | +| **O** | Pattern engine — multi-step + weighted random + think-time | `feat/m5-patterns` | `src/scenario/pattern.rs` (new); refactor `src/scenario/sustained.rs` to drive Patterns; `tests/patterns.rs` (new) | `Session::call_tool`, locked `Scenario` trait | done (2026-05-11) | +| **P** | `compare` subcommand + soak scenario polish | `feat/m5-compare-soak` | `src/scenario/soak.rs` (new); `crates/mcp-loadtest-cli/src/cmd_compare.rs` (new); add `mcp-loadtest compare baseline.json current.json` to CLI | `report/json.rs` (read) for the on-disk shape | done | ### M5 conflicts / notes - All four touch `scenario/mod.rs` (register their new module). Convention: each appends one line at the bottom; integration agent resolves any merge. - Agent O refactors `Sustained` — must keep the existing TOML config that has `tool` + `args` working (single-step pattern fallback). Existing `tests/scenarios_basic.rs` must still pass without modification. - Agent P's `compare` subcommand shape: - ``` - mcp-loadtest compare - ``` - Outputs a markdown diff highlighting regressions. JSON output via `--format json` for CI use. + ``` + mcp-loadtest compare + ``` + Outputs a markdown diff highlighting regressions. JSON output via `--format json` for CI use. - DESIGN §21 template-var pattern feature (using previous step's output in next step) is **deferred to M6**. M5's pattern engine: multi-step + weighted + think-time only. ### M5 sprint exit criteria @@ -253,11 +255,11 @@ Pre-M4 (integration agent, done): toolchain bumped 1.85→stable; `Transport` tr ### M4 active agent registry -| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | -|---|---|---|---|---|---| -| **J** | HTTP transport (Streamable HTTP simple variant) | `feat/m4-http` | `src/protocol/transport/http.rs`; `tests/http_transport.rs` (new) | `protocol/transport/mod.rs` (locked Transport trait), `session.rs` | open | -| **K** | SSE transport (event-stream subscribe + POST send) | `feat/m4-sse` | `src/protocol/transport/sse.rs`; `tests/sse_transport.rs` (new) | `protocol/transport/mod.rs`, `session.rs` | open | -| **L** | Python HTTP+SSE mock fixtures | `feat/m4-fixtures-http` | `tests/fixtures/mock-http-server.py`; `tests/fixtures/mock-sse-server.py`; `tests/fixtures/_http_common.py` (allowed) | existing fixture conventions | open | +| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | +| ----- | -------------------------------------------------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ | ----------------- | +| **J** | HTTP transport (Streamable HTTP simple variant) | `feat/m4-http` | `src/protocol/transport/http.rs`; `tests/http_transport.rs` (new) | `protocol/transport/mod.rs` (locked Transport trait), `session.rs` | done (2026-05-11) | +| **K** | SSE transport (event-stream subscribe + POST send) | `feat/m4-sse` | `src/protocol/transport/sse.rs`; `tests/sse_transport.rs` (new) | `protocol/transport/mod.rs`, `session.rs` | done (2026-05-11) | +| **L** | Python HTTP+SSE mock fixtures | `feat/m4-fixtures-http` | `tests/fixtures/mock-http-server.py`; `tests/fixtures/mock-sse-server.py`; `tests/fixtures/_http_common.py` (allowed) | existing fixture conventions | done (2026-05-11) | Agent I (stdio extraction + Transport trait) done by integration agent pre-sprint; left as 3 agents instead of 4 to reduce coordination risk. @@ -281,13 +283,13 @@ Started 2026-05-10 (right after M2 close). Target: 1 week. ### M3 active agent registry -| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | -|---|---|---|---|---|---| -| **E** | TOML config + parse | `feat/m3-config` | `src/config.rs`; `tests/config_parse.rs` | `metrics/*`, `scenario/mod.rs`, locked types in config.rs | ✅ done | -| **F** | Markdown + JSON reporters | `feat/m3-reporters-md-json` | `src/report/markdown.rs` + `src/report/json.rs`; `tests/reporter_snapshots.rs` + 4 insta snapshots | `report/mod.rs` (locked Report + Reporter trait), `metrics/*`, `scenario/mod.rs` | ✅ done | -| **G** | Terminal reporter + sysinfo process metrics | `feat/m3-terminal-process` | `src/report/terminal.rs`; `src/metrics/process.rs`; `tests/process_metrics.rs` | `report/mod.rs`, `metrics/mod.rs` | ✅ done | -| **H** | Run orchestrator + Vibe-Trading regression test | `feat/m3-run-orchestrator` | `src/run.rs` (full body + 6 unit tests); `tests/vibe_trading_regression.rs` | everything else read-only | ✅ done | -| **integration** | CLI + reporters wiring + ci-checks + commit | `main` | `crates/mcp-loadtest-cli/src/main.rs` (clap subcommands); `crates/mcp-loadtest-cli/Cargo.toml` (deps); CHANGELOG; AGENTS | all M3 outputs | ✅ done | +| Agent | Scope | Branch | Files OWNED | Read-only deps | Status | +| --------------- | ----------------------------------------------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------- | ------- | +| **E** | TOML config + parse | `feat/m3-config` | `src/config.rs`; `tests/config_parse.rs` | `metrics/*`, `scenario/mod.rs`, locked types in config.rs | ✅ done | +| **F** | Markdown + JSON reporters | `feat/m3-reporters-md-json` | `src/report/markdown.rs` + `src/report/json.rs`; `tests/reporter_snapshots.rs` + 4 insta snapshots | `report/mod.rs` (locked Report + Reporter trait), `metrics/*`, `scenario/mod.rs` | ✅ done | +| **G** | Terminal reporter + sysinfo process metrics | `feat/m3-terminal-process` | `src/report/terminal.rs`; `src/metrics/process.rs`; `tests/process_metrics.rs` | `report/mod.rs`, `metrics/mod.rs` | ✅ done | +| **H** | Run orchestrator + Vibe-Trading regression test | `feat/m3-run-orchestrator` | `src/run.rs` (full body + 6 unit tests); `tests/vibe_trading_regression.rs` | everything else read-only | ✅ done | +| **integration** | CLI + reporters wiring + ci-checks + commit | `main` | `crates/mcp-loadtest-cli/src/main.rs` (clap subcommands); `crates/mcp-loadtest-cli/Cargo.toml` (deps); CHANGELOG; AGENTS | all M3 outputs | ✅ done | ### M3 locked contracts (DO NOT modify) diff --git a/CHANGELOG.md b/CHANGELOG.md index bf3992f..50edbe7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,75 +8,79 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added + +- Opt-in strict MCP schema validation (ADR 0010): `[validation] strict = true` validates each `tools/call`'s arguments against the server's advertised `inputSchema` before sending. A dependency-free subset validator (`protocol::schema`) covers `type`/`properties`/`required`/`enum`/`items` and skips unmodeled keywords (forward-compatible, ADR 0005). Arg violations are classified as `CallOutcome::ProtocolError` so they gate a run; default (off) behaviour is byte-for-byte unchanged. Result-side validation is deferred to v0.2. +- Configurable regression thresholds (ADR 0009 follow-up): `compare` gains `--max-p99-regression-pct` / `--max-error-rate-regression-pp` / `--allow-deadlock-increase`, and the `compare_runs` MCP tool gains the matching optional args. Backed by a shared `analysis::regression::RegressionThresholds` whose `Default` reproduces the historical 10% p99 / 0.5pp / deadlock-zero-tolerance policy, so existing CI gates are unaffected unless they opt in. - Project scaffolding: workspace layout, Cargo config, CLAUDE.md hierarchy, slash commands, CI workflow. - Design document covering motivation, types, algorithms, test matrix, milestones (DESIGN.md, 20 sections). - Project-structure document for AI-assisted development workflow (PROJECT-STRUCTURE.md). - M1 protocol stack: - - `protocol::jsonrpc` — JSON-RPC 2.0 message types (`OutgoingRequest`, `OutgoingNotification`, `ResponseEnvelope`, `ResponsePayload`, `ErrorObject`). - - `protocol::mcp` — MCP types (`InitializeParams`/`Result`, `Tool`, `ListToolsResult`, `CallToolParams`, `CallToolResult`, `Content`). + - `protocol::jsonrpc` — JSON-RPC 2.0 message types (`OutgoingRequest`, `OutgoingNotification`, `ResponseEnvelope`, `ResponsePayload`, `ErrorObject`). + - `protocol::mcp` — MCP types (`InitializeParams`/`Result`, `Tool`, `ListToolsResult`, `CallToolParams`, `CallToolResult`, `Content`). - M1 `Session` — stdio MCP session that spawns a child, runs `initialize` + `notifications/initialized`, exposes `list_tools` / `call_tool` / `shutdown`. Synchronous request/response only; concurrent in-flight requests deferred to M2. - Python test fixtures: `_common.py` (framing helpers) + `mock-normal.py` (echoes args). - End-to-end integration test `happy_path.rs` covering spawn → initialize → tools/list → tools/call → shutdown against `mock-normal.py`. - M2 scenarios + metrics core (delivered via 4-agent parallel sprint): - - `scenario::Scenario` trait + `RunContext` + `ScenarioOutcome` (interface contract pinned in AGENTS.md). - - `scenario::sustained::Sustained` — concurrent-load workload (M2 sequential against single Session; multi-Session pool is M3). - - `scenario::deadlock_probe::DeadlockProbe` — Vibe-Trading-bug-class detector. Wraps each `tools/call` with `hang_detect`; bails on first deadlock to avoid flooding a wedged session. - - `scenario::cold_start::ColdStart` — placeholder (M3 will activate once `RunContext` gains a session-spawning factory). - - `hang_detector::hang_detect` — two-phase `tokio::select!` watchdog (DESIGN.md §15.1) classifying each call as Ok / Slow / Deadlock / Err. - - `metrics::Recorder` — Arc-shared, lock-free outcome counters + 16-shard `hdrhistogram` for per-call latencies (microsecond resolution, 1µs..=1h range). - - `ScenarioMetrics`, `LatencyStats` (p50/p95/p99/p999/mean/min/max/count), `ThroughputStats`, `OutcomeCounts`. + - `scenario::Scenario` trait + `RunContext` + `ScenarioOutcome` (interface contract pinned in AGENTS.md). + - `scenario::sustained::Sustained` — concurrent-load workload (M2 sequential against single Session; multi-Session pool is M3). + - `scenario::deadlock_probe::DeadlockProbe` — Vibe-Trading-bug-class detector. Wraps each `tools/call` with `hang_detect`; bails on first deadlock to avoid flooding a wedged session. + - `scenario::cold_start::ColdStart` — placeholder (M3 will activate once `RunContext` gains a session-spawning factory). + - `hang_detector::hang_detect` — two-phase `tokio::select!` watchdog (DESIGN.md §15.1) classifying each call as Ok / Slow / Deadlock / Err. + - `metrics::Recorder` — Arc-shared, lock-free outcome counters + 16-shard `hdrhistogram` for per-call latencies (microsecond resolution, 1µs..=1h range). + - `ScenarioMetrics`, `LatencyStats` (p50/p95/p99/p999/mean/min/max/count), `ThroughputStats`, `OutcomeCounts`. - M2 fixtures: `mock-broken.py` (canonical deadlock pattern), `mock-slow.py` (2s tool latency), `mock-crash.py` (~1% mid-call exit). - M2 integration tests: `scenarios_basic.rs` (sustained + cold_start placeholder + cancellation), `deadlock.rs` (`mock_normal_no_deadlock` + `mock_broken_detects_deadlock` — the killer test that catches the Vibe-Trading bug class in <7s). - M3 reports + first internal release (delivered via 4-agent parallel sprint): - - `config::Config` + `ServerConfig` + `ScenarioConfig` + `ThresholdsConfig` + `OutputConfig` — TOML schema with humantime durations, semantic validation (`Config::from_toml_str`/`from_file`), and `example_config()` printer. - - `report::Report` + `ProcessStats` + `ProcessSample` + `ServerInfo` + `ThresholdViolation` + `ReportError` + `Reporter` trait. - - `report::markdown::MarkdownReporter` — DESIGN §17.3 template (status badge, summary, latency table, errors, process line, threshold violations, trace). - - `report::json::JsonReporter` — DESIGN §17.2 schema via a `ReportView` wrapper (durations as `_ms`, ISO 8601 timestamps); `Report` stays unmodified. - - `report::terminal::TerminalReporter` — ANSI-colored compact summary; respects `NO_COLOR` / `CLICOLOR` / non-tty automatically. - - `metrics::process::ProcessSampler` — sysinfo 0.32 backed periodic RSS+CPU sampler, two-phase CPU baseline, cancellation-aware, best-effort on dead PIDs. - - `run::Run` + `RunError` + `Run::execute()` — full orchestrator: ulid run-id, run dir creation, `Session` spawn, scenario drive, metrics snapshot, threshold evaluation, bounded shutdown. - - 79 tests passing across lib + 6 integration test files. + - `config::Config` + `ServerConfig` + `ScenarioConfig` + `ThresholdsConfig` + `OutputConfig` — TOML schema with humantime durations, semantic validation (`Config::from_toml_str`/`from_file`), and `example_config()` printer. + - `report::Report` + `ProcessStats` + `ProcessSample` + `ServerInfo` + `ThresholdViolation` + `ReportError` + `Reporter` trait. + - `report::markdown::MarkdownReporter` — DESIGN §17.3 template (status badge, summary, latency table, errors, process line, threshold violations, trace). + - `report::json::JsonReporter` — DESIGN §17.2 schema via a `ReportView` wrapper (durations as `_ms`, ISO 8601 timestamps); `Report` stays unmodified. + - `report::terminal::TerminalReporter` — ANSI-colored compact summary; respects `NO_COLOR` / `CLICOLOR` / non-tty automatically. + - `metrics::process::ProcessSampler` — sysinfo 0.32 backed periodic RSS+CPU sampler, two-phase CPU baseline, cancellation-aware, best-effort on dead PIDs. + - `run::Run` + `RunError` + `Run::execute()` — full orchestrator: ulid run-id, run dir creation, `Session` spawn, scenario drive, metrics snapshot, threshold evaluation, bounded shutdown. + - 79 tests passing across lib + 6 integration test files. - M3 CLI: `mcp-loadtest example-config | run --config | deadlock-probe --server "..." | list-scenarios`. Run + DeadlockProbe write `report.md` and `metrics.json` under `runs//`; `deadlock-probe` exits non-zero when `deadlock_count > 0`. - M3 Vibe-Trading regression test: clones `HKUDS/Vibe-Trading@71220c7` (parent of PR #85) into `target/vibe-trading-fixture/`, runs `DeadlockProbe`, asserts deadlock detected. `#[ignore]`d by default; run with `cargo test --test vibe_trading_regression -- --ignored --nocapture`. - M4 transport parity (3-agent parallel sprint): - - `protocol::transport::Transport` async trait + `TransportError` (Io/Http/Closed/Timeout/Other). - - `StdioTransport` extracted from `Session` (legacy single-call path stays via `Session::spawn`). - - `HttpTransport` (Streamable HTTP simple JSON variant; SSE-response detection routes to a clear M5 deferral). - - `SseTransport` with background reader task + endpoint handshake + id-correlation buffer. - - Python fixtures `mock-http-server.py` + `mock-sse-server.py` (stdlib http.server only). - - `ServerConfig.url` + transport-aware `Run::execute` dispatch. - - Toolchain bumped 1.85 → stable to satisfy icu transitive MSRV pulled in by `url`. + - `protocol::transport::Transport` async trait + `TransportError` (Io/Http/Closed/Timeout/Other). + - `StdioTransport` extracted from `Session` (legacy single-call path stays via `Session::spawn`). + - `HttpTransport` (Streamable HTTP simple JSON variant; SSE-response detection routes to a clear M5 deferral). + - `SseTransport` with background reader task + endpoint handshake + id-correlation buffer. + - Python fixtures `mock-http-server.py` + `mock-sse-server.py` (stdlib http.server only). + - `ServerConfig.url` + transport-aware `Run::execute` dispatch. + - Toolchain bumped 1.85 → stable to satisfy icu transitive MSRV pulled in by `url`. - M5 analysis parity (4-agent parallel sprint): - - `analysis::breaking_point` (BreakingPointDetector w/ first-violator semantics on per-step deltas). - - `analysis::grading` (Grade A-F per latency/concurrency/error, worst-of-three rollup). - - `scenario::ramp` (linear-stepped concurrency; integrates with breaking_point). - - `scenario::pattern` (multi-step + weighted-random + think-time + ErrorBehavior). - - `Sustained` refactor to drive Patterns; `run_patterns` free function for multi-pattern callers. - - `scenario::soak` (periodic snapshots + leak signal via mean-latency regression). - - `mcp-loadtest compare baseline.json current.json` CLI subcommand (markdown/JSON diff). + - `analysis::breaking_point` (BreakingPointDetector w/ first-violator semantics on per-step deltas). + - `analysis::grading` (Grade A-F per latency/concurrency/error, worst-of-three rollup). + - `scenario::ramp` (linear-stepped concurrency; integrates with breaking_point). + - `scenario::pattern` (multi-step + weighted-random + think-time + ErrorBehavior). + - `Sustained` refactor to drive Patterns; `run_patterns` free function for multi-pattern callers. + - `scenario::soak` (periodic snapshots + leak signal via mean-latency regression). + - `mcp-loadtest compare baseline.json current.json` CLI subcommand (markdown/JSON diff). - M6 differentiators v1 (4-agent parallel sprint): - - `tui::dashboard` (ratatui + crossterm live polling; quits on q/Esc, propagates cancel). - - `analysis::race_detector` + `scenario::race_check` (key-sorted JSON canonicalization, divergence reporting). - - `mcp-loadtest cross --server "..." --server "..."` (side-by-side multi-server comparison with grading). - - `ProcessStats` enriched with `peak_fd`/`final_fd`/`peak_threads`/`final_threads` (best-effort: Linux /proc/fd; macOS/Windows degrade to 0). - - `Soak` linear-regression-based leak signal helper (`detect_leak`). + - `tui::dashboard` (ratatui + crossterm live polling; quits on q/Esc, propagates cancel). + - `analysis::race_detector` + `scenario::race_check` (key-sorted JSON canonicalization, divergence reporting). + - `mcp-loadtest cross --server "..." --server "..."` (side-by-side multi-server comparison with grading). + - `ProcessStats` enriched with `peak_fd`/`final_fd`/`peak_threads`/`final_threads` (best-effort: Linux /proc/fd; macOS/Windows degrade to 0). + - `Soak` linear-regression-based leak signal helper (`detect_leak`). - M7 differentiators v2 + v0.1.0 polish (4-agent parallel sprint): - - `scenario::fuzzer` (FuzzPayload enum: UnknownMethod / NumericMethod / GiantPayload / ControlChars / Nested / NullParams / StringParams; raw-byte variants documented + skipped pending Transport::raw_send hook). - - `analysis::fuzz_report` (FuzzClass classification + has_critical signal). - - `analysis::coverage` (CoverageReport: registered vs exercised tools + `coverage_pct`). - - `ToolSlo` per-tool latency budget assertions in `ThresholdsConfig`. - - `Recorder::record_tool` + `snapshot_per_tool` (per-tool counters; existing `record`/`snapshot` aggregate untouched for back-compat). - - `mcp-loadtest serve --mcp` self-hosted MCP server (DESIGN §21.2 differentiator) — exposes `deadlock_probe`/`sustained_load`/`compare_runs` as MCP tools so AI agents drive load tests directly via stdio JSON-RPC. - - README rewritten to lead with the deadlock demo + competitive positioning vs. reaatech. - - `docs/examples/{ci-integration,custom-scenario,debugging-deadlocks}.md` cookbook. + - `scenario::fuzzer` (FuzzPayload enum: UnknownMethod / NumericMethod / GiantPayload / ControlChars / Nested / NullParams / StringParams; raw-byte variants documented + skipped pending Transport::raw_send hook). + - `analysis::fuzz_report` (FuzzClass classification + has_critical signal). + - `analysis::coverage` (CoverageReport: registered vs exercised tools + `coverage_pct`). + - `ToolSlo` per-tool latency budget assertions in `ThresholdsConfig`. + - `Recorder::record_tool` + `snapshot_per_tool` (per-tool counters; existing `record`/`snapshot` aggregate untouched for back-compat). + - `mcp-loadtest serve --mcp` self-hosted MCP server (DESIGN §21.2 differentiator) — exposes `deadlock_probe`/`sustained_load`/`compare_runs` as MCP tools so AI agents drive load tests directly via stdio JSON-RPC. + - README rewritten to lead with the deadlock demo + competitive positioning vs. reaatech. + - `docs/examples/{ci-integration,custom-scenario,debugging-deadlocks}.md` cookbook. - Post-M7 competitive-gap close (3-agent parallel sprint, post-review): - - `scenario::spike` — sudden-burst concurrency pattern (baseline → spike window → cooldown). Closes the reaatech parity gap. - - `report::html::HtmlReporter` — self-contained `report.html` with inline SVG histogram, escaped HTML, no external CDN or JS. Closes the IBM/spbiju enterprise-report gap. Wired into the CLI as the `"html"` output format. - - `protocol::transport::ws::WsTransport` — WebSocket transport via tokio-tungstenite (rustls + webpki-roots); 16 MB per-frame OOM cap mirroring stdio. Activates the `"ws"` scheme that was previously parser-accepted but rejected at runtime. - - SECURITY.md — security disclosure policy at repo root (in-scope vs out-of-scope surfaces, reporting flow, recent hardening notes). + - `scenario::spike` — sudden-burst concurrency pattern (baseline → spike window → cooldown). Closes the reaatech parity gap. + - `report::html::HtmlReporter` — self-contained `report.html` with inline SVG histogram, escaped HTML, no external CDN or JS. Closes the IBM/spbiju enterprise-report gap. Wired into the CLI as the `"html"` output format. + - `protocol::transport::ws::WsTransport` — WebSocket transport via tokio-tungstenite (rustls + webpki-roots); 16 MB per-frame OOM cap mirroring stdio. Activates the `"ws"` scheme that was previously parser-accepted but rejected at runtime. + - SECURITY.md — security disclosure policy at repo root (in-scope vs out-of-scope surfaces, reporting flow, recent hardening notes). - 12 new tests bring the suite to ~255 passing (was 243): spike happy path + 5 HTML reporter (escaping, chart, violation styling) + 2 WS (echo roundtrip + scheme rejection) + 1 config parse-spike-kind + scenario name/schema asserts. ### Refactor / cleanup + - `report::common` extracted (post-M4 `/simplify` pass): `fmt_duration`, `fmt_count`, `format_server_command`, `describe_failure`, `format_iso8601_utc` shared between markdown + terminal reporters (-69 LoC net). - `ThresholdViolation.metric: String` → `kind: ThresholdKind` enum (post-M3 QF-5); JSON wire format preserved via `#[serde(rename = "metric")]` + per-variant snake_case rename. - `Report::passed()` now treats `deadlock_count > 0` as a hard failure (post-M3 QF-1). @@ -89,45 +93,62 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - ADR 0004 — strategic decision to compete head-on (Path A) over contributing to reaatech (Path B) or repositioning (Path C). ### Fixed + - `Soak::leak_threshold_mb_per_sec` renamed to `latency_drift_ms_per_sec` (the units were always ms/sec; the old name lied). - Fuzzer: skipped raw-transport iterations no longer bump `total_calls` or pollute `CallOutcome::Cancelled` (they never hit the wire). - Fuzzer: server-accepted malformed payloads now record `CallOutcome::Malformed` and bump `error_count` so threshold evaluators surface them. - Run: `memory_growth_mb` now compares `peak − final` instead of bare peak, so a steady-state high-RSS process no longer false-positives. ### Performance + - Sustained / ramp / soak scenarios: `tokio::task::yield_now()` instead of `sleep(ZERO)` to avoid registering no-op reactor timers. - Metrics: per-tool `BTreeMap` moved behind `RwLock`; fast path is now a read lock. - `cmd_cross`: cross-server runs use `futures::future::join_all` (was sequential, now N-way parallel). - Fuzzer: `LazyLock` for `GIANT_PAYLOAD` + `NESTED_PAYLOAD` so multi-MB / 100-deep payloads build once per process. - Release profile: `panic = "abort"` shaves ~600 KB off the stripped binary (5.7 MB → 5.1 MB on x86_64-pc-windows-gnu). - Hot-path zero-copy refactor (Phase 1 pre-public audit): - - `OutgoingRequest` / `OutgoingNotification` now borrow `method: &'a str` and `params: &'a P` (generic, `P: ?Sized + Serialize`). Eliminates the intermediate `serde_json::to_value()` round-trip on every tool call. - - `CallToolParams { name: &'a str, arguments: &'a Value }`. - - `Session::call_tool(&str, &Value)` — was `(&str, Value)`. Scenarios drop `.clone()` and pass `&self.args`; deep-clone of the JSON args tree per iteration is gone. + - `OutgoingRequest` / `OutgoingNotification` now borrow `method: &'a str` and `params: &'a P` (generic, `P: ?Sized + Serialize`). Eliminates the intermediate `serde_json::to_value()` round-trip on every tool call. + - `CallToolParams { name: &'a str, arguments: &'a Value }`. + - `Session::call_tool(&str, &Value)` — was `(&str, Value)`. Scenarios drop `.clone()` and pass `&self.args`; deep-clone of the JSON args tree per iteration is gone. - Transports: `pending: VecDeque` (id-mismatch buffer) is now capped at `MAX_PENDING_FRAMES = 256` in both `sse` and `ws`. Overflow surfaces `TransportError::Other` instead of growing without bound. ### Tests / benches + - Criterion microbenchmarks added under `crates/mcp-loadtest/benches/`: `record`, `histogram`, `session_loopback` (Transport-trait loopback, no I/O), `hang_detect`. Run with `cargo bench -p mcp-loadtest`. Closes the DESIGN.md §19 perf-claim gap. - Phase 3 coverage gaps closed (pre-publish audit): - - `tests/ramp.rs` — first integration test for the `ramp` scenario (was unit-only). - - `tests/spike.rs` — added `spike_against_crashing_server_survives_without_hang` (failure-mode coverage; uses `mock-crash.py`; assertions are crash-stochastic-aware so the test isn't flaky). - - `src/protocol/transport/ws.rs` — 3 new failure-mode tests: server-closes-mid-call, cancel-during-request, oversized-frame-rejected. - - `tests/reporter_snapshots.rs` — 5 new tests: substring landmarks for html + terminal (insta-snapshot parity with markdown + json was skipped because both reporters have too much structural variance for stable snapshots); empty-metrics renders without panic for html, terminal, and json (catches divide-by-zero in throughput math). + - `tests/ramp.rs` — first integration test for the `ramp` scenario (was unit-only). + - `tests/spike.rs` — added `spike_against_crashing_server_survives_without_hang` (failure-mode coverage; uses `mock-crash.py`; assertions are crash-stochastic-aware so the test isn't flaky). + - `src/protocol/transport/ws.rs` — 3 new failure-mode tests: server-closes-mid-call, cancel-during-request, oversized-frame-rejected. + - `tests/reporter_snapshots.rs` — 5 new tests: substring landmarks for html + terminal (insta-snapshot parity with markdown + json was skipped because both reporters have too much structural variance for stable snapshots); empty-metrics renders without panic for html, terminal, and json (catches divide-by-zero in throughput math). - Test suite: 260 passing (was 250 pre-Phase-3); 0 flakes in a 3-run check; 1 `#[ignore]` left (Vibe-Trading regression — requires external checkout). - `cold_start` scenario remains an intentional placeholder in v0.1.0; real handshake-time histogram measurement is queued for v0.2 (DESIGN §8). The existing `cold_start_is_an_inert_placeholder` test pins the placeholder contract so the v0.2 work is forced to update assertions. ### Changed + - `ServerConfig::stdio(command, args)` constructor + `split_server_command()` free fn relocated to `config` module (kills 4 + 3 hand-rolled literals across the codebase). - `classify_error` / `is_terminal_error` deduped into `scenario/mod.rs` (kills 3 byte-identical copies across pattern / ramp / soak). - Regression thresholds `P99_REGRESSION_PCT` + `ERROR_RATE_REGRESSION_PP` lifted into `analysis::regression` and shared between `cmd_compare` and `serve/tools::compare_runs`. +- Unified scenario builder: a single `build_scenario` factory now drives every config — `sustained` accepts weighted `patterns` / legacy `tool_call` arrays (via `PatternScenario`) alongside the single-`tool` path, and all M5–M7 kinds (`ramp` / `soak` / `spike` / `race_check` / `fuzzer` / `pattern`) dispatch through it. +- `cmd_run` split into private `builder` / `params` / `patterns` submodules (each under the 300 prod-LoC convention); public surface unchanged (`run_from_config` + the re-exported `parse_dur_str` `main.rs` shares with `deadlock-probe` / `cross`). ### Deprecated + - `DEFAULT_LEAK_THRESHOLD_MB_PER_SEC` → use `DEFAULT_LATENCY_DRIFT_MS_PER_SEC`. The old constant remains as an alias for one release and will be removed in v0.2.0. ### Notes + - ✅ The M8 file-split pass completed in the pre-publish review. All source files have production code (excluding `#[cfg(test)] mod tests`) under the 300-line convention. See `POST_PUBLISH_ISSUES.md` for the per-wave summary of what split where. - `serve` and `tui` modules will move behind cargo feature flags in a future release to keep the default build slim. - HTTP / SSE transport host-allowlist for SSRF defense is deferred. Currently mitigated by `Policy::none()` on redirects; the allowlist is operator-facing config and will land alongside the broader transport-hardening pass. +- ✅ Pre-publish security pass on the B1/B2 surface: bounded `protocol::schema` recursion (`MAX_SCHEMA_DEPTH`, defends strict mode against a maliciously deep server `inputSchema`); non-positive regression thresholds rejected at the CLI/MCP boundary (would otherwise invert the gate); `compare_runs` no longer echoes the raw caller path; `enum`-violation messages are length-capped. +- ✅ Pre-publish stabilization: `crates/mcp-loadtest` now `exclude`s the four `CLAUDE.md` scaffolding files from the published package (108 → 104 files); added `run_strict` CLI integration test covering the real `run --config` entrypoint with `[validation] strict = true` end-to-end (TOML → `Run::execute` → report-on-disk → non-zero gate); full pipeline (fmt, clippy `-D warnings`, `--locked` build/test, doc) verified on the **x86_64-pc-windows-msvc** target — the toolchain crates.io ships to Windows users. +- ✅ Supply-chain gate (`cargo deny check` / `cargo audit`): allowlisted `CDLA-Permissive-2.0` (the `webpki-roots` CA-bundle license, via `tokio-tungstenite`) and added three individually-triaged, documented advisory `ignore`s — RUSTSEC-2025-0052 (`async-std`, dev-only via `httpmock`), RUSTSEC-2024-0436 (`paste`, transitive), RUSTSEC-2026-0002 (`lru` unsoundness, transitive via `ratatui`, optional `tui` feature only, no semver fix). Rationale + revisit conditions in ADR 0011 + `POST_PUBLISH_ISSUES.md`; no actual vulnerabilities, gate still hard-fails on anything new. +- ✅ Pre-publish API-durability & standards pass (so v0.1.0's public surface is safe to commit to): + - `#[non_exhaustive]` on the `Config` family (`Config`/`ServerConfig`/`ScenarioConfig`/`ThresholdsConfig`/`OutputConfig`/`ValidationConfig`/`ConfigError`), the public error/outcome enums (`SessionError`, `RunError`, `CallOutcome`, `ThresholdKind`, `ReportError`, `TransportError`, `HangOutcome`, `SchemaPolicy`, `ValidationSite`) and `RunContext` — adding a field/variant is no longer a breaking change. New constructors keep ergonomic cross-crate construction: `Config::new` + `with_thresholds`/`with_output`/`with_validation`, `ScenarioConfig::new`, `OutputConfig::new`, `RunContext::new`. + - Crate-root docs rewritten: dropped the stale "v0.0.0 — scaffolding" status and replaced the fictional-API example with a real, compiling `no_run` one; `ValidationConfig` added to the public re-export facade. + - docs.rs builds `--all-features` (so `serve`/`tui` render); `missing_docs` raised to `deny` (lib + CLI); README repo links rewritten to absolute URLs so they resolve on crates.io. + - MSRV corrected `1.86` → **`1.88`** (the real floor — edition-2024 let-chains; verified by an actual `cargo build` on 1.88) with a new CI `msrv` job pinning it; modernized the clippy nits it surfaced (`collapsible_if`, `is_multiple_of`). + - Re-verified end-to-end on the **x86_64-pc-windows-msvc** target: fmt, clippy `-D warnings`, `--locked` build/test (incl. doctests), `doc -D warnings`, `cargo deny`, `cargo audit`, `publish --dry-run` (104 files). ## [0.1.0] — TBD diff --git a/Cargo.toml b/Cargo.toml index 59bf323..1f86194 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -12,7 +12,7 @@ license = "MIT OR Apache-2.0" repository = "https://github.com/Teerapat-Vatpitak/mcp-loadtest" homepage = "https://github.com/Teerapat-Vatpitak/mcp-loadtest" authors = ["Teerapat Vatpitak"] -rust-version = "1.86" +rust-version = "1.88" [workspace.dependencies] tokio = { version = "1", features = ["full"] } diff --git a/DESIGN.md b/DESIGN.md index 537c1f5..4393a98 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -19,6 +19,7 @@ The Model Context Protocol ecosystem is exploding — new MCP servers ship weekl ### Canonical motivating case The author hit a deadlock in [HKUDS/Vibe-Trading](https://github.com/HKUDS/Vibe-Trading) where: + - `initialize` → ✅ worked - `tools/list` → ✅ worked - `tools/call` → ❌ hung forever @@ -38,6 +39,7 @@ There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), ## 2. Goals & Non-Goals ### Goals + - Detect deadlocks, hangs, livelocks under realistic concurrent load - Measure latency (p50/p95/p99), throughput, error rate - Work against any MCP server regardless of language / transport (stdio, HTTP, SSE, WebSocket) @@ -47,6 +49,7 @@ There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), - Zero-config quick-start: `mcp-loadtest probe -s "python -m my_mcp"` should just work ### Non-Goals + - **Not a replacement for unit tests.** Different problem. - **Not a tool for testing MCP clients.** Client-side bugs are a separate domain. - **Not validating tool output correctness.** We test protocol-level behavior. If your tool returns wrong data, that's not what we catch. @@ -60,12 +63,14 @@ There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), ### MCP protocol (relevant subset) JSON-RPC 2.0 framing over one of four transports: + - **stdio** — line-delimited JSON over child process stdin/stdout (most common, all examples in this doc focus here) - **HTTP** — Streamable HTTP (simple JSON variant); request via POST, simple JSON response - **HTTP+SSE** — request via POST, server pushes events via SSE channel - **WebSocket** — bidirectional frames Lifecycle (stdio): + ``` client → server {"method":"initialize", "params":{...}} client ← server {"result":{"protocolVersion":...,"capabilities":{...}}} @@ -78,14 +83,14 @@ client ← server {"result":{"content":[{...}]}} ### Bug classes we target -| Class | Example | Why hard to catch in unit tests | -|---|---|---| -| Lazy-init deadlock | Vibe-Trading PR #85 | Bug only surfaces with full subprocess + protocol handshake | -| Concurrent tool-call race | tools/call before tools/list completes | Need real concurrency; mocked async ≠ real async | -| Resource exhaustion | 1000 concurrent calls → fd / mem leak | Need sustained load | -| Slow-tool head-of-line | One slow tool blocks queue | Need mixed workload | -| Reconnect / mid-call kill | Connection drops between request and response | Hard to simulate without tooling | -| Notification ordering | Server sends `notifications/cancelled` mid-call | Need sequence-aware client | +| Class | Example | Why hard to catch in unit tests | +| ------------------------- | ----------------------------------------------- | ----------------------------------------------------------- | +| Lazy-init deadlock | Vibe-Trading PR #85 | Bug only surfaces with full subprocess + protocol handshake | +| Concurrent tool-call race | tools/call before tools/list completes | Need real concurrency; mocked async ≠ real async | +| Resource exhaustion | 1000 concurrent calls → fd / mem leak | Need sustained load | +| Slow-tool head-of-line | One slow tool blocks queue | Need mixed workload | +| Reconnect / mid-call kill | Connection drops between request and response | Hard to simulate without tooling | +| Notification ordering | Server sends `notifications/cancelled` mid-call | Need sequence-aware client | --- @@ -172,18 +177,18 @@ src/ ### Key crate dependencies -| Crate | Why | -|---|---| -| tokio | async runtime | -| serde / serde_json | JSON-RPC payloads | -| clap | CLI | -| toml | config | -| hdrhistogram | percentile latency | -| sysinfo | RSS/CPU per pid (cross-platform) | -| indicatif | terminal progress | -| tracing | structured logging | -| thiserror / anyhow | errors | -| tokio-util | LinesCodec for stdio framing | +| Crate | Why | +| ------------------ | -------------------------------- | +| tokio | async runtime | +| serde / serde_json | JSON-RPC payloads | +| clap | CLI | +| toml | config | +| hdrhistogram | percentile latency | +| sysinfo | RSS/CPU per pid (cross-platform) | +| indicatif | terminal progress | +| tracing | structured logging | +| thiserror / anyhow | errors | +| tokio-util | LinesCodec for stdio framing | No proc-macro magic. No "framework" — just composable structs. @@ -225,6 +230,7 @@ async fn no_deadlock_under_concurrent_calls() { ``` Design choices: + - **Builder pattern** — predictable, no struct-init explosion - **`execute()` returns `Result`** — lets tests pattern-match on metrics - **Report exposes raw histograms** — can drive custom assertions @@ -262,6 +268,7 @@ mcp-loadtest example-config > bench.toml ``` Output structure for each run: + ``` runs/2026-05-10T14-30-00/ ├── config.toml # exact config used @@ -325,23 +332,25 @@ formats = ["markdown", "json", "terminal"] ## 8. Built-in scenarios -| Scenario | Description | Detects | -|---|---|---| -| `cold_start` | Spawn → initialize → first tool call. Repeat N times. (M2 placeholder — needs session factory; tracked for follow-up.) | regression in startup time, init-time deadlocks | -| `sustained` | Constant load against one session for a fixed duration. Drives the multi-step weighted-random `pattern` engine internally. | baseline p99 latency, throughput, sustained error rate | -| `spike` | Baseline → sharp burst at peak concurrency for a fixed window → cooldown back to baseline. Models Black-Friday-style traffic spikes. | queue overflow, recovery behavior, fairness under burst | -| `ramp` | Step concurrency from `from` to `to` by `step_increment`, optionally feeding the per-step metrics into [`analysis::breaking_point`]. | finds break-point — concurrency where p99 explodes | -| `soak` | Long-duration steady load with periodic snapshots; pairs with `analysis::regression` for latency-drift and (via `ProcessSampler`) RSS-slope leak signals. | memory leaks, latency drift, throughput collapse over hours | -| `pattern` | Multi-step weighted-random tool-call sequences with per-pattern `think_time` and `ErrorBehavior`. Building block used directly by `sustained`. | realistic mixed workloads (explore-then-act, read-then-write) | -| `deadlock_probe` | initialize → tools/list → fire N `tools/call` to same tool wrapped in `hang_detect`. Bails on first deadlock to avoid flooding a wedged session. | the **Vibe-Trading bug class** specifically | -| `race_check` | Issue N identical `tools/call` and run the responses through `analysis::race_detector` (key-sorted JSON canonicalization). | non-determinism / divergent responses to identical inputs | -| `fuzzer` | Cycle through enumerated malformed-but-plausible payloads (unknown method, numeric method, giant payload, control chars, deep-nested, null/string params); classify each via `analysis::fuzz_report`. | parser bugs, type-confusion in method dispatch | +| Scenario | Description | Detects | +| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- | +| `cold_start` | Spawn → initialize → first tool call. Repeat N times. (M2 placeholder — needs session factory; tracked for follow-up.) | regression in startup time, init-time deadlocks | +| `sustained` | Constant load against one session for a fixed duration. Drives the multi-step weighted-random `pattern` engine internally. | baseline p99 latency, throughput, sustained error rate | +| `spike` | Baseline → sharp burst at peak concurrency for a fixed window → cooldown back to baseline. Models Black-Friday-style traffic spikes. | queue overflow, recovery behavior, fairness under burst | +| `ramp` | Step concurrency from `from` to `to` by `step_increment`, optionally feeding the per-step metrics into [`analysis::breaking_point`]. | finds break-point — concurrency where p99 explodes | +| `soak` | Long-duration steady load with periodic snapshots; pairs with `analysis::regression` for latency-drift and (via `ProcessSampler`) RSS-slope leak signals. | memory leaks, latency drift, throughput collapse over hours | +| `pattern` | Multi-step weighted-random tool-call sequences with per-pattern `think_time` and `ErrorBehavior`. Building block used directly by `sustained`. | realistic mixed workloads (explore-then-act, read-then-write) | +| `deadlock_probe` | initialize → tools/list → fire N `tools/call` to same tool wrapped in `hang_detect`. Bails on first deadlock to avoid flooding a wedged session. | the **Vibe-Trading bug class** specifically | +| `race_check` | Issue N identical `tools/call` and run the responses through `analysis::race_detector` (key-sorted JSON canonicalization). | non-determinism / divergent responses to identical inputs | +| `fuzzer` | Cycle through enumerated malformed-but-plausible payloads (unknown method, numeric method, giant payload, control chars, deep-nested, null/string params); classify each via `analysis::fuzz_report`. | parser bugs, type-confusion in method dispatch | Deferred to v0.2: + - `slow_mix` — 80% calls to a fast tool, 20% to a deliberately-slow tool (head-of-line blocking, fairness). Approximable today by configuring a multi-step `pattern` with weighted tools. - `reconnect` — drop session mid-call (close stdin), spawn new session, retry (resilience, leftover state, zombies). Needs the session pool that lands in M8+. Each scenario is an `impl Scenario` with two methods: + ```rust trait Scenario { async fn drive(&self, session: SessionPool, ctx: RunContext) -> ScenarioOutcome; @@ -357,27 +366,29 @@ trait Scenario { Mock MCP servers in `tests/fixtures/`. Each is a tiny Python script (chosen for ubiquity, not Rust, to make the test environment realistic). -| Mock | Behavior | Tests | -|---|---|---| -| `mock-normal.py` | Echoes args, responds in 1ms | happy-path metrics shape | -| `mock-slow.py` | Tool sleeps 2s | latency histogram correctness | -| `mock-broken.py` | Hangs on first tools/call (replicates Vibe-Trading bug) | `deadlock_probe` correctly classifies | -| `mock-crash.py` | Panics on 1% of calls | error-rate accuracy | -| `mock-leak.py` | Allocates 10 KB/call, never frees | `leak` scenario detects | -| `mock-error.py` | Returns JSON-RPC errors per spec | error classification | -| `mock-slow-init.py` | Takes 5s to respond to `initialize` | `cold_start` measures correctly | -| `mock-malformed.py` | Returns invalid JSON occasionally | parser robustness | +| Mock | Behavior | Tests | +| ------------------- | ------------------------------------------------------- | ------------------------------------- | +| `mock-normal.py` | Echoes args, responds in 1ms | happy-path metrics shape | +| `mock-slow.py` | Tool sleeps 2s | latency histogram correctness | +| `mock-broken.py` | Hangs on first tools/call (replicates Vibe-Trading bug) | `deadlock_probe` correctly classifies | +| `mock-crash.py` | Panics on 1% of calls | error-rate accuracy | +| `mock-leak.py` | Allocates 10 KB/call, never frees | `leak` scenario detects | +| `mock-error.py` | Returns JSON-RPC errors per spec | error classification | +| `mock-slow-init.py` | Takes 5s to respond to `initialize` | `cold_start` measures correctly | +| `mock-malformed.py` | Returns invalid JSON occasionally | parser robustness | Test invariant: for each (scenario × mock) pair, the report's machine-readable summary contains expected fields with expected ranges. This is the bulk of integration tests. ### Layer B — does it catch real bugs? Snapshot test against a known-buggy commit of Vibe-Trading: + - Pin to commit `~PR-85` (just before the fix) - Run `deadlock_probe` scenario - Assert: report flags ≥1 deadlock, identifies `tools/call` as the offending request Re-run against post-fix commit: + - Same scenario, expect 0 deadlocks This is the killer demo. It goes in the README. @@ -390,26 +401,27 @@ CI matrix: ubuntu-latest, macos-latest, windows-latest × stable Rust × Python ## 10. Milestones (revised 2026-05-10 — head-on competition with reaatech/mcp-load-test) -Original 3-week plan replaced after discovering [reaatech/mcp-load-test](https://github.com/reaatech/mcp-load-test) ships v0.1 functionality already (see §10.5 for parity matrix). v0.1.0 of mcp-loadtest must reach feature parity *and* surface our differentiators before re-publishing. +Original 3-week plan replaced after discovering [reaatech/mcp-load-test](https://github.com/reaatech/mcp-load-test) ships v0.1 functionality already (see §10.5 for parity matrix). v0.1.0 of mcp-loadtest must reach feature parity _and_ surface our differentiators before re-publishing. Repo was private through the M1-M7 development phase. The new public repo URL will be added once v0.1.0 is published to crates.io. M1 through M7 are all shipped. Post-M7 work (spike scenario, HTML reporter, WebSocket transport, hot-path zero-copy refactor, criterion benches) is captured under `[Unreleased]` in CHANGELOG rather than as a new milestone — the work is small + cohesive enough that bundling it into v0.1.0 makes more sense than coining "M8" for it. The "Week N" column is dropped because milestones are no longer time-boxed — they're released. -| M | Theme | Key deliverables | -|---|---|---| -| **M1** ✓ | stdio Session | `Session::spawn` → handshake → `list_tools`/`call_tool`/`shutdown`; mock-normal.py; happy-path integration test | -| **M2** ✓ | Scenarios + metrics core | `Scenario` trait; `cold_start` + `sustained` + `deadlock_probe` impls; `hang_detector` (§15.1); hdrhistogram metrics; mocks `mock-broken`/`mock-slow`/`mock-crash` + tests | -| **M3** ✓ | Reports + first internal release | TOML config; markdown / JSON / console reporters; sysinfo-based process sampling; **regression test against real Vibe-Trading commit ~PR-85** | -| **M4** ✓ | Transport parity | HTTP transport (StreamableHTTP); SSE transport; HTTP/SSE fixtures; transport-aware concurrency profiles | -| **M5** ✓ | Analysis parity | `breaking_point` detection; performance grading (A-F per latency/concurrency/error); realistic patterns (explore-then-act, read-then-write, multi-step) with weighted random + think-time; `soak` scenario polish; `compare-baselines` subcommand | -| **M6** ✓ | Differentiators v1 | Real-time terminal TUI dashboard (live latency/throughput/RSS); server resource sampling beyond RSS (CPU, fd, threads); `race_detector` scenario; cross-server compare (`run --server srv-a --server srv-b`) | -| **M7** ✓ | Differentiators v2 + v0.1 polish | Protocol fuzzer (basic — random/malformed payloads); coverage tracking (tools registered vs. exercised); per-tool SLO assertions; README rewrite with competitive positioning; `cargo install` smoke test on all 3 OS | -| **Post-M7** ✓ | Pre-public-release close-out | Spike scenario; HTML reporter; WebSocket transport; hot-path zero-copy refactor; criterion benches (DESIGN §19 claims now reproducible). See CHANGELOG `[Unreleased]`. | -| **v0.1.0-rc** | Pre-publish review in flight | repo back to **public**; crates.io publish; HN/lobste.rs/r/rust announce | -| _M8+ stretch_ | Beyond | AI-assisted pattern generator; distributed mode (multi-worker); replay/record; PyO3 binding | +| M | Theme | Key deliverables | +| ------------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **M1** ✓ | stdio Session | `Session::spawn` → handshake → `list_tools`/`call_tool`/`shutdown`; mock-normal.py; happy-path integration test | +| **M2** ✓ | Scenarios + metrics core | `Scenario` trait; `cold_start` + `sustained` + `deadlock_probe` impls; `hang_detector` (§15.1); hdrhistogram metrics; mocks `mock-broken`/`mock-slow`/`mock-crash` + tests | +| **M3** ✓ | Reports + first internal release | TOML config; markdown / JSON / console reporters; sysinfo-based process sampling; **regression test against real Vibe-Trading commit ~PR-85** | +| **M4** ✓ | Transport parity | HTTP transport (StreamableHTTP); SSE transport; HTTP/SSE fixtures; transport-aware concurrency profiles | +| **M5** ✓ | Analysis parity | `breaking_point` detection; performance grading (A-F per latency/concurrency/error); realistic patterns (explore-then-act, read-then-write, multi-step) with weighted random + think-time; `soak` scenario polish; `compare-baselines` subcommand | +| **M6** ✓ | Differentiators v1 | Real-time terminal TUI dashboard (live latency/throughput/RSS); server resource sampling beyond RSS (CPU, fd, threads); `race_detector` scenario; cross-server compare (`run --server srv-a --server srv-b`) | +| **M7** ✓ | Differentiators v2 + v0.1 polish | Protocol fuzzer (basic — random/malformed payloads); coverage tracking (tools registered vs. exercised); per-tool SLO assertions; README rewrite with competitive positioning; `cargo install` smoke test on all 3 OS | +| **Post-M7** ✓ | Pre-public-release close-out | Spike scenario; HTML reporter; WebSocket transport; hot-path zero-copy refactor; criterion benches (DESIGN §19 claims now reproducible). See CHANGELOG `[Unreleased]`. | +| **v0.1.0-rc** | Pre-publish review in flight | repo back to **public**; crates.io publish; HN/lobste.rs/r/rust announce | +| _M8+ stretch_ | Beyond | AI-assisted pattern generator; distributed mode (multi-worker); replay/record; PyO3 binding | **Definition of done for v0.1.0:** + - `cargo install mcp-loadtest-cli` works on Linux/macOS/Windows. - `mcp-loadtest deadlock-probe -s "python -m vibe_trading_mcp"` reproduces the original bug on commit `~PR-85`. - All §10.5 parity-must-have rows are checked. @@ -424,40 +436,42 @@ reaatech/mcp-load-test as of 2026-05-10 (TS monorepo, 77 source files, ~50% of R ### Parity — features they have, we must match before re-publishing public -| Feature | reaatech | mcp-loadtest target | Milestone | -|---|---|---|---| -| stdio transport | ✓ | ✓ | M1 | -| HTTP (StreamableHTTP) transport | ✓ | ✓ | M4 | -| SSE transport | ✓ | ✓ | M4 | -| WebSocket transport | ✗ | ✓ | Post-M7 | -| Latency histograms p50/p95/p99/p999 per tool | ✓ | ✓ | M2 | -| Breaking point detection | ✓ | ✓ | M5 | -| Performance grading A-F | ✓ | ✓ | M5 | -| Soak / leak detection | ✓ | ✓ | M5 | -| Spike scenario | ✓ | ✓ | Post-M7 | -| Compare baselines | ✓ | ✓ | M5 | -| Realistic patterns (explore-then-act, multi-step) | ✓ | ✓ | M5 | -| Console + markdown + JSON reporters | ✓ | ✓ | M3 | -| HTML reporter (self-contained) | ✗ | ✓ | Post-M7 | -| Programmatic library API | ✓ | ✓ | M2/M3 | +| Feature | reaatech | mcp-loadtest target | Milestone | +| ------------------------------------------------- | -------- | ------------------- | --------- | +| stdio transport | ✓ | ✓ | M1 | +| HTTP (StreamableHTTP) transport | ✓ | ✓ | M4 | +| SSE transport | ✓ | ✓ | M4 | +| WebSocket transport | ✗ | ✓ | Post-M7 | +| Latency histograms p50/p95/p99/p999 per tool | ✓ | ✓ | M2 | +| Breaking point detection | ✓ | ✓ | M5 | +| Performance grading A-F | ✓ | ✓ | M5 | +| Soak / leak detection | ✓ | ✓ | M5 | +| Spike scenario | ✓ | ✓ | Post-M7 | +| Compare baselines | ✓ | ✓ | M5 | +| Realistic patterns (explore-then-act, multi-step) | ✓ | ✓ | M5 | +| Console + markdown + JSON reporters | ✓ | ✓ | M3 | +| HTML reporter (self-contained) | ✗ | ✓ | Post-M7 | +| Programmatic library API | ✓ | ✓ | M2/M3 | ### Differentiators — features we have/will have that they don't -| Feature | reaatech | mcp-loadtest | Why it matters | -|---|---|---|---| -| **Deadlock detection (`deadlock_probe`)** | ✗ | ✓ M2 | Lazy-init / async-worker bugs that break in prod. Direct response to Vibe-Trading PR #85. | -| **Race detector** | ✗ | ✓ M6 | Order-sensitive concurrent tool calls; finds protocol-level race bugs. | -| **Real-time TUI dashboard** | ✗ (post-hoc only) | ✓ M6 | Watch perf cliff happen live during a run. | -| **Cross-server compare** (run vs N targets) | partial (compare baselines = 2 runs) | ✓ M6 (1 run, N targets) | Side-by-side: vendor A vs vendor B vs your fork. | -| **Server resource sampling** (CPU/fd/threads/RSS over time) | ✗ (latency only) | ✓ M6 | Find resource exhaustion before throughput collapses. | -| **Protocol fuzzer (mcp-fuzz integrated)** | ✗ | ✓ M7 | Random/malformed payloads; finds parser bugs unit tests miss. | -| **Coverage tracking** (registered vs exercised tools) | ✗ | ✓ M7 | Catch silently-broken tools that nobody tests in CI. | -| **Per-tool SLO assertions** | partial (global) | ✓ M7 | Per-tool latency/error budgets in CI. | -| **Rust perf** + static binary | ✗ (Node runtime required) | ✓ | `cargo install` → single ~5MB binary; no Node toolchain. | -| **AI-assisted pattern generator** | ✗ | ⏳ M8 stretch | LLM reads tool schemas → generates realistic call sequences. | -| **Distributed mode** | ✗ | ⏳ M8 stretch | Multiple workers driving one server (high-RPS targets). | -| **Replay / record** | ✗ | ⏳ M8 stretch | Capture prod traffic, replay deterministically. | -| **Self-hosted as MCP server** (`mcp-loadtest serve --mcp`) | ✗ | ✓ M7 | AI agents (Claude, Cursor, etc.) call `deadlock_probe` / `compare` / `report` directly via MCP. Recursive: load-test an MCP using an MCP. | +| Feature | reaatech | mcp-loadtest | Why it matters | +| ----------------------------------------------------------- | ------------------------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Deadlock detection (`deadlock_probe`)** | ✗ | ✓ M2 | Lazy-init / async-worker bugs that break in prod. Direct response to Vibe-Trading PR #85. | +| **Race detector** | ✗ | ✓ M6 | Order-sensitive concurrent tool calls; finds protocol-level race bugs. | +| **Real-time TUI dashboard** | ✗ (post-hoc only) | ✓ M6 | Watch perf cliff happen live during a run. | +| **Cross-server compare** (run vs N targets) | partial (compare baselines = 2 runs) | ✓ M6 (1 run, N targets) | Side-by-side: vendor A vs vendor B vs your fork. | +| **Server resource sampling** (CPU/fd/threads/RSS over time) | ✗ (latency only) | ✓ M6 | Find resource exhaustion before throughput collapses. | +| **Protocol fuzzer (mcp-fuzz integrated)** | ✗ | ✓ M7 | Random/malformed payloads; finds parser bugs unit tests miss. | +| **Coverage tracking** (registered vs exercised tools) | ✗ | ✓ M7 | Catch silently-broken tools that nobody tests in CI. | +| **Per-tool SLO assertions** | partial (global) | ✓ M7 | Per-tool latency/error budgets in CI. | +| **Configurable regression thresholds** | ✗ (fixed) | ✓ v0.1 | `compare` CLI flags + `compare_runs` MCP args override p99 / error-rate / deadlock policy; defaults unchanged (ADR 0009). | +| **Protocol-aware assertions** | ✗ | ✓ v0.1 | Opt-in strict mode validates `tools/call` args vs the server's advertised `inputSchema`; mismatch → `ProtocolError` gates the run. Forward-compatible, off by default (ADR 0005/0010). | +| **Rust perf** + static binary | ✗ (Node runtime required) | ✓ | `cargo install` → single ~5MB binary; no Node toolchain. | +| **AI-assisted pattern generator** | ✗ | ⏳ M8 stretch | LLM reads tool schemas → generates realistic call sequences. | +| **Distributed mode** | ✗ | ⏳ M8 stretch | Multiple workers driving one server (high-RPS targets). | +| **Replay / record** | ✗ | ⏳ M8 stretch | Capture prod traffic, replay deterministically. | +| **Self-hosted as MCP server** (`mcp-loadtest serve --mcp`) | ✗ | ✓ M7 | AI agents (Claude, Cursor, etc.) call `deadlock_probe` / `compare` / `report` directly via MCP. Recursive: load-test an MCP using an MCP. | ### Strategic positioning (for README at v0.1.0) @@ -469,16 +483,16 @@ The README at re-publish must lead with the deadlock demo (replicated Vibe-Tradi ## 11. Decisions (resolved 2026-05-10) -| # | Question | Decision | Rationale | -|---|---|---|---| -| 1 | Crate name | **`mcp-loadtest`** (lib) + **`mcp-loadtest-cli`** (bin) | descriptive, discoverable, doesn't pigeonhole to "bench" | -| 2 | License | **MIT OR Apache-2.0** (dual) | Rust ecosystem standard; MIT for individuals, Apache-2.0 for corporate patent grant | -| 3 | Repo location | **`github.com/Teerapat-Vatpitak/mcp-loadtest`** | personal handle for v0.1; transfer to `mcp-tools/` org if/when sister projects emerge | -| 4 | MCP protocol versioning | v0.1 pin to spec v1.x, warn on mismatch; `--strict-protocol` flag for fail-on-mismatch; v0.2+ detect-and-adapt | ship v0.1 fast, add complexity when justified | -| 5 | `deadlock_probe` | **both** subcommand (`mcp-loadtest deadlock-probe -s "..."`) and scenario in `run --scenario deadlock_probe` | subcommand for newcomer UX, scenario for CI; near-zero implementation cost | -| 6 | Server stderr | **always capture** to `runs//server.stderr.log`; opt-in `--tee-stderr` to also stream | stderr critical for debugging; capture is cheap; tee opt-in to avoid CI log spam | -| 7 | Diff-vs-baseline mode | **defer to M5 stretch** | v0.1 emits JSON; users diff externally. Proper baseline storage + regression detection has too many edge cases for v0.1 | -| 8 | Library API → 1.0 | When **all three**: 3 months no breaking changes + 5+ external users + 1 real bug caught in wild | calendar time + adoption + value-prop validation, all required | +| # | Question | Decision | Rationale | +| --- | ----------------------- | -------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | +| 1 | Crate name | **`mcp-loadtest`** (lib) + **`mcp-loadtest-cli`** (bin) | descriptive, discoverable, doesn't pigeonhole to "bench" | +| 2 | License | **MIT OR Apache-2.0** (dual) | Rust ecosystem standard; MIT for individuals, Apache-2.0 for corporate patent grant | +| 3 | Repo location | **`github.com/Teerapat-Vatpitak/mcp-loadtest`** | personal handle for v0.1; transfer to `mcp-tools/` org if/when sister projects emerge | +| 4 | MCP protocol versioning | v0.1 pin to spec v1.x, warn on mismatch; `--strict-protocol` flag for fail-on-mismatch; v0.2+ detect-and-adapt | ship v0.1 fast, add complexity when justified | +| 5 | `deadlock_probe` | **both** subcommand (`mcp-loadtest deadlock-probe -s "..."`) and scenario in `run --scenario deadlock_probe` | subcommand for newcomer UX, scenario for CI; near-zero implementation cost | +| 6 | Server stderr | **always capture** to `runs//server.stderr.log`; opt-in `--tee-stderr` to also stream | stderr critical for debugging; capture is cheap; tee opt-in to avoid CI log spam | +| 7 | Diff-vs-baseline mode | **defer to M5 stretch** | v0.1 emits JSON; users diff externally. Proper baseline storage + regression detection has too many edge cases for v0.1 | +| 8 | Library API → 1.0 | When **all three**: 3 months no breaking changes + 5+ external users + 1 real bug caught in wild | calendar time + adoption + value-prop validation, all required | --- @@ -803,7 +817,7 @@ Simple, but worth specifying so the report's `passed()` is unambiguous. ## 16. Mock server specs -Mocks live in `tests/fixtures/.py`. Each is < 50 lines of Python — minimal MCP server using stdio + JSON-RPC by hand (no fastmcp dep, to avoid version coupling). Shipped fixtures: `mock-normal.py`, `mock-slow.py`, `mock-broken.py`, `mock-crash.py`, plus `mock-http-server.py` and `mock-sse-server.py` (transport parity coverage). Pseudocode for each below; entries marked *(planned for v0.2)* are documented for completeness but not yet shipped. +Mocks live in `tests/fixtures/.py`. Each is < 50 lines of Python — minimal MCP server using stdio + JSON-RPC by hand (no fastmcp dep, to avoid version coupling). Shipped fixtures: `mock-normal.py`, `mock-slow.py`, `mock-broken.py`, `mock-crash.py`, plus `mock-http-server.py` and `mock-sse-server.py` (transport parity coverage). Pseudocode for each below; entries marked _(planned for v0.2)_ are documented for completeness but not yet shipped. ### 16.1 mock-normal.py @@ -866,7 +880,7 @@ while True: # Stdlib http.server only. Used by SseTransport integration tests. ``` -### 16.7 mock-leak.py *(planned for v0.2)* +### 16.7 mock-leak.py _(planned for v0.2)_ ```python # Allocates 10 KB per tools/call into a module-global list. Never frees. @@ -875,7 +889,7 @@ while True: # (t, rss) series; a real leaking fixture is still useful for end-to-end coverage. ``` -### 16.8 mock-error.py *(planned for v0.2)* +### 16.8 mock-error.py _(planned for v0.2)_ ```python # Returns JSON-RPC errors per spec: -32601 method not found, @@ -883,13 +897,13 @@ while True: # Cycles through error codes per call. Tests error classification (§18). ``` -### 16.9 mock-slow-init.py *(planned for v0.2)* +### 16.9 mock-slow-init.py _(planned for v0.2)_ ```python # Sleeps 5s on `initialize` before responding. Tests cold_start measurement. ``` -### 16.10 mock-malformed.py *(planned for v0.2)* +### 16.10 mock-malformed.py _(planned for v0.2)_ ```python # Returns invalid JSON every 10th response (truncated, missing field). @@ -927,36 +941,52 @@ Stream-friendly. Can be processed with `jq` or any line-oriented tool. ```json { - "run_id": "01HXY...", - "started_at": "2026-05-10T07:30:00Z", - "duration_secs": 60.0, - "scenario": {"kind": "Sustained", "concurrent": 50, "duration_secs": 60.0}, - "latency_ms": { - "p50": 12.3, "p95": 45.6, "p99": 123.4, "p999": 456.7, - "min": 1.2, "max": 999.9, "mean": 23.4, "stddev": 18.7, - "count": 12345 - }, - "throughput": { - "total_requests": 12345, "successful_requests": 12300, - "requests_per_sec": 205.75 - }, - "errors": { - "total": 45, - "by_category": { - "Hang": 0, "Timeout": 5, "ServerError": 30, - "ProtocolError": 10, "Crash": 0, "Malformed": 0 - } - }, - "process": { - "peak_rss_mb": 156.3, "final_rss_mb": 142.1, - "avg_cpu_pct": 23.4 - }, - "deadlock_count": 0, - "hang_count": 0, - "threshold_violations": [ - {"metric": "p99_latency", "expected": "<=100ms", "actual": "123.4ms"} - ], - "passed": false + "run_id": "01HXY...", + "started_at": "2026-05-10T07:30:00Z", + "duration_secs": 60.0, + "scenario": { + "kind": "Sustained", + "concurrent": 50, + "duration_secs": 60.0 + }, + "latency_ms": { + "p50": 12.3, + "p95": 45.6, + "p99": 123.4, + "p999": 456.7, + "min": 1.2, + "max": 999.9, + "mean": 23.4, + "stddev": 18.7, + "count": 12345 + }, + "throughput": { + "total_requests": 12345, + "successful_requests": 12300, + "requests_per_sec": 205.75 + }, + "errors": { + "total": 45, + "by_category": { + "Hang": 0, + "Timeout": 5, + "ServerError": 30, + "ProtocolError": 10, + "Crash": 0, + "Malformed": 0 + } + }, + "process": { + "peak_rss_mb": 156.3, + "final_rss_mb": 142.1, + "avg_cpu_pct": 23.4 + }, + "deadlock_count": 0, + "hang_count": 0, + "threshold_violations": [ + { "metric": "p99_latency", "expected": "<=100ms", "actual": "123.4ms" } + ], + "passed": false } ``` @@ -975,32 +1005,38 @@ JSON Schema published at `schema/metrics.v1.json` for downstream tooling. **Started:** 2026-05-10 07:30:00 UTC ## Summary + - Total requests: 12,345 - Throughput: 205.75 req/s - Error rate: 0.36% -- Deadlocks: 0 Hangs: 0 +- Deadlocks: 0 Hangs: 0 ## Latency -| p50 | p95 | p99 | p999 | max | -|---|---|---|---|---| + +| p50 | p95 | p99 | p999 | max | +| ------ | ------ | -------------- | ------- | ------- | | 12.3ms | 45.6ms | **123.4ms** ❌ | 456.7ms | 999.9ms | (latency histogram ASCII chart here) ## Errors -| Category | Count | -|---|---| -| ServerError | 30 | -| ProtocolError | 10 | -| Timeout | 5 | + +| Category | Count | +| ------------- | ----- | +| ServerError | 30 | +| ProtocolError | 10 | +| Timeout | 5 | ## Process + Peak RSS: 156.3 MB · Final RSS: 142.1 MB · Avg CPU: 23.4% ## Threshold violations + - ❌ **p99_latency**: expected ≤100ms, got 123.4ms ## Trace + Full trace: `./trace.jsonl` (12,345 events, 8.2 MB) ``` @@ -1010,17 +1046,17 @@ Full trace: `./trace.jsonl` (12,345 events, 8.2 MB) Every failure is classified into exactly one category. Used for `ErrorStats.by_category` and reporting. -| Category | Definition | Example | -|---|---|---| -| `Hang` | No response within `hang_threshold`, but response arrived before grace_period expires | tool genuinely slow under contention | -| `Deadlock` | No response after `hang_threshold + grace_period` | Vibe-Trading PR #85 | -| `Timeout` | Client-side configured deadline exceeded (separate from hang_threshold) | network buffer full | -| `ServerError` | JSON-RPC error response with `code` in `[-32099..=-32000]` (server-defined) | tool returned business error | -| `ProtocolError` | JSON-RPC error with `code` `-32600..=-32603` (transport / spec violations) | malformed request rejected | -| `Crash` | Server process exited (non-zero or signal) during call | unhandled panic | -| `Malformed` | Response was not valid JSON or didn't match JSON-RPC schema | partial response, broken framing | -| `Disconnected` | Transport closed unexpectedly mid-call | broken pipe | -| `Cancelled` | Client cancelled the request before response | scenario shutdown | +| Category | Definition | Example | +| --------------- | ------------------------------------------------------------------------------------- | ------------------------------------ | +| `Hang` | No response within `hang_threshold`, but response arrived before grace_period expires | tool genuinely slow under contention | +| `Deadlock` | No response after `hang_threshold + grace_period` | Vibe-Trading PR #85 | +| `Timeout` | Client-side configured deadline exceeded (separate from hang_threshold) | network buffer full | +| `ServerError` | JSON-RPC error response with `code` in `[-32099..=-32000]` (server-defined) | tool returned business error | +| `ProtocolError` | JSON-RPC error with `code` `-32600..=-32603` (transport / spec violations) | malformed request rejected | +| `Crash` | Server process exited (non-zero or signal) during call | unhandled panic | +| `Malformed` | Response was not valid JSON or didn't match JSON-RPC schema | partial response, broken framing | +| `Disconnected` | Transport closed unexpectedly mid-call | broken pipe | +| `Cancelled` | Client cancelled the request before response | scenario shutdown | Classification precedence: top-down. A request that hangs and then the server crashes → classified as `Crash` (the terminal event), but trace.jsonl records both `hang` and `crash` events for forensics. @@ -1030,13 +1066,13 @@ Classification precedence: top-down. A request that hangs and then the server cr `mcp-loadtest` should never be the bottleneck. -| Aspect | Target | -|---|---| -| Driver per-request CPU overhead | < 50µs (excluding JSON serialization) | -| Memory per concurrent worker | < 100KB | -| Max sustainable concurrency on a 4-core laptop | ≥ 1000 workers | -| Trace file write throughput | ≥ 100k events/sec | -| Histogram update | lock-free per-worker, merged at end | +| Aspect | Target | +| ---------------------------------------------- | ------------------------------------- | +| Driver per-request CPU overhead | < 50µs (excluding JSON serialization) | +| Memory per concurrent worker | < 100KB | +| Max sustainable concurrency on a 4-core laptop | ≥ 1000 workers | +| Trace file write throughput | ≥ 100k events/sec | +| Histogram update | lock-free per-worker, merged at end | These are tested in `benches/` (criterion). v0.1 ships with reproducible numbers in the README. @@ -1050,6 +1086,7 @@ These are tested in `benches/` (criterion). v0.1 ships with reproducible numbers - Library MSRV (minimum supported Rust version): stable - 2 (e.g. if 1.85 is current stable, MSRV is 1.83). When to commit to 1.0: + - After 3 months of v0.x with no breaking changes - After 5+ external users have integrated - After at least 1 real bug caught in the wild and reported back @@ -1070,15 +1107,15 @@ mcp-loadtest is a tool that AI agents will both **operate** (Claude Code running The single most important AI-friendly feature. mcp-loadtest exposes itself as an MCP server with these tools: -| Tool | Args | Returns | -|---|---|---| -| `deadlock_probe` | `server_command`, `tool`, `concurrent` | `{ deadlock_count, hung_for_ms[], details }` | -| `sustained_load` | `server_command`, `concurrent`, `duration_secs`, `tool`, `args` | `{ p50_ms, p99_ms, error_rate, requests_per_sec }` | -| `compare_runs` | `baseline_run_dir`, `current_run_dir` | structured diff with regression flags | -| `report_summary` | `run_dir` | markdown summary string | -| `list_recent_runs` | `limit` | run dirs with metadata | +| Tool | Args | Returns | +| ------------------ | --------------------------------------------------------------- | -------------------------------------------------- | +| `deadlock_probe` | `server_command`, `tool`, `concurrent` | `{ deadlock_count, hung_for_ms[], details }` | +| `sustained_load` | `server_command`, `concurrent`, `duration_secs`, `tool`, `args` | `{ p50_ms, p99_ms, error_rate, requests_per_sec }` | +| `compare_runs` | `baseline_run_dir`, `current_run_dir` | structured diff with regression flags | +| `report_summary` | `run_dir` | markdown summary string | +| `list_recent_runs` | `limit` | run dirs with metadata | -A user can say to Claude / Cursor / any MCP-aware agent: *"Find deadlocks in my new MCP server at `python -m foo`"* — and the agent calls `deadlock_probe` directly. **No human-in-the-loop required to spawn a child process and parse stdout** — the agent gets structured JSON back. +A user can say to Claude / Cursor / any MCP-aware agent: _"Find deadlocks in my new MCP server at `python -m foo`"_ — and the agent calls `deadlock_probe` directly. **No human-in-the-loop required to spawn a child process and parse stdout** — the agent gets structured JSON back. **Reaatech doesn't do this.** It's our most under-priced differentiator. @@ -1094,6 +1131,7 @@ Hint: server may have crashed before responding. Check stderr at: ``` vs. the bad version: + ``` Error: BrokenPipe(Os { code: 32, ... }) ``` @@ -1130,6 +1168,7 @@ LLMs use this to plan the right invocation. Reduces "I tried it but it didn't do ### 21.6 `mcp-loadtest doctor` Diagnoses common setup issues: + - Python interpreter not on PATH (for fixture-based tests). - MSVC vs GNU toolchain mismatch on Windows. - Stale `runs/` accumulation. @@ -1158,6 +1197,7 @@ A report that says `p99 latency: 234ms` is data. A report that adds `"95% of use Per-scenario copy-pasteable commands + expected output. LLMs train on README-style examples; cookbook entries make those examples concrete and executable. Examples to ship at v0.1.0: + - "Find deadlocks in my new MCP server" - "Add a regression gate to my CI" - "Compare two implementations of the same MCP server" diff --git a/POST_PUBLISH_ISSUES.md b/POST_PUBLISH_ISSUES.md index 91e2b96..64d0c55 100644 --- a/POST_PUBLISH_ISSUES.md +++ b/POST_PUBLISH_ISSUES.md @@ -8,9 +8,23 @@ Each block below is in `gh issue create` shape — copy-paste-ready once the rep ## 🔒 Security +### `chore(deps): revisit triaged RUSTSEC ignores (ADR 0011)` + +**Body:** + +> `deny.toml` ignores three advisories, each justified in ADR 0011. Re-evaluate each release: +> +> - **RUSTSEC-2026-0002** (`lru` `IterMut` Stacked-Borrows unsoundness) — transitive via `ratatui 0.29` (`lru ^0.12`), only under the optional `tui` feature. Drop the ignore + bump once `ratatui` ships on patched `lru` (≥0.13). +> - **RUSTSEC-2025-0052** (`async-std` discontinued) — dev-only via `httpmock`. Drop when `httpmock` releases without `async-std` (or swap the mock). +> - **RUSTSEC-2024-0436** (`paste` unmaintained) — transitive proc-macro. Drop when the dep tree moves off `paste` (e.g. `pastey`). +> Action: `cargo deny check` after `cargo update`; if any becomes an actual vulnerability or `lru` reaches the default build, fix immediately rather than re-ignoring. + +**Labels:** `security`, `dependencies`, `chore` + ### `feat(transport): host-allowlist for HTTP/SSE/WS to defend against SSRF` **Body:** + > Currently the redirect policy is hardened to `Policy::none()` per [ADR 0007](docs/adr/0007-transport-security-posture.md), but an operator pointing the load tester at a malicious URL can still hit any internal endpoint resolvable from the machine running the test. v0.2 should add an operator-facing allowlist (TOML config: `server.allowed_hosts = ["app.example.com"]`) and reject `connect_async`/`reqwest::send` calls when the resolved host isn't matched. See CHANGELOG `[Unreleased]` Notes. **Labels:** `security`, `enhancement`, `v0.2` @@ -19,9 +33,10 @@ Each block below is in `gh issue create` shape — copy-paste-ready once the rep ## ⚡ Performance -### `perf(scenarios): switch `args: Value` to `Arc` for hot-loop sharing` +### `perf(scenarios): switch `args: Value`to`Arc` for hot-loop sharing` **Body:** + > Pre-publish Phase 1 audit found scenarios still deep-clone `Value` once per worker spawn (the `&Value` change in commit `c8dee52` cut per-call clones, not per-worker setup). Wrapping `Sustained.args` / `Pattern.args` etc. in `Arc` lets concurrent workers share the JSON tree by reference. Touches `Session::call_tool` signature → breaking v0.2 API change. **Labels:** `performance`, `breaking`, `v0.2` @@ -29,13 +44,15 @@ Each block below is in `gh issue create` shape — copy-paste-ready once the rep ### `perf(transport): drop double-parse in SSE/WS id-extract` **Body:** + > `extract_id` parses the entire JSON twice (once for id-probe, once for full body after match). Use `simd-json` or a streaming id-extractor (parse only the first `"id":N` key). Matters at >100K iter/s — not a v0.1 blocker. **Labels:** `performance`, `v0.2` -### `perf(session): drop `String` allocation in `stdio` line trim` +### `perf(session): drop `String`allocation in`stdio` line trim` **Body:** + > `stdio.rs::request` does `self.line_buf.trim_end().to_string()`. In-place truncate plus returning `&str` would save one alloc per call. Small win at high call rates. **Labels:** `performance`, `v0.2` @@ -47,6 +64,7 @@ Each block below is in `gh issue create` shape — copy-paste-ready once the rep ### `test(scenario): land real cold_start handshake-latency test` **Body:** + > `ColdStart` is intentionally a placeholder in v0.1.0; the integration test pins the inert-placeholder contract. v0.2: implement real cold-start sampling (`Session::reinitialize` loop + per-iteration histogram) and replace the placeholder assertion with a measured-latency assertion. **Labels:** `test`, `feature`, `v0.2` @@ -54,18 +72,21 @@ Each block below is in `gh issue create` shape — copy-paste-ready once the rep ### `test(fixtures): add `mock-leak.py`, `mock-error.py`, `mock-slow-init.py`, `mock-malformed.py`` **Body:** + > DESIGN.md §16 lists 10 mock fixtures; v0.1 ships 6 (normal/slow/broken/crash/http/sse). The 4 missing fixtures gate richer scenario coverage: +> > - `mock-leak.py` — RSS grows over time → exercises `soak::detect_leak` > - `mock-error.py` — returns JSON-RPC errors deterministically → exercises error-classification scenarios > - `mock-slow-init.py` — slow initialize handshake → exercises `cold_start` (post-real-impl) > - `mock-malformed.py` — emits malformed JSON → exercises fuzzer's defensive parse paths -> Each is < 50 lines of stdlib-only Python following `_common.py`. +> Each is < 50 lines of stdlib-only Python following `_common.py`. **Labels:** `test`, `v0.2` ### `test(bench): wire criterion benches into CI baseline comparison` **Body:** + > Phase 1 added `benches/{record,histogram,session_loopback,hang_detect}.rs`. v0.2: capture baseline numbers in `bench-baseline.json` and add a `cargo bench-check` CI step that flags regressions > 10% vs baseline (same threshold as `compare` subcommand). **Labels:** `test`, `ci`, `v0.2` @@ -97,6 +118,7 @@ Public API paths preserved via `pub use` re-exports throughout. 264 tests pass, ### `feat(transport): add `Transport::raw_send(&[u8])` hook for fuzzer raw-byte payloads` **Body:** + > `Fuzzer` currently skips raw-transport payloads (`GiantPayload` raw variant, etc.) because there's no API to bypass JSON-RPC framing. v0.2: add a `raw_send` method to the `Transport` trait that lets `Fuzzer` send arbitrary bytes. Enables full coverage of the malformed-input attack surface. **Labels:** `feature`, `v0.2` @@ -104,6 +126,7 @@ Public API paths preserved via `pub use` re-exports throughout. 264 tests pass, ### `feat(cli): `--capture-stderr` flag for stdio transport` **Body:** + > Currently the spawned MCP server's stderr inherits the parent's stderr. When `mcp-loadtest` runs as a child of an LLM agent, the target server's stderr blends into the agent's view. Add a flag to redirect to a per-run file (`runs//server-stderr.log`). **Labels:** `feature`, `v0.2` @@ -111,6 +134,7 @@ Public API paths preserved via `pub use` re-exports throughout. 264 tests pass, ### `feat(cli): docker-compose generator for the `cross` subcommand` **Body:** + > IBM's mcp-context-forge perf testing wanted Docker Compose multi-server setup. We have `cross` which drives N servers but doesn't scaffold the compose file. Add `mcp-loadtest cross --emit-compose > docker-compose.yml`. **Labels:** `feature`, `v0.2` @@ -118,6 +142,7 @@ Public API paths preserved via `pub use` re-exports throughout. 264 tests pass, ### `feat(report): HTML report charts using inline JS for interactivity` **Body:** + > Current HTML reporter uses inline SVG (static, no JS). v0.2 could optionally embed Chart.js + interactive percentile sliders. Stays self-contained (`