Status: v0.1.0-rc (2026-05-11) Author: Teerapat Vatpitak Reviewers: (pending)
The Model Context Protocol ecosystem is exploding — new MCP servers ship weekly across Python, Node, Rust. But MCP servers fail in ways unit tests don't catch:
- Lazy-init deadlocks inside async worker threads
- Race conditions when concurrent
tools/callarrive beforetools/listcompletes - Memory leaks under sustained load
- Hangs that look like work-in-progress
- Subtle protocol violations only visible at scale
The author hit a deadlock in HKUDS/Vibe-Trading where:
initialize→ ✅ workedtools/list→ ✅ workedtools/call→ ❌ hung forever
Root cause: _get_registry() lazy-init inside the FastMCP asyncio worker thread, blocking on import src.tools.shell.*. Standard pytest didn't catch this — the bug only surfaces when a real client opens a session and calls a tool through stdio.
The fix took ~5 lines (PR #85) and a regression smoke test (PR #86) — but finding the bug took hours of differential testing because no purpose-built tool exists for stress-testing MCP servers.
There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), GraphQL (autocannon plugins). There is nothing for MCP. Ad-hoc Python scripts and pytest fixtures are what people use.
mcp-loadtest aims to be the canonical tool: language-agnostic, transport-agnostic, with built-in scenarios for the bug classes that actually occur.
- Detect deadlocks, hangs, livelocks under realistic concurrent load
- Measure latency (p50/p95/p99), throughput, error rate
- Work against any MCP server regardless of language / transport (stdio, HTTP, SSE, WebSocket)
- Library mode (Rust crate) for embedding in CI tests
- CLI mode for ad-hoc smoke tests and benchmarks
- Cross-platform (Linux, macOS, Windows — author runs Windows so this is a hard requirement, not aspirational)
- Zero-config quick-start:
mcp-loadtest probe -s "python -m my_mcp"should just work
- Not a replacement for unit tests. Different problem.
- Not a tool for testing MCP clients. Client-side bugs are a separate domain.
- Not validating tool output correctness. We test protocol-level behavior. If your tool returns wrong data, that's not what we catch.
- Not a fuzzer. A protocol fuzzer (random/malformed payloads) is a different design — possibly a future sister project (
mcp-fuzz). - Not a benchmark suite. We provide infrastructure to bench, not a curated set of "official" benchmarks.
JSON-RPC 2.0 framing over one of four transports:
- stdio — line-delimited JSON over child process stdin/stdout (most common, all examples in this doc focus here)
- HTTP — Streamable HTTP (simple JSON variant); request via POST, simple JSON response
- HTTP+SSE — request via POST, server pushes events via SSE channel
- WebSocket — bidirectional frames
Lifecycle (stdio):
client → server {"method":"initialize", "params":{...}}
client ← server {"result":{"protocolVersion":...,"capabilities":{...}}}
client → server {"method":"notifications/initialized"} # one-way notif
client → server {"method":"tools/list"}
client ← server {"result":{"tools":[{...},...]}}
client → server {"method":"tools/call", "params":{"name":"X","arguments":{...}}}
client ← server {"result":{"content":[{...}]}}
| Class | Example | Why hard to catch in unit tests |
|---|---|---|
| Lazy-init deadlock | Vibe-Trading PR #85 | Bug only surfaces with full subprocess + protocol handshake |
| Concurrent tool-call race | tools/call before tools/list completes | Need real concurrency; mocked async ≠ real async |
| Resource exhaustion | 1000 concurrent calls → fd / mem leak | Need sustained load |
| Slow-tool head-of-line | One slow tool blocks queue | Need mixed workload |
| Reconnect / mid-call kill | Connection drops between request and response | Hard to simulate without tooling |
| Notification ordering | Server sends notifications/cancelled mid-call |
Need sequence-aware client |
┌─────────────────────────────────────────────────────────────┐
│ CLI / Library │
│ - parse args / config │
│ - construct Run │
│ - print report │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Run (orchestrator) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ServerManager │ │ ScenarioImpl │ │ Reporter │ │
│ │ │ │ │ │ │ │
│ │ spawn(), │ │ N tokio │ │ hdrhistogram │ │
│ │ kill(), │ │ worker │ │ process stats │ │
│ │ rss/cpu │ │ tasks │ │ → markdown/json │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │ │
│ └──────┬───────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ ProtocolSession │ │
│ │ stdio framing + JSON-RPC │ │
│ │ per-call timeout (hang det.) │ │
│ │ request/response correlation │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ Server subprocess │ (under test — any language)
└─────────────────────┘
src/
├── main.rs # CLI entry (clap)
├── lib.rs # public crate API
├── config.rs # TOML schema + parsing
├── protocol/
│ ├── mod.rs # re-exports
│ ├── jsonrpc.rs # JSON-RPC 2.0 framing
│ ├── mcp.rs # MCP request/response types
│ ├── session.rs # ProtocolSession (per-connection state)
│ └── transport/
│ ├── mod.rs
│ ├── stdio.rs # M1
│ ├── http.rs # M4
│ ├── sse.rs # M4
│ └── ws.rs # Post-M7
├── server_manager.rs # spawn/kill, env, working_dir, sysinfo
├── scenario/
│ ├── mod.rs # Scenario trait
│ ├── cold_start.rs
│ ├── sustained.rs
│ ├── deadlock_probe.rs # the Vibe-Trading-style smoke
│ ├── spike.rs # M5
│ ├── ramp.rs # M5
│ └── leak.rs # M5
├── driver.rs # tokio task pool, rate limiter
├── hang_detector.rs # per-call watchdog
├── metrics/
│ ├── mod.rs
│ ├── histogram.rs # hdrhistogram wrapper
│ ├── throughput.rs
│ └── process.rs # RSS/CPU/fd via sysinfo
├── report/
│ ├── mod.rs # Report struct
│ ├── markdown.rs
│ ├── json.rs
│ └── terminal.rs # indicatif progress + final summary
└── run.rs # orchestrator that ties it all together
| Crate | Why |
|---|---|
| tokio | async runtime |
| serde / serde_json | JSON-RPC payloads |
| clap | CLI |
| toml | config |
| hdrhistogram | percentile latency |
| sysinfo | RSS/CPU per pid (cross-platform) |
| indicatif | terminal progress |
| tracing | structured logging |
| thiserror / anyhow | errors |
| tokio-util | LinesCodec for stdio framing |
No proc-macro magic. No "framework" — just composable structs.
use mcp_loadtest::{Server, Scenario, Run, Thresholds};
use serde_json::json;
use std::time::Duration;
#[tokio::test]
async fn no_deadlock_under_concurrent_calls() {
let server = Server::stdio("python")
.args(["-m", "vibe_trading_mcp"])
.env("LOG_LEVEL", "warn");
let scenario = Scenario::sustained()
.concurrent(20)
.duration(Duration::from_secs(30))
.tool("get_market_data")
.args(json!({ "ticker": "AAPL" }));
let report = Run::new(server, scenario)
.with_thresholds(Thresholds {
p99_latency: Duration::from_millis(500),
error_rate: 0.01,
hang_timeout: Duration::from_secs(5),
..Default::default()
})
.execute()
.await
.expect("run failed");
assert!(report.passed(), "thresholds violated: {report:?}");
assert_eq!(report.deadlock_count, 0);
}Design choices:
- Builder pattern — predictable, no struct-init explosion
execute()returnsResult<Report>— lets tests pattern-match on metrics- Report exposes raw histograms — can drive custom assertions
- No global state — multiple
Runs in parallel inside one test process is supported
# Quick health check (initialize + tools/list + 1 call per tool)
mcp-loadtest probe --server "python -m my_mcp"
# Targeted smoke for the Vibe-Trading bug class
mcp-loadtest deadlock-probe --server "python -m my_mcp" --tool get_market_data
# Run a custom scenario from CLI
mcp-loadtest run \
--server "python -m my_mcp" \
--scenario sustained \
--concurrent 50 --duration 60s \
--tool get_market_data --args '{"ticker":"AAPL"}'
# Run from config file
mcp-loadtest run --config bench.toml
# Re-render saved run
mcp-loadtest report ./runs/2026-05-10T14-30-00/
# List built-in scenarios
mcp-loadtest list-scenarios
# Print example config
mcp-loadtest example-config > bench.tomlOutput structure for each run:
runs/2026-05-10T14-30-00/
├── config.toml # exact config used
├── server.stdout.log # captured server stdout
├── server.stderr.log # captured server stderr
├── trace.jsonl # per-call records (request/response/duration/error)
├── metrics.json # aggregated metrics
├── report.md # human-readable summary
└── summary.json # CI-friendly pass/fail JSON
[server]
command = "python"
args = ["-m", "my_mcp"]
env.LOG_LEVEL = "warn"
working_dir = "/path/to/proj"
transport = "stdio" # stdio | http | sse | ws
startup_timeout = "10s" # how long to wait for initialize response
[scenario]
type = "sustained" # cold_start | sustained | spike | ramp | soak | pattern | deadlock_probe | race_check | fuzzer
duration = "60s"
concurrent = 50
# scenario-specific knobs
ramp_from = 1 # ramp only
ramp_to = 100 # ramp only
spike_at = "30s" # spike only
spike_multiplier = 10 # spike only
leak_check_interval = "10s" # leak only
# What to call. Multiple entries → weighted random selection.
[[scenario.tool_call]]
name = "get_market_data"
args = { ticker = "AAPL" }
weight = 1.0
[[scenario.tool_call]]
name = "analyze_options"
args = { ticker = "SPY", expiry = "2026-06-19" }
weight = 0.3
[thresholds]
p50_latency = "100ms"
p99_latency = "500ms"
error_rate = 0.01 # fraction; 0.01 = 1%
hang_timeout = "5s" # call considered hung if no response in this long
memory_growth_mb = 50 # fail if RSS grows by more than this MB during run
[output]
report_dir = "./runs"
formats = ["markdown", "json", "terminal"]| Scenario | Description | Detects |
|---|---|---|
cold_start |
Spawn → initialize → first tool call. Repeat N times. (M2 placeholder — needs session factory; tracked for follow-up.) | regression in startup time, init-time deadlocks |
sustained |
Constant load against one session for a fixed duration. Drives the multi-step weighted-random pattern engine internally. |
baseline p99 latency, throughput, sustained error rate |
spike |
Baseline → sharp burst at peak concurrency for a fixed window → cooldown back to baseline. Models Black-Friday-style traffic spikes. | queue overflow, recovery behavior, fairness under burst |
ramp |
Step concurrency from from to to by step_increment, optionally feeding the per-step metrics into [analysis::breaking_point]. |
finds break-point — concurrency where p99 explodes |
soak |
Long-duration steady load with periodic snapshots; pairs with analysis::regression for latency-drift and (via ProcessSampler) RSS-slope leak signals. |
memory leaks, latency drift, throughput collapse over hours |
pattern |
Multi-step weighted-random tool-call sequences with per-pattern think_time and ErrorBehavior. Building block used directly by sustained. |
realistic mixed workloads (explore-then-act, read-then-write) |
deadlock_probe |
initialize → tools/list → fire N tools/call to same tool wrapped in hang_detect. Bails on first deadlock to avoid flooding a wedged session. |
the Vibe-Trading bug class specifically |
race_check |
Issue N identical tools/call and run the responses through analysis::race_detector (key-sorted JSON canonicalization). |
non-determinism / divergent responses to identical inputs |
fuzzer |
Cycle through enumerated malformed-but-plausible payloads (unknown method, numeric method, giant payload, control chars, deep-nested, null/string params); classify each via analysis::fuzz_report. |
parser bugs, type-confusion in method dispatch |
Deferred to v0.2:
slow_mix— 80% calls to a fast tool, 20% to a deliberately-slow tool (head-of-line blocking, fairness). Approximable today by configuring a multi-steppatternwith weighted tools.reconnect— drop session mid-call (close stdin), spawn new session, retry (resilience, leftover state, zombies). Needs the session pool that lands in M8+.
Each scenario is an impl Scenario with two methods:
trait Scenario {
async fn drive(&self, session: SessionPool, ctx: RunContext) -> ScenarioOutcome;
fn config_schema() -> serde_json::Value;
}Mock MCP servers in tests/fixtures/. Each is a tiny Python script (chosen for ubiquity, not Rust, to make the test environment realistic).
| Mock | Behavior | Tests |
|---|---|---|
mock-normal.py |
Echoes args, responds in 1ms | happy-path metrics shape |
mock-slow.py |
Tool sleeps 2s | latency histogram correctness |
mock-broken.py |
Hangs on first tools/call (replicates Vibe-Trading bug) | deadlock_probe correctly classifies |
mock-crash.py |
Panics on 1% of calls | error-rate accuracy |
mock-leak.py |
Allocates 10 KB/call, never frees | leak scenario detects |
mock-error.py |
Returns JSON-RPC errors per spec | error classification |
mock-slow-init.py |
Takes 5s to respond to initialize |
cold_start measures correctly |
mock-malformed.py |
Returns invalid JSON occasionally | parser robustness |
Test invariant: for each (scenario × mock) pair, the report's machine-readable summary contains expected fields with expected ranges. This is the bulk of integration tests.
Snapshot test against a known-buggy commit of Vibe-Trading:
- Pin to commit
~PR-85(just before the fix) - Run
deadlock_probescenario - Assert: report flags ≥1 deadlock, identifies
tools/callas the offending request
Re-run against post-fix commit:
- Same scenario, expect 0 deadlocks
This is the killer demo. It goes in the README.
CI matrix: ubuntu-latest, macos-latest, windows-latest × stable Rust × Python 3.13 (for fixtures).
Original 3-week plan replaced after discovering reaatech/mcp-load-test ships v0.1 functionality already (see §10.5 for parity matrix). v0.1.0 of mcp-loadtest must reach feature parity and surface our differentiators before re-publishing.
Repo was private through the M1-M7 development phase. v0.1.0 ships from a public repo via cargo install --git + prebuilt GitHub Release binaries; the crates.io publish is deferred to keep the first release off append-only (ADR 0015 — amends the distribution channel of ADR 0004).
M1 through M7 are all shipped. Post-M7 work (spike scenario, HTML reporter, WebSocket transport, hot-path zero-copy refactor, criterion benches) is captured under [Unreleased] in CHANGELOG rather than as a new milestone — the work is small + cohesive enough that bundling it into v0.1.0 makes more sense than coining "M8" for it. The "Week N" column is dropped because milestones are no longer time-boxed — they're released.
| M | Theme | Key deliverables |
|---|---|---|
| M1 ✓ | stdio Session | Session::spawn → handshake → list_tools/call_tool/shutdown; mock-normal.py; happy-path integration test |
| M2 ✓ | Scenarios + metrics core | Scenario trait; cold_start + sustained + deadlock_probe impls; hang_detector (§15.1); hdrhistogram metrics; mocks mock-broken/mock-slow/mock-crash + tests |
| M3 ✓ | Reports + first internal release | TOML config; markdown / JSON / console reporters; sysinfo-based process sampling; regression test against real Vibe-Trading commit ~PR-85 |
| M4 ✓ | Transport parity | HTTP transport (StreamableHTTP); SSE transport; HTTP/SSE fixtures; transport-aware concurrency profiles |
| M5 ✓ | Analysis parity | breaking_point detection; performance grading (A-F per latency/concurrency/error); realistic patterns (explore-then-act, read-then-write, multi-step) with weighted random + think-time; soak scenario polish; compare-baselines subcommand |
| M6 ✓ | Differentiators v1 | Real-time terminal TUI dashboard (live latency/throughput/RSS); server resource sampling beyond RSS (CPU, fd, threads); race_detector scenario; cross-server compare (run --server srv-a --server srv-b) |
| M7 ✓ | Differentiators v2 + v0.1 polish | Protocol fuzzer (basic — random/malformed payloads); coverage tracking (tools registered vs. exercised); per-tool SLO assertions; README rewrite with competitive positioning; cargo install smoke test on all 3 OS |
| Post-M7 ✓ | Pre-public-release close-out | Spike scenario; HTML reporter; WebSocket transport; hot-path zero-copy refactor; criterion benches (DESIGN §19 claims now reproducible). See CHANGELOG [Unreleased]. |
| v0.1.0-rc | Pre-publish review in flight | repo back to public; cargo install --git + GitHub Release binaries (crates.io deferred — ADR 0015); HN/lobste.rs/r/rust announce |
| M8+ stretch | Beyond | AI-assisted pattern generator; distributed mode (multi-worker); replay/record; PyO3 binding |
Definition of done for v0.1.0:
cargo install --git <repo-url> mcp-loadtest-cliworks on Linux/macOS/Windows, and prebuilt binaries are attached to the GitHub Release (crates.io publish deferred — ADR 0015).mcp-loadtest deadlock-probe -s "python -m vibe_trading_mcp"reproduces the original bug on commit~PR-85.- All §10.5 parity-must-have rows are checked.
- All §10.5 differentiator rows are checked.
- README has side-by-side comparison table vs. reaatech, citing concrete benchmarks.
reaatech/mcp-load-test as of 2026-05-10 (TS monorepo, 77 source files, ~50% of README claims fleshed out per file-size sampling).
| Feature | reaatech | mcp-loadtest target | Milestone |
|---|---|---|---|
| stdio transport | ✓ | ✓ | M1 |
| HTTP (StreamableHTTP) transport | ✓ | ✓ | M4 |
| SSE transport | ✓ | ✓ | M4 |
| WebSocket transport | ✗ | ✓ | Post-M7 |
| Latency histograms p50/p95/p99/p999 per tool | ✓ | ✓ | M2 |
| Breaking point detection | ✓ | ✓ | M5 |
| Performance grading A-F | ✓ | ✓ | M5 |
| Soak / leak detection | ✓ | ✓ | M5 |
| Spike scenario | ✓ | ✓ | Post-M7 |
| Compare baselines | ✓ | ✓ | M5 |
| Realistic patterns (explore-then-act, multi-step) | ✓ | ✓ | M5 |
| Console + markdown + JSON reporters | ✓ | ✓ | M3 |
| HTML reporter (self-contained) | ✗ | ✓ | Post-M7 |
| Programmatic library API | ✓ | ✓ | M2/M3 |
| Feature | reaatech | mcp-loadtest | Why it matters |
|---|---|---|---|
Deadlock detection (deadlock_probe) |
✗ | ✓ M2 | Lazy-init / async-worker bugs that break in prod. Direct response to Vibe-Trading PR #85. |
| Race detector | ✗ | ✓ M6 | Order-sensitive concurrent tool calls; finds protocol-level race bugs. |
| Real-time TUI dashboard | ✗ (post-hoc only) | ✓ M6 | Watch perf cliff happen live during a run. |
| Cross-server compare (run vs N targets) | partial (compare baselines = 2 runs) | ✓ M6 (1 run, N targets) | Side-by-side: vendor A vs vendor B vs your fork. |
| Server resource sampling (CPU/fd/threads/RSS over time) | ✗ (latency only) | ✓ M6 | Find resource exhaustion before throughput collapses. |
| Protocol fuzzer (mcp-fuzz integrated) | ✗ | ✓ M7 | Random/malformed payloads; finds parser bugs unit tests miss. |
| Coverage tracking (registered vs exercised tools) | ✗ | ✓ M7 | Catch silently-broken tools that nobody tests in CI. |
| Per-tool SLO assertions | partial (global) | ✓ M7 | Per-tool latency/error budgets in CI. |
| Configurable regression thresholds | ✗ (fixed) | ✓ v0.1 | compare CLI flags + compare_runs MCP args override p99 / error-rate / deadlock policy; defaults unchanged (ADR 0009). |
| Protocol-aware assertions | ✗ | ✓ v0.1 | Opt-in strict mode validates tools/call args vs the server's advertised inputSchema; mismatch → ProtocolError gates the run. Forward-compatible, off by default (ADR 0005/0010). |
| Rust perf + static binary | ✗ (Node runtime required) | ✓ | cargo install → single ~5MB binary; no Node toolchain. |
| AI-assisted pattern generator | ✗ | ⏳ M8 stretch | LLM reads tool schemas → generates realistic call sequences. |
| Distributed mode | ✗ | ⏳ M8 stretch | Multiple workers driving one server (high-RPS targets). |
| Replay / record | ✗ | ⏳ M8 stretch | Capture prod traffic, replay deterministically. |
Self-hosted as MCP server (mcp-loadtest serve --mcp) |
✗ | ✓ M7 | AI agents (Claude, Cursor, etc.) call deadlock_probe / compare / report directly via MCP. Recursive: load-test an MCP using an MCP. |
mcp-loadtest is a load tester + bug detector for MCP servers. Match-or-exceed reaatech/mcp-load-test on every load-testing dimension, and detect classes of bugs no other tool finds: deadlocks, races, resource leaks, coverage gaps.
The README at re-publish must lead with the deadlock demo (replicated Vibe-Trading PR #85 bug, caught in 2 seconds) — not the load-testing checklist. Differentiation first; parity proves we're serious.
| # | Question | Decision | Rationale |
|---|---|---|---|
| 1 | Crate name | mcp-loadtest (lib) + mcp-loadtest-cli (bin) |
descriptive, discoverable, doesn't pigeonhole to "bench" |
| 2 | License | MIT OR Apache-2.0 (dual) | Rust ecosystem standard; MIT for individuals, Apache-2.0 for corporate patent grant |
| 3 | Repo location | github.com/Teerapat-Vatpitak/mcp-loadtest |
personal handle for v0.1; transfer to mcp-tools/ org if/when sister projects emerge |
| 4 | MCP protocol versioning | v0.1 pin to spec v1.x, warn on mismatch; --strict-protocol flag for fail-on-mismatch; v0.2+ detect-and-adapt |
ship v0.1 fast, add complexity when justified |
| 5 | deadlock_probe |
both subcommand (mcp-loadtest deadlock-probe -s "...") and scenario in run --scenario deadlock_probe |
subcommand for newcomer UX, scenario for CI; near-zero implementation cost |
| 6 | Server stderr | always capture to runs/<id>/server.stderr.log; opt-in --tee-stderr to also stream |
stderr critical for debugging; capture is cheap; tee opt-in to avoid CI log spam |
| 7 | Diff-vs-baseline mode | defer to M5 stretch | v0.1 emits JSON; users diff externally. Proper baseline storage + regression detection has too many edge cases for v0.1 |
| 8 | Library API → 1.0 | When all three: 3 months no breaking changes + 5+ external users + 1 real bug caught in wild | calendar time + adoption + value-prop validation, all required |
mcp-loadtest— clear, no surprisesmcp-bench— implies benchmarking specificallymcphammer— playful, memorable, but maybe too aggressive for a tool that aims to be canonicalmcptest— too genericmcp-stress— accurate but slightly negativelockesmith— clever ("lock-finder for MCP servers") but obscure
Author's preference: mcp-loadtest for v0.1. Rename later if needed.
mcp-fuzz— sister project for protocol fuzzing (random/malformed payloads)mcp-trace— record + replay tool for debugging production MCP issues- Distributed mode — multiple loadtest workers driving one server (for very high RPS targets)
- GUI/web UI — render reports interactively
- Plugin system — user-defined scenarios as separate crates
- Public benchmark dataset — track perf of popular MCP servers over time (
mcp-leaderboard)
Prioritized. Each item is a debt v0.1 explicitly took on; provenance in parentheses so a future planner can trace the contract. The bullets above remain the broader ecosystem horizon.
P1 — correctness / security debt promised in v0.1
cold_startreal handshake-time histogram — v0.1 ships an inert placeholder; thecold_start_is_an_inert_placeholdertest pins the contract so this work must update it. Needs a session-spawning factory onRunContext. (DESIGN §8; CHANGELOG [0.1.0] Tests/benches)- Result-side strict schema validation — v0.1 validates only
tools/callarguments; extend the dependency-free validator to theCallToolResultpayload. (ADR 0010; CHANGELOG [0.1.0] Added) - DNS-rebinding defense (resolver-pinning connector) — v0.1's SSRF guard blocks IP literals + enforces the host allowlist, but a hostname that resolves to a private IP is not blocked. (ADR 0012 "Open"; CHANGELOG [0.1.0] Security / Notes)
P2 — API / packaging hygiene due exactly at v0.2.0
- Remove deprecated alias
DEFAULT_LEAK_THRESHOLD_MB_PER_SEC— kept one release as an alias forDEFAULT_LATENCY_DRIFT_MS_PER_SEC; removal is a documented breaking change for v0.2.0. (CHANGELOG [0.1.0] Deprecated) - Feature-gate
serve/tuibehind cargo features — keep the default build slim; migrate docs/examples to show the feature flags. (CHANGELOG [0.1.0] Notes)
P3 — differentiators / ecosystem (longer horizon)
- Fuzzer raw-byte payloads — needs a
Transport::raw_sendhook; the raw variants are documented + skipped in v0.1. (CHANGELOG [0.1.0] Added — fuzzer) instasnapshot parity for html / terminal reporters — v0.1 asserts substring landmarks because both reporters have too much structural variance for stable snapshots. (CHANGELOG [0.1.0] Tests/benches)- Sister projects —
mcp-fuzz,mcp-trace(see §13 list above). (ADR 0004 Path C) - M8+ stretch — distributed multi-worker, PyO3 binding, AI-assisted pattern generator (see §13 list above). (ADR 0004 Path C)
These are the public types code will hang off. Full definitions, not sketches.
pub struct Server {
pub command: String,
pub args: Vec<String>,
pub env: BTreeMap<String, String>, // BTreeMap for stable serialization
pub working_dir: Option<PathBuf>,
pub transport: Transport,
pub startup_timeout: Duration, // default 10s
pub shutdown_timeout: Duration, // default 5s; SIGTERM → wait → SIGKILL
}
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum Transport {
Stdio,
Http { url: String, headers: BTreeMap<String, String> }, // M4 — Streamable HTTP (simple JSON variant)
Sse { url: String, headers: BTreeMap<String, String> }, // M4
WebSocket { url: String }, // Post-M7
}
impl Server {
pub fn stdio(command: impl Into<String>) -> ServerBuilder { /* ... */ }
}pub enum ScenarioKind {
ColdStart {
iterations: u32, // default 5
warmup: bool, // discard first iter — default true
},
Sustained {
concurrent: u32,
duration: Duration,
rate_limit: Option<u32>, // requests/sec cap; None = unbounded
},
Spike {
baseline_concurrent: u32,
spike_concurrent: u32,
baseline_duration: Duration,
spike_at: Duration,
spike_duration: Duration,
},
Ramp {
from_concurrent: u32,
to_concurrent: u32,
duration: Duration,
},
DeadlockProbe {
concurrent: u32, // default 20
hang_threshold: Duration, // default 5s
grace_period: Duration, // default 10s — after timeout, how long to wait for late responses
},
Soak {
concurrent: u32, // default 4
duration: Duration, // default 1h
sample_interval: Duration, // default 10s
latency_drift_ms_per_sec: f64, // fail if linear-regression slope on mean latency exceeds this
},
// M5+ ships additional kinds not detailed here for brevity:
// Pattern { steps, think_time, weight, error_behavior }
// RaceCheck { concurrent, tool, args }
// Fuzzer { iterations, seed, payloads }
// See crate::scenario::{pattern, race_check, fuzzer}.
}
pub struct Scenario {
pub kind: ScenarioKind,
pub tool_calls: Vec<ToolCall>, // weighted random selection
}
pub struct ToolCall {
pub name: String,
pub args: serde_json::Value,
pub weight: f64, // default 1.0
}pub struct Run {
server: Server,
scenario: Scenario,
thresholds: Thresholds,
output_dir: Option<PathBuf>,
}
impl Run {
pub fn new(server: Server, scenario: Scenario) -> Self;
pub fn with_thresholds(self, t: Thresholds) -> Self;
pub fn with_output_dir(self, dir: PathBuf) -> Self;
pub async fn execute(self) -> Result<Report, RunError>;
}
#[derive(Default)]
pub struct Thresholds {
pub p50_latency: Option<Duration>,
pub p95_latency: Option<Duration>,
pub p99_latency: Option<Duration>,
pub p999_latency: Option<Duration>,
pub error_rate: Option<f64>, // 0.0..=1.0
pub hang_timeout: Duration, // default 5s — used by hang_detector
pub memory_growth_mb: Option<f64>,
}
pub struct Report {
pub run_id: String, // ULID
pub started_at: SystemTime,
pub duration: Duration,
pub scenario_kind: ScenarioKind,
pub server_info: ServerInfo,
pub latency: LatencyStats,
pub throughput: ThroughputStats,
pub errors: ErrorStats,
pub process: ProcessStats,
pub deadlock_count: u32,
pub hang_count: u32,
pub trace_path: PathBuf,
pub threshold_violations: Vec<ThresholdViolation>,
}
impl Report {
pub fn passed(&self) -> bool { self.threshold_violations.is_empty() }
pub fn write_markdown(&self, path: &Path) -> io::Result<()>;
pub fn write_json(&self, path: &Path) -> io::Result<()>;
}
pub struct LatencyStats {
pub histogram: hdrhistogram::Histogram<u64>, // exposed for custom analysis
pub p50: Duration,
pub p95: Duration,
pub p99: Duration,
pub p999: Duration,
pub min: Duration,
pub max: Duration,
pub mean: Duration,
pub stddev: Duration,
pub count: u64,
}
pub struct ThroughputStats {
pub total_requests: u64,
pub successful_requests: u64,
pub requests_per_sec: f64,
pub timeline: Vec<(Duration, u64)>, // (offset, requests-completed-by-then) for charts
}
pub struct ErrorStats {
pub total: u64,
pub by_category: BTreeMap<ErrorCategory, u64>, // see §18
}
pub struct ProcessStats {
pub peak_rss_mb: f64,
pub final_rss_mb: f64,
pub avg_cpu_pct: f64,
pub samples: Vec<ProcessSample>,
}
pub struct ProcessSample {
pub at: Duration, // offset from run start
pub rss_mb: f64,
pub cpu_pct: f64,
}
pub struct ThresholdViolation {
pub metric: String, // e.g. "p99_latency"
pub expected: String, // e.g. "<= 500ms"
pub actual: String, // e.g. "812ms"
}#[derive(thiserror::Error, Debug)]
pub enum RunError {
#[error("server failed to start: {0}")]
ServerStart(io::Error),
#[error("server exited unexpectedly with code {0:?}")]
ServerExit(Option<i32>),
#[error("initialize handshake failed: {0}")]
Handshake(String),
#[error("server stderr: {0}")]
ServerStderr(String),
#[error("config invalid: {0}")]
Config(String),
#[error("io: {0}")]
Io(#[from] io::Error),
#[error("internal: {0}")]
Internal(String),
}The detection logic is the IP of this tool. Spec'd precisely so any implementer can reproduce.
Per-call watchdog. Wraps every tools/call request:
Algorithm: hang_detector(req, threshold)
1. Record send_at = now().
2. Send req to server.
3. Spawn watchdog task with timer = threshold.
4. Race: watchdog completes OR response arrives.
5. If response arrives first:
duration = now() - send_at
return Ok((response, duration))
6. If watchdog completes first:
mark request_id as HUNG
continue listening for late response (up to grace_period)
if late response arrives: classify as LATE (not HUNG)
if no response within grace_period: classify as DEADLOCK
return Err(Hang { request_id, hung_for })
Hang ≠ deadlock. Hang means "no response within hang_threshold". Deadlock means "no response within hang_threshold + grace_period" — i.e. server appears genuinely stuck, not just slow.
The Vibe-Trading-bug-class detector. Specific call sequence designed to reproduce lazy-init races.
Algorithm: deadlock_probe(server, tool, N, hang_threshold)
1. Spawn server. Record startup_duration = time-to-stdout-EOF or initialize-response.
2. Send `initialize`. Await with timeout = startup_timeout. (fails → SERVER_INIT_ERROR)
3. Send `notifications/initialized`.
4. Send `tools/list`. Await with timeout = 1s. (fails → TOOLS_LIST_HANG)
5. Synchronization barrier — all N tasks ready to send concurrently.
6. Release barrier. All N tasks send `tools/call` to `tool` simultaneously.
7. Each task: hang_detector(req, hang_threshold).
8. After all N return (Ok or Err): wait grace_period.
9. Categorize each:
- Ok with duration → SUCCESS
- Late response within grace_period → SLOW
- No response after grace_period → DEADLOCK
10. Send shutdown notification, wait shutdown_timeout, kill if needed.
11. Report:
- if DEADLOCK count > 0 → severity=CRITICAL, "DEADLOCK DETECTED"
- else if SLOW > 0.5 * N → severity=WARNING, "concurrency degrades latency"
- else → severity=PASS
The barrier in step 5-6 is critical. Without it, requests serialize naturally and lazy-init bugs hide. Barrier forces real concurrency at the point of greatest stress.
Algorithm: leak_detector(server, scenario, sample_interval, growth_threshold_mb)
1. Run sustained scenario. Concurrently:
2. Every sample_interval, sample server's RSS via sysinfo.
3. After scenario completes:
4. Fit linear regression: rss_mb = a * t + b, where t in seconds
5. Predicted total growth = a * scenario.duration_secs
6. If predicted_growth > growth_threshold_mb:
classify as LEAK_DETECTED
report: slope (MB/sec), R² (fit quality), samples
7. R² < 0.5 → "noisy, can't conclude" — report as INDETERMINATE
Caveat: warmup-and-stabilize matters. First 30s of samples are discarded by default to avoid false positives from JIT / lazy-load.
Algorithm: evaluate_thresholds(report, thresholds)
For each threshold field that is Some:
compare report's metric to threshold
if violated: append ThresholdViolation { metric, expected, actual }
Return: violations vec — empty means PASS.
Simple, but worth specifying so the report's passed() is unambiguous.
Mocks live in tests/fixtures/<name>.py. Each is < 50 lines of Python — minimal MCP server using stdio + JSON-RPC by hand (no fastmcp dep, to avoid version coupling). Shipped fixtures: mock-normal.py, mock-slow.py, mock-broken.py, mock-crash.py, mock-leak.py, mock-error.py, mock-slow-init.py, mock-malformed.py, plus mock-http-server.py and mock-sse-server.py (transport parity coverage). Pseudocode for each below.
# Echoes args, responds in 1ms. Reference implementation.
while True:
line = sys.stdin.readline()
msg = json.loads(line)
if msg["method"] == "initialize":
respond({"protocolVersion":"...", "capabilities":{...}})
elif msg["method"] == "tools/list":
respond({"tools":[{"name":"echo","inputSchema":{...}}]})
elif msg["method"] == "tools/call":
respond({"content":[{"type":"text","text":json.dumps(msg["params"]["arguments"])}]})Same as mock-normal, but tools/call does time.sleep(2) before responding. Used to verify latency histogram correctness (p99 should be ~2s).
# Replicates Vibe-Trading lazy-init deadlock pattern.
# initialize and tools/list work; first tools/call hangs forever.
calls_made = 0
while True:
msg = json.loads(sys.stdin.readline())
if msg["method"] in ("initialize", "tools/list"):
respond_normally()
elif msg["method"] == "tools/call":
# The bug: blocking import in worker
if calls_made == 0:
calls_made += 1
time.sleep(999999) # actual deadlock
else:
respond_normally()deadlock_probe against this MUST report deadlock_count >= 1.
# Panics 1% of calls (random.random() < 0.01). Tests error rate accuracy.
# Crash = exit(1), not JSON-RPC error.# Streamable HTTP transport fixture. Stdlib http.server only — no fastapi/etc.
# Used by HttpTransport integration tests.# HTTP+SSE transport fixture. Endpoint handshake + id-correlated responses.
# Stdlib http.server only. Used by SseTransport integration tests.# Allocates 10 KB per tools/call into a module-global list. Never frees.
# Tests leak detector — slope should be ~10KB × rps.
# Today leak/drift signals are exercised via `Soak::detect_leak` over synthetic
# (t, rss) series; a real leaking fixture is still useful for end-to-end coverage.# Returns JSON-RPC errors per spec: -32601 method not found,
# -32602 invalid params, -32603 internal error.
# Cycles through error codes per call. Tests error classification (§18).# Sleeps 5s on `initialize` before responding. Tests cold_start measurement.# Returns invalid JSON every 10th response (truncated, missing field).
# Tests parser robustness — should classify as MALFORMED_RESPONSE not crash.All mocks share common framing helpers in tests/fixtures/_common.py (read frame, write frame, respond ok/err).
One JSON object per line. Schema:
{
"ts": 0.0, // seconds since run start (f64)
"kind": "request|response|error|hang|deadlock|process_sample|scenario_event",
"request_id": 123, // matches JSON-RPC id, present for request/response/error/hang/deadlock
"method": "tools/call", // present for request
"params": {...}, // present for request (compact, can be large)
"result": {...}, // present for response (truncated to 1KB by default)
"error": {"category": "...", "message": "...", "code": -32603}, // present for error
"duration_ms": 12.5, // present for response/error
"rss_mb": 45.2, // present for process_sample
"cpu_pct": 12.3 // present for process_sample
}Stream-friendly. Can be processed with jq or any line-oriented tool.
{
"run_id": "01HXY...",
"started_at": "2026-05-10T07:30:00Z",
"duration_secs": 60.0,
"scenario": {
"kind": "Sustained",
"concurrent": 50,
"duration_secs": 60.0
},
"latency_ms": {
"p50": 12.3,
"p95": 45.6,
"p99": 123.4,
"p999": 456.7,
"min": 1.2,
"max": 999.9,
"mean": 23.4,
"stddev": 18.7,
"count": 12345
},
"throughput": {
"total_requests": 12345,
"successful_requests": 12300,
"requests_per_sec": 205.75
},
"errors": {
"total": 45,
"by_category": {
"Hang": 0,
"Timeout": 5,
"ServerError": 30,
"ProtocolError": 10,
"Crash": 0,
"Malformed": 0
}
},
"process": {
"peak_rss_mb": 156.3,
"final_rss_mb": 142.1,
"avg_cpu_pct": 23.4
},
"deadlock_count": 0,
"hang_count": 0,
"threshold_violations": [
{ "metric": "p99_latency", "expected": "<=100ms", "actual": "123.4ms" }
],
"passed": false
}On the Rust side, metric is a ThresholdKind enum (crate::report::ThresholdKind); serde flattens it to the string slug shown here via #[serde(rename = "metric")] + per-variant snake_case so the wire format stays stable across refactors.
JSON Schema published at schema/metrics.v1.json for downstream tooling.
# Run {run_id}
**Status:** ❌ FAIL (1 threshold violation)
**Server:** `python -m vibe_trading_mcp`
**Scenario:** Sustained, 50 concurrent, 60s
**Started:** 2026-05-10 07:30:00 UTC
## Summary
- Total requests: 12,345
- Throughput: 205.75 req/s
- Error rate: 0.36%
- Deadlocks: 0 Hangs: 0
## Latency
| p50 | p95 | p99 | p999 | max |
| ------ | ------ | -------------- | ------- | ------- |
| 12.3ms | 45.6ms | **123.4ms** ❌ | 456.7ms | 999.9ms |
(latency histogram ASCII chart here)
## Errors
| Category | Count |
| ------------- | ----- |
| ServerError | 30 |
| ProtocolError | 10 |
| Timeout | 5 |
## Process
Peak RSS: 156.3 MB · Final RSS: 142.1 MB · Avg CPU: 23.4%
## Threshold violations
- ❌ **p99_latency**: expected ≤100ms, got 123.4ms
## Trace
Full trace: `./trace.jsonl` (12,345 events, 8.2 MB)Every failure is classified into exactly one category. Used for ErrorStats.by_category and reporting.
| Category | Definition | Example |
|---|---|---|
Hang |
No response within hang_threshold, but response arrived before grace_period expires |
tool genuinely slow under contention |
Deadlock |
No response after hang_threshold + grace_period |
Vibe-Trading PR #85 |
Timeout |
Client-side configured deadline exceeded (separate from hang_threshold) | network buffer full |
ServerError |
JSON-RPC error response with code in [-32099..=-32000] (server-defined) |
tool returned business error |
ProtocolError |
JSON-RPC error with code -32600..=-32603 (transport / spec violations) |
malformed request rejected |
Crash |
Server process exited (non-zero or signal) during call | unhandled panic |
Malformed |
Response was not valid JSON or didn't match JSON-RPC schema | partial response, broken framing |
Disconnected |
Transport closed unexpectedly mid-call | broken pipe |
Cancelled |
Client cancelled the request before response | scenario shutdown |
Classification precedence: top-down. A request that hangs and then the server crashes → classified as Crash (the terminal event), but trace.jsonl records both hang and crash events for forensics.
mcp-loadtest should never be the bottleneck.
| Aspect | Target |
|---|---|
| Driver per-request CPU overhead | < 50µs (excluding JSON serialization) |
| Memory per concurrent worker | < 100KB |
| Max sustainable concurrency on a 4-core laptop | ≥ 1000 workers |
| Trace file write throughput | ≥ 100k events/sec |
| Histogram update | lock-free per-worker, merged at end |
These are tested in benches/ (criterion). v0.1 ships with reproducible numbers in the README.
- v0.x: API can change anywhere
- v1.0: locked. Breaking changes require major version bump (semver strict)
- MCP spec:
protocol_versionfield ininitializeis checked. Mismatch warns but does not fail by default. Override with--strict-protocol. - Library MSRV (minimum supported Rust version): stable - 2 (e.g. if 1.85 is current stable, MSRV is 1.83).
When to commit to 1.0:
- After 3 months of v0.x with no breaking changes
- After 5+ external users have integrated
- After at least 1 real bug caught in the wild and reported back
mcp-loadtest is a tool that AI agents will both operate (Claude Code running CI) and be operated by (developers asking Claude "load-test my MCP server"). Design accordingly.
- All public types have
#[derive(Debug, Serialize, Deserialize)]so they're trivially JSON-able. - The library API is documented with rustdoc examples that compile (doctested in CI). LLMs read these examples to build correct calls on the first try.
- No "you must construct in this exact order" sequencing — builders are commutative where possible.
The single most important AI-friendly feature. mcp-loadtest exposes itself as an MCP server with these tools:
| Tool | Args | Returns |
|---|---|---|
deadlock_probe |
server_command, tool, concurrent |
{ deadlock_count, hung_for_ms[], details } |
sustained_load |
server_command, concurrent, duration_secs, tool, args |
{ p50_ms, p99_ms, error_rate, requests_per_sec } |
compare_runs |
baseline_run_dir, current_run_dir |
structured diff with regression flags |
report_summary |
run_dir |
markdown summary string |
list_recent_runs |
limit |
run dirs with metadata |
A user can say to Claude / Cursor / any MCP-aware agent: "Find deadlocks in my new MCP server at python -m foo" — and the agent calls deadlock_probe directly. No human-in-the-loop required to spawn a child process and parse stdout — the agent gets structured JSON back.
Reaatech doesn't do this. It's our most under-priced differentiator.
Every Err returned to the user includes a suggested next step:
Error: server stdin closed unexpectedly during initialize handshake.
Hint: server may have crashed before responding. Check stderr at:
runs/01HXY.../server.stderr.log
Or re-run with --tee-stderr to see it live.
vs. the bad version:
Error: BrokenPipe(Os { code: 32, ... })
LLMs (and humans) act on the first; bounce off the second.
$ mcp-loadtest deadlock-probe --explain
Algorithm:
1. Spawn server process.
2. Send `initialize`. Wait up to startup_timeout (default 10s).
3. Send `notifications/initialized`.
4. Send `tools/list`. Wait up to 1s.
5. Synchronization barrier — N concurrent `tools/call` ready to fire.
6. Release barrier. All N calls fire in parallel.
7. Each call wrapped in hang_detect(hang_threshold=5s, grace_period=10s):
- response within hang_threshold → SUCCESS
- response between threshold and grace_period → SLOW (warning)
- no response after grace_period → DEADLOCK (critical)
8. Report aggregated results.
Tunable knobs: --concurrent, --hang-threshold, --grace-period.
See DESIGN.md §15.2 for the spec source.
LLMs use this to plan the right invocation. Reduces "I tried it but it didn't do what I expected" loops.
schema/config.v1.json and schema/metrics.v1.json shipped at well-known paths. LLMs validate generated configs / parse outputs without guessing field shapes.
Diagnoses common setup issues:
- Python interpreter not on PATH (for fixture-based tests).
- MSVC vs GNU toolchain mismatch on Windows.
- Stale
runs/accumulation. - MCP server fails initialize — captures stderr and reports.
Outputs a checklist with ✅/❌ per item and a one-line fix per ❌. Exactly the kind of thing an LLM agent can chain into a fix-it loop.
runs/<id>/trace.jsonl is line-oriented JSON with stable field names (DESIGN.md §17.1). Pipeable through jq, parseable by any agent without custom code:
$ jq 'select(.kind=="hang")' runs/01HXY.../trace.jsonlA report that says p99 latency: 234ms is data. A report that adds "95% of users would call this acceptable; the slow tail (top 1%) is concentrated on analyze_options calls" is information. We aim for the latter — derived sentences, not just numbers.
insta::assert_snapshot! on report markdown / JSON. Output shapes are stable across releases unless explicitly changed (with CHANGELOG entry). LLM agents that parse our output don't break across patch versions.
Per-scenario copy-pasteable commands + expected output. LLMs train on README-style examples; cookbook entries make those examples concrete and executable.
Examples to ship at v0.1.0:
- "Find deadlocks in my new MCP server"
- "Add a regression gate to my CI"
- "Compare two implementations of the same MCP server"
- "Detect a memory leak before production"