mcp-loadtest — Design Document

Status: v0.1.0-rc (2026-05-11) Author: Teerapat Vatpitak Reviewers: (pending)

1. Motivation

The Model Context Protocol ecosystem is exploding — new MCP servers ship weekly across Python, Node, Rust. But MCP servers fail in ways unit tests don't catch:

Lazy-init deadlocks inside async worker threads
Race conditions when concurrent tools/call arrive before tools/list completes
Memory leaks under sustained load
Hangs that look like work-in-progress
Subtle protocol violations only visible at scale

Canonical motivating case

The author hit a deadlock in HKUDS/Vibe-Trading where:

initialize → ✅ worked
tools/list → ✅ worked
tools/call → ❌ hung forever

Root cause: _get_registry() lazy-init inside the FastMCP asyncio worker thread, blocking on import src.tools.shell.*. Standard pytest didn't catch this — the bug only surfaces when a real client opens a session and calls a tool through stdio.

The fix took ~5 lines (PR #85) and a regression smoke test (PR #86) — but finding the bug took hours of differential testing because no purpose-built tool exists for stress-testing MCP servers.

The gap

There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), GraphQL (autocannon plugins). There is nothing for MCP. Ad-hoc Python scripts and pytest fixtures are what people use.

mcp-loadtest aims to be the canonical tool: language-agnostic, transport-agnostic, with built-in scenarios for the bug classes that actually occur.

2. Goals & Non-Goals

Goals

Detect deadlocks, hangs, livelocks under realistic concurrent load
Measure latency (p50/p95/p99), throughput, error rate
Work against any MCP server regardless of language / transport (stdio, HTTP, SSE, WebSocket)
Library mode (Rust crate) for embedding in CI tests
CLI mode for ad-hoc smoke tests and benchmarks
Cross-platform (Linux, macOS, Windows — author runs Windows so this is a hard requirement, not aspirational)
Zero-config quick-start: mcp-loadtest probe -s "python -m my_mcp" should just work

Non-Goals

Not a replacement for unit tests. Different problem.
Not a tool for testing MCP clients. Client-side bugs are a separate domain.
Not validating tool output correctness. We test protocol-level behavior. If your tool returns wrong data, that's not what we catch.
Not a fuzzer. A protocol fuzzer (random/malformed payloads) is a different design — possibly a future sister project (mcp-fuzz).
Not a benchmark suite. We provide infrastructure to bench, not a curated set of "official" benchmarks.

3. Background

MCP protocol (relevant subset)

JSON-RPC 2.0 framing over one of four transports:

stdio — line-delimited JSON over child process stdin/stdout (most common, all examples in this doc focus here)
HTTP — Streamable HTTP (simple JSON variant); request via POST, simple JSON response
HTTP+SSE — request via POST, server pushes events via SSE channel
WebSocket — bidirectional frames

Lifecycle (stdio):

client → server   {"method":"initialize", "params":{...}}
client ← server   {"result":{"protocolVersion":...,"capabilities":{...}}}
client → server   {"method":"notifications/initialized"}    # one-way notif
client → server   {"method":"tools/list"}
client ← server   {"result":{"tools":[{...},...]}}
client → server   {"method":"tools/call", "params":{"name":"X","arguments":{...}}}
client ← server   {"result":{"content":[{...}]}}

Bug classes we target

Class	Example	Why hard to catch in unit tests
Lazy-init deadlock	Vibe-Trading PR #85	Bug only surfaces with full subprocess + protocol handshake
Concurrent tool-call race	tools/call before tools/list completes	Need real concurrency; mocked async ≠ real async
Resource exhaustion	1000 concurrent calls → fd / mem leak	Need sustained load
Slow-tool head-of-line	One slow tool blocks queue	Need mixed workload
Reconnect / mid-call kill	Connection drops between request and response	Hard to simulate without tooling
Notification ordering	Server sends `notifications/cancelled` mid-call	Need sequence-aware client

4. Architecture

High-level

┌─────────────────────────────────────────────────────────────┐
│                       CLI / Library                          │
│  - parse args / config                                       │
│  - construct Run                                             │
│  - print report                                              │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    Run (orchestrator)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ServerManager │  │ ScenarioImpl │  │     Reporter     │  │
│  │              │  │              │  │                  │  │
│  │ spawn(),     │  │  N tokio     │  │  hdrhistogram    │  │
│  │ kill(),      │  │  worker      │  │  process stats   │  │
│  │ rss/cpu      │  │  tasks       │  │  → markdown/json │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────────────┘  │
│         │                  │                                 │
│         └──────┬───────────┘                                 │
│                ▼                                             │
│  ┌─────────────────────────────────┐                        │
│  │       ProtocolSession           │                        │
│  │   stdio framing + JSON-RPC      │                        │
│  │   per-call timeout (hang det.)  │                        │
│  │   request/response correlation  │                        │
│  └─────────────────────────────────┘                        │
└─────────────────────────────────────────────────────────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │  Server subprocess  │  (under test — any language)
        └─────────────────────┘

Module layout (`src/`)

src/
├── main.rs                  # CLI entry (clap)
├── lib.rs                   # public crate API
├── config.rs                # TOML schema + parsing
├── protocol/
│   ├── mod.rs               # re-exports
│   ├── jsonrpc.rs           # JSON-RPC 2.0 framing
│   ├── mcp.rs               # MCP request/response types
│   ├── session.rs           # ProtocolSession (per-connection state)
│   └── transport/
│       ├── mod.rs
│       ├── stdio.rs         # M1
│       ├── http.rs          # M4
│       ├── sse.rs           # M4
│       └── ws.rs            # Post-M7
├── server_manager.rs        # spawn/kill, env, working_dir, sysinfo
├── scenario/
│   ├── mod.rs               # Scenario trait
│   ├── cold_start.rs
│   ├── sustained.rs
│   ├── deadlock_probe.rs    # the Vibe-Trading-style smoke
│   ├── spike.rs             # M5
│   ├── ramp.rs              # M5
│   └── leak.rs              # M5
├── driver.rs                # tokio task pool, rate limiter
├── hang_detector.rs         # per-call watchdog
├── metrics/
│   ├── mod.rs
│   ├── histogram.rs         # hdrhistogram wrapper
│   ├── throughput.rs
│   └── process.rs           # RSS/CPU/fd via sysinfo
├── report/
│   ├── mod.rs               # Report struct
│   ├── markdown.rs
│   ├── json.rs
│   └── terminal.rs          # indicatif progress + final summary
└── run.rs                   # orchestrator that ties it all together

Key crate dependencies

Crate	Why
tokio	async runtime
serde / serde_json	JSON-RPC payloads
clap	CLI
toml	config
hdrhistogram	percentile latency
sysinfo	RSS/CPU per pid (cross-platform)
indicatif	terminal progress
tracing	structured logging
thiserror / anyhow	errors
tokio-util	LinesCodec for stdio framing

No proc-macro magic. No "framework" — just composable structs.

5. Library API (Rust crate)

use mcp_loadtest::{Server, Scenario, Run, Thresholds};
use serde_json::json;
use std::time::Duration;

#[tokio::test]
async fn no_deadlock_under_concurrent_calls() {
    let server = Server::stdio("python")
        .args(["-m", "vibe_trading_mcp"])
        .env("LOG_LEVEL", "warn");

    let scenario = Scenario::sustained()
        .concurrent(20)
        .duration(Duration::from_secs(30))
        .tool("get_market_data")
        .args(json!({ "ticker": "AAPL" }));

    let report = Run::new(server, scenario)
        .with_thresholds(Thresholds {
            p99_latency: Duration::from_millis(500),
            error_rate: 0.01,
            hang_timeout: Duration::from_secs(5),
            ..Default::default()
        })
        .execute()
        .await
        .expect("run failed");

    assert!(report.passed(), "thresholds violated: {report:?}");
    assert_eq!(report.deadlock_count, 0);
}

Design choices:

Builder pattern — predictable, no struct-init explosion
execute() returns Result<Report> — lets tests pattern-match on metrics
Report exposes raw histograms — can drive custom assertions
No global state — multiple Runs in parallel inside one test process is supported

6. CLI surface

# Quick health check (initialize + tools/list + 1 call per tool)
mcp-loadtest probe --server "python -m my_mcp"

# Targeted smoke for the Vibe-Trading bug class
mcp-loadtest deadlock-probe --server "python -m my_mcp" --tool get_market_data

# Run a custom scenario from CLI
mcp-loadtest run \
  --server "python -m my_mcp" \
  --scenario sustained \
  --concurrent 50 --duration 60s \
  --tool get_market_data --args '{"ticker":"AAPL"}'

# Run from config file
mcp-loadtest run --config bench.toml

# Re-render saved run
mcp-loadtest report ./runs/2026-05-10T14-30-00/

# List built-in scenarios
mcp-loadtest list-scenarios

# Print example config
mcp-loadtest example-config > bench.toml

Output structure for each run:

runs/2026-05-10T14-30-00/
├── config.toml          # exact config used
├── server.stdout.log    # captured server stdout
├── server.stderr.log    # captured server stderr
├── trace.jsonl          # per-call records (request/response/duration/error)
├── metrics.json         # aggregated metrics
├── report.md            # human-readable summary
└── summary.json         # CI-friendly pass/fail JSON

7. Configuration schema (TOML)

[server]
command = "python"
args = ["-m", "my_mcp"]
env.LOG_LEVEL = "warn"
working_dir = "/path/to/proj"
transport = "stdio"           # stdio | http | sse | ws
startup_timeout = "10s"       # how long to wait for initialize response

[scenario]
type = "sustained"            # cold_start | sustained | spike | ramp | soak | pattern | deadlock_probe | race_check | fuzzer
duration = "60s"
concurrent = 50

# scenario-specific knobs
ramp_from = 1                 # ramp only
ramp_to = 100                 # ramp only
spike_at = "30s"              # spike only
spike_multiplier = 10         # spike only
leak_check_interval = "10s"   # leak only

# What to call. Multiple entries → weighted random selection.
[[scenario.tool_call]]
name = "get_market_data"
args = { ticker = "AAPL" }
weight = 1.0

[[scenario.tool_call]]
name = "analyze_options"
args = { ticker = "SPY", expiry = "2026-06-19" }
weight = 0.3

[thresholds]
p50_latency = "100ms"
p99_latency = "500ms"
error_rate = 0.01             # fraction; 0.01 = 1%
hang_timeout = "5s"           # call considered hung if no response in this long
memory_growth_mb = 50         # fail if RSS grows by more than this MB during run

[output]
report_dir = "./runs"
formats = ["markdown", "json", "terminal"]

8. Built-in scenarios

Scenario	Description	Detects
`cold_start`	Spawn → initialize → first tool call. Repeat N times. (M2 placeholder — needs session factory; tracked for follow-up.)	regression in startup time, init-time deadlocks
`sustained`	Constant load against one session for a fixed duration. Drives the multi-step weighted-random `pattern` engine internally.	baseline p99 latency, throughput, sustained error rate
`spike`	Baseline → sharp burst at peak concurrency for a fixed window → cooldown back to baseline. Models Black-Friday-style traffic spikes.	queue overflow, recovery behavior, fairness under burst
`ramp`	Step concurrency from `from` to `to` by `step_increment`, optionally feeding the per-step metrics into [`analysis::breaking_point`].	finds break-point — concurrency where p99 explodes
`soak`	Long-duration steady load with periodic snapshots; pairs with `analysis::regression` for latency-drift and (via `ProcessSampler`) RSS-slope leak signals.	memory leaks, latency drift, throughput collapse over hours
`pattern`	Multi-step weighted-random tool-call sequences with per-pattern `think_time` and `ErrorBehavior`. Building block used directly by `sustained`.	realistic mixed workloads (explore-then-act, read-then-write)
`deadlock_probe`	initialize → tools/list → fire N `tools/call` to same tool wrapped in `hang_detect`. Bails on first deadlock to avoid flooding a wedged session.	the Vibe-Trading bug class specifically
`race_check`	Issue N identical `tools/call` and run the responses through `analysis::race_detector` (key-sorted JSON canonicalization).	non-determinism / divergent responses to identical inputs
`fuzzer`	Cycle through enumerated malformed-but-plausible payloads (unknown method, numeric method, giant payload, control chars, deep-nested, null/string params); classify each via `analysis::fuzz_report`.	parser bugs, type-confusion in method dispatch

Deferred to v0.2:

slow_mix — 80% calls to a fast tool, 20% to a deliberately-slow tool (head-of-line blocking, fairness). Approximable today by configuring a multi-step pattern with weighted tools.
reconnect — drop session mid-call (close stdin), spawn new session, retry (resilience, leftover state, zombies). Needs the session pool that lands in M8+.

Each scenario is an impl Scenario with two methods:

trait Scenario {
    async fn drive(&self, session: SessionPool, ctx: RunContext) -> ScenarioOutcome;
    fn config_schema() -> serde_json::Value;
}

9. Test matrix

Layer A — does mcp-loadtest itself work?

Mock MCP servers in tests/fixtures/. Each is a tiny Python script (chosen for ubiquity, not Rust, to make the test environment realistic).

Mock	Behavior	Tests
`mock-normal.py`	Echoes args, responds in 1ms	happy-path metrics shape
`mock-slow.py`	Tool sleeps 2s	latency histogram correctness
`mock-broken.py`	Hangs on first tools/call (replicates Vibe-Trading bug)	`deadlock_probe` correctly classifies
`mock-crash.py`	Panics on 1% of calls	error-rate accuracy
`mock-leak.py`	Allocates 10 KB/call, never frees	`leak` scenario detects
`mock-error.py`	Returns JSON-RPC errors per spec	error classification
`mock-slow-init.py`	Takes 5s to respond to `initialize`	`cold_start` measures correctly
`mock-malformed.py`	Returns invalid JSON occasionally	parser robustness

Test invariant: for each (scenario × mock) pair, the report's machine-readable summary contains expected fields with expected ranges. This is the bulk of integration tests.

Layer B — does it catch real bugs?

Snapshot test against a known-buggy commit of Vibe-Trading:

Pin to commit ~PR-85 (just before the fix)
Run deadlock_probe scenario
Assert: report flags ≥1 deadlock, identifies tools/call as the offending request

Re-run against post-fix commit:

Same scenario, expect 0 deadlocks

This is the killer demo. It goes in the README.

Layer C — cross-platform

CI matrix: ubuntu-latest, macos-latest, windows-latest × stable Rust × Python 3.13 (for fixtures).

10. Milestones (revised 2026-05-10 — head-on competition with reaatech/mcp-load-test)

Original 3-week plan replaced after discovering reaatech/mcp-load-test ships v0.1 functionality already (see §10.5 for parity matrix). v0.1.0 of mcp-loadtest must reach feature parity and surface our differentiators before re-publishing.

Repo was private through the M1-M7 development phase. v0.1.0 ships from a public repo via cargo install --git + prebuilt GitHub Release binaries; the crates.io publish is deferred to keep the first release off append-only (ADR 0015 — amends the distribution channel of ADR 0004).

M1 through M7 are all shipped. Post-M7 work (spike scenario, HTML reporter, WebSocket transport, hot-path zero-copy refactor, criterion benches) is captured under [Unreleased] in CHANGELOG rather than as a new milestone — the work is small + cohesive enough that bundling it into v0.1.0 makes more sense than coining "M8" for it. The "Week N" column is dropped because milestones are no longer time-boxed — they're released.

M	Theme	Key deliverables
M1 ✓	stdio Session	`Session::spawn` → handshake → `list_tools`/`call_tool`/`shutdown`; mock-normal.py; happy-path integration test
M2 ✓	Scenarios + metrics core	`Scenario` trait; `cold_start` + `sustained` + `deadlock_probe` impls; `hang_detector` (§15.1); hdrhistogram metrics; mocks `mock-broken`/`mock-slow`/`mock-crash` + tests
M3 ✓	Reports + first internal release	TOML config; markdown / JSON / console reporters; sysinfo-based process sampling; regression test against real Vibe-Trading commit ~PR-85
M4 ✓	Transport parity	HTTP transport (StreamableHTTP); SSE transport; HTTP/SSE fixtures; transport-aware concurrency profiles
M5 ✓	Analysis parity	`breaking_point` detection; performance grading (A-F per latency/concurrency/error); realistic patterns (explore-then-act, read-then-write, multi-step) with weighted random + think-time; `soak` scenario polish; `compare-baselines` subcommand
M6 ✓	Differentiators v1	Real-time terminal TUI dashboard (live latency/throughput/RSS); server resource sampling beyond RSS (CPU, fd, threads); `race_detector` scenario; cross-server compare (`run --server srv-a --server srv-b`)
M7 ✓	Differentiators v2 + v0.1 polish	Protocol fuzzer (basic — random/malformed payloads); coverage tracking (tools registered vs. exercised); per-tool SLO assertions; README rewrite with competitive positioning; `cargo install` smoke test on all 3 OS
Post-M7 ✓	Pre-public-release close-out	Spike scenario; HTML reporter; WebSocket transport; hot-path zero-copy refactor; criterion benches (DESIGN §19 claims now reproducible). See CHANGELOG `[Unreleased]`.
v0.1.0-rc	Pre-publish review in flight	repo back to public; `cargo install --git` + GitHub Release binaries (crates.io deferred — ADR 0015); HN/lobste.rs/r/rust announce
M8+ stretch	Beyond	AI-assisted pattern generator; distributed mode (multi-worker); replay/record; PyO3 binding

Definition of done for v0.1.0:

cargo install --git <repo-url> mcp-loadtest-cli works on Linux/macOS/Windows, and prebuilt binaries are attached to the GitHub Release (crates.io publish deferred — ADR 0015).
mcp-loadtest deadlock-probe -s "python -m vibe_trading_mcp" reproduces the original bug on commit ~PR-85.
All §10.5 parity-must-have rows are checked.
All §10.5 differentiator rows are checked.
README has side-by-side comparison table vs. reaatech, citing concrete benchmarks.

10.5 Competitive parity & differentiation matrix

reaatech/mcp-load-test as of 2026-05-10 (TS monorepo, 77 source files, ~50% of README claims fleshed out per file-size sampling).

Parity — features they have, we must match before re-publishing public

Feature	reaatech	mcp-loadtest target	Milestone
stdio transport	✓	✓	M1
HTTP (StreamableHTTP) transport	✓	✓	M4
SSE transport	✓	✓	M4
WebSocket transport	✗	✓	Post-M7
Latency histograms p50/p95/p99/p999 per tool	✓	✓	M2
Breaking point detection	✓	✓	M5
Performance grading A-F	✓	✓	M5
Soak / leak detection	✓	✓	M5
Spike scenario	✓	✓	Post-M7
Compare baselines	✓	✓	M5
Realistic patterns (explore-then-act, multi-step)	✓	✓	M5
Console + markdown + JSON reporters	✓	✓	M3
HTML reporter (self-contained)	✗	✓	Post-M7
Programmatic library API	✓	✓	M2/M3

Differentiators — features we have/will have that they don't

Feature	reaatech	mcp-loadtest	Why it matters
Deadlock detection (`deadlock_probe`)	✗	✓ M2	Lazy-init / async-worker bugs that break in prod. Direct response to Vibe-Trading PR #85.
Race detector	✗	✓ M6	Order-sensitive concurrent tool calls; finds protocol-level race bugs.
Real-time TUI dashboard	✗ (post-hoc only)	✓ M6	Watch perf cliff happen live during a run.
Cross-server compare (run vs N targets)	partial (compare baselines = 2 runs)	✓ M6 (1 run, N targets)	Side-by-side: vendor A vs vendor B vs your fork.
Server resource sampling (CPU/fd/threads/RSS over time)	✗ (latency only)	✓ M6	Find resource exhaustion before throughput collapses.
Protocol fuzzer (mcp-fuzz integrated)	✗	✓ M7	Random/malformed payloads; finds parser bugs unit tests miss.
Coverage tracking (registered vs exercised tools)	✗	✓ M7	Catch silently-broken tools that nobody tests in CI.
Per-tool SLO assertions	partial (global)	✓ M7	Per-tool latency/error budgets in CI.
Configurable regression thresholds	✗ (fixed)	✓ v0.1	`compare` CLI flags + `compare_runs` MCP args override p99 / error-rate / deadlock policy; defaults unchanged (ADR 0009).
Protocol-aware assertions	✗	✓ v0.1	Opt-in strict mode validates `tools/call` args vs the server's advertised `inputSchema`; mismatch → `ProtocolError` gates the run. Forward-compatible, off by default (ADR 0005/0010).
Rust perf + static binary	✗ (Node runtime required)	✓	`cargo install` → single ~5MB binary; no Node toolchain.
AI-assisted pattern generator	✗	⏳ M8 stretch	LLM reads tool schemas → generates realistic call sequences.
Distributed mode	✗	⏳ M8 stretch	Multiple workers driving one server (high-RPS targets).
Replay / record	✗	⏳ M8 stretch	Capture prod traffic, replay deterministically.
Self-hosted as MCP server (`mcp-loadtest serve --mcp`)	✗	✓ M7	AI agents (Claude, Cursor, etc.) call `deadlock_probe` / `compare` / `report` directly via MCP. Recursive: load-test an MCP using an MCP.

Strategic positioning (for README at v0.1.0)

mcp-loadtest is a load tester + bug detector for MCP servers. Match-or-exceed reaatech/mcp-load-test on every load-testing dimension, and detect classes of bugs no other tool finds: deadlocks, races, resource leaks, coverage gaps.

The README at re-publish must lead with the deadlock demo (replicated Vibe-Trading PR #85 bug, caught in 2 seconds) — not the load-testing checklist. Differentiation first; parity proves we're serious.

11. Decisions (resolved 2026-05-10)

#	Question	Decision	Rationale
1	Crate name	`mcp-loadtest` (lib) + `mcp-loadtest-cli` (bin)	descriptive, discoverable, doesn't pigeonhole to "bench"
2	License	MIT OR Apache-2.0 (dual)	Rust ecosystem standard; MIT for individuals, Apache-2.0 for corporate patent grant
3	Repo location	`github.com/Teerapat-Vatpitak/mcp-loadtest`	personal handle for v0.1; transfer to `mcp-tools/` org if/when sister projects emerge
4	MCP protocol versioning	v0.1 pin to spec v1.x, warn on mismatch; `--strict-protocol` flag for fail-on-mismatch; v0.2+ detect-and-adapt	ship v0.1 fast, add complexity when justified
5	`deadlock_probe`	both subcommand (`mcp-loadtest deadlock-probe -s "..."`) and scenario in `run --scenario deadlock_probe`	subcommand for newcomer UX, scenario for CI; near-zero implementation cost
6	Server stderr	always capture to `runs/<id>/server.stderr.log`; opt-in `--tee-stderr` to also stream	stderr critical for debugging; capture is cheap; tee opt-in to avoid CI log spam
7	Diff-vs-baseline mode	defer to M5 stretch	v0.1 emits JSON; users diff externally. Proper baseline storage + regression detection has too many edge cases for v0.1
8	Library API → 1.0	When all three: 3 months no breaking changes + 5+ external users + 1 real bug caught in wild	calendar time + adoption + value-prop validation, all required

12. Naming options (decide in §11.1)

mcp-loadtest — clear, no surprises
mcp-bench — implies benchmarking specifically
mcphammer — playful, memorable, but maybe too aggressive for a tool that aims to be canonical
mcptest — too generic
mcp-stress — accurate but slightly negative
lockesmith — clever ("lock-finder for MCP servers") but obscure

Author's preference: mcp-loadtest for v0.1. Rename later if needed.

13. Future work (out of scope for v0.1)

mcp-fuzz — sister project for protocol fuzzing (random/malformed payloads)
mcp-trace — record + replay tool for debugging production MCP issues
Distributed mode — multiple loadtest workers driving one server (for very high RPS targets)
GUI/web UI — render reports interactively
Plugin system — user-defined scenarios as separate crates
Public benchmark dataset — track perf of popular MCP servers over time (mcp-leaderboard)

13.1 v0.2 backlog (committed deferrals from v0.1)

Prioritized. Each item is a debt v0.1 explicitly took on; provenance in parentheses so a future planner can trace the contract. The bullets above remain the broader ecosystem horizon.

P1 — correctness / security debt promised in v0.1

cold_start real handshake-time histogram — v0.1 ships an inert placeholder; the cold_start_is_an_inert_placeholder test pins the contract so this work must update it. Needs a session-spawning factory on RunContext. (DESIGN §8; CHANGELOG [0.1.0] Tests/benches)
Result-side strict schema validation — v0.1 validates only tools/call arguments; extend the dependency-free validator to the CallToolResult payload. (ADR 0010; CHANGELOG [0.1.0] Added)
DNS-rebinding defense (resolver-pinning connector) — v0.1's SSRF guard blocks IP literals + enforces the host allowlist, but a hostname that resolves to a private IP is not blocked. (ADR 0012 "Open"; CHANGELOG [0.1.0] Security / Notes)

P2 — API / packaging hygiene due exactly at v0.2.0

Remove deprecated alias DEFAULT_LEAK_THRESHOLD_MB_PER_SEC — kept one release as an alias for DEFAULT_LATENCY_DRIFT_MS_PER_SEC; removal is a documented breaking change for v0.2.0. (CHANGELOG [0.1.0] Deprecated)
Feature-gate serve / tui behind cargo features — keep the default build slim; migrate docs/examples to show the feature flags. (CHANGELOG [0.1.0] Notes)

P3 — differentiators / ecosystem (longer horizon)

Fuzzer raw-byte payloads — needs a Transport::raw_send hook; the raw variants are documented + skipped in v0.1. (CHANGELOG [0.1.0] Added — fuzzer)
insta snapshot parity for html / terminal reporters — v0.1 asserts substring landmarks because both reporters have too much structural variance for stable snapshots. (CHANGELOG [0.1.0] Tests/benches)
Sister projects — mcp-fuzz, mcp-trace (see §13 list above). (ADR 0004 Path C)
M8+ stretch — distributed multi-worker, PyO3 binding, AI-assisted pattern generator (see §13 list above). (ADR 0004 Path C)

14. Concrete Rust types

These are the public types code will hang off. Full definitions, not sketches.

14.1 Server config

pub struct Server {
    pub command: String,
    pub args: Vec<String>,
    pub env: BTreeMap<String, String>,    // BTreeMap for stable serialization
    pub working_dir: Option<PathBuf>,
    pub transport: Transport,
    pub startup_timeout: Duration,         // default 10s
    pub shutdown_timeout: Duration,        // default 5s; SIGTERM → wait → SIGKILL
}

#[derive(Clone, Copy, PartialEq, Eq)]
pub enum Transport {
    Stdio,
    Http { url: String, headers: BTreeMap<String, String> }, // M4 — Streamable HTTP (simple JSON variant)
    Sse { url: String, headers: BTreeMap<String, String> },  // M4
    WebSocket { url: String },                               // Post-M7
}

impl Server {
    pub fn stdio(command: impl Into<String>) -> ServerBuilder { /* ... */ }
}

14.2 Scenario

pub enum ScenarioKind {
    ColdStart {
        iterations: u32,                   // default 5
        warmup: bool,                      // discard first iter — default true
    },
    Sustained {
        concurrent: u32,
        duration: Duration,
        rate_limit: Option<u32>,           // requests/sec cap; None = unbounded
    },
    Spike {
        baseline_concurrent: u32,
        spike_concurrent: u32,
        baseline_duration: Duration,
        spike_at: Duration,
        spike_duration: Duration,
    },
    Ramp {
        from_concurrent: u32,
        to_concurrent: u32,
        duration: Duration,
    },
    DeadlockProbe {
        concurrent: u32,                   // default 20
        hang_threshold: Duration,          // default 5s
        grace_period: Duration,            // default 10s — after timeout, how long to wait for late responses
    },
    Soak {
        concurrent: u32,                   // default 4
        duration: Duration,                // default 1h
        sample_interval: Duration,         // default 10s
        latency_drift_ms_per_sec: f64,     // fail if linear-regression slope on mean latency exceeds this
    },
    // M5+ ships additional kinds not detailed here for brevity:
    //   Pattern { steps, think_time, weight, error_behavior }
    //   RaceCheck { concurrent, tool, args }
    //   Fuzzer { iterations, seed, payloads }
    // See crate::scenario::{pattern, race_check, fuzzer}.
}

pub struct Scenario {
    pub kind: ScenarioKind,
    pub tool_calls: Vec<ToolCall>,         // weighted random selection
}

pub struct ToolCall {
    pub name: String,
    pub args: serde_json::Value,
    pub weight: f64,                       // default 1.0
}

14.3 Run + Report

pub struct Run {
    server: Server,
    scenario: Scenario,
    thresholds: Thresholds,
    output_dir: Option<PathBuf>,
}

impl Run {
    pub fn new(server: Server, scenario: Scenario) -> Self;
    pub fn with_thresholds(self, t: Thresholds) -> Self;
    pub fn with_output_dir(self, dir: PathBuf) -> Self;
    pub async fn execute(self) -> Result<Report, RunError>;
}

#[derive(Default)]
pub struct Thresholds {
    pub p50_latency: Option<Duration>,
    pub p95_latency: Option<Duration>,
    pub p99_latency: Option<Duration>,
    pub p999_latency: Option<Duration>,
    pub error_rate: Option<f64>,           // 0.0..=1.0
    pub hang_timeout: Duration,            // default 5s — used by hang_detector
    pub memory_growth_mb: Option<f64>,
}

pub struct Report {
    pub run_id: String,                    // ULID
    pub started_at: SystemTime,
    pub duration: Duration,
    pub scenario_kind: ScenarioKind,
    pub server_info: ServerInfo,
    pub latency: LatencyStats,
    pub throughput: ThroughputStats,
    pub errors: ErrorStats,
    pub process: ProcessStats,
    pub deadlock_count: u32,
    pub hang_count: u32,
    pub trace_path: PathBuf,
    pub threshold_violations: Vec<ThresholdViolation>,
}

impl Report {
    pub fn passed(&self) -> bool { self.threshold_violations.is_empty() }
    pub fn write_markdown(&self, path: &Path) -> io::Result<()>;
    pub fn write_json(&self, path: &Path) -> io::Result<()>;
}

pub struct LatencyStats {
    pub histogram: hdrhistogram::Histogram<u64>,  // exposed for custom analysis
    pub p50: Duration,
    pub p95: Duration,
    pub p99: Duration,
    pub p999: Duration,
    pub min: Duration,
    pub max: Duration,
    pub mean: Duration,
    pub stddev: Duration,
    pub count: u64,
}

pub struct ThroughputStats {
    pub total_requests: u64,
    pub successful_requests: u64,
    pub requests_per_sec: f64,
    pub timeline: Vec<(Duration, u64)>,    // (offset, requests-completed-by-then) for charts
}

pub struct ErrorStats {
    pub total: u64,
    pub by_category: BTreeMap<ErrorCategory, u64>,    // see §18
}

pub struct ProcessStats {
    pub peak_rss_mb: f64,
    pub final_rss_mb: f64,
    pub avg_cpu_pct: f64,
    pub samples: Vec<ProcessSample>,
}

pub struct ProcessSample {
    pub at: Duration,                      // offset from run start
    pub rss_mb: f64,
    pub cpu_pct: f64,
}

pub struct ThresholdViolation {
    pub metric: String,                    // e.g. "p99_latency"
    pub expected: String,                  // e.g. "<= 500ms"
    pub actual: String,                    // e.g. "812ms"
}

14.4 Errors

#[derive(thiserror::Error, Debug)]
pub enum RunError {
    #[error("server failed to start: {0}")]
    ServerStart(io::Error),

    #[error("server exited unexpectedly with code {0:?}")]
    ServerExit(Option<i32>),

    #[error("initialize handshake failed: {0}")]
    Handshake(String),

    #[error("server stderr: {0}")]
    ServerStderr(String),

    #[error("config invalid: {0}")]
    Config(String),

    #[error("io: {0}")]
    Io(#[from] io::Error),

    #[error("internal: {0}")]
    Internal(String),
}

15. Algorithm specs

The detection logic is the IP of this tool. Spec'd precisely so any implementer can reproduce.

15.1 Hang detector

Per-call watchdog. Wraps every tools/call request:

Algorithm: hang_detector(req, threshold)
1. Record send_at = now().
2. Send req to server.
3. Spawn watchdog task with timer = threshold.
4. Race: watchdog completes OR response arrives.
5. If response arrives first:
     duration = now() - send_at
     return Ok((response, duration))
6. If watchdog completes first:
     mark request_id as HUNG
     continue listening for late response (up to grace_period)
     if late response arrives: classify as LATE (not HUNG)
     if no response within grace_period: classify as DEADLOCK
     return Err(Hang { request_id, hung_for })

Hang ≠ deadlock. Hang means "no response within hang_threshold". Deadlock means "no response within hang_threshold + grace_period" — i.e. server appears genuinely stuck, not just slow.

15.2 Deadlock probe scenario

The Vibe-Trading-bug-class detector. Specific call sequence designed to reproduce lazy-init races.

Algorithm: deadlock_probe(server, tool, N, hang_threshold)
1. Spawn server. Record startup_duration = time-to-stdout-EOF or initialize-response.
2. Send `initialize`. Await with timeout = startup_timeout. (fails → SERVER_INIT_ERROR)
3. Send `notifications/initialized`.
4. Send `tools/list`. Await with timeout = 1s. (fails → TOOLS_LIST_HANG)
5. Synchronization barrier — all N tasks ready to send concurrently.
6. Release barrier. All N tasks send `tools/call` to `tool` simultaneously.
7. Each task: hang_detector(req, hang_threshold).
8. After all N return (Ok or Err): wait grace_period.
9. Categorize each:
     - Ok with duration → SUCCESS
     - Late response within grace_period → SLOW
     - No response after grace_period → DEADLOCK
10. Send shutdown notification, wait shutdown_timeout, kill if needed.
11. Report:
     - if DEADLOCK count > 0 → severity=CRITICAL, "DEADLOCK DETECTED"
     - else if SLOW > 0.5 * N → severity=WARNING, "concurrency degrades latency"
     - else → severity=PASS

The barrier in step 5-6 is critical. Without it, requests serialize naturally and lazy-init bugs hide. Barrier forces real concurrency at the point of greatest stress.

15.3 Leak detector

Algorithm: leak_detector(server, scenario, sample_interval, growth_threshold_mb)
1. Run sustained scenario. Concurrently:
2. Every sample_interval, sample server's RSS via sysinfo.
3. After scenario completes:
4. Fit linear regression: rss_mb = a * t + b, where t in seconds
5. Predicted total growth = a * scenario.duration_secs
6. If predicted_growth > growth_threshold_mb:
     classify as LEAK_DETECTED
     report: slope (MB/sec), R² (fit quality), samples
7. R² < 0.5 → "noisy, can't conclude" — report as INDETERMINATE

Caveat: warmup-and-stabilize matters. First 30s of samples are discarded by default to avoid false positives from JIT / lazy-load.

15.4 Threshold evaluator

Algorithm: evaluate_thresholds(report, thresholds)
For each threshold field that is Some:
  compare report's metric to threshold
  if violated: append ThresholdViolation { metric, expected, actual }
Return: violations vec — empty means PASS.

Simple, but worth specifying so the report's passed() is unambiguous.

16. Mock server specs

Mocks live in tests/fixtures/<name>.py. Each is < 50 lines of Python — minimal MCP server using stdio + JSON-RPC by hand (no fastmcp dep, to avoid version coupling). Shipped fixtures: mock-normal.py, mock-slow.py, mock-broken.py, mock-crash.py, mock-leak.py, mock-error.py, mock-slow-init.py, mock-malformed.py, plus mock-http-server.py and mock-sse-server.py (transport parity coverage). Pseudocode for each below.

16.1 mock-normal.py

# Echoes args, responds in 1ms. Reference implementation.
while True:
    line = sys.stdin.readline()
    msg = json.loads(line)
    if msg["method"] == "initialize":
        respond({"protocolVersion":"...", "capabilities":{...}})
    elif msg["method"] == "tools/list":
        respond({"tools":[{"name":"echo","inputSchema":{...}}]})
    elif msg["method"] == "tools/call":
        respond({"content":[{"type":"text","text":json.dumps(msg["params"]["arguments"])}]})

16.2 mock-slow.py

Same as mock-normal, but tools/call does time.sleep(2) before responding. Used to verify latency histogram correctness (p99 should be ~2s).

16.3 mock-broken.py

# Replicates Vibe-Trading lazy-init deadlock pattern.
# initialize and tools/list work; first tools/call hangs forever.
calls_made = 0
while True:
    msg = json.loads(sys.stdin.readline())
    if msg["method"] in ("initialize", "tools/list"):
        respond_normally()
    elif msg["method"] == "tools/call":
        # The bug: blocking import in worker
        if calls_made == 0:
            calls_made += 1
            time.sleep(999999)              # actual deadlock
        else:
            respond_normally()

deadlock_probe against this MUST report deadlock_count >= 1.

16.4 mock-crash.py

# Panics 1% of calls (random.random() < 0.01). Tests error rate accuracy.
# Crash = exit(1), not JSON-RPC error.

16.5 mock-http-server.py

# Streamable HTTP transport fixture. Stdlib http.server only — no fastapi/etc.
# Used by HttpTransport integration tests.

16.6 mock-sse-server.py

# HTTP+SSE transport fixture. Endpoint handshake + id-correlated responses.
# Stdlib http.server only. Used by SseTransport integration tests.

16.7 mock-leak.py

# Allocates 10 KB per tools/call into a module-global list. Never frees.
# Tests leak detector — slope should be ~10KB × rps.
# Today leak/drift signals are exercised via `Soak::detect_leak` over synthetic
# (t, rss) series; a real leaking fixture is still useful for end-to-end coverage.

16.8 mock-error.py

# Returns JSON-RPC errors per spec: -32601 method not found,
# -32602 invalid params, -32603 internal error.
# Cycles through error codes per call. Tests error classification (§18).

16.9 mock-slow-init.py

# Sleeps 5s on `initialize` before responding. Tests cold_start measurement.

16.10 mock-malformed.py

# Returns invalid JSON every 10th response (truncated, missing field).
# Tests parser robustness — should classify as MALFORMED_RESPONSE not crash.

All mocks share common framing helpers in tests/fixtures/_common.py (read frame, write frame, respond ok/err).

17. Output format spec

17.1 Trace format (`trace.jsonl`)

One JSON object per line. Schema:

{
  "ts": 0.0,                              // seconds since run start (f64)
  "kind": "request|response|error|hang|deadlock|process_sample|scenario_event",
  "request_id": 123,                      // matches JSON-RPC id, present for request/response/error/hang/deadlock
  "method": "tools/call",                 // present for request
  "params": {...},                        // present for request (compact, can be large)
  "result": {...},                        // present for response (truncated to 1KB by default)
  "error": {"category": "...", "message": "...", "code": -32603},  // present for error
  "duration_ms": 12.5,                    // present for response/error
  "rss_mb": 45.2,                         // present for process_sample
  "cpu_pct": 12.3                         // present for process_sample
}

Stream-friendly. Can be processed with jq or any line-oriented tool.

17.2 metrics.json

{
    "run_id": "01HXY...",
    "started_at": "2026-05-10T07:30:00Z",
    "duration_secs": 60.0,
    "scenario": {
        "kind": "Sustained",
        "concurrent": 50,
        "duration_secs": 60.0
    },
    "latency_ms": {
        "p50": 12.3,
        "p95": 45.6,
        "p99": 123.4,
        "p999": 456.7,
        "min": 1.2,
        "max": 999.9,
        "mean": 23.4,
        "stddev": 18.7,
        "count": 12345
    },
    "throughput": {
        "total_requests": 12345,
        "successful_requests": 12300,
        "requests_per_sec": 205.75
    },
    "errors": {
        "total": 45,
        "by_category": {
            "Hang": 0,
            "Timeout": 5,
            "ServerError": 30,
            "ProtocolError": 10,
            "Crash": 0,
            "Malformed": 0
        }
    },
    "process": {
        "peak_rss_mb": 156.3,
        "final_rss_mb": 142.1,
        "avg_cpu_pct": 23.4
    },
    "deadlock_count": 0,
    "hang_count": 0,
    "threshold_violations": [
        { "metric": "p99_latency", "expected": "<=100ms", "actual": "123.4ms" }
    ],
    "passed": false
}

On the Rust side, metric is a ThresholdKind enum (crate::report::ThresholdKind); serde flattens it to the string slug shown here via #[serde(rename = "metric")] + per-variant snake_case so the wire format stays stable across refactors.

JSON Schema published at schema/metrics.v1.json for downstream tooling.

17.3 report.md template

# Run {run_id}

**Status:** ❌ FAIL (1 threshold violation)
**Server:** `python -m vibe_trading_mcp`
**Scenario:** Sustained, 50 concurrent, 60s
**Started:** 2026-05-10 07:30:00 UTC

## Summary

- Total requests: 12,345
- Throughput: 205.75 req/s
- Error rate: 0.36%
- Deadlocks: 0 Hangs: 0

## Latency

| p50    | p95    | p99            | p999    | max     |
| ------ | ------ | -------------- | ------- | ------- |
| 12.3ms | 45.6ms | **123.4ms** ❌ | 456.7ms | 999.9ms |

(latency histogram ASCII chart here)

## Errors

| Category      | Count |
| ------------- | ----- |
| ServerError   | 30    |
| ProtocolError | 10    |
| Timeout       | 5     |

## Process

Peak RSS: 156.3 MB · Final RSS: 142.1 MB · Avg CPU: 23.4%

## Threshold violations

- ❌ **p99_latency**: expected ≤100ms, got 123.4ms

## Trace

Full trace: `./trace.jsonl` (12,345 events, 8.2 MB)

18. Error taxonomy

Every failure is classified into exactly one category. Used for ErrorStats.by_category and reporting.

Category	Definition	Example
`Hang`	No response within `hang_threshold`, but response arrived before grace_period expires	tool genuinely slow under contention
`Deadlock`	No response after `hang_threshold + grace_period`	Vibe-Trading PR #85
`Timeout`	Client-side configured deadline exceeded (separate from hang_threshold)	network buffer full
`ServerError`	JSON-RPC error response with `code` in `[-32099..=-32000]` (server-defined)	tool returned business error
`ProtocolError`	JSON-RPC error with `code` `-32600..=-32603` (transport / spec violations)	malformed request rejected
`Crash`	Server process exited (non-zero or signal) during call	unhandled panic
`Malformed`	Response was not valid JSON or didn't match JSON-RPC schema	partial response, broken framing
`Disconnected`	Transport closed unexpectedly mid-call	broken pipe
`Cancelled`	Client cancelled the request before response	scenario shutdown

Classification precedence: top-down. A request that hangs and then the server crashes → classified as Crash (the terminal event), but trace.jsonl records both hang and crash events for forensics.

19. Performance targets for the tool itself

mcp-loadtest should never be the bottleneck.

Aspect	Target
Driver per-request CPU overhead	< 50µs (excluding JSON serialization)
Memory per concurrent worker	< 100KB
Max sustainable concurrency on a 4-core laptop	≥ 1000 workers
Trace file write throughput	≥ 100k events/sec
Histogram update	lock-free per-worker, merged at end

These are tested in benches/ (criterion). v0.1 ships with reproducible numbers in the README.

20. Versioning + stability policy

v0.x: API can change anywhere
v1.0: locked. Breaking changes require major version bump (semver strict)
MCP spec: protocol_version field in initialize is checked. Mismatch warns but does not fail by default. Override with --strict-protocol.
Library MSRV (minimum supported Rust version): stable - 2 (e.g. if 1.85 is current stable, MSRV is 1.83).

When to commit to 1.0:

After 3 months of v0.x with no breaking changes
After 5+ external users have integrated
After at least 1 real bug caught in the wild and reported back

21. AI-friendliness (design pillar)

mcp-loadtest is a tool that AI agents will both operate (Claude Code running CI) and be operated by (developers asking Claude "load-test my MCP server"). Design accordingly.

21.1 First-class library API for embedding in agent tools

All public types have #[derive(Debug, Serialize, Deserialize)] so they're trivially JSON-able.
The library API is documented with rustdoc examples that compile (doctested in CI). LLMs read these examples to build correct calls on the first try.
No "you must construct in this exact order" sequencing — builders are commutative where possible.

21.2 Self-hosted MCP server: `mcp-loadtest serve --mcp`

The single most important AI-friendly feature. mcp-loadtest exposes itself as an MCP server with these tools:

Tool	Args	Returns
`deadlock_probe`	`server_command`, `tool`, `concurrent`	`{ deadlock_count, hung_for_ms[], details }`
`sustained_load`	`server_command`, `concurrent`, `duration_secs`, `tool`, `args`	`{ p50_ms, p99_ms, error_rate, requests_per_sec }`
`compare_runs`	`baseline_run_dir`, `current_run_dir`	structured diff with regression flags
`report_summary`	`run_dir`	markdown summary string
`list_recent_runs`	`limit`	run dirs with metadata

A user can say to Claude / Cursor / any MCP-aware agent: "Find deadlocks in my new MCP server at python -m foo" — and the agent calls deadlock_probe directly. No human-in-the-loop required to spawn a child process and parse stdout — the agent gets structured JSON back.

Reaatech doesn't do this. It's our most under-priced differentiator.

21.3 Actionable error messages with hints

Every Err returned to the user includes a suggested next step:

Error: server stdin closed unexpectedly during initialize handshake.
Hint: server may have crashed before responding. Check stderr at:
      runs/01HXY.../server.stderr.log
      Or re-run with --tee-stderr to see it live.

vs. the bad version:

Error: BrokenPipe(Os { code: 32, ... })

LLMs (and humans) act on the first; bounce off the second.

21.4 `--explain` flag on every subcommand

$ mcp-loadtest deadlock-probe --explain
Algorithm:
  1. Spawn server process.
  2. Send `initialize`. Wait up to startup_timeout (default 10s).
  3. Send `notifications/initialized`.
  4. Send `tools/list`. Wait up to 1s.
  5. Synchronization barrier — N concurrent `tools/call` ready to fire.
  6. Release barrier. All N calls fire in parallel.
  7. Each call wrapped in hang_detect(hang_threshold=5s, grace_period=10s):
     - response within hang_threshold → SUCCESS
     - response between threshold and grace_period → SLOW (warning)
     - no response after grace_period → DEADLOCK (critical)
  8. Report aggregated results.

Tunable knobs: --concurrent, --hang-threshold, --grace-period.
See DESIGN.md §15.2 for the spec source.

LLMs use this to plan the right invocation. Reduces "I tried it but it didn't do what I expected" loops.

21.5 JSON Schema published for config + outputs

schema/config.v1.json and schema/metrics.v1.json shipped at well-known paths. LLMs validate generated configs / parse outputs without guessing field shapes.

21.6 `mcp-loadtest doctor`

Diagnoses common setup issues:

Python interpreter not on PATH (for fixture-based tests).
MSVC vs GNU toolchain mismatch on Windows.
Stale runs/ accumulation.
MCP server fails initialize — captures stderr and reports.

Outputs a checklist with ✅/❌ per item and a one-line fix per ❌. Exactly the kind of thing an LLM agent can chain into a fix-it loop.

21.7 Trace format is LLM-readable

runs/<id>/trace.jsonl is line-oriented JSON with stable field names (DESIGN.md §17.1). Pipeable through jq, parseable by any agent without custom code:

$ jq 'select(.kind=="hang")' runs/01HXY.../trace.jsonl

21.8 Reports include "What this means" interpretation

A report that says p99 latency: 234ms is data. A report that adds "95% of users would call this acceptable; the slow tail (top 1%) is concentrated on analyze_options calls" is information. We aim for the latter — derived sentences, not just numbers.

21.9 Snapshot tests for output formats

insta::assert_snapshot! on report markdown / JSON. Output shapes are stable across releases unless explicitly changed (with CHANGELOG entry). LLM agents that parse our output don't break across patch versions.

21.10 Cookbook in `docs/examples/`

Per-scenario copy-pasteable commands + expected output. LLMs train on README-style examples; cookbook entries make those examples concrete and executable.

Examples to ship at v0.1.0:

"Find deadlocks in my new MCP server"
"Add a regression gate to my CI"
"Compare two implementations of the same MCP server"
"Detect a memory leak before production"

FilesExpand file tree

DESIGN.md

Latest commit

History