Skip to content

Latest commit

 

History

History
1204 lines (938 loc) · 60 KB

File metadata and controls

1204 lines (938 loc) · 60 KB

mcp-loadtest — Design Document

Status: v0.1.0-rc (2026-05-11) Author: Teerapat Vatpitak Reviewers: (pending)


1. Motivation

The Model Context Protocol ecosystem is exploding — new MCP servers ship weekly across Python, Node, Rust. But MCP servers fail in ways unit tests don't catch:

  • Lazy-init deadlocks inside async worker threads
  • Race conditions when concurrent tools/call arrive before tools/list completes
  • Memory leaks under sustained load
  • Hangs that look like work-in-progress
  • Subtle protocol violations only visible at scale

Canonical motivating case

The author hit a deadlock in HKUDS/Vibe-Trading where:

  • initialize → ✅ worked
  • tools/list → ✅ worked
  • tools/call → ❌ hung forever

Root cause: _get_registry() lazy-init inside the FastMCP asyncio worker thread, blocking on import src.tools.shell.*. Standard pytest didn't catch this — the bug only surfaces when a real client opens a session and calls a tool through stdio.

The fix took ~5 lines (PR #85) and a regression smoke test (PR #86) — but finding the bug took hours of differential testing because no purpose-built tool exists for stress-testing MCP servers.

The gap

There are excellent tools for HTTP load testing (k6, vegeta, wrk2), gRPC (ghz), GraphQL (autocannon plugins). There is nothing for MCP. Ad-hoc Python scripts and pytest fixtures are what people use.

mcp-loadtest aims to be the canonical tool: language-agnostic, transport-agnostic, with built-in scenarios for the bug classes that actually occur.


2. Goals & Non-Goals

Goals

  • Detect deadlocks, hangs, livelocks under realistic concurrent load
  • Measure latency (p50/p95/p99), throughput, error rate
  • Work against any MCP server regardless of language / transport (stdio, HTTP, SSE, WebSocket)
  • Library mode (Rust crate) for embedding in CI tests
  • CLI mode for ad-hoc smoke tests and benchmarks
  • Cross-platform (Linux, macOS, Windows — author runs Windows so this is a hard requirement, not aspirational)
  • Zero-config quick-start: mcp-loadtest probe -s "python -m my_mcp" should just work

Non-Goals

  • Not a replacement for unit tests. Different problem.
  • Not a tool for testing MCP clients. Client-side bugs are a separate domain.
  • Not validating tool output correctness. We test protocol-level behavior. If your tool returns wrong data, that's not what we catch.
  • Not a fuzzer. A protocol fuzzer (random/malformed payloads) is a different design — possibly a future sister project (mcp-fuzz).
  • Not a benchmark suite. We provide infrastructure to bench, not a curated set of "official" benchmarks.

3. Background

MCP protocol (relevant subset)

JSON-RPC 2.0 framing over one of four transports:

  • stdio — line-delimited JSON over child process stdin/stdout (most common, all examples in this doc focus here)
  • HTTP — Streamable HTTP (simple JSON variant); request via POST, simple JSON response
  • HTTP+SSE — request via POST, server pushes events via SSE channel
  • WebSocket — bidirectional frames

Lifecycle (stdio):

client → server   {"method":"initialize", "params":{...}}
client ← server   {"result":{"protocolVersion":...,"capabilities":{...}}}
client → server   {"method":"notifications/initialized"}    # one-way notif
client → server   {"method":"tools/list"}
client ← server   {"result":{"tools":[{...},...]}}
client → server   {"method":"tools/call", "params":{"name":"X","arguments":{...}}}
client ← server   {"result":{"content":[{...}]}}

Bug classes we target

Class Example Why hard to catch in unit tests
Lazy-init deadlock Vibe-Trading PR #85 Bug only surfaces with full subprocess + protocol handshake
Concurrent tool-call race tools/call before tools/list completes Need real concurrency; mocked async ≠ real async
Resource exhaustion 1000 concurrent calls → fd / mem leak Need sustained load
Slow-tool head-of-line One slow tool blocks queue Need mixed workload
Reconnect / mid-call kill Connection drops between request and response Hard to simulate without tooling
Notification ordering Server sends notifications/cancelled mid-call Need sequence-aware client

4. Architecture

High-level

┌─────────────────────────────────────────────────────────────┐
│                       CLI / Library                          │
│  - parse args / config                                       │
│  - construct Run                                             │
│  - print report                                              │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    Run (orchestrator)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ServerManager │  │ ScenarioImpl │  │     Reporter     │  │
│  │              │  │              │  │                  │  │
│  │ spawn(),     │  │  N tokio     │  │  hdrhistogram    │  │
│  │ kill(),      │  │  worker      │  │  process stats   │  │
│  │ rss/cpu      │  │  tasks       │  │  → markdown/json │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────────────┘  │
│         │                  │                                 │
│         └──────┬───────────┘                                 │
│                ▼                                             │
│  ┌─────────────────────────────────┐                        │
│  │       ProtocolSession           │                        │
│  │   stdio framing + JSON-RPC      │                        │
│  │   per-call timeout (hang det.)  │                        │
│  │   request/response correlation  │                        │
│  └─────────────────────────────────┘                        │
└─────────────────────────────────────────────────────────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │  Server subprocess  │  (under test — any language)
        └─────────────────────┘

Module layout (src/)

src/
├── main.rs                  # CLI entry (clap)
├── lib.rs                   # public crate API
├── config.rs                # TOML schema + parsing
├── protocol/
│   ├── mod.rs               # re-exports
│   ├── jsonrpc.rs           # JSON-RPC 2.0 framing
│   ├── mcp.rs               # MCP request/response types
│   ├── session.rs           # ProtocolSession (per-connection state)
│   └── transport/
│       ├── mod.rs
│       ├── stdio.rs         # M1
│       ├── http.rs          # M4
│       ├── sse.rs           # M4
│       └── ws.rs            # Post-M7
├── server_manager.rs        # spawn/kill, env, working_dir, sysinfo
├── scenario/
│   ├── mod.rs               # Scenario trait
│   ├── cold_start.rs
│   ├── sustained.rs
│   ├── deadlock_probe.rs    # the Vibe-Trading-style smoke
│   ├── spike.rs             # M5
│   ├── ramp.rs              # M5
│   └── leak.rs              # M5
├── driver.rs                # tokio task pool, rate limiter
├── hang_detector.rs         # per-call watchdog
├── metrics/
│   ├── mod.rs
│   ├── histogram.rs         # hdrhistogram wrapper
│   ├── throughput.rs
│   └── process.rs           # RSS/CPU/fd via sysinfo
├── report/
│   ├── mod.rs               # Report struct
│   ├── markdown.rs
│   ├── json.rs
│   └── terminal.rs          # indicatif progress + final summary
└── run.rs                   # orchestrator that ties it all together

Key crate dependencies

Crate Why
tokio async runtime
serde / serde_json JSON-RPC payloads
clap CLI
toml config
hdrhistogram percentile latency
sysinfo RSS/CPU per pid (cross-platform)
indicatif terminal progress
tracing structured logging
thiserror / anyhow errors
tokio-util LinesCodec for stdio framing

No proc-macro magic. No "framework" — just composable structs.


5. Library API (Rust crate)

use mcp_loadtest::{Server, Scenario, Run, Thresholds};
use serde_json::json;
use std::time::Duration;

#[tokio::test]
async fn no_deadlock_under_concurrent_calls() {
    let server = Server::stdio("python")
        .args(["-m", "vibe_trading_mcp"])
        .env("LOG_LEVEL", "warn");

    let scenario = Scenario::sustained()
        .concurrent(20)
        .duration(Duration::from_secs(30))
        .tool("get_market_data")
        .args(json!({ "ticker": "AAPL" }));

    let report = Run::new(server, scenario)
        .with_thresholds(Thresholds {
            p99_latency: Duration::from_millis(500),
            error_rate: 0.01,
            hang_timeout: Duration::from_secs(5),
            ..Default::default()
        })
        .execute()
        .await
        .expect("run failed");

    assert!(report.passed(), "thresholds violated: {report:?}");
    assert_eq!(report.deadlock_count, 0);
}

Design choices:

  • Builder pattern — predictable, no struct-init explosion
  • execute() returns Result<Report> — lets tests pattern-match on metrics
  • Report exposes raw histograms — can drive custom assertions
  • No global state — multiple Runs in parallel inside one test process is supported

6. CLI surface

# Quick health check (initialize + tools/list + 1 call per tool)
mcp-loadtest probe --server "python -m my_mcp"

# Targeted smoke for the Vibe-Trading bug class
mcp-loadtest deadlock-probe --server "python -m my_mcp" --tool get_market_data

# Run a custom scenario from CLI
mcp-loadtest run \
  --server "python -m my_mcp" \
  --scenario sustained \
  --concurrent 50 --duration 60s \
  --tool get_market_data --args '{"ticker":"AAPL"}'

# Run from config file
mcp-loadtest run --config bench.toml

# Re-render saved run
mcp-loadtest report ./runs/2026-05-10T14-30-00/

# List built-in scenarios
mcp-loadtest list-scenarios

# Print example config
mcp-loadtest example-config > bench.toml

Output structure for each run:

runs/2026-05-10T14-30-00/
├── config.toml          # exact config used
├── server.stdout.log    # captured server stdout
├── server.stderr.log    # captured server stderr
├── trace.jsonl          # per-call records (request/response/duration/error)
├── metrics.json         # aggregated metrics
├── report.md            # human-readable summary
└── summary.json         # CI-friendly pass/fail JSON

7. Configuration schema (TOML)

[server]
command = "python"
args = ["-m", "my_mcp"]
env.LOG_LEVEL = "warn"
working_dir = "/path/to/proj"
transport = "stdio"           # stdio | http | sse | ws
startup_timeout = "10s"       # how long to wait for initialize response

[scenario]
type = "sustained"            # cold_start | sustained | spike | ramp | soak | pattern | deadlock_probe | race_check | fuzzer
duration = "60s"
concurrent = 50

# scenario-specific knobs
ramp_from = 1                 # ramp only
ramp_to = 100                 # ramp only
spike_at = "30s"              # spike only
spike_multiplier = 10         # spike only
leak_check_interval = "10s"   # leak only

# What to call. Multiple entries → weighted random selection.
[[scenario.tool_call]]
name = "get_market_data"
args = { ticker = "AAPL" }
weight = 1.0

[[scenario.tool_call]]
name = "analyze_options"
args = { ticker = "SPY", expiry = "2026-06-19" }
weight = 0.3

[thresholds]
p50_latency = "100ms"
p99_latency = "500ms"
error_rate = 0.01             # fraction; 0.01 = 1%
hang_timeout = "5s"           # call considered hung if no response in this long
memory_growth_mb = 50         # fail if RSS grows by more than this MB during run

[output]
report_dir = "./runs"
formats = ["markdown", "json", "terminal"]

8. Built-in scenarios

Scenario Description Detects
cold_start Spawn → initialize → first tool call. Repeat N times. (M2 placeholder — needs session factory; tracked for follow-up.) regression in startup time, init-time deadlocks
sustained Constant load against one session for a fixed duration. Drives the multi-step weighted-random pattern engine internally. baseline p99 latency, throughput, sustained error rate
spike Baseline → sharp burst at peak concurrency for a fixed window → cooldown back to baseline. Models Black-Friday-style traffic spikes. queue overflow, recovery behavior, fairness under burst
ramp Step concurrency from from to to by step_increment, optionally feeding the per-step metrics into [analysis::breaking_point]. finds break-point — concurrency where p99 explodes
soak Long-duration steady load with periodic snapshots; pairs with analysis::regression for latency-drift and (via ProcessSampler) RSS-slope leak signals. memory leaks, latency drift, throughput collapse over hours
pattern Multi-step weighted-random tool-call sequences with per-pattern think_time and ErrorBehavior. Building block used directly by sustained. realistic mixed workloads (explore-then-act, read-then-write)
deadlock_probe initialize → tools/list → fire N tools/call to same tool wrapped in hang_detect. Bails on first deadlock to avoid flooding a wedged session. the Vibe-Trading bug class specifically
race_check Issue N identical tools/call and run the responses through analysis::race_detector (key-sorted JSON canonicalization). non-determinism / divergent responses to identical inputs
fuzzer Cycle through enumerated malformed-but-plausible payloads (unknown method, numeric method, giant payload, control chars, deep-nested, null/string params); classify each via analysis::fuzz_report. parser bugs, type-confusion in method dispatch

Deferred to v0.2:

  • slow_mix — 80% calls to a fast tool, 20% to a deliberately-slow tool (head-of-line blocking, fairness). Approximable today by configuring a multi-step pattern with weighted tools.
  • reconnect — drop session mid-call (close stdin), spawn new session, retry (resilience, leftover state, zombies). Needs the session pool that lands in M8+.

Each scenario is an impl Scenario with two methods:

trait Scenario {
    async fn drive(&self, session: SessionPool, ctx: RunContext) -> ScenarioOutcome;
    fn config_schema() -> serde_json::Value;
}

9. Test matrix

Layer A — does mcp-loadtest itself work?

Mock MCP servers in tests/fixtures/. Each is a tiny Python script (chosen for ubiquity, not Rust, to make the test environment realistic).

Mock Behavior Tests
mock-normal.py Echoes args, responds in 1ms happy-path metrics shape
mock-slow.py Tool sleeps 2s latency histogram correctness
mock-broken.py Hangs on first tools/call (replicates Vibe-Trading bug) deadlock_probe correctly classifies
mock-crash.py Panics on 1% of calls error-rate accuracy
mock-leak.py Allocates 10 KB/call, never frees leak scenario detects
mock-error.py Returns JSON-RPC errors per spec error classification
mock-slow-init.py Takes 5s to respond to initialize cold_start measures correctly
mock-malformed.py Returns invalid JSON occasionally parser robustness

Test invariant: for each (scenario × mock) pair, the report's machine-readable summary contains expected fields with expected ranges. This is the bulk of integration tests.

Layer B — does it catch real bugs?

Snapshot test against a known-buggy commit of Vibe-Trading:

  • Pin to commit ~PR-85 (just before the fix)
  • Run deadlock_probe scenario
  • Assert: report flags ≥1 deadlock, identifies tools/call as the offending request

Re-run against post-fix commit:

  • Same scenario, expect 0 deadlocks

This is the killer demo. It goes in the README.

Layer C — cross-platform

CI matrix: ubuntu-latest, macos-latest, windows-latest × stable Rust × Python 3.13 (for fixtures).


10. Milestones (revised 2026-05-10 — head-on competition with reaatech/mcp-load-test)

Original 3-week plan replaced after discovering reaatech/mcp-load-test ships v0.1 functionality already (see §10.5 for parity matrix). v0.1.0 of mcp-loadtest must reach feature parity and surface our differentiators before re-publishing.

Repo was private through the M1-M7 development phase. The new public repo URL will be added once v0.1.0 is published to crates.io.

M1 through M7 are all shipped. Post-M7 work (spike scenario, HTML reporter, WebSocket transport, hot-path zero-copy refactor, criterion benches) is captured under [Unreleased] in CHANGELOG rather than as a new milestone — the work is small + cohesive enough that bundling it into v0.1.0 makes more sense than coining "M8" for it. The "Week N" column is dropped because milestones are no longer time-boxed — they're released.

M Theme Key deliverables
M1 stdio Session Session::spawn → handshake → list_tools/call_tool/shutdown; mock-normal.py; happy-path integration test
M2 Scenarios + metrics core Scenario trait; cold_start + sustained + deadlock_probe impls; hang_detector (§15.1); hdrhistogram metrics; mocks mock-broken/mock-slow/mock-crash + tests
M3 Reports + first internal release TOML config; markdown / JSON / console reporters; sysinfo-based process sampling; regression test against real Vibe-Trading commit ~PR-85
M4 Transport parity HTTP transport (StreamableHTTP); SSE transport; HTTP/SSE fixtures; transport-aware concurrency profiles
M5 Analysis parity breaking_point detection; performance grading (A-F per latency/concurrency/error); realistic patterns (explore-then-act, read-then-write, multi-step) with weighted random + think-time; soak scenario polish; compare-baselines subcommand
M6 Differentiators v1 Real-time terminal TUI dashboard (live latency/throughput/RSS); server resource sampling beyond RSS (CPU, fd, threads); race_detector scenario; cross-server compare (run --server srv-a --server srv-b)
M7 Differentiators v2 + v0.1 polish Protocol fuzzer (basic — random/malformed payloads); coverage tracking (tools registered vs. exercised); per-tool SLO assertions; README rewrite with competitive positioning; cargo install smoke test on all 3 OS
Post-M7 Pre-public-release close-out Spike scenario; HTML reporter; WebSocket transport; hot-path zero-copy refactor; criterion benches (DESIGN §19 claims now reproducible). See CHANGELOG [Unreleased].
v0.1.0-rc Pre-publish review in flight repo back to public; crates.io publish; HN/lobste.rs/r/rust announce
M8+ stretch Beyond AI-assisted pattern generator; distributed mode (multi-worker); replay/record; PyO3 binding

Definition of done for v0.1.0:

  • cargo install mcp-loadtest-cli works on Linux/macOS/Windows.
  • mcp-loadtest deadlock-probe -s "python -m vibe_trading_mcp" reproduces the original bug on commit ~PR-85.
  • All §10.5 parity-must-have rows are checked.
  • All §10.5 differentiator rows are checked.
  • README has side-by-side comparison table vs. reaatech, citing concrete benchmarks.

10.5 Competitive parity & differentiation matrix

reaatech/mcp-load-test as of 2026-05-10 (TS monorepo, 77 source files, ~50% of README claims fleshed out per file-size sampling).

Parity — features they have, we must match before re-publishing public

Feature reaatech mcp-loadtest target Milestone
stdio transport M1
HTTP (StreamableHTTP) transport M4
SSE transport M4
WebSocket transport Post-M7
Latency histograms p50/p95/p99/p999 per tool M2
Breaking point detection M5
Performance grading A-F M5
Soak / leak detection M5
Spike scenario Post-M7
Compare baselines M5
Realistic patterns (explore-then-act, multi-step) M5
Console + markdown + JSON reporters M3
HTML reporter (self-contained) Post-M7
Programmatic library API M2/M3

Differentiators — features we have/will have that they don't

Feature reaatech mcp-loadtest Why it matters
Deadlock detection (deadlock_probe) ✓ M2 Lazy-init / async-worker bugs that break in prod. Direct response to Vibe-Trading PR #85.
Race detector ✓ M6 Order-sensitive concurrent tool calls; finds protocol-level race bugs.
Real-time TUI dashboard ✗ (post-hoc only) ✓ M6 Watch perf cliff happen live during a run.
Cross-server compare (run vs N targets) partial (compare baselines = 2 runs) ✓ M6 (1 run, N targets) Side-by-side: vendor A vs vendor B vs your fork.
Server resource sampling (CPU/fd/threads/RSS over time) ✗ (latency only) ✓ M6 Find resource exhaustion before throughput collapses.
Protocol fuzzer (mcp-fuzz integrated) ✓ M7 Random/malformed payloads; finds parser bugs unit tests miss.
Coverage tracking (registered vs exercised tools) ✓ M7 Catch silently-broken tools that nobody tests in CI.
Per-tool SLO assertions partial (global) ✓ M7 Per-tool latency/error budgets in CI.
Configurable regression thresholds ✗ (fixed) ✓ v0.1 compare CLI flags + compare_runs MCP args override p99 / error-rate / deadlock policy; defaults unchanged (ADR 0009).
Protocol-aware assertions ✓ v0.1 Opt-in strict mode validates tools/call args vs the server's advertised inputSchema; mismatch → ProtocolError gates the run. Forward-compatible, off by default (ADR 0005/0010).
Rust perf + static binary ✗ (Node runtime required) cargo install → single ~5MB binary; no Node toolchain.
AI-assisted pattern generator ⏳ M8 stretch LLM reads tool schemas → generates realistic call sequences.
Distributed mode ⏳ M8 stretch Multiple workers driving one server (high-RPS targets).
Replay / record ⏳ M8 stretch Capture prod traffic, replay deterministically.
Self-hosted as MCP server (mcp-loadtest serve --mcp) ✓ M7 AI agents (Claude, Cursor, etc.) call deadlock_probe / compare / report directly via MCP. Recursive: load-test an MCP using an MCP.

Strategic positioning (for README at v0.1.0)

mcp-loadtest is a load tester + bug detector for MCP servers. Match-or-exceed reaatech/mcp-load-test on every load-testing dimension, and detect classes of bugs no other tool finds: deadlocks, races, resource leaks, coverage gaps.

The README at re-publish must lead with the deadlock demo (replicated Vibe-Trading PR #85 bug, caught in 2 seconds) — not the load-testing checklist. Differentiation first; parity proves we're serious.


11. Decisions (resolved 2026-05-10)

# Question Decision Rationale
1 Crate name mcp-loadtest (lib) + mcp-loadtest-cli (bin) descriptive, discoverable, doesn't pigeonhole to "bench"
2 License MIT OR Apache-2.0 (dual) Rust ecosystem standard; MIT for individuals, Apache-2.0 for corporate patent grant
3 Repo location github.com/Teerapat-Vatpitak/mcp-loadtest personal handle for v0.1; transfer to mcp-tools/ org if/when sister projects emerge
4 MCP protocol versioning v0.1 pin to spec v1.x, warn on mismatch; --strict-protocol flag for fail-on-mismatch; v0.2+ detect-and-adapt ship v0.1 fast, add complexity when justified
5 deadlock_probe both subcommand (mcp-loadtest deadlock-probe -s "...") and scenario in run --scenario deadlock_probe subcommand for newcomer UX, scenario for CI; near-zero implementation cost
6 Server stderr always capture to runs/<id>/server.stderr.log; opt-in --tee-stderr to also stream stderr critical for debugging; capture is cheap; tee opt-in to avoid CI log spam
7 Diff-vs-baseline mode defer to M5 stretch v0.1 emits JSON; users diff externally. Proper baseline storage + regression detection has too many edge cases for v0.1
8 Library API → 1.0 When all three: 3 months no breaking changes + 5+ external users + 1 real bug caught in wild calendar time + adoption + value-prop validation, all required

12. Naming options (decide in §11.1)

  • mcp-loadtest — clear, no surprises
  • mcp-bench — implies benchmarking specifically
  • mcphammer — playful, memorable, but maybe too aggressive for a tool that aims to be canonical
  • mcptest — too generic
  • mcp-stress — accurate but slightly negative
  • lockesmith — clever ("lock-finder for MCP servers") but obscure

Author's preference: mcp-loadtest for v0.1. Rename later if needed.


13. Future work (out of scope for v0.1)

  • mcp-fuzz — sister project for protocol fuzzing (random/malformed payloads)
  • mcp-trace — record + replay tool for debugging production MCP issues
  • Distributed mode — multiple loadtest workers driving one server (for very high RPS targets)
  • GUI/web UI — render reports interactively
  • Plugin system — user-defined scenarios as separate crates
  • Public benchmark dataset — track perf of popular MCP servers over time (mcp-leaderboard)

14. Concrete Rust types

These are the public types code will hang off. Full definitions, not sketches.

14.1 Server config

pub struct Server {
    pub command: String,
    pub args: Vec<String>,
    pub env: BTreeMap<String, String>,    // BTreeMap for stable serialization
    pub working_dir: Option<PathBuf>,
    pub transport: Transport,
    pub startup_timeout: Duration,         // default 10s
    pub shutdown_timeout: Duration,        // default 5s; SIGTERM → wait → SIGKILL
}

#[derive(Clone, Copy, PartialEq, Eq)]
pub enum Transport {
    Stdio,
    Http { url: String, headers: BTreeMap<String, String> }, // M4 — Streamable HTTP (simple JSON variant)
    Sse { url: String, headers: BTreeMap<String, String> },  // M4
    WebSocket { url: String },                               // Post-M7
}

impl Server {
    pub fn stdio(command: impl Into<String>) -> ServerBuilder { /* ... */ }
}

14.2 Scenario

pub enum ScenarioKind {
    ColdStart {
        iterations: u32,                   // default 5
        warmup: bool,                      // discard first iter — default true
    },
    Sustained {
        concurrent: u32,
        duration: Duration,
        rate_limit: Option<u32>,           // requests/sec cap; None = unbounded
    },
    Spike {
        baseline_concurrent: u32,
        spike_concurrent: u32,
        baseline_duration: Duration,
        spike_at: Duration,
        spike_duration: Duration,
    },
    Ramp {
        from_concurrent: u32,
        to_concurrent: u32,
        duration: Duration,
    },
    DeadlockProbe {
        concurrent: u32,                   // default 20
        hang_threshold: Duration,          // default 5s
        grace_period: Duration,            // default 10s — after timeout, how long to wait for late responses
    },
    Soak {
        concurrent: u32,                   // default 4
        duration: Duration,                // default 1h
        sample_interval: Duration,         // default 10s
        latency_drift_ms_per_sec: f64,     // fail if linear-regression slope on mean latency exceeds this
    },
    // M5+ ships additional kinds not detailed here for brevity:
    //   Pattern { steps, think_time, weight, error_behavior }
    //   RaceCheck { concurrent, tool, args }
    //   Fuzzer { iterations, seed, payloads }
    // See crate::scenario::{pattern, race_check, fuzzer}.
}

pub struct Scenario {
    pub kind: ScenarioKind,
    pub tool_calls: Vec<ToolCall>,         // weighted random selection
}

pub struct ToolCall {
    pub name: String,
    pub args: serde_json::Value,
    pub weight: f64,                       // default 1.0
}

14.3 Run + Report

pub struct Run {
    server: Server,
    scenario: Scenario,
    thresholds: Thresholds,
    output_dir: Option<PathBuf>,
}

impl Run {
    pub fn new(server: Server, scenario: Scenario) -> Self;
    pub fn with_thresholds(self, t: Thresholds) -> Self;
    pub fn with_output_dir(self, dir: PathBuf) -> Self;
    pub async fn execute(self) -> Result<Report, RunError>;
}

#[derive(Default)]
pub struct Thresholds {
    pub p50_latency: Option<Duration>,
    pub p95_latency: Option<Duration>,
    pub p99_latency: Option<Duration>,
    pub p999_latency: Option<Duration>,
    pub error_rate: Option<f64>,           // 0.0..=1.0
    pub hang_timeout: Duration,            // default 5s — used by hang_detector
    pub memory_growth_mb: Option<f64>,
}

pub struct Report {
    pub run_id: String,                    // ULID
    pub started_at: SystemTime,
    pub duration: Duration,
    pub scenario_kind: ScenarioKind,
    pub server_info: ServerInfo,
    pub latency: LatencyStats,
    pub throughput: ThroughputStats,
    pub errors: ErrorStats,
    pub process: ProcessStats,
    pub deadlock_count: u32,
    pub hang_count: u32,
    pub trace_path: PathBuf,
    pub threshold_violations: Vec<ThresholdViolation>,
}

impl Report {
    pub fn passed(&self) -> bool { self.threshold_violations.is_empty() }
    pub fn write_markdown(&self, path: &Path) -> io::Result<()>;
    pub fn write_json(&self, path: &Path) -> io::Result<()>;
}

pub struct LatencyStats {
    pub histogram: hdrhistogram::Histogram<u64>,  // exposed for custom analysis
    pub p50: Duration,
    pub p95: Duration,
    pub p99: Duration,
    pub p999: Duration,
    pub min: Duration,
    pub max: Duration,
    pub mean: Duration,
    pub stddev: Duration,
    pub count: u64,
}

pub struct ThroughputStats {
    pub total_requests: u64,
    pub successful_requests: u64,
    pub requests_per_sec: f64,
    pub timeline: Vec<(Duration, u64)>,    // (offset, requests-completed-by-then) for charts
}

pub struct ErrorStats {
    pub total: u64,
    pub by_category: BTreeMap<ErrorCategory, u64>,    // see §18
}

pub struct ProcessStats {
    pub peak_rss_mb: f64,
    pub final_rss_mb: f64,
    pub avg_cpu_pct: f64,
    pub samples: Vec<ProcessSample>,
}

pub struct ProcessSample {
    pub at: Duration,                      // offset from run start
    pub rss_mb: f64,
    pub cpu_pct: f64,
}

pub struct ThresholdViolation {
    pub metric: String,                    // e.g. "p99_latency"
    pub expected: String,                  // e.g. "<= 500ms"
    pub actual: String,                    // e.g. "812ms"
}

14.4 Errors

#[derive(thiserror::Error, Debug)]
pub enum RunError {
    #[error("server failed to start: {0}")]
    ServerStart(io::Error),

    #[error("server exited unexpectedly with code {0:?}")]
    ServerExit(Option<i32>),

    #[error("initialize handshake failed: {0}")]
    Handshake(String),

    #[error("server stderr: {0}")]
    ServerStderr(String),

    #[error("config invalid: {0}")]
    Config(String),

    #[error("io: {0}")]
    Io(#[from] io::Error),

    #[error("internal: {0}")]
    Internal(String),
}

15. Algorithm specs

The detection logic is the IP of this tool. Spec'd precisely so any implementer can reproduce.

15.1 Hang detector

Per-call watchdog. Wraps every tools/call request:

Algorithm: hang_detector(req, threshold)
1. Record send_at = now().
2. Send req to server.
3. Spawn watchdog task with timer = threshold.
4. Race: watchdog completes OR response arrives.
5. If response arrives first:
     duration = now() - send_at
     return Ok((response, duration))
6. If watchdog completes first:
     mark request_id as HUNG
     continue listening for late response (up to grace_period)
     if late response arrives: classify as LATE (not HUNG)
     if no response within grace_period: classify as DEADLOCK
     return Err(Hang { request_id, hung_for })

Hang ≠ deadlock. Hang means "no response within hang_threshold". Deadlock means "no response within hang_threshold + grace_period" — i.e. server appears genuinely stuck, not just slow.

15.2 Deadlock probe scenario

The Vibe-Trading-bug-class detector. Specific call sequence designed to reproduce lazy-init races.

Algorithm: deadlock_probe(server, tool, N, hang_threshold)
1. Spawn server. Record startup_duration = time-to-stdout-EOF or initialize-response.
2. Send `initialize`. Await with timeout = startup_timeout. (fails → SERVER_INIT_ERROR)
3. Send `notifications/initialized`.
4. Send `tools/list`. Await with timeout = 1s. (fails → TOOLS_LIST_HANG)
5. Synchronization barrier — all N tasks ready to send concurrently.
6. Release barrier. All N tasks send `tools/call` to `tool` simultaneously.
7. Each task: hang_detector(req, hang_threshold).
8. After all N return (Ok or Err): wait grace_period.
9. Categorize each:
     - Ok with duration → SUCCESS
     - Late response within grace_period → SLOW
     - No response after grace_period → DEADLOCK
10. Send shutdown notification, wait shutdown_timeout, kill if needed.
11. Report:
     - if DEADLOCK count > 0 → severity=CRITICAL, "DEADLOCK DETECTED"
     - else if SLOW > 0.5 * N → severity=WARNING, "concurrency degrades latency"
     - else → severity=PASS

The barrier in step 5-6 is critical. Without it, requests serialize naturally and lazy-init bugs hide. Barrier forces real concurrency at the point of greatest stress.

15.3 Leak detector

Algorithm: leak_detector(server, scenario, sample_interval, growth_threshold_mb)
1. Run sustained scenario. Concurrently:
2. Every sample_interval, sample server's RSS via sysinfo.
3. After scenario completes:
4. Fit linear regression: rss_mb = a * t + b, where t in seconds
5. Predicted total growth = a * scenario.duration_secs
6. If predicted_growth > growth_threshold_mb:
     classify as LEAK_DETECTED
     report: slope (MB/sec), R² (fit quality), samples
7. R² < 0.5 → "noisy, can't conclude" — report as INDETERMINATE

Caveat: warmup-and-stabilize matters. First 30s of samples are discarded by default to avoid false positives from JIT / lazy-load.

15.4 Threshold evaluator

Algorithm: evaluate_thresholds(report, thresholds)
For each threshold field that is Some:
  compare report's metric to threshold
  if violated: append ThresholdViolation { metric, expected, actual }
Return: violations vec — empty means PASS.

Simple, but worth specifying so the report's passed() is unambiguous.


16. Mock server specs

Mocks live in tests/fixtures/<name>.py. Each is < 50 lines of Python — minimal MCP server using stdio + JSON-RPC by hand (no fastmcp dep, to avoid version coupling). Shipped fixtures: mock-normal.py, mock-slow.py, mock-broken.py, mock-crash.py, plus mock-http-server.py and mock-sse-server.py (transport parity coverage). Pseudocode for each below; entries marked (planned for v0.2) are documented for completeness but not yet shipped.

16.1 mock-normal.py

# Echoes args, responds in 1ms. Reference implementation.
while True:
    line = sys.stdin.readline()
    msg = json.loads(line)
    if msg["method"] == "initialize":
        respond({"protocolVersion":"...", "capabilities":{...}})
    elif msg["method"] == "tools/list":
        respond({"tools":[{"name":"echo","inputSchema":{...}}]})
    elif msg["method"] == "tools/call":
        respond({"content":[{"type":"text","text":json.dumps(msg["params"]["arguments"])}]})

16.2 mock-slow.py

Same as mock-normal, but tools/call does time.sleep(2) before responding. Used to verify latency histogram correctness (p99 should be ~2s).

16.3 mock-broken.py

# Replicates Vibe-Trading lazy-init deadlock pattern.
# initialize and tools/list work; first tools/call hangs forever.
calls_made = 0
while True:
    msg = json.loads(sys.stdin.readline())
    if msg["method"] in ("initialize", "tools/list"):
        respond_normally()
    elif msg["method"] == "tools/call":
        # The bug: blocking import in worker
        if calls_made == 0:
            calls_made += 1
            time.sleep(999999)              # actual deadlock
        else:
            respond_normally()

deadlock_probe against this MUST report deadlock_count >= 1.

16.4 mock-crash.py

# Panics 1% of calls (random.random() < 0.01). Tests error rate accuracy.
# Crash = exit(1), not JSON-RPC error.

16.5 mock-http-server.py

# Streamable HTTP transport fixture. Stdlib http.server only — no fastapi/etc.
# Used by HttpTransport integration tests.

16.6 mock-sse-server.py

# HTTP+SSE transport fixture. Endpoint handshake + id-correlated responses.
# Stdlib http.server only. Used by SseTransport integration tests.

16.7 mock-leak.py (planned for v0.2)

# Allocates 10 KB per tools/call into a module-global list. Never frees.
# Tests leak detector — slope should be ~10KB × rps.
# Today leak/drift signals are exercised via `Soak::detect_leak` over synthetic
# (t, rss) series; a real leaking fixture is still useful for end-to-end coverage.

16.8 mock-error.py (planned for v0.2)

# Returns JSON-RPC errors per spec: -32601 method not found,
# -32602 invalid params, -32603 internal error.
# Cycles through error codes per call. Tests error classification (§18).

16.9 mock-slow-init.py (planned for v0.2)

# Sleeps 5s on `initialize` before responding. Tests cold_start measurement.

16.10 mock-malformed.py (planned for v0.2)

# Returns invalid JSON every 10th response (truncated, missing field).
# Tests parser robustness — should classify as MALFORMED_RESPONSE not crash.

All mocks share common framing helpers in tests/fixtures/_common.py (read frame, write frame, respond ok/err).


17. Output format spec

17.1 Trace format (trace.jsonl)

One JSON object per line. Schema:

{
  "ts": 0.0,                              // seconds since run start (f64)
  "kind": "request|response|error|hang|deadlock|process_sample|scenario_event",
  "request_id": 123,                      // matches JSON-RPC id, present for request/response/error/hang/deadlock
  "method": "tools/call",                 // present for request
  "params": {...},                        // present for request (compact, can be large)
  "result": {...},                        // present for response (truncated to 1KB by default)
  "error": {"category": "...", "message": "...", "code": -32603},  // present for error
  "duration_ms": 12.5,                    // present for response/error
  "rss_mb": 45.2,                         // present for process_sample
  "cpu_pct": 12.3                         // present for process_sample
}

Stream-friendly. Can be processed with jq or any line-oriented tool.

17.2 metrics.json

{
    "run_id": "01HXY...",
    "started_at": "2026-05-10T07:30:00Z",
    "duration_secs": 60.0,
    "scenario": {
        "kind": "Sustained",
        "concurrent": 50,
        "duration_secs": 60.0
    },
    "latency_ms": {
        "p50": 12.3,
        "p95": 45.6,
        "p99": 123.4,
        "p999": 456.7,
        "min": 1.2,
        "max": 999.9,
        "mean": 23.4,
        "stddev": 18.7,
        "count": 12345
    },
    "throughput": {
        "total_requests": 12345,
        "successful_requests": 12300,
        "requests_per_sec": 205.75
    },
    "errors": {
        "total": 45,
        "by_category": {
            "Hang": 0,
            "Timeout": 5,
            "ServerError": 30,
            "ProtocolError": 10,
            "Crash": 0,
            "Malformed": 0
        }
    },
    "process": {
        "peak_rss_mb": 156.3,
        "final_rss_mb": 142.1,
        "avg_cpu_pct": 23.4
    },
    "deadlock_count": 0,
    "hang_count": 0,
    "threshold_violations": [
        { "metric": "p99_latency", "expected": "<=100ms", "actual": "123.4ms" }
    ],
    "passed": false
}

On the Rust side, metric is a ThresholdKind enum (crate::report::ThresholdKind); serde flattens it to the string slug shown here via #[serde(rename = "metric")] + per-variant snake_case so the wire format stays stable across refactors.

JSON Schema published at schema/metrics.v1.json for downstream tooling.

17.3 report.md template

# Run {run_id}

**Status:** ❌ FAIL (1 threshold violation)
**Server:** `python -m vibe_trading_mcp`
**Scenario:** Sustained, 50 concurrent, 60s
**Started:** 2026-05-10 07:30:00 UTC

## Summary

- Total requests: 12,345
- Throughput: 205.75 req/s
- Error rate: 0.36%
- Deadlocks: 0 Hangs: 0

## Latency

| p50    | p95    | p99            | p999    | max     |
| ------ | ------ | -------------- | ------- | ------- |
| 12.3ms | 45.6ms | **123.4ms**| 456.7ms | 999.9ms |

(latency histogram ASCII chart here)

## Errors

| Category      | Count |
| ------------- | ----- |
| ServerError   | 30    |
| ProtocolError | 10    |
| Timeout       | 5     |

## Process

Peak RSS: 156.3 MB · Final RSS: 142.1 MB · Avg CPU: 23.4%

## Threshold violations

-**p99_latency**: expected ≤100ms, got 123.4ms

## Trace

Full trace: `./trace.jsonl` (12,345 events, 8.2 MB)

18. Error taxonomy

Every failure is classified into exactly one category. Used for ErrorStats.by_category and reporting.

Category Definition Example
Hang No response within hang_threshold, but response arrived before grace_period expires tool genuinely slow under contention
Deadlock No response after hang_threshold + grace_period Vibe-Trading PR #85
Timeout Client-side configured deadline exceeded (separate from hang_threshold) network buffer full
ServerError JSON-RPC error response with code in [-32099..=-32000] (server-defined) tool returned business error
ProtocolError JSON-RPC error with code -32600..=-32603 (transport / spec violations) malformed request rejected
Crash Server process exited (non-zero or signal) during call unhandled panic
Malformed Response was not valid JSON or didn't match JSON-RPC schema partial response, broken framing
Disconnected Transport closed unexpectedly mid-call broken pipe
Cancelled Client cancelled the request before response scenario shutdown

Classification precedence: top-down. A request that hangs and then the server crashes → classified as Crash (the terminal event), but trace.jsonl records both hang and crash events for forensics.


19. Performance targets for the tool itself

mcp-loadtest should never be the bottleneck.

Aspect Target
Driver per-request CPU overhead < 50µs (excluding JSON serialization)
Memory per concurrent worker < 100KB
Max sustainable concurrency on a 4-core laptop ≥ 1000 workers
Trace file write throughput ≥ 100k events/sec
Histogram update lock-free per-worker, merged at end

These are tested in benches/ (criterion). v0.1 ships with reproducible numbers in the README.


20. Versioning + stability policy

  • v0.x: API can change anywhere
  • v1.0: locked. Breaking changes require major version bump (semver strict)
  • MCP spec: protocol_version field in initialize is checked. Mismatch warns but does not fail by default. Override with --strict-protocol.
  • Library MSRV (minimum supported Rust version): stable - 2 (e.g. if 1.85 is current stable, MSRV is 1.83).

When to commit to 1.0:

  • After 3 months of v0.x with no breaking changes
  • After 5+ external users have integrated
  • After at least 1 real bug caught in the wild and reported back

21. AI-friendliness (design pillar)

mcp-loadtest is a tool that AI agents will both operate (Claude Code running CI) and be operated by (developers asking Claude "load-test my MCP server"). Design accordingly.

21.1 First-class library API for embedding in agent tools

  • All public types have #[derive(Debug, Serialize, Deserialize)] so they're trivially JSON-able.
  • The library API is documented with rustdoc examples that compile (doctested in CI). LLMs read these examples to build correct calls on the first try.
  • No "you must construct in this exact order" sequencing — builders are commutative where possible.

21.2 Self-hosted MCP server: mcp-loadtest serve --mcp

The single most important AI-friendly feature. mcp-loadtest exposes itself as an MCP server with these tools:

Tool Args Returns
deadlock_probe server_command, tool, concurrent { deadlock_count, hung_for_ms[], details }
sustained_load server_command, concurrent, duration_secs, tool, args { p50_ms, p99_ms, error_rate, requests_per_sec }
compare_runs baseline_run_dir, current_run_dir structured diff with regression flags
report_summary run_dir markdown summary string
list_recent_runs limit run dirs with metadata

A user can say to Claude / Cursor / any MCP-aware agent: "Find deadlocks in my new MCP server at python -m foo" — and the agent calls deadlock_probe directly. No human-in-the-loop required to spawn a child process and parse stdout — the agent gets structured JSON back.

Reaatech doesn't do this. It's our most under-priced differentiator.

21.3 Actionable error messages with hints

Every Err returned to the user includes a suggested next step:

Error: server stdin closed unexpectedly during initialize handshake.
Hint: server may have crashed before responding. Check stderr at:
      runs/01HXY.../server.stderr.log
      Or re-run with --tee-stderr to see it live.

vs. the bad version:

Error: BrokenPipe(Os { code: 32, ... })

LLMs (and humans) act on the first; bounce off the second.

21.4 --explain flag on every subcommand

$ mcp-loadtest deadlock-probe --explain
Algorithm:
  1. Spawn server process.
  2. Send `initialize`. Wait up to startup_timeout (default 10s).
  3. Send `notifications/initialized`.
  4. Send `tools/list`. Wait up to 1s.
  5. Synchronization barrier — N concurrent `tools/call` ready to fire.
  6. Release barrier. All N calls fire in parallel.
  7. Each call wrapped in hang_detect(hang_threshold=5s, grace_period=10s):
     - response within hang_threshold → SUCCESS
     - response between threshold and grace_period → SLOW (warning)
     - no response after grace_period → DEADLOCK (critical)
  8. Report aggregated results.

Tunable knobs: --concurrent, --hang-threshold, --grace-period.
See DESIGN.md §15.2 for the spec source.

LLMs use this to plan the right invocation. Reduces "I tried it but it didn't do what I expected" loops.

21.5 JSON Schema published for config + outputs

schema/config.v1.json and schema/metrics.v1.json shipped at well-known paths. LLMs validate generated configs / parse outputs without guessing field shapes.

21.6 mcp-loadtest doctor

Diagnoses common setup issues:

  • Python interpreter not on PATH (for fixture-based tests).
  • MSVC vs GNU toolchain mismatch on Windows.
  • Stale runs/ accumulation.
  • MCP server fails initialize — captures stderr and reports.

Outputs a checklist with ✅/❌ per item and a one-line fix per ❌. Exactly the kind of thing an LLM agent can chain into a fix-it loop.

21.7 Trace format is LLM-readable

runs/<id>/trace.jsonl is line-oriented JSON with stable field names (DESIGN.md §17.1). Pipeable through jq, parseable by any agent without custom code:

$ jq 'select(.kind=="hang")' runs/01HXY.../trace.jsonl

21.8 Reports include "What this means" interpretation

A report that says p99 latency: 234ms is data. A report that adds "95% of users would call this acceptable; the slow tail (top 1%) is concentrated on analyze_options calls" is information. We aim for the latter — derived sentences, not just numbers.

21.9 Snapshot tests for output formats

insta::assert_snapshot! on report markdown / JSON. Output shapes are stable across releases unless explicitly changed (with CHANGELOG entry). LLM agents that parse our output don't break across patch versions.

21.10 Cookbook in docs/examples/

Per-scenario copy-pasteable commands + expected output. LLMs train on README-style examples; cookbook entries make those examples concrete and executable.

Examples to ship at v0.1.0:

  • "Find deadlocks in my new MCP server"
  • "Add a regression gate to my CI"
  • "Compare two implementations of the same MCP server"
  • "Detect a memory leak before production"