Load tester and bug detector for MCP (Model Context Protocol) servers. Catches lazy-init deadlocks, concurrency races, hangs, and perf regressions that unit tests miss.
Lazy-init inside an async worker thread is one of the easiest ways to ship a broken MCP server. initialize works, tools/list works, the first tools/call hangs forever. Standard pytest never sees it because the bug only surfaces when a real client opens a session and drives the protocol end-to-end.
The flagship example is HKUDS/Vibe-Trading PR #85 — _get_registry() blocked on a deferred import src.tools.shell.* inside FastMCP's worker thread, so every concurrent caller wedged on the same lock. The fix was five lines. Finding the bug took hours of differential testing.
mcp-loadtest finds it in seconds:
$ mcp-loadtest deadlock-probe --server "python -m vibe_trading_mcp" \
--tool analyze_options --concurrent 5 \
--args '{"spot":450,"strike":460,"expiry_days":30}'
Run 01KR9JX7E4P638TKQM96YA0B4Z
Status: FAIL (1 deadlock)
Server: python -m vibe_trading_mcp
Scenario: deadlock_probe
Deadlocks: 1 Hangs: 0 Errors: 0
Error: DEADLOCK DETECTED — 1 deadlock(s), 0 error(s), 0 threshold violation(s)
$ echo $?
1This is the bug class that breaks MCP servers in production. Unit tests don't catch it. mcp-loadtest does — and exits non-zero so it can fail your CI gate. The regression that catches the exact Vibe-Trading commit lives at crates/mcp-loadtest/tests/vibe_trading_regression.rs (pinned to commit 71220c7c — the parent of PR #85).
- Bug-class detection
deadlock_probe— fires N concurrenttools/calls throughhang_detect; classifies each as success / slow / deadlock (see DESIGN.md §15.2).race_check— issues identical calls and diffs the responses to surface non-determinism (clocks, RNG, leaked state).- Hang detector watchdog wraps every call, so any scenario can surface a hung tool, not just the dedicated probe.
- Load testing
sustained— constant concurrency over a duration; baseline p50/p95/p99/p999 + throughput.ramp— linear ramp of concurrency to find the break-point.soak— long-running sustained load with periodic RSS sampling for leak hunting.- Cold-start measurement, weighted pattern mixes (explore-then-act, multi-step).
- Reporting
- Markdown report at
runs/<ulid>/report.md, self-containedreport.html(no external deps), machine-readablemetrics.json, ANSI terminal summary. - Schema-stable JSON (see
docs/schema/metrics.v1.json); snapshot-tested so downstream LLM agents don't break on patch versions. mcp-loadtest compare baseline.json current.jsonfor regression diffs in CI.mcp-loadtest cross --server "..." --server "..."for side-by-side runs across N targets.
- Markdown report at
- AI-agent friendly
- Stable JSON output and structured error messages with
Hint:lines. mcp-loadtest serve --mcpexposes the tool itself as an MCP server so Claude Code, Cursor, or any MCP-aware agent can calldeadlock_probe,sustained_load, andcompare_runsas MCP tools directly. See DESIGN.md §21.2.
- Stable JSON output and structured error messages with
mcp-loadtest is built to be a CI regression gate, not just a profiler. Every run resolves to a pass/fail and a non-zero exit code, so it drops straight into a pipeline:
-
Threshold gating —
[thresholds](p50/p95/p99/p999 latency, error rate, memory growth, per-tool SLOs). Any breach →report.passed() == false→ non-zero exit. Deadlocks are zero-tolerance. -
Baseline regression diff —
mcp-loadtest compare baseline.json current.jsonflags p99 / error-rate / deadlock regressions. Thresholds default to 10% p99 / 0.5pp error rate / deadlock-zero-tolerance and are now configurable:mcp-loadtest compare base.json cur.json \ --max-p99-regression-pct 15 --max-error-rate-regression-pp 1.0The same knobs are exposed as
compare_runsMCP tool args for agent-driven gating (ADR 0009). -
Protocol-aware assertions — opt-in strict mode validates every
tools/call's arguments against the server's advertisedinputSchemabefore the call. A contract mismatch is recorded as aProtocolErrorand gates the run. Off by default (forward-compatible, ADR 0005/0010); enable per-config:[validation] strict = true
Full GitHub Actions example: docs/examples/ci-integration.md.
# Install from the public repo (not on crates.io yet — see docs/adr/0015)
cargo install --git https://github.com/Teerapat-Vatpitak/mcp-loadtest mcp-loadtest-cli
# ...or download a prebuilt binary from the GitHub Release:
# https://github.com/Teerapat-Vatpitak/mcp-loadtest/releases
# Quick deadlock smoke against a real MCP server
mcp-loadtest deadlock-probe --server "python -m my_mcp" --tool foo
# Sustained load from a config file
mcp-loadtest run --config bench.toml
# Print a starter config
mcp-loadtest example-config > bench.toml
# Compare two runs (e.g. main vs PR branch)
mcp-loadtest compare runs/baseline/metrics.json runs/current/metrics.jsonA minimal bench.toml:
[server]
command = "python"
args = ["-m", "my_mcp"]
transport = "stdio"
[scenario]
type = "sustained"
duration = "60s"
concurrent = 50
tool = "get_market_data"
args = { ticker = "AAPL" }
[thresholds]
p99_latency = "500ms"
error_rate = 0.01
hang_timeout = "5s"
[output]
report_dir = "./runs"
formats = ["terminal", "markdown", "json"] # "html" is also availableFrom the CLI (the common path):
cargo run -p mcp-loadtest-cli -- deadlock-probe \
--server "python -m my_mcp" \
--tool get_market_data \
--concurrent 20 \
--args '{"ticker":"AAPL"}'Library usage (pseudocode — see crates/mcp-loadtest/tests/vibe_trading_regression.rs for a runnable example):
// Sketch of the library API. RunContext requires run_start, cancel_token,
// metrics, hang_threshold, and grace_period — see the regression test linked
// above for the wiring.
use std::time::Duration;
use mcp_loadtest::scenario::deadlock_probe::DeadlockProbe;
use mcp_loadtest::scenario::Scenario;
use mcp_loadtest::Session;
use serde_json::json;
#[tokio::test]
async fn no_deadlock_under_concurrent_calls() {
let mut session = Session::spawn("python", ["-m", "my_mcp"]).await.unwrap();
let probe = DeadlockProbe {
concurrent: 20,
hang_threshold: Duration::from_secs(2),
grace_period: Duration::from_secs(5),
tool: "get_market_data".into(),
args: json!({ "ticker": "AAPL" }),
};
// Build RunContext { run_start, cancel_token, metrics, hang_threshold, grace_period }.
let outcome = probe.drive(&mut session, &ctx).await;
assert_eq!(outcome.deadlock_count, 0);
}reaatech/mcp-load-test is the only other MCP load tester we're aware of. It's a TypeScript monorepo and covers the load-testing basics well. mcp-loadtest is built on a different axis: Rust performance + a static binary, plus a bug-detector layer that targets the classes of MCP failures unit tests miss.
| Feature | reaatech | mcp-loadtest |
|---|---|---|
Deadlock detection (deadlock_probe) |
not available | yes |
| Race / non-determinism detector | not available | yes (race_check) |
| Real-time TUI dashboard | post-hoc only | yes |
| Cross-server compare (1 run, N targets) | partial (2-run baseline diff) | yes (cross subcommand) |
| Server resource sampling over time (RSS/CPU/fd) | latency only | yes |
| Protocol fuzzer | not available | yes |
| Coverage tracking (registered vs exercised tools) | not available | yes |
| Per-tool SLO assertions | global only | yes |
| Configurable regression thresholds (CLI + MCP args) | fixed | yes |
Protocol-aware assertions (opt-in strict inputSchema gating) |
not available | yes |
| Self-hosted as MCP server (LLM-agent control) | not available | yes |
| HTML report | not available | yes |
| WebSocket transport | not available | yes |
| Rust perf + static binary | Node runtime required | single ~5 MB binary via cargo install |
| stdio transport | yes | yes |
| HTTP / SSE transports | yes | yes |
| Latency histograms p50/p95/p99/p999 | yes | yes |
| Breaking-point detection | yes | yes |
| Performance grading A-F | yes | yes |
| Soak / leak detection | yes | yes |
| Spike scenario (sudden burst) | yes | yes |
| Compare baselines | yes | yes |
| Realistic patterns (explore-then-act, multi-step) | yes | yes |
| Console + markdown + JSON reporters | yes | yes |
| Programmatic library API | yes | yes |
We tracked 4 direct competitors (reaatech, haakco/mcp-testing-framework, spbiju/MCP-Benchmark, IBM mcp-context-forge internal) and 6 adjacent LLM-eval frameworks (MCP-Bench, MCPBench, MCP-Universe, MCPMark, MCP-Inspector, k6-MCP). See DESIGN.md §10.5 for the full matrix and ADR 0004 for the positioning decision.
| Scenario | Detects |
|---|---|
cold_start |
startup time regressions, init-time deadlocks |
sustained |
baseline p99 latency, throughput, error rate |
ramp |
break-point — concurrency where p99 explodes |
spike |
sudden-burst load — baseline → peak window → cooldown |
soak |
memory leaks under sustained load |
deadlock_probe |
lazy-init deadlocks (the canonical Vibe-Trading bug class) |
race_check |
non-determinism / order-sensitive bugs |
pattern |
weighted random mixes (explore-then-act, read-then-write, multi-step) |
Each scenario is one impl Scenario in crates/mcp-loadtest/src/scenario/ with a JSON-Schema describing its config block. See DESIGN.md §8 for the full table.
Three worked examples in docs/examples/:
- CI integration — GitHub Actions workflow that runs
mcp-loadteston every PR and fails the build on threshold violations. - Custom scenario — write
impl Scenario for MyThing, register it, drive it from a TOML config. UsesDeadlockProbeas a reference. - Debugging deadlocks — narrative walkthrough of what to do when
deadlock-probesays DEADLOCK DETECTED. Stderr inspection, the lazy-init pattern that caused Vibe-Trading PR #85, and a worked-example test you can copy.
# From the public repo (not on crates.io yet — see docs/adr/0015)
cargo install --git https://github.com/Teerapat-Vatpitak/mcp-loadtest mcp-loadtest-cliOr download a prebuilt binary for Linux/macOS/Windows from the GitHub Release. The crates.io publish is deferred to keep the first release off append-only (ADR 0015).
v0.1.0 is tagged (v0.1.0, annotated) and validated: the CI checks (fmt, clippy, build, test, doc) are green on Windows with 368 tests passing, plus cargo deny / cargo audit clean. The killer demo (deadlock_probe catches the Vibe-Trading PR #85 bug on the unpatched commit) is in crates/mcp-loadtest/tests/vibe_trading_regression.rs. The repo is public and cargo install --git works today; prebuilt GitHub Release binaries are the next step. crates.io is deferred (ADR 0015).
git clone https://github.com/Teerapat-Vatpitak/mcp-loadtest
cd mcp-loadtest
bash scripts/ci-checks.sh # or: pwsh scripts/ci-checks.ps1 on Windows
cargo nextest run --workspace --all-featuresSee CLAUDE.md for project conventions and CONTRIBUTING.md before opening a PR.
- DESIGN.md — what this is + how it works (21 sections)
- PROJECT-STRUCTURE.md — how the repo is laid out for AI-assisted development
- CONTRIBUTING.md — how to propose changes and open a PR
- CODE_OF_CONDUCT.md — contributor expectations
- SECURITY.md — how to report vulnerabilities
- docs/adr/ — architecture decision records
- docs/examples/ — cookbook
- CHANGELOG.md — release history
Dual-licensed under MIT OR Apache-2.0, at your option.