Skip to content

Teerapat-Vatpitak/mcp-loadtest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

mcp-loadtest

CI License: MIT OR Apache-2.0 Rust 1.88+

Load tester and bug detector for MCP (Model Context Protocol) servers. Catches lazy-init deadlocks, concurrency races, hangs, and perf regressions that unit tests miss.

Why mcp-loadtest

Lazy-init inside an async worker thread is one of the easiest ways to ship a broken MCP server. initialize works, tools/list works, the first tools/call hangs forever. Standard pytest never sees it because the bug only surfaces when a real client opens a session and drives the protocol end-to-end.

The flagship example is HKUDS/Vibe-Trading PR #85_get_registry() blocked on a deferred import src.tools.shell.* inside FastMCP's worker thread, so every concurrent caller wedged on the same lock. The fix was five lines. Finding the bug took hours of differential testing.

mcp-loadtest finds it in seconds:

$ mcp-loadtest deadlock-probe --server "python -m vibe_trading_mcp" \
    --tool analyze_options --concurrent 5 \
    --args '{"spot":450,"strike":460,"expiry_days":30}'

Run 01KR9JX7E4P638TKQM96YA0B4Z
Status: FAIL (1 deadlock)
Server: python -m vibe_trading_mcp
Scenario: deadlock_probe
Deadlocks: 1   Hangs: 0   Errors: 0

Error: DEADLOCK DETECTED — 1 deadlock(s), 0 error(s), 0 threshold violation(s)
$ echo $?
1

This is the bug class that breaks MCP servers in production. Unit tests don't catch it. mcp-loadtest does — and exits non-zero so it can fail your CI gate. The regression that catches the exact Vibe-Trading commit lives at crates/mcp-loadtest/tests/vibe_trading_regression.rs (pinned to commit 71220c7c — the parent of PR #85).

What it does

  • Bug-class detection
    • deadlock_probe — fires N concurrent tools/calls through hang_detect; classifies each as success / slow / deadlock (see DESIGN.md §15.2).
    • race_check — issues identical calls and diffs the responses to surface non-determinism (clocks, RNG, leaked state).
    • Hang detector watchdog wraps every call, so any scenario can surface a hung tool, not just the dedicated probe.
  • Load testing
    • sustained — constant concurrency over a duration; baseline p50/p95/p99/p999 + throughput.
    • ramp — linear ramp of concurrency to find the break-point.
    • soak — long-running sustained load with periodic RSS sampling for leak hunting.
    • Cold-start measurement, weighted pattern mixes (explore-then-act, multi-step).
  • Reporting
    • Markdown report at runs/<ulid>/report.md, self-contained report.html (no external deps), machine-readable metrics.json, ANSI terminal summary.
    • Schema-stable JSON (see docs/schema/metrics.v1.json); snapshot-tested so downstream LLM agents don't break on patch versions.
    • mcp-loadtest compare baseline.json current.json for regression diffs in CI.
    • mcp-loadtest cross --server "..." --server "..." for side-by-side runs across N targets.
  • AI-agent friendly
    • Stable JSON output and structured error messages with Hint: lines.
    • mcp-loadtest serve --mcp exposes the tool itself as an MCP server so Claude Code, Cursor, or any MCP-aware agent can call deadlock_probe, sustained_load, and compare_runs as MCP tools directly. See DESIGN.md §21.2.

CI gating & protocol-aware assertions

mcp-loadtest is built to be a CI regression gate, not just a profiler. Every run resolves to a pass/fail and a non-zero exit code, so it drops straight into a pipeline:

  • Threshold gating[thresholds] (p50/p95/p99/p999 latency, error rate, memory growth, per-tool SLOs). Any breach → report.passed() == false → non-zero exit. Deadlocks are zero-tolerance.

  • Baseline regression diffmcp-loadtest compare baseline.json current.json flags p99 / error-rate / deadlock regressions. Thresholds default to 10% p99 / 0.5pp error rate / deadlock-zero-tolerance and are now configurable:

    mcp-loadtest compare base.json cur.json \
        --max-p99-regression-pct 15 --max-error-rate-regression-pp 1.0

    The same knobs are exposed as compare_runs MCP tool args for agent-driven gating (ADR 0009).

  • Protocol-aware assertions — opt-in strict mode validates every tools/call's arguments against the server's advertised inputSchema before the call. A contract mismatch is recorded as a ProtocolError and gates the run. Off by default (forward-compatible, ADR 0005/0010); enable per-config:

    [validation]
    strict = true

Full GitHub Actions example: docs/examples/ci-integration.md.

Quick start

# Install from the public repo (not on crates.io yet — see docs/adr/0015)
cargo install --git https://github.com/Teerapat-Vatpitak/mcp-loadtest mcp-loadtest-cli
# ...or download a prebuilt binary from the GitHub Release:
#   https://github.com/Teerapat-Vatpitak/mcp-loadtest/releases

# Quick deadlock smoke against a real MCP server
mcp-loadtest deadlock-probe --server "python -m my_mcp" --tool foo

# Sustained load from a config file
mcp-loadtest run --config bench.toml

# Print a starter config
mcp-loadtest example-config > bench.toml

# Compare two runs (e.g. main vs PR branch)
mcp-loadtest compare runs/baseline/metrics.json runs/current/metrics.json

A minimal bench.toml:

[server]
command = "python"
args = ["-m", "my_mcp"]
transport = "stdio"

[scenario]
type = "sustained"
duration = "60s"
concurrent = 50
tool = "get_market_data"
args = { ticker = "AAPL" }

[thresholds]
p99_latency = "500ms"
error_rate = 0.01
hang_timeout = "5s"

[output]
report_dir = "./runs"
formats = ["terminal", "markdown", "json"]  # "html" is also available

From the CLI (the common path):

cargo run -p mcp-loadtest-cli -- deadlock-probe \
    --server "python -m my_mcp" \
    --tool get_market_data \
    --concurrent 20 \
    --args '{"ticker":"AAPL"}'

Library usage (pseudocode — see crates/mcp-loadtest/tests/vibe_trading_regression.rs for a runnable example):

// Sketch of the library API. RunContext requires run_start, cancel_token,
// metrics, hang_threshold, and grace_period — see the regression test linked
// above for the wiring.
use std::time::Duration;
use mcp_loadtest::scenario::deadlock_probe::DeadlockProbe;
use mcp_loadtest::scenario::Scenario;
use mcp_loadtest::Session;
use serde_json::json;

#[tokio::test]
async fn no_deadlock_under_concurrent_calls() {
    let mut session = Session::spawn("python", ["-m", "my_mcp"]).await.unwrap();
    let probe = DeadlockProbe {
        concurrent: 20,
        hang_threshold: Duration::from_secs(2),
        grace_period: Duration::from_secs(5),
        tool: "get_market_data".into(),
        args: json!({ "ticker": "AAPL" }),
    };
    // Build RunContext { run_start, cancel_token, metrics, hang_threshold, grace_period }.
    let outcome = probe.drive(&mut session, &ctx).await;
    assert_eq!(outcome.deadlock_count, 0);
}

vs reaatech/mcp-load-test

reaatech/mcp-load-test is the only other MCP load tester we're aware of. It's a TypeScript monorepo and covers the load-testing basics well. mcp-loadtest is built on a different axis: Rust performance + a static binary, plus a bug-detector layer that targets the classes of MCP failures unit tests miss.

Feature reaatech mcp-loadtest
Deadlock detection (deadlock_probe) not available yes
Race / non-determinism detector not available yes (race_check)
Real-time TUI dashboard post-hoc only yes
Cross-server compare (1 run, N targets) partial (2-run baseline diff) yes (cross subcommand)
Server resource sampling over time (RSS/CPU/fd) latency only yes
Protocol fuzzer not available yes
Coverage tracking (registered vs exercised tools) not available yes
Per-tool SLO assertions global only yes
Configurable regression thresholds (CLI + MCP args) fixed yes
Protocol-aware assertions (opt-in strict inputSchema gating) not available yes
Self-hosted as MCP server (LLM-agent control) not available yes
HTML report not available yes
WebSocket transport not available yes
Rust perf + static binary Node runtime required single ~5 MB binary via cargo install
stdio transport yes yes
HTTP / SSE transports yes yes
Latency histograms p50/p95/p99/p999 yes yes
Breaking-point detection yes yes
Performance grading A-F yes yes
Soak / leak detection yes yes
Spike scenario (sudden burst) yes yes
Compare baselines yes yes
Realistic patterns (explore-then-act, multi-step) yes yes
Console + markdown + JSON reporters yes yes
Programmatic library API yes yes

We tracked 4 direct competitors (reaatech, haakco/mcp-testing-framework, spbiju/MCP-Benchmark, IBM mcp-context-forge internal) and 6 adjacent LLM-eval frameworks (MCP-Bench, MCPBench, MCP-Universe, MCPMark, MCP-Inspector, k6-MCP). See DESIGN.md §10.5 for the full matrix and ADR 0004 for the positioning decision.

Built-in scenarios

Scenario Detects
cold_start startup time regressions, init-time deadlocks
sustained baseline p99 latency, throughput, error rate
ramp break-point — concurrency where p99 explodes
spike sudden-burst load — baseline → peak window → cooldown
soak memory leaks under sustained load
deadlock_probe lazy-init deadlocks (the canonical Vibe-Trading bug class)
race_check non-determinism / order-sensitive bugs
pattern weighted random mixes (explore-then-act, read-then-write, multi-step)

Each scenario is one impl Scenario in crates/mcp-loadtest/src/scenario/ with a JSON-Schema describing its config block. See DESIGN.md §8 for the full table.

Cookbook

Three worked examples in docs/examples/:

  • CI integration — GitHub Actions workflow that runs mcp-loadtest on every PR and fails the build on threshold violations.
  • Custom scenario — write impl Scenario for MyThing, register it, drive it from a TOML config. Uses DeadlockProbe as a reference.
  • Debugging deadlocks — narrative walkthrough of what to do when deadlock-probe says DEADLOCK DETECTED. Stderr inspection, the lazy-init pattern that caused Vibe-Trading PR #85, and a worked-example test you can copy.

Install

# From the public repo (not on crates.io yet — see docs/adr/0015)
cargo install --git https://github.com/Teerapat-Vatpitak/mcp-loadtest mcp-loadtest-cli

Or download a prebuilt binary for Linux/macOS/Windows from the GitHub Release. The crates.io publish is deferred to keep the first release off append-only (ADR 0015).

Status

v0.1.0 is tagged (v0.1.0, annotated) and validated: the CI checks (fmt, clippy, build, test, doc) are green on Windows with 368 tests passing, plus cargo deny / cargo audit clean. The killer demo (deadlock_probe catches the Vibe-Trading PR #85 bug on the unpatched commit) is in crates/mcp-loadtest/tests/vibe_trading_regression.rs. The repo is public and cargo install --git works today; prebuilt GitHub Release binaries are the next step. crates.io is deferred (ADR 0015).

Development

git clone https://github.com/Teerapat-Vatpitak/mcp-loadtest
cd mcp-loadtest
bash scripts/ci-checks.sh        # or: pwsh scripts/ci-checks.ps1 on Windows
cargo nextest run --workspace --all-features

See CLAUDE.md for project conventions and CONTRIBUTING.md before opening a PR.

Documents

License

Dual-licensed under MIT OR Apache-2.0, at your option.

About

Load tester and bug detector for MCP (Model Context Protocol) servers — deadlocks, races, hangs, perf regressions

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages