Skip to content

Latest commit

 

History

History
181 lines (155 loc) · 23.5 KB

File metadata and controls

181 lines (155 loc) · 23.5 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Changed

  • Distribution: v0.1.0 ships via cargo install --git + prebuilt GitHub Release binaries instead of crates.io, to keep the first public release off crates.io's append-only commitment (ADR 0015 — amends the distribution channel of ADR 0004; the head-on-competition strategy is unchanged). No code change; the v0.1.0 tag is unaffected. README / DESIGN.md §10 / docs/RELEASE.md install instructions updated accordingly.

0.1.0 — 2026-05-17

First public release. See DESIGN.md §10 for milestone plan.

Added

  • Opt-in strict MCP schema validation (ADR 0010): [validation] strict = true validates each tools/call's arguments against the server's advertised inputSchema before sending. A dependency-free subset validator (protocol::schema) covers type/properties/required/enum/items and skips unmodeled keywords (forward-compatible, ADR 0005). Arg violations are classified as CallOutcome::ProtocolError so they gate a run; default (off) behaviour is byte-for-byte unchanged. Result-side validation is deferred to v0.2.
  • Configurable regression thresholds (ADR 0009 follow-up): compare gains --max-p99-regression-pct / --max-error-rate-regression-pp / --allow-deadlock-increase, and the compare_runs MCP tool gains the matching optional args. Backed by a shared analysis::regression::RegressionThresholds whose Default reproduces the historical 10% p99 / 0.5pp / deadlock-zero-tolerance policy, so existing CI gates are unaffected unless they opt in.
  • Project scaffolding: workspace layout, Cargo config, CLAUDE.md hierarchy, slash commands, CI workflow.
  • Design document covering motivation, types, algorithms, test matrix, milestones (DESIGN.md, 20 sections).
  • Project-structure document for AI-assisted development workflow (PROJECT-STRUCTURE.md).
  • M1 protocol stack:
    • protocol::jsonrpc — JSON-RPC 2.0 message types (OutgoingRequest, OutgoingNotification, ResponseEnvelope, ResponsePayload, ErrorObject).
    • protocol::mcp — MCP types (InitializeParams/Result, Tool, ListToolsResult, CallToolParams, CallToolResult, Content).
  • M1 Session — stdio MCP session that spawns a child, runs initialize + notifications/initialized, exposes list_tools / call_tool / shutdown. Synchronous request/response only; concurrent in-flight requests deferred to M2.
  • Python test fixtures: _common.py (framing helpers) + mock-normal.py (echoes args).
  • End-to-end integration test happy_path.rs covering spawn → initialize → tools/list → tools/call → shutdown against mock-normal.py.
  • M2 scenarios + metrics core (delivered via 4-agent parallel sprint):
    • scenario::Scenario trait + RunContext + ScenarioOutcome (interface contract pinned in AGENTS.md).
    • scenario::sustained::Sustained — concurrent-load workload (M2 sequential against single Session; multi-Session pool is M3).
    • scenario::deadlock_probe::DeadlockProbe — Vibe-Trading-bug-class detector. Wraps each tools/call with hang_detect; bails on first deadlock to avoid flooding a wedged session.
    • scenario::cold_start::ColdStart — placeholder (M3 will activate once RunContext gains a session-spawning factory).
    • hang_detector::hang_detect — two-phase tokio::select! watchdog (DESIGN.md §15.1) classifying each call as Ok / Slow / Deadlock / Err.
    • metrics::Recorder — Arc-shared, lock-free outcome counters + 16-shard hdrhistogram for per-call latencies (microsecond resolution, 1µs..=1h range).
    • ScenarioMetrics, LatencyStats (p50/p95/p99/p999/mean/min/max/count), ThroughputStats, OutcomeCounts.
  • M2 fixtures: mock-broken.py (canonical deadlock pattern), mock-slow.py (2s tool latency), mock-crash.py (~1% mid-call exit).
  • M2 integration tests: scenarios_basic.rs (sustained + cold_start placeholder + cancellation), deadlock.rs (mock_normal_no_deadlock + mock_broken_detects_deadlock — the killer test that catches the Vibe-Trading bug class in <7s).
  • M3 reports + first internal release (delivered via 4-agent parallel sprint):
    • config::Config + ServerConfig + ScenarioConfig + ThresholdsConfig + OutputConfig — TOML schema with humantime durations, semantic validation (Config::from_toml_str/from_file), and example_config() printer.
    • report::Report + ProcessStats + ProcessSample + ServerInfo + ThresholdViolation + ReportError + Reporter trait.
    • report::markdown::MarkdownReporter — DESIGN §17.3 template (status badge, summary, latency table, errors, process line, threshold violations, trace).
    • report::json::JsonReporter — DESIGN §17.2 schema via a ReportView wrapper (durations as _ms, ISO 8601 timestamps); Report stays unmodified.
    • report::terminal::TerminalReporter — ANSI-colored compact summary; respects NO_COLOR / CLICOLOR / non-tty automatically.
    • metrics::process::ProcessSampler — sysinfo 0.32 backed periodic RSS+CPU sampler, two-phase CPU baseline, cancellation-aware, best-effort on dead PIDs.
    • run::Run + RunError + Run::execute() — full orchestrator: ulid run-id, run dir creation, Session spawn, scenario drive, metrics snapshot, threshold evaluation, bounded shutdown.
    • 79 tests passing across lib + 6 integration test files.
  • M3 CLI: mcp-loadtest example-config | run --config <path> | deadlock-probe --server "..." | list-scenarios. Run + DeadlockProbe write report.md and metrics.json under runs/<ulid>/; deadlock-probe exits non-zero when deadlock_count > 0.
  • M3 Vibe-Trading regression test: clones HKUDS/Vibe-Trading@71220c7 (parent of PR #85) into target/vibe-trading-fixture/, runs DeadlockProbe, asserts deadlock detected. #[ignore]d by default; run with cargo test --test vibe_trading_regression -- --ignored --nocapture.
  • M4 transport parity (3-agent parallel sprint):
    • protocol::transport::Transport async trait + TransportError (Io/Http/Closed/Timeout/Other).
    • StdioTransport extracted from Session (legacy single-call path stays via Session::spawn).
    • HttpTransport (Streamable HTTP simple JSON variant; SSE-response detection routes to a clear M5 deferral).
    • SseTransport with background reader task + endpoint handshake + id-correlation buffer.
    • Python fixtures mock-http-server.py + mock-sse-server.py (stdlib http.server only).
    • ServerConfig.url + transport-aware Run::execute dispatch.
    • Toolchain bumped 1.85 → stable to satisfy icu transitive MSRV pulled in by url.
  • M5 analysis parity (4-agent parallel sprint):
    • analysis::breaking_point (BreakingPointDetector w/ first-violator semantics on per-step deltas).
    • analysis::grading (Grade A-F per latency/concurrency/error, worst-of-three rollup).
    • scenario::ramp (linear-stepped concurrency; integrates with breaking_point).
    • scenario::pattern (multi-step + weighted-random + think-time + ErrorBehavior).
    • Sustained refactor to drive Patterns; run_patterns free function for multi-pattern callers.
    • scenario::soak (periodic snapshots + leak signal via mean-latency regression).
    • mcp-loadtest compare baseline.json current.json CLI subcommand (markdown/JSON diff).
  • M6 differentiators v1 (4-agent parallel sprint):
    • tui::dashboard (ratatui + crossterm live polling; quits on q/Esc, propagates cancel).
    • analysis::race_detector + scenario::race_check (key-sorted JSON canonicalization, divergence reporting).
    • mcp-loadtest cross --server "..." --server "..." (side-by-side multi-server comparison with grading).
    • ProcessStats enriched with peak_fd/final_fd/peak_threads/final_threads (best-effort: Linux /proc/fd; macOS/Windows degrade to 0).
    • Soak linear-regression-based leak signal helper (detect_leak).
  • M7 differentiators v2 + v0.1.0 polish (4-agent parallel sprint):
    • scenario::fuzzer (FuzzPayload enum: UnknownMethod / NumericMethod / GiantPayload / ControlChars / Nested / NullParams / StringParams; raw-byte variants documented + skipped pending Transport::raw_send hook).
    • analysis::fuzz_report (FuzzClass classification + has_critical signal).
    • analysis::coverage (CoverageReport: registered vs exercised tools + coverage_pct).
    • ToolSlo per-tool latency budget assertions in ThresholdsConfig.
    • Recorder::record_tool + snapshot_per_tool (per-tool counters; existing record/snapshot aggregate untouched for back-compat).
    • mcp-loadtest serve --mcp self-hosted MCP server (DESIGN §21.2 differentiator) — exposes deadlock_probe/sustained_load/compare_runs as MCP tools so AI agents drive load tests directly via stdio JSON-RPC.
    • README rewritten to lead with the deadlock demo + competitive positioning vs. reaatech.
    • docs/examples/{ci-integration,custom-scenario,debugging-deadlocks}.md cookbook.
  • Post-M7 competitive-gap close (3-agent parallel sprint, post-review):
    • scenario::spike — sudden-burst concurrency pattern (baseline → spike window → cooldown). Closes the reaatech parity gap.
    • report::html::HtmlReporter — self-contained report.html with inline SVG histogram, escaped HTML, no external CDN or JS. Closes the IBM/spbiju enterprise-report gap. Wired into the CLI as the "html" output format.
    • protocol::transport::ws::WsTransport — WebSocket transport via tokio-tungstenite (rustls + webpki-roots); 16 MB per-frame OOM cap mirroring stdio. Activates the "ws" scheme that was previously parser-accepted but rejected at runtime.
    • SECURITY.md — security disclosure policy at repo root (in-scope vs out-of-scope surfaces, reporting flow, recent hardening notes).
  • 12 new tests bring the suite to ~255 passing (was 243): spike happy path + 5 HTML reporter (escaping, chart, violation styling) + 2 WS (echo roundtrip + scheme rejection) + 1 config parse-spike-kind + scenario name/schema asserts.
  • v0.1.0 pre-publish feature wave (4-agent parallel sprint, disjoint file ownership):
    • SSRF host-allowlist (ADR 0012): [server].allowed_hosts exact-match (ASCII case-insensitive, no wildcard) allowlist for http/sse/ws; empty/unset = allow any public host. Always-on block of private/loopback/link-local/ULA/unspecified/reserved IP literals (the SSE server-provided endpoint URL is checked too), with an operator escape hatch (list the literal, e.g. "127.0.0.1", to permit local testing). protocol::transport::HostGuard is now public (needed to construct Http/Sse/Ws transports directly via their connect(url, &guard) constructors).
    • --capture-stderr / --tee-stderr on run (ADR 0013): redirect the spawned stdio server's stderr to runs/<id>/server-stderr.log (capture) or additionally mirror it live to the parent's stderr (tee). New public API SpawnOptions / StderrMode / StderrCapture, Session::spawn_with, StdioTransport::spawn_with, Run::with_stderr_capture; the stable 2-arg Session::spawn is unchanged (delegates). Tee is a cancellation-aware, JoinHandle-tracked task that flushes before every exit.
    • mcp-loadtest doctor (DESIGN §21.6, ADR 0014): 4 best-effort checks — Python on PATH, optional --server initialize smoke (captures the server's stderr on failure), stale runs/ accumulation, Windows MSVC/GNU toolchain mismatch. ✅/❌ checklist + one-line fix per ❌; non-zero exit if any ❌.
    • --explain global flag (DESIGN §21.4, ADR 0014): static per-subcommand algorithm text, exit 0. Serviced by a pre-clap std::env::args() scan so it works without a subcommand's required args (e.g. run --explain needs no --config).
    • Actionable error hints (DESIGN §21.3, ADR 0014): hints::ErrorHint (CLI crate) maps RunError/SessionError/TransportError/ConfigError/ReportError to a one-line next step, printed after the error chain at the CLI boundary so the library error enums stay clean.
    • Python fixtures mock-leak.py (10 KB/call RSS growth), mock-error.py (cycles -32601/-32602/-32603), mock-slow-init.py (5 s initialize delay), mock-malformed.py (every 10th response is newline-terminated broken JSON) — DESIGN §16.7-16.10, no longer "planned for v0.2".

Refactor / cleanup

  • report::common extracted (post-M4 /simplify pass): fmt_duration, fmt_count, format_server_command, describe_failure, format_iso8601_utc shared between markdown + terminal reporters (-69 LoC net).
  • ThresholdViolation.metric: Stringkind: ThresholdKind enum (post-M3 QF-5); JSON wire format preserved via #[serde(rename = "metric")] + per-variant snake_case rename.
  • Report::passed() now treats deadlock_count > 0 as a hard failure (post-M3 QF-1).
  • Session::pid() accessor + Run::execute wires ProcessSampler (post-M3 QF-2).
  • CLI deadlock-probe exits non-zero on any error / threshold violation / deadlock (post-M3 QF-3).
  • AGENTS.md — multi-agent coordination contract (file ownership, locked interfaces, sprint exit criteria).
  • DESIGN.md §10 revised: 8-week head-on competition plan vs. reaatech/mcp-load-test (was 3-week solo plan).
  • DESIGN.md §10.5 — competitive parity matrix (12 reaatech features to match, 12 mcp-loadtest differentiators).
  • DESIGN.md §21 — AI-friendliness pillar (10 design principles incl. self-hosted MCP wrapper, --explain flag, doctor subcommand, structured errors with hints, JSON Schema for outputs).
  • ADR 0004 — strategic decision to compete head-on (Path A) over contributing to reaatech (Path B) or repositioning (Path C).

Fixed

  • Soak::leak_threshold_mb_per_sec renamed to latency_drift_ms_per_sec (the units were always ms/sec; the old name lied).
  • Fuzzer: skipped raw-transport iterations no longer bump total_calls or pollute CallOutcome::Cancelled (they never hit the wire).
  • Fuzzer: server-accepted malformed payloads now record CallOutcome::Malformed and bump error_count so threshold evaluators surface them.
  • Run: memory_growth_mb now compares peak − final instead of bare peak, so a steady-state high-RSS process no longer false-positives.
  • SSRF guard IPv6 hole found + fixed in review: url::Url::host_str() returns bracketed IPv6 ([::1]), so the initial parse::<IpAddr>() host extraction silently let every IPv6 literal bypass the private-IP block. The guard now classifies the host via the typed url::Host enum (the parser already split host kind at parse time), closing the bypass; covered by IPv6/IPv4-mapped unit tests.

Security

  • SSRF hardening (ADR 0012, supersedes the ADR 0007 deferral): every outbound http/sse/ws connection — including the SSE server-provided endpoint URL — is validated against an optional exact-match host allowlist plus an always-on private/loopback/link-local/ULA/unspecified/reserved IP-literal block, before any network I/O. Complements the existing Policy::none() redirect policy. Residual documented gap: a hostname that resolves to a private IP (DNS rebinding) is not blocked in v0.1 — the allowlist is the strong control; a resolver-pinning connector is the recorded v0.2 follow-up (ADR 0012 Open).

Performance

  • Sustained / ramp / soak scenarios: tokio::task::yield_now() instead of sleep(ZERO) to avoid registering no-op reactor timers.
  • Metrics: per-tool BTreeMap moved behind RwLock; fast path is now a read lock.
  • cmd_cross: cross-server runs use futures::future::join_all (was sequential, now N-way parallel).
  • Fuzzer: LazyLock for GIANT_PAYLOAD + NESTED_PAYLOAD so multi-MB / 100-deep payloads build once per process.
  • Release profile: panic = "abort" shaves ~600 KB off the stripped binary (5.7 MB → 5.1 MB on x86_64-pc-windows-gnu).
  • Hot-path zero-copy refactor (Phase 1 pre-public audit):
    • OutgoingRequest / OutgoingNotification now borrow method: &'a str and params: &'a P (generic, P: ?Sized + Serialize). Eliminates the intermediate serde_json::to_value() round-trip on every tool call.
    • CallToolParams { name: &'a str, arguments: &'a Value }.
    • Session::call_tool(&str, &Value) — was (&str, Value). Scenarios drop .clone() and pass &self.args; deep-clone of the JSON args tree per iteration is gone.
  • Transports: pending: VecDeque<String> (id-mismatch buffer) is now capped at MAX_PENDING_FRAMES = 256 in both sse and ws. Overflow surfaces TransportError::Other instead of growing without bound.

Tests / benches

  • Criterion microbenchmarks added under crates/mcp-loadtest/benches/: record, histogram, session_loopback (Transport-trait loopback, no I/O), hang_detect. Run with cargo bench -p mcp-loadtest. Closes the DESIGN.md §19 perf-claim gap.
  • Phase 3 coverage gaps closed (pre-publish audit):
    • tests/ramp.rs — first integration test for the ramp scenario (was unit-only).
    • tests/spike.rs — added spike_against_crashing_server_survives_without_hang (failure-mode coverage; uses mock-crash.py; assertions are crash-stochastic-aware so the test isn't flaky).
    • src/protocol/transport/ws.rs — 3 new failure-mode tests: server-closes-mid-call, cancel-during-request, oversized-frame-rejected.
    • tests/reporter_snapshots.rs — 5 new tests: substring landmarks for html + terminal (insta-snapshot parity with markdown + json was skipped because both reporters have too much structural variance for stable snapshots); empty-metrics renders without panic for html, terminal, and json (catches divide-by-zero in throughput math).
  • Test suite: 260 passing (was 250 pre-Phase-3); 0 flakes in a 3-run check; 1 #[ignore] left (Vibe-Trading regression — requires external checkout).
  • cold_start scenario remains an intentional placeholder in v0.1.0; real handshake-time histogram measurement is queued for v0.2 (DESIGN §8). The existing cold_start_is_an_inert_placeholder test pins the placeholder contract so the v0.2 work is forced to update assertions.

Changed

  • ServerConfig::stdio(command, args) constructor + split_server_command() free fn relocated to config module (kills 4 + 3 hand-rolled literals across the codebase).
  • classify_error / is_terminal_error deduped into scenario/mod.rs (kills 3 byte-identical copies across pattern / ramp / soak).
  • Regression thresholds P99_REGRESSION_PCT + ERROR_RATE_REGRESSION_PP lifted into analysis::regression and shared between cmd_compare and serve/tools::compare_runs.
  • Unified scenario builder: a single build_scenario factory now drives every config — sustained accepts weighted patterns / legacy tool_call arrays (via PatternScenario) alongside the single-tool path, and all M5–M7 kinds (ramp / soak / spike / race_check / fuzzer / pattern) dispatch through it.
  • cmd_run split into private builder / params / patterns submodules (each under the 300 prod-LoC convention); public surface unchanged (run_from_config + the re-exported parse_dur_str main.rs shares with deadlock-probe / cross).
  • Breaking (absorbed into v0.1.0, pre-publish): HttpTransport/SseTransport/WsTransport::connect take an added &HostGuard argument; StdioTransport::spawn is now async; CLI cmd_run::run_from_config takes added capture_stderr / tee_stderr parameters. The documented Session::spawn(command, args) signature is unchanged (delegates to spawn_with).
  • DESIGN.md §16.7-16.10: dropped the (planned for v0.2) markers — the four fixtures now ship.

Deprecated

  • DEFAULT_LEAK_THRESHOLD_MB_PER_SEC → use DEFAULT_LATENCY_DRIFT_MS_PER_SEC. The old constant remains as an alias for one release and will be removed in v0.2.0.

Notes

  • ✅ The M8 file-split pass completed in the pre-publish review. All source files have production code (excluding #[cfg(test)] mod tests) under the 300-line convention. See POST_PUBLISH_ISSUES.md for the per-wave summary of what split where.
  • serve and tui modules will move behind cargo feature flags in a future release to keep the default build slim.
  • ✅ HTTP / SSE / WS transport host-allowlist for SSRF defense landed (was deferred): exact-match [server].allowed_hosts + always-on private/loopback/link-local/ULA/reserved IP-literal block, on top of the existing Policy::none() redirect policy. Supersedes the ADR 0007 deferral — see ADR 0012. Residual documented gap: hostname→private-IP (DNS rebinding) is not yet blocked in v0.1; resolver-pinning is the recorded v0.2 follow-up.
  • ✅ Pre-publish security pass on the B1/B2 surface: bounded protocol::schema recursion (MAX_SCHEMA_DEPTH, defends strict mode against a maliciously deep server inputSchema); non-positive regression thresholds rejected at the CLI/MCP boundary (would otherwise invert the gate); compare_runs no longer echoes the raw caller path; enum-violation messages are length-capped.
  • ✅ Pre-publish stabilization: crates/mcp-loadtest now excludes the four CLAUDE.md scaffolding files from the published package (108 → 104 files); added run_strict CLI integration test covering the real run --config entrypoint with [validation] strict = true end-to-end (TOML → Run::execute → report-on-disk → non-zero gate); full pipeline (fmt, clippy -D warnings, --locked build/test, doc) verified on the x86_64-pc-windows-msvc target — the toolchain crates.io ships to Windows users.
  • ✅ Supply-chain gate (cargo deny check / cargo audit): allowlisted CDLA-Permissive-2.0 (the webpki-roots CA-bundle license, via tokio-tungstenite) and added three individually-triaged, documented advisory ignores — RUSTSEC-2025-0052 (async-std, dev-only via httpmock), RUSTSEC-2024-0436 (paste, transitive), RUSTSEC-2026-0002 (lru unsoundness, transitive via ratatui, optional tui feature only, no semver fix). Rationale + revisit conditions in ADR 0011 + POST_PUBLISH_ISSUES.md; no actual vulnerabilities, gate still hard-fails on anything new.
  • ✅ Pre-publish API-durability & standards pass (so v0.1.0's public surface is safe to commit to):
    • #[non_exhaustive] on the Config family (Config/ServerConfig/ScenarioConfig/ThresholdsConfig/OutputConfig/ValidationConfig/ConfigError), the public error/outcome enums (SessionError, RunError, CallOutcome, ThresholdKind, ReportError, TransportError, HangOutcome, SchemaPolicy, ValidationSite) and RunContext — adding a field/variant is no longer a breaking change. New constructors keep ergonomic cross-crate construction: Config::new + with_thresholds/with_output/with_validation, ScenarioConfig::new, OutputConfig::new, RunContext::new.
    • Crate-root docs rewritten: dropped the stale "v0.0.0 — scaffolding" status and replaced the fictional-API example with a real, compiling no_run one; ValidationConfig added to the public re-export facade.
    • docs.rs builds --all-features (so serve/tui render); missing_docs raised to deny (lib + CLI); README repo links rewritten to absolute URLs so they resolve on crates.io.
    • MSRV corrected 1.861.88 (the real floor — edition-2024 let-chains; verified by an actual cargo build on 1.88) with a new CI msrv job pinning it; modernized the clippy nits it surfaced (collapsible_if, is_multiple_of).
    • Re-verified end-to-end on the x86_64-pc-windows-msvc target: fmt, clippy -D warnings, --locked build/test (incl. doctests), doc -D warnings, cargo deny, cargo audit, publish --dry-run (104 files).