Skip to content

v0.1.0 readiness: builder refactor + B1/B2 + full pre-publish hardening#1

Merged
Teerapat-Vatpitak merged 8 commits into
mainfrom
refactor/cmd-run-scenario-builder
May 17, 2026
Merged

v0.1.0 readiness: builder refactor + B1/B2 + full pre-publish hardening#1
Teerapat-Vatpitak merged 8 commits into
mainfrom
refactor/cmd-run-scenario-builder

Conversation

@Teerapat-Vatpitak
Copy link
Copy Markdown
Owner

What

v0.1.0 readiness: completes the in-flight scenario-builder refactor, closes the two differentiator gaps, then a full pre-publish security + stabilization + API-durability hardening pass. 8 commits (b6d7165b7e75a2).

Commit
b6d7165 refactor(cli) finish unified scenario builder; split cmd_run into builder/params/patterns (<300 prod LoC each)
ba0faae feat(compare) configurable regression thresholds — compare flags + compare_runs MCP args (ADR 0009)
10443aa feat(validation) opt-in strict MCP schema validation, dependency-free subset validator (ADR 0010)
c9089a5 docs advertise CI gating + protocol-aware assertions (README/DESIGN)
ad2e5c6 fix(security) bound schema recursion, reject non-positive thresholds, path/enum message hardening
95783e6 chore(release) exclude CLAUDE.md from package; run_strict e2e test (real run --config path)
fec989e chore(deny) allowlist CDLA-Permissive-2.0; triaged RUSTSEC ignores (ADR 0011)
b7e75a2 refactor(api) #[non_exhaustive] + constructors on the public surface; crate-root docs; docs.rs all-features; missing_docs=deny; MSRV 1.86→1.88 (verified)

Net public-API change is pre-release hardening only; Cargo.lock is identical to main (zero dependency changes across all 8 commits).

Why a PR (not a merge)

This PR exists to run the authoritative CI backstop — 3-OS (ubuntu/macOS/windows-latest) checks + cargo-audit + the new msrv (Rust 1.88) job — including the toolchains that couldn't be exercised locally. Not publishing; no tag; no crates.io.

Local verification (x86_64-pc-windows-msvc)

fmt · clippy -D warnings · build --locked · full test incl. doctests · doc -D warnings · cargo deny check · cargo audit (rc 0) · publish --dry-run -p mcp-loadtest (104 files) — all green. See CHANGELOG.md [Unreleased] for the full detail and ADRs 0009–0011 for the decisions.

Out of scope (v0.2)

Result-side schema validation, host-allowlist SSRF, distributed mode, revisiting the triaged RUSTSEC ignores — tracked in POST_PUBLISH_ISSUES.md.

cmd_run.rs grew past the 300 prod-LoC convention while the in-flight
refactor unified single-tool, weighted-pattern and legacy tool_call
configs through one build_scenario factory and added M5-M7 kind
dispatch.

Split into private builder/params/patterns submodules. The public
surface (run_from_config + the re-exported parse_dur_str shared with
deadlock-probe and cross) is unchanged and now guarded by a cmd_run
integration test. CHANGELOG and the custom-scenario doc are synced to
the new layout.

ci-checks green: fmt, clippy -D warnings, build --locked, 260+ tests,
doc -D warnings.
ADR 0009 deferred operator-configurable thresholds until a user asked;
this lands them. p99 / error-rate / deadlock-zero-tolerance were
hard-coded constants shared by the compare CLI and the compare_runs
MCP tool, so a user wanting a different CI budget had to fork.

Add analysis::regression::RegressionThresholds whose Default
reproduces the historical 10% / 0.5pp / zero-tolerance policy exactly,
so existing gates are unaffected unless they opt in. Surface overrides
where operators work: compare gains --max-p99-regression-pct /
--max-error-rate-regression-pp / --allow-deadlock-increase; the
compare_runs tool gains the matching optional args. CLI and MCP stay
in lockstep by construction (same struct).

ADR 0009 updated (open question resolved); CHANGELOG noted.
ci-checks green: fmt, clippy -D warnings, --locked tests, doc.
The protocol-aware-assertions differentiator wanted the tester to catch
when a tools/call's args don't match the server's advertised
inputSchema, but ADR 0005 deliberately chose lenient/forward-compat
parsing and the repo's supply-chain discipline rules out an
unauditable JSON Schema crate.

Add `[validation] strict = true` (default false → behaviour unchanged):
- protocol::schema, a dependency-free subset validator
  (type/properties/required/enum/items) that skips unmodeled keywords,
  so it never rejects on forward-compat grounds.
- classify_schema_violation: args violations Fail (gate as
  ProtocolError), result violations Warn (result-side deferred to
  v0.2), none Ignore.
- Session caches the tools/list registry the run already fetches (no
  extra round-trip, ADR 0006 hot path untouched when off) and rejects
  bad args before send via SessionError::SchemaViolation, mapped to
  CallOutcome::ProtocolError.

ADR 0010 added; CHANGELOG noted. New mock-schema.py fixture +
strict_validation.rs prove bad args gate and compliant args don't.
ci-checks green: fmt, clippy -D warnings, --locked tests, doc.
B1/B2 shipped real differentiators (configurable regression
thresholds, opt-in protocol-aware schema validation) but the README
and DESIGN matrix didn't surface them, so a reader couldn't tell the
tool is a CI regression gate, not just a profiler.

Add a "CI gating & protocol-aware assertions" section to the README
(threshold gating, configurable compare, strict mode, link to the
ci-integration cookbook) and two rows each to the README and
DESIGN.md §10.5 competitive matrices. Docs only — no code change;
doc -D warnings green.
Pre-publish security review of the configurable-thresholds and
strict-validation work surfaced one HIGH and three lower findings,
all in code added this branch:

- HIGH: protocol::schema::validate_at recursed with no depth bound. A
  malicious server-advertised inputSchema could stack-overflow the
  load test (an uncatchable abort) once strict mode is on. Cap at
  MAX_SCHEMA_DEPTH = 64; over-deep tails are skipped, not failed.
- MED: negative regression thresholds silently inverted the gate.
  Reject non-positive / non-finite values at both boundaries — a
  clap value_parser on the CLI flags and a finite-and-positive
  filter in compare_runs::regression_overrides.
- LOW: compare_runs canonicalize error echoed the raw caller path
  back to the MCP client; drop it.
- LOW: enum-violation messages rendered the full server enum array;
  cap the preview at 5 entries + a count.

Regression tests added for the depth cap and the threshold guard.
ci-checks green: fmt, clippy -D warnings, --locked tests, doc.
Pre-publish stabilization from a read-only audit:

- packaging: exclude the four CLAUDE.md scaffolding files from the
  published `mcp-loadtest` crate (108 -> 104 files; verified via
  `cargo package --list`). runs/ was already gitignored.
- test gap: add `crates/mcp-loadtest-cli/tests/run_strict.rs` — the
  existing strict_validation.rs drives scenario.drive() directly;
  this covers the real production entrypoint end-to-end:
  `cmd_run::run_from_config` -> Config::from_file (incl. [validation]
  TOML deserialize) -> build_scenario -> Run::execute (run.rs strict
  wiring) -> emit_reports (report.md/metrics.json on disk) ->
  non-zero gate. Asserts both the fail (bad args -> ProtocolError ->
  not passed) and pass (compliant args) paths.
- docs: tick AGENTS.md M7 `/security-review + /release-checks`;
  annotate POST_PUBLISH_ISSUES dry-run/smoke status. CHANGELOG noted.

Verified on **x86_64-pc-windows-msvc** (the target crates.io ships to
Windows users — previously only gnu was built locally): fmt, clippy
-D warnings, build --locked, full test suite (incl. run_strict /
strict_validation), doc -D warnings, publish --dry-run (lib, 104
files), release-binary smoke — all green.
`cargo deny check` (a release-checks must-pass gate) failed on the
pre-publish audit: webpki-roots' CDLA-Permissive-2.0 license wasn't
allowlisted, and three advisories tripped the default error. `cargo
audit` exits 0 — all three are unmaintained/unsound notices, not
scored vulnerabilities. Advisory checks hit the live RustSec DB, so
this surfaces regardless of the (main-identical) lockfile.

- licenses: allow CDLA-Permissive-2.0 (webpki-roots' Mozilla CA
  bundle, via tokio-tungstenite — permissive, publish-safe).
- advisories: documented per-ID ignore of RUSTSEC-2025-0052
  (async-std, dev-only via httpmock — never shipped), -2024-0436
  (paste, transitive proc-macro), -2026-0002 (lru IterMut
  unsoundness — transitive via ratatui 0.29, optional `tui` feature
  only, no semver-compatible fix). Nothing else ignored; the gate
  still hard-fails on any new or actual-vulnerability advisory.

Decision recorded in ADR 0011; revisit conditions tracked in
POST_PUBLISH_ISSUES.md; CHANGELOG noted. `cargo deny check` now
passes (advisories/bans/licenses/sources ok); `cargo audit` rc 0.
Pre-publish FULL hardening so v0.1.0's public API and crate page are
safe to commit to permanently (from a market + quality/standards
audit before `cargo publish`).

- API durability: `#[non_exhaustive]` on the Config family + public
  error/outcome enums + RunContext (adding a field/variant is no
  longer breaking — the exact pain hit internally when `validation`
  was added). Constructors keep cross-crate use ergonomic:
  Config::new + with_*, ScenarioConfig::new, OutputConfig::new,
  RunContext::new. cli + 12 integration-test helpers refactored to
  the constructors.
- Crate-root docs: dropped the stale "v0.0.0 scaffolding" status and
  the fictional-API example; replaced with a real compiling `no_run`
  example. ValidationConfig added to the re-export facade.
- Standards: docs.rs all-features metadata; missing_docs = deny
  (lib + cli; zero gaps); README repo links → absolute so they
  render on crates.io.
- MSRV corrected 1.86 -> 1.88 (the real floor: edition-2024
  let-chains; verified by an actual build on 1.88) + new CI `msrv`
  job; fixed the clippy nits it surfaced (collapsible_if,
  is_multiple_of).

Edition stays 2024 (an audit "downgrade to 2021" suggestion was
wrong — let-chains require it and dry-run already accepted it).

Re-verified on x86_64-pc-windows-msvc: fmt, clippy -D warnings,
--locked build/test (incl. doctests), doc -D warnings, cargo deny,
cargo audit, publish --dry-run (104 files). CHANGELOG noted.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7e75a2a73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +67
/// Strict validation rejected this call before it was sent: the
/// arguments didn't match the tool's advertised `inputSchema` and the
/// policy classified it as [`SchemaPolicy::Fail`]. Only produced when
/// `[validation] strict = true`. Maps to `CallOutcome::ProtocolError`.
#[error("schema violation for tool `{tool}`: {summary}")]
SchemaViolation {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route schema violations through shared classifier

When [validation] strict = true is used with scenarios that still bucket HangOutcome::Err/call errors directly as CallOutcome::ServerError (I checked deadlock_probe.rs and race_check.rs), this new SchemaViolation will be reported as a server error instead of the documented ProtocolError. The run may still fail on error rate, but the metrics/report category is wrong for strict-mode schema mismatches; route those handlers through scenario::classify_error or handle this variant explicitly.

Useful? React with 👍 / 👎.

@Teerapat-Vatpitak Teerapat-Vatpitak merged commit aaf0172 into main May 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant