Skip to content

chore(testing): make ContextForge benchmarks reproducible and CI-friendly#3680

Open
lucarlig wants to merge 15 commits intomainfrom
chore/build-prod-image-with-profilers
Open

chore(testing): make ContextForge benchmarks reproducible and CI-friendly#3680
lucarlig wants to merge 15 commits intomainfrom
chore/build-prod-image-with-profilers

Conversation

@lucarlig
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig commented Mar 14, 2026

Summary

This PR adds a Rust-native, scenario-driven benchmark suite for ContextForge and wires it into the repo as a committed testing artifact instead of an ad hoc collection of scripts and flags.

The new benchmark flow centers on:

  • committed TOML scenarios under crates/contextforge_benchmark_runner/assets/scenarios/
  • a Rust benchmark runner for validate/run/report/compare workflows
  • an interactive TUI launcher exposed via make benchmark
  • a Goose-based load driver that can hit real REST and MCP JSON-RPC paths
  • a benchmark container image with optional Rust plugin wheels and profiling tools

The result is a benchmark setup that is easier to rerun, compare, extend, and eventually automate in CI.


Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

What Changed

1. Added Rust benchmark crates in the workspace layout

This PR introduces three new direct workspace crates under crates/:

  • crates/contextforge_benchmark_runner for scenario loading, stack orchestration, run execution, report regeneration, and comparison output
  • crates/contextforge_benchmark_console for the interactive TUI launcher and scenario template generator
  • crates/contextforge_goose for the Goose load driver used by benchmark scenarios

2. Added committed TOML benchmark scenarios

This PR checks in a suite of benchmark scenarios under crates/contextforge_benchmark_runner/assets/scenarios/, including coverage for:

  • A2A invoke flows
  • admin plugin inventory
  • REST discovery
  • MCP protocol and MCP runtime paths
  • MCP prompts, resources, and tools
  • rate limiter behavior and scaling
  • secret detection
  • spin detector and delay-oriented cases
  • other runtime comparison and smoke-oriented suites

These scenario files now act as the benchmark contract for runtime, load, execution, profiling, and comparison settings.

3. Added real MCP payload fixtures

This PR adds payload fixtures under crates/contextforge_benchmark_runner/assets/payloads/ for benchmark traffic that exercises:

  • tools/list
  • tools/call
  • resources/list
  • resources/read
  • prompts/list
  • prompts/get

That lets the load driver benchmark real MCP JSON-RPC prompt/resource/tool paths instead of only health or admin endpoints.

4. Added a benchmark-specific container image and entrypoints

This PR adds crates/contextforge_benchmark_runner/assets/Containerfile plus supporting entrypoint scripts so benchmark runs can build and launch a dedicated image that supports:

  • optional Rust plugin wheel builds via maturin
  • optional profiling tools
  • multiple HTTP server modes (gunicorn, granian, uvicorn)
  • the benchmark assets needed by the runner and load driver

5. Added docs, Make target, and workspace-alignment follow-up

This PR adds benchmark documentation in docs/docs/testing/benchmark-suite.md and exposes the TUI launcher through:

  • make benchmark

It also rebases the work onto current main and aligns the benchmark code with the repository’s current crates/* Rust workspace layout instead of the older tools_rust/ placement.


Why This Matters

Before this change, benchmark configuration was spread across scripts, flags, and implicit defaults. This PR moves the suite toward committed, repeatable scenarios with explicit runtime/load settings and reusable tooling around them.

That makes it easier to:

  • rerun a benchmark consistently
  • compare runs across branches or images
  • extend the suite by authoring scenario files instead of bespoke scripts
  • keep new Rust benchmark code in the same workspace model used by current main
  • capture richer benchmark/report artifacts for future regression tracking

Usage / Verification Commands

Relevant commands introduced or documented by this PR:

make benchmark
cargo run --manifest-path crates/contextforge_benchmark_runner/Cargo.toml -- validate --scenario rust-mcp-runtime-300
cargo run --manifest-path crates/contextforge_benchmark_runner/Cargo.toml -- run --scenario a2a-invoke-300 --smoke
cargo run --manifest-path crates/contextforge_benchmark_runner/Cargo.toml -- run --scenario rust-mcp-runtime-300
cargo run --manifest-path crates/contextforge_benchmark_runner/Cargo.toml -- regenerate-report --run-dir reports/benchmarks/<run-dir>
cargo run --manifest-path crates/contextforge_benchmark_runner/Cargo.toml -- compare-run --run-dir reports/benchmarks/<run-dir>
cargo test -p contextforge_benchmark_runner -p contextforge_benchmark_console -p contextforge_goose
uv run pytest tests/unit/test_rust_workspace_layout.py

Notes

  • This PR is broader than a TOML-config cleanup. It adds the runner, launcher, load driver, benchmark image, scenarios, payload fixtures, docs, and the repo entrypoint for using them.
  • Several scenarios compare benchmark images with and without optional Rust plugin artifacts. Those suites are not claiming that the underlying product runtime has moved wholesale to Rust; they are measuring the benchmarked paths described in each scenario.
  • Benchmark reports are written under reports/benchmarks/<profile>_<timestamp>/.

Refs #2473

@lucarlig lucarlig force-pushed the chore/build-prod-image-with-profilers branch 3 times, most recently from 40ff7d6 to 4849f2c Compare March 16, 2026 09:14
@lucarlig lucarlig added triage Issues / Features awaiting triage experimental Experimental features, test proposed MCP Specification changes chore Linting, formatting, dependency hygiene, or project maintenance chores labels Mar 16, 2026
@lucarlig lucarlig force-pushed the chore/build-prod-image-with-profilers branch 4 times, most recently from ba07cf3 to 21ac88a Compare March 16, 2026 09:31
@lucarlig lucarlig changed the title Chore/build prod image with profilers Feature/deterministic benchamrks Mar 16, 2026
@lucarlig lucarlig changed the title Feature/deterministic benchamrks Feature: Make ContextForge benchmarks reproducible and CI-friendly Mar 16, 2026
@lucarlig lucarlig force-pushed the chore/build-prod-image-with-profilers branch from 21ac88a to 8dbd4b8 Compare March 16, 2026 11:08
@crivetimihai crivetimihai added this to the Release 1.2.0 milestone Mar 20, 2026
@crivetimihai crivetimihai added the COULD P3: Nice-to-have features with minimal impact if left out; included if time permits label Mar 20, 2026
@crivetimihai crivetimihai changed the title Feature: Make ContextForge benchmarks reproducible and CI-friendly chore(testing): make ContextForge benchmarks reproducible and CI-friendly Mar 20, 2026
@lucarlig lucarlig force-pushed the chore/build-prod-image-with-profilers branch 2 times, most recently from de19605 to f0527ef Compare March 26, 2026 10:45
@lucarlig lucarlig force-pushed the chore/build-prod-image-with-profilers branch 2 times, most recently from d82289c to 61bb02f Compare April 8, 2026 11:01
@lucarlig lucarlig marked this pull request as ready for review April 8, 2026 14:30
dima-zakharov
dima-zakharov previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Collaborator

@dima-zakharov dima-zakharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The code is well-structured and modular
  • Documentation is comprehensive
  • Test coverage is extensive

Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
@lucarlig lucarlig marked this pull request as draft April 15, 2026 15:23
@lucarlig lucarlig marked this pull request as ready for review April 15, 2026 15:23
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Linting, formatting, dependency hygiene, or project maintenance chores COULD P3: Nice-to-have features with minimal impact if left out; included if time permits experimental Experimental features, test proposed MCP Specification changes triage Issues / Features awaiting triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants