openstudio-mcp — behavioral testing of an MCP server with ~142 tools, where a real LLM agent drives the tests end-to-end.
TL;DR — 160/167 tests passing (95.8%) in Run 13. Core methodology: each tool tested at three prompt specificity levels (L1 vague / L2 moderate / L3 explicit). Pass-rate gap between levels isolates tool-description problems from tool-design problems. System prompt is the single biggest lever (44% → 83% in one run).
Unit and integration tests verify that MCP tools work in isolation. They don't verify that an LLM agent, given a natural-language request, will discover and call the correct tool — the actual user experience.
Examples of failures only LLM tests catch:
- Agent writes raw IDF files to bypass MCP tools (guardrail regression)
- Agent loops on
list_filesforever instead of calling the right tool - A tool exists but has a docstring so vague the agent never picks it
- A "correct but surprising" rename breaks discovery for every prompt that doesn't mention the new name
The LLM suite is the only gate that measures agent behavior end-to-end against a real Claude session hitting a real openstudio-mcp Docker container.
pytest (tests/llm/conftest.py)
│
├─ pytest_runtest_protocol ─→ retry loop (up to LLM_TESTS_RETRIES)
│
└─ run_claude(prompt, ...) (tests/llm/runner.py)
│
└─ subprocess: claude -p "<prompt>"
--output-format stream-json --verbose
--mcp-config <generated mcp.json>
--max-turns N --model sonnet
│
├─ stdin ←──── NDJSON stream ────→ _parse_stream_json()
│ │
│ └─→ ClaudeResult
│ (tool_calls, tokens, cost,
│ num_turns, final_text)
│
└─ MCP stdio → openstudio-mcp Docker container
├─ stdio_suppression wrapping
├─ 142 MCP tools
└─ shared /runs volume (baseline models)
| Concern | Where | Detail |
|---|---|---|
| Subprocess spawn | runner.py:181-239 run_claude() |
Writes temp mcp.json, spawns CLI. Strips CLAUDECODE env var (nested claude -p fails otherwise). |
| Output parsing | runner.py:242-261 _parse_stream_json() |
--output-format stream-json --verbose is mandatory — plain json drops tool_use blocks. |
| Tool-call extraction | runner.py:61-106 ClaudeResult |
Two views: tool_calls (all, inc. builtins like ToolSearch/Bash) and mcp_tool_calls (MCP-only). |
| Markers & auto-tagging | conftest.py:42-53, 252-278 |
llm, tier1-4, stable, flaky, smoke, progressive, generic. Auto-tagged via FLAKY_TESTS frozenset. |
| Retry logic | conftest.py:281-323 |
Custom pytest_runtest_protocol hook. Each retry consumes one prompt from the budget. |
| Benchmark collection | conftest.py:342-412, 434-692 |
pytest_runtest_logreport stores per-test metrics. Session end writes benchmark.json / benchmark.md / benchmark_history.json. |
| Failure classification | conftest.py:383-390 |
timeout · no_mcp_tool · wrong_tool. |
| Prompt budget | conftest.py LLM_TESTS_MAX_PROMPTS (default 180) |
Hard cap prevents runaway cost during iteration. |
| Skill eval auto-discovery | eval_parser.py:48-90 |
Scrapes "Should trigger" / "Should NOT trigger" tables from .claude/skills/*/eval.md. |
| Var | Default | Purpose |
|---|---|---|
LLM_TESTS_ENABLED |
unset | Must be 1 to enable the suite |
LLM_TESTS_MODEL |
sonnet |
sonnet / haiku / opus |
LLM_TESTS_RETRIES |
0 |
Retry count for non-determinism |
LLM_TESTS_MAX_PROMPTS |
180 |
Hard budget cap |
LLM_TESTS_TIER |
all |
1/2/3/4/all |
LLM_TESTS_RUNS_DIR |
/tmp/llm-test-runs |
Host path mounted as /runs in Docker |
Ten test files, organized by what the agent is asked to do.
| File | Tier | ~Count | Purpose | Pass‑rate signal |
|---|---|---|---|---|
test_01_setup.py |
setup | 5 | Creates baseline/HVAC/example models in /runs. All other tests depend on these. Prompts use explicit tool names to minimize non-determinism. |
Dependency gate |
test_02_tool_selection.py |
tier1 | 4 | Single-tool discovery, no model state (e.g., "What is the server status?"). Fastest tests. | Baseline discovery |
test_03_eval_cases.py |
tier3 | 26 | Auto-parsed from .claude/skills/*/eval.md "Should trigger" tables. Keeps tests DRY and co-located with skill definitions. |
Skill discovery |
test_04_workflows.py |
tier2 | 19 | Multi-step chains (3-5 MCP calls): load → weather → HVAC → simulate → extract. | Multi-step composition |
test_05_guardrails.py |
tier4 | 3 | Regression gate: agent must NOT use Bash/Edit/Write to bypass MCP tools. |
Safety/bypass |
test_06_progressive.py |
progressive | 110 | The core diagnostic. 34+ operations × 3 specificity levels. | Tool description quality |
test_07_fourpipe_e2e.py |
tier2 | 1 | Full retrofit on 44-zone SystemD model using natural language (no tool names). Two simulations, 40+ turns, ~5 min. | Real-user session |
test_08_measure_authoring.py |
tier2 | 8 | Custom measure create/edit/test/export. Regression tests pulled from debug-session JSON exports. | Authoring workflows |
test_09_tool_routing.py |
tier4 | 4 | A/B baseline: all 139 tools vs. recommend_tools routing. Not in CI. |
Tool-routing efficiency |
test_10_confusion_pairs.py |
tier4 | 8 | Prompts that could reasonably trigger either of two similar tools (run_qaqc_checks vs validate_model). |
Disambiguation |
Each operation is tested with three prompts of increasing specificity:
| Level | Example (add HVAC) | What it measures |
|---|---|---|
| L1 — vague | "Add HVAC to the building" | Can the agent discover the tool from keyword scraps alone? → docstring keyword quality |
| L2 — moderate | "Add a VAV reheat system to all 10 zones" | With domain context, can the agent pick the right tool among near-neighbors? → tool discovery / ToolSearch |
| L3 — explicit | "Use add_baseline_system to add System 7 VAV reheat" | Given the exact tool name, does the tool work? → tool code / API correctness |
The gap between levels is the diagnostic:
- L1 fails, L2/L3 pass → docstring is missing keywords. Fast fix. (Example: adding "HVAC / heating and cooling" to
add_baseline_systemmade L1 pass immediately in Run 3.) - L2 fails, L3 passes → tool is hard to discover even with context. Fix ToolSearch indexing or tool name.
- L3 fails → tool is broken. Fix the code.
This decomposition is why the progressive tier is the most useful part of the suite — it points at the cause, not just the symptom.
Every run_claude() call yields a ClaudeResult object. These fields are written to benchmark.json, aggregated into benchmark.md, and appended to benchmark_history.json.
Per test:
| Metric | Source | Meaning |
|---|---|---|
passed |
pytest outcome | Binary, after retries |
attempt |
retry hook | 1 = first try, 2+ = flaky |
duration_s |
wall clock | Includes Docker spawn + LLM inference |
num_turns |
CLI result | Conversation turns. High = looping. |
num_tool_calls |
NDJSON | Total MCP tools invoked |
tool_calls |
NDJSON | Ordered list — primary assertion target |
input_tokens |
CLI usage | Fresh tokens to model |
output_tokens |
CLI usage | Tokens generated |
cache_read_tokens |
CLI usage | Served from prompt cache (high = tool defs cached) |
cost_usd |
CLI result | Notional — free on Claude Max |
failure_mode |
conftest.py:383-390 |
timeout / no_mcp_tool / wrong_tool |
Aggregates: per-tier pass rate, per-L1/L2/L3 pass rate, token profile by tier, failed-test drill-down with tool sequences, run history (last 50 runs).
Explicit gaps (things we don't measure yet):
- Parameter correctness — a test passes if the right tool is called, even with wrong arguments.
- First-attempt pass rate — retries mask flakiness. Only
attemptcaptures it, not aggregates. - Time-to-first-tool — slow ToolSearch discovery isn't penalized.
- Cross-model comparison — all runs use one model. No GPT-4 / Gemini data to validate model-agnostic tool descriptions.
- Error recovery rate — when a tool returns
ok:False, does the agent retry or give up?
| Run | Date | Tests | Passed | Rate | Key change |
|---|---|---|---|---|---|
| 1 | 03-05 | 50 | 22 | 44.0% | Baseline — no system prompt, wrong model path |
| 2 | 03-06 | 90 | 75 | 83.3% | +system prompt (anti-loop), model path fix, pre-check → +39pp |
| 3 | 03-07 | 90 | 82 | 91.1% | +tool description improvements → +8pp |
| 4 | 03-07 | 90 | 84 | 93.3% | Stability run (no code changes) |
| 5 | 03-10 | 107 | 103 | 96.3% | +generic access tests, cleanup |
| 6 | 03-11 | 159 | 153 | 96.2% | +progressive tier (L1/L2/L3), workflows, sim setup |
| 7 | 03-12 | 159 | 155 | 97.5% | Test consolidation (no tool changes) — high-water mark |
| 8 | 03-13 | 25 | 23 | 92.0% | Measure authoring + cooled beam (targeted runs) |
| 9a/b | 03-19 | 9 | 9 | 100% | Tool-routing A/B baseline (9 cases, neutral delta) |
| 10 | 03-19 | 172 | 166 | 96.5% | Full regression: tags, recommend_tools, search_api, docstrings — no regressions |
| 11 | 03-20 | 171 | 164 | 95.9% | +ToolSearch + wiring recipes + enriched descriptions. 7 flaky. |
| 12 | 03-20 | 170 | 163 | 95.9% | Description enrichment (all 142 tools ≥40 char). Same 7 flaky. |
| 13 | 03-26 | 230 | 160 | 95.8% | Post #40 fix + test audit. 63 skipped. 7 fail. Previously-flaky L1s all passing. |
The two big inflections are the system prompt (Run 1→2, +39pp) and progressive-tier introduction (Run 5→6, which massively expanded the test space without dropping pass rate). Everything since Run 10 sits in the 95.8-96.5% band — a regime where improvements are marginal and noise dominates.
- setup / tier1 / tier4: 100% — prerequisites, single-tool discovery, and guardrails are solid.
- progressive: 98% (108/110) — the biggest category and the most diagnostic.
- tier3 skill evals: 92% — 63 additional tests skipped due to test structure issues (these will reappear in future runs).
- tier2 workflows: 84% — lowest tier. Three failures are all
run_qaqc_checksnot being called for validation prompts, i.e. a confusion pair withvalidate_model. Multi-step chains are inherently more fragile than single-tool tests.
Left: aggregate pass rate across 42 progressive cases. L1 93% → L2 95% → L3 100%. The monotone climb is the expected signature of a healthy suite: explicit prompts always succeed, so L3 failures mean broken tools; vague prompts fail more, and the magnitude of the gap tells you how docstring-dependent discovery is.
Right: the only cases that don't pass all three levels. All others are 3/3.
| Case | Status | Root cause |
|---|---|---|
| import_floorplan | Now passing at all levels | Was flaky — no file path in vague prompt, agent correctly asks for one |
| list_dynamic_type | Now passing | "What sizing parameters?" was too vague; agent used explicit sizing tools |
| check_loads | Now passing | "What loads?" → agent inspected spaces instead of calling get_load_details |
| thermostat | Now passing | "Change thermostat settings" needs direction (up/down, by how much) |
| run_simulation | L1 FAIL (Run 13) | "Run a simulation" genuinely too vague — agent hesitates on a bare prompt |
| export_measure | L1 & L2 FAIL | Agent can't discover export_measure without the explicit name — durable description gap |
The export_measure case is the best example of a real bug the methodology catches: the tool works at L3 (so the implementation is fine), the docstring has keywords, but Claude still doesn't pick it over list_custom_measures + list_files. Fix is on the tool/description side, not the test.
Left panel (log scale): cache-read tokens dominate by 2-3 orders of magnitude. Each invocation loads ~27-50K tokens of tool definitions, and Claude's prompt cache serves them on subsequent tests. This is why a 172-test run only costs ~$12 of notional API pricing — the fresh-token footprint per test is tiny (10-30 in, 400-2800 out).
Right panel: cost and turn count per tier. Single-tool tests ≈ 3 turns, $0.06. The cooled-beam comparison workflow is a 22-turn outlier because it runs two full simulations and recovers from sim errors mid-session — it's the only test that costs >$0.10 per run.
Left: the 7 Run-13 failures fit three buckets.
| Mode | Count | Cases |
|---|---|---|
no_mcp_tool — agent didn't call any MCP tool |
3 | qaqc tier2 (agent used validate_model instead of run_qaqc_checks) |
wrong_tool — MCP tool called but not the expected one |
1 | run_simulation_L1 (intermittent) |
| Measure-quality assertions (new tests) | 3 | measure authoring syntax/structure checks |
The qaqc cluster is the most interesting: both tools legitimately "check the model", and validate_model is a defensible answer. This is a confusion pair that needs docstring disambiguation, not a bug.
Right: absolute pass/fail counts by run. Run 1's 28 failures stand out; runs 5-13 are in a stable <10-failure regime despite the test count roughly quadrupling.
-
System prompts are the biggest lever. Adding anti-loop guidance to
server.pyinstructionswas a single change that took pass rate from 44% → 83%. Before touching individual tool docstrings, audit the server-wide prompt. -
Docstring keywords >> docstring prose.
add_baseline_systemL1 was failing until we added "HVAC / heating and cooling" to its docstring. A verbose paragraph doesn't help. A single matched keyword does. All 142 tools are now enforced ≥40 chars. -
Progressive testing is the best diagnostic tool. L1/L2/L3 separates three failure classes (description, discovery, code) that a binary pass/fail obscures completely. Every tool should have at least one progressive case.
-
L1 failures are often structural, not fixable. "What loads?" is genuinely ambiguous — a good agent asks for clarification. Don't bend a tool description to pass a vague prompt if the agent's alternative behavior is reasonable.
-
Multi-step workflows are fragile. Tier 2 is consistently the lowest. ToolSearch + measure execution eats turns; one stall mid-chain fails the whole test. Keep
max_turnsgenerous (25+ for 3-tool chains, 40+ for e2e). -
Retries mask flakiness. Default
LLM_TESTS_RETRIES=0gives you the honest first-attempt signal. Only add retries when you need CI-like confidence — and trackattemptfield to see which tests are actually brittle. -
Flaky tests need a promotion path. The
FLAKY_TESTSfrozenset is the quarantine. Pattern-match by substring. Remove patterns when a test stabilizes across 3+ runs. Don't let the list grow indefinitely. -
Description guidance alone doesn't fix L1 failures. See
benchmark-description-guidance.md— ~35 tools got disambiguation/when-to-use/emphasis edits and L1 pass rate did not move. The remaining failures were structural. -
NDJSON logs per test are indispensable. When a test fails, the
.ndjsonlog shows the exact tool calls, arguments, error responses, and where the agent got stuck. Clearing them per run keeps disk usage sane. -
Stable/flaky classification beats "just run more tests". Iterating on
-m flaky(~18 tests, ~10 min) is the right inner loop. Running the full suite is reserved for final validation.
# Full suite (~100-150 min)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -v
# Smoke subset (~12 tests, ~10 min)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -m smoke -v
# Progressive tier only (~60 min)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -m progressive -v
# Iterate on flaky tests (~10 min)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -m flaky -v
# Single case
LLM_TESTS_ENABLED=1 pytest tests/llm/test_06_progressive.py -k thermostat_L1 -vReports land in $LLM_TESTS_RUNS_DIR/benchmark.md / benchmark.json. After each run, copy results into llm-test-benchmark.md to check into version control.
llm-test-benchmark.md— raw benchmark data, per-tool matrix, run historyframeworks-summary.md— unit/integration/LLM side-by-side, strengths & gapsbenchmark-description-guidance.md— negative-result experiment: description edits that didn't move the needletesting.md— general testing guide (unit + integration + CI)plots/generate_plots.py— reproduce every chart in this doc (python docs/testing/plots/generate_plots.py)




