You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Full regression after tool routing (tags, recommend_tools, search_api, docstrings). No regressions — 6 failures all known flaky.
11
2026-03-20
171
164
95.9%
—
Full suite with ToolSearch + wiring recipes + enriched descriptions. 12/12 test_09 pass. 7 failures all known flaky (replace_windows_L1 new — agent called search_api instead).
12
2026-03-20
170
163
95.9%
—
Post description enrichment (all 142 tools ≥40 char). Same 7 flaky failures. No regression.
CodeMode A/B experiment (ON) — 71pp regression. 168 min. 67 wrong_tool + 30 timeout + 1 no_mcp_tool. Feature kept as opt-in toggle, NOT default. See docs/knowledge/codemode-benchmark-2026-04-05.md.
Run 8 = combined results from two separate targeted runs (measure authoring 13/15 + cooled beam 10/10).Run 16 is an experimental outlier (CodeMode ON) and is excluded from the main pass-rate timeline in plots.
Tool Verification Failures
Only cases where expected tool wasn't called.
Test
Expected Tool
Actually Called
Root Cause
import_floorplan_L1
import_floorspacejs
(asks for file path)
No path in prompt — structurally vague
list_dynamic_type_L1
list_model_objects
get_sizing_zone_properties
L1 "What sizing parameters?" → explicit tool
check_loads_L1
get_load_details
get_space_details
"What loads?" too vague without direction
export_measure_L1
export_measure
list_custom_measures
Can't discover export without explicit name
export_measure_L2
export_measure
list_custom_measures + list_files
Moderate prompt still insufficient
Known Flaky Tests
Test
Root Cause
Run 13
import_floorplan_L1
No file path in prompt — agent correctly asks for one
PASS
list_dynamic_type_L1
L1 "sizing parameters" too vague, agent uses explicit sizing tools
PASS
check_loads_L1
"What loads?" too vague, agent inspects space instead
PASS
thermostat_L1
Intermittent — "change thermostat settings" needs direction
PASS
save_model_L1
Intermittent
skipped
schedule_details_L1
Intermittent
PASS
create_loads_L1
Intermittent
PASS
set_wwr_L1
Intermittent
PASS
ideal_air_L1
Intermittent
PASS
add_hvac_L1
Intermittent — stable since docstring fix
PASS
export_measure_L1/L2
Tool not discoverable without explicit name
skipped
floorspacejs_to_typical
Multi-step workflow chain stalls after step 1
PASS
run_simulation_L1
Intermittent — "Run a simulation" too vague at L1
FAIL
qaqc tier2 (3 cases)
Agent doesn't call run_qaqc_checks for validation prompts
FAIL
measure quality (3 cases)
New tests — measure code quality checks
FAIL
Key Lessons & Patterns
System prompt is the biggest lever — adding anti-loop guidance took pass rate from 44% → 83% in one run
Tool descriptions drive L1 discovery — adding "HVAC / heating and cooling" keywords to add_baseline_system fixed L1 discovery immediately
L1 failures are mostly structural — vague prompts missing required info (file paths, direction). Correct agent behavior is to ask.
L2 → L3 gap is rare — once moderate context is given, agents find tools. L3 is 100% across all 42 cases.
Progressive tests are the best diagnostic — L1/L2/L3 clearly separates tool description gaps from tool design gaps
Multi-step workflows are fragile — floorspacejs_to_typical consistently stalls. Single-tool discovery is robust.
Retries help — default retries=2 catches transient failures. retries=1 is useful for first-attempt signal.
Generic access pattern works — inspect_component/modify_component pass at all levels, validating Phase C's dynamic property access
Running
# Full suite (~100 min)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -v
# Quick smoke (~12 min, 12 tests)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -m smoke -v
# Progressive only (~60 min, 102 tests)
LLM_TESTS_ENABLED=1 pytest tests/llm/ -m progressive -v
# Single case
LLM_TESTS_ENABLED=1 pytest tests/llm/test_06_progressive.py -k "thermostat_L1" -v
Reports written to LLM_TESTS_RUNS_DIR/benchmark.md and benchmark.json.
After running, copy to docs/testing/llm-test-benchmark.md.