Problem
MCP tool calling is the primary interface between GAIA agents and external services. There is no automated way to validate that MCP tool calls succeed reliably across different models, prompt configurations, and tool complexities. When running on smaller models (4B on iGPU), tool selection accuracy and parameter formatting degrade significantly — but there is no systematic measurement of this.
Proposed Solution
Build a reliability test framework for MCP tool calling that measures accuracy and robustness across configurations.
1. Test Suite Definition
Define a graduated test suite of 10–100 MCP tool call scenarios, categorized by complexity:
| Complexity |
Description |
| Simple |
Single tool call, no parameters or trivial ones |
| Moderate |
Single tool call with structured parameters |
| Complex |
Multi-step tool chains or conditional tool selection |
Each scenario specifies: input prompt, expected tool name, expected parameters, and expected output pattern. Prompts must reflect realistic end-user interaction patterns — not minimal or synthetic test strings.
Example seed scenarios (illustrative, not exhaustive):
| Tier |
User Prompt |
Expected Tool |
Notes |
| Simple |
"List files on my desktop" |
filesystem.list_directory |
No parameters required beyond default path |
| Simple |
"What MCP servers am I connected to?" |
gaia.list_mcp_servers |
Introspection call, no params |
| Moderate |
"Find any documents mentioning Q1 budget" |
filesystem.search_files |
Requires query and path params |
| Moderate |
"Open the GAIA documentation site" |
browser.navigate |
Requires valid URL parameter |
| Complex |
"Find the largest file in Downloads and tell me what's in it" |
filesystem.list_directory → filesystem.read_file |
Two-step chain; second call depends on first result |
| Complex |
"Search the web for the latest GAIA release and save a summary to my notes" |
web_search → filesystem.write_file |
Conditional: must extract content before writing |
The full test suite expands these seed scenarios to at least 20, covering common GAIA-supported MCP tools across all three tiers.
2. Execution Engine
- Run each scenario N times (default: 10) to measure consistency
- Record per-run: tool selected, parameters generated, success/failure, latency
- Support targeting specific model + prompt configurations (e.g., 4B on iGPU vs 30B on dGPU)
3. Scorecard Output
- Per-tool pass rate (target: 90–100%)
- Per-complexity-tier pass rate
- Latency distribution (p50, p95, p99)
- Failure mode classification: wrong tool, wrong parameters, format error, timeout
- Model × prompt × tool matrix with pass rates
4. Integration with Agent Eval
Files Likely Affected
src/gaia/eval/ — new scenario type for MCP reliability
src/gaia/mcp/ — test harness for tool call capture and replay
tests/ — unit tests for the framework itself
Acceptance Criteria
References
Problem
MCP tool calling is the primary interface between GAIA agents and external services. There is no automated way to validate that MCP tool calls succeed reliably across different models, prompt configurations, and tool complexities. When running on smaller models (4B on iGPU), tool selection accuracy and parameter formatting degrade significantly — but there is no systematic measurement of this.
Proposed Solution
Build a reliability test framework for MCP tool calling that measures accuracy and robustness across configurations.
1. Test Suite Definition
Define a graduated test suite of 10–100 MCP tool call scenarios, categorized by complexity:
Each scenario specifies: input prompt, expected tool name, expected parameters, and expected output pattern. Prompts must reflect realistic end-user interaction patterns — not minimal or synthetic test strings.
Example seed scenarios (illustrative, not exhaustive):
filesystem.list_directorygaia.list_mcp_serversfilesystem.search_filesqueryandpathparamsbrowser.navigatefilesystem.list_directory→filesystem.read_fileweb_search→filesystem.write_fileThe full test suite expands these seed scenarios to at least 20, covering common GAIA-supported MCP tools across all three tiers.
2. Execution Engine
3. Scorecard Output
4. Integration with Agent Eval
Files Likely Affected
src/gaia/eval/— new scenario type for MCP reliabilitysrc/gaia/mcp/— test harness for tool call capture and replaytests/— unit tests for the framework itselfAcceptance Criteria
References