MCP tool calling reliability test framework

## Problem

MCP tool calling is the primary interface between GAIA agents and external services. There is no automated way to validate that MCP tool calls succeed reliably across different models, prompt configurations, and tool complexities. When running on smaller models (4B on iGPU), tool selection accuracy and parameter formatting degrade significantly — but there is no systematic measurement of this.

## Proposed Solution

Build a reliability test framework for MCP tool calling that measures accuracy and robustness across configurations.

### 1. Test Suite Definition

Define a graduated test suite of 10–100 MCP tool call scenarios, categorized by complexity:

| Complexity | Description |
|------------|-------------|
| **Simple** | Single tool call, no parameters or trivial ones |
| **Moderate** | Single tool call with structured parameters |
| **Complex** | Multi-step tool chains or conditional tool selection |

Each scenario specifies: input prompt, expected tool name, expected parameters, and expected output pattern. Prompts must reflect realistic end-user interaction patterns — not minimal or synthetic test strings.

**Example seed scenarios** (illustrative, not exhaustive):

| Tier | User Prompt | Expected Tool | Notes |
|------|-------------|---------------|-------|
| Simple | "List files on my desktop" | `filesystem.list_directory` | No parameters required beyond default path |
| Simple | "What MCP servers am I connected to?" | `gaia.list_mcp_servers` | Introspection call, no params |
| Moderate | "Find any documents mentioning Q1 budget" | `filesystem.search_files` | Requires `query` and `path` params |
| Moderate | "Open the GAIA documentation site" | `browser.navigate` | Requires valid URL parameter |
| Complex | "Find the largest file in Downloads and tell me what's in it" | `filesystem.list_directory` → `filesystem.read_file` | Two-step chain; second call depends on first result |
| Complex | "Search the web for the latest GAIA release and save a summary to my notes" | `web_search` → `filesystem.write_file` | Conditional: must extract content before writing |

The full test suite expands these seed scenarios to at least 20, covering common GAIA-supported MCP tools across all three tiers.

### 2. Execution Engine
- Run each scenario N times (default: 10) to measure consistency
- Record per-run: tool selected, parameters generated, success/failure, latency
- Support targeting specific model + prompt configurations (e.g., 4B on iGPU vs 30B on dGPU)

### 3. Scorecard Output
- Per-tool pass rate (target: 90–100%)
- Per-complexity-tier pass rate
- Latency distribution (p50, p95, p99)
- Failure mode classification: wrong tool, wrong parameters, format error, timeout
- Model × prompt × tool matrix with pass rates

### 4. Integration with Agent Eval
- Integrate as a scenario type within the agent eval framework (#573)
- Results feed into the same scorecard pipeline
- Can be run standalone or as part of a full eval sweep

## Files Likely Affected
- `src/gaia/eval/` — new scenario type for MCP reliability
- `src/gaia/mcp/` — test harness for tool call capture and replay
- `tests/` — unit tests for the framework itself

## Acceptance Criteria
- [ ] Test suite with at least 20 graduated scenarios across all three complexity tiers
- [ ] Seed scenarios from this issue are included as a baseline
- [ ] Execution engine runs N iterations per scenario with full result capture
- [ ] Scorecard reports per-tool and per-tier pass rates
- [ ] Latency distribution reported per scenario
- [ ] Failure modes classified automatically
- [ ] Works with 4B model on iGPU configuration
- [ ] Integrates with agent eval framework (#573)
- [ ] Produces a configurable go/no-go readiness signal (e.g., "all Standard-tier scenarios pass at ≥90% on target hardware")

## References
- Related: #573 (Replace existing eval framework)
- Related: #609 (MCP client/server support and configuration)
- Related: v0.18.0 milestone (Agent Eval Benchmark)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP tool calling reliability test framework #709

Problem

Proposed Solution

1. Test Suite Definition

2. Execution Engine

3. Scorecard Output

4. Integration with Agent Eval

Files Likely Affected

Acceptance Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Complexity	Description
Simple	Single tool call, no parameters or trivial ones
Moderate	Single tool call with structured parameters
Complex	Multi-step tool chains or conditional tool selection

Tier	User Prompt	Expected Tool	Notes
Simple	"List files on my desktop"	`filesystem.list_directory`	No parameters required beyond default path
Simple	"What MCP servers am I connected to?"	`gaia.list_mcp_servers`	Introspection call, no params
Moderate	"Find any documents mentioning Q1 budget"	`filesystem.search_files`	Requires `query` and `path` params
Moderate	"Open the GAIA documentation site"	`browser.navigate`	Requires valid URL parameter
Complex	"Find the largest file in Downloads and tell me what's in it"	`filesystem.list_directory` → `filesystem.read_file`	Two-step chain; second call depends on first result
Complex	"Search the web for the latest GAIA release and save a summary to my notes"	`web_search` → `filesystem.write_file`	Conditional: must extract content before writing

MCP tool calling reliability test framework #709

Description

Problem

Proposed Solution

1. Test Suite Definition

2. Execution Engine

3. Scorecard Output

4. Integration with Agent Eval

Files Likely Affected

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions