Skip to content

MCP tool calling reliability test framework #709

@itomek

Description

@itomek

Problem

MCP tool calling is the primary interface between GAIA agents and external services. There is no automated way to validate that MCP tool calls succeed reliably across different models, prompt configurations, and tool complexities. When running on smaller models (4B on iGPU), tool selection accuracy and parameter formatting degrade significantly — but there is no systematic measurement of this.

Proposed Solution

Build a reliability test framework for MCP tool calling that measures accuracy and robustness across configurations.

1. Test Suite Definition

Define a graduated test suite of 10–100 MCP tool call scenarios, categorized by complexity:

Complexity Description
Simple Single tool call, no parameters or trivial ones
Moderate Single tool call with structured parameters
Complex Multi-step tool chains or conditional tool selection

Each scenario specifies: input prompt, expected tool name, expected parameters, and expected output pattern. Prompts must reflect realistic end-user interaction patterns — not minimal or synthetic test strings.

Example seed scenarios (illustrative, not exhaustive):

Tier User Prompt Expected Tool Notes
Simple "List files on my desktop" filesystem.list_directory No parameters required beyond default path
Simple "What MCP servers am I connected to?" gaia.list_mcp_servers Introspection call, no params
Moderate "Find any documents mentioning Q1 budget" filesystem.search_files Requires query and path params
Moderate "Open the GAIA documentation site" browser.navigate Requires valid URL parameter
Complex "Find the largest file in Downloads and tell me what's in it" filesystem.list_directoryfilesystem.read_file Two-step chain; second call depends on first result
Complex "Search the web for the latest GAIA release and save a summary to my notes" web_searchfilesystem.write_file Conditional: must extract content before writing

The full test suite expands these seed scenarios to at least 20, covering common GAIA-supported MCP tools across all three tiers.

2. Execution Engine

  • Run each scenario N times (default: 10) to measure consistency
  • Record per-run: tool selected, parameters generated, success/failure, latency
  • Support targeting specific model + prompt configurations (e.g., 4B on iGPU vs 30B on dGPU)

3. Scorecard Output

  • Per-tool pass rate (target: 90–100%)
  • Per-complexity-tier pass rate
  • Latency distribution (p50, p95, p99)
  • Failure mode classification: wrong tool, wrong parameters, format error, timeout
  • Model × prompt × tool matrix with pass rates

4. Integration with Agent Eval

Files Likely Affected

  • src/gaia/eval/ — new scenario type for MCP reliability
  • src/gaia/mcp/ — test harness for tool call capture and replay
  • tests/ — unit tests for the framework itself

Acceptance Criteria

  • Test suite with at least 20 graduated scenarios across all three complexity tiers
  • Seed scenarios from this issue are included as a baseline
  • Execution engine runs N iterations per scenario with full result capture
  • Scorecard reports per-tool and per-tier pass rates
  • Latency distribution reported per scenario
  • Failure modes classified automatically
  • Works with 4B model on iGPU configuration
  • Integrates with agent eval framework (Agent Eval: Replace existing eval framework #573)
  • Produces a configurable go/no-go readiness signal (e.g., "all Standard-tier scenarios pass at ≥90% on target hardware")

References

Metadata

Metadata

Assignees

Labels

domain:surfacesAgent UI, Telegram, WhatsApp, Slack/Discord, mobileenhancementNew feature or requesttrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions