A comprehensive benchmark framework for evaluating AI agents' ability to use MCP (Model Context Protocol) tools to complete real-world tasks.
This benchmark evaluates how well AI agents can:
- Understand complex, multi-step task instructions
- Select appropriate tools from a large MCP pools
- Execute MCPs in the correct order
- Produce accurate and well-formatted outputs
The benchmark contains 101 tasks across various difficulty levels:
- Easy: simple multi-step tasks
- Medium: Tasks requiring multiple tools with dependencies
- Hard: Complex tasks with complex reasoning and data processing
Each task includes:
- A natural language query with realistic context
- Required tools for completion
- Execution plan with expected tool chain