Live MCP Benchmark

Easy: simple multi-step tasks
Medium: Tasks requiring multiple tools with dependencies
Hard: Complex tasks with complex reasoning and data processing

A comprehensive benchmark framework for evaluating AI agents' ability to use MCP (Model Context Protocol) tools to complete real-world tasks.

Overview

This benchmark evaluates how well AI agents can:

The benchmark contains 101 tasks across various difficulty levels:

Each task includes:

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
mcp_descriptions.json		mcp_descriptions.json