Skip to content

mingyin1/CS-590

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Live MCP Benchmark

A comprehensive benchmark framework for evaluating AI agents' ability to use MCP (Model Context Protocol) tools to complete real-world tasks.

Overview

This benchmark evaluates how well AI agents can:

  • Understand complex, multi-step task instructions
  • Select appropriate tools from a large MCP pools
  • Execute MCPs in the correct order
  • Produce accurate and well-formatted outputs

Dataset

The benchmark contains 101 tasks across various difficulty levels:

  • Easy: simple multi-step tasks
  • Medium: Tasks requiring multiple tools with dependencies
  • Hard: Complex tasks with complex reasoning and data processing

Each task includes:

  • A natural language query with realistic context
  • Required tools for completion
  • Execution plan with expected tool chain

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors