Skip to content

EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub and Allison Zhang, we would love to collaborate with research labs, MCP servers, independent contributors, and more.

Join us, contribute, or reach out!


MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

Pinned Loading

  1. mcpmark mcpmark Public

    MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

    Python 255 15

  2. mcpmark-experiments mcpmark-experiments Public

    Collection of evaluation results for MCPMark

    1

Repositories

Showing 5 of 5 repositories
  • mcpmark Public

    MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

    eval-sys/mcpmark’s past year of commit activity
    Python 255 Apache-2.0 15 2 0 Updated Oct 12, 2025
  • mcpmark-experiments Public

    Collection of evaluation results for MCPMark

    eval-sys/mcpmark-experiments’s past year of commit activity
    1 0 0 0 Updated Oct 1, 2025
  • eval-sys/mcpmark-community-experiments’s past year of commit activity
    0 1 0 0 Updated Sep 21, 2025
  • .github Public

    Community health files for the @eval-sys organization

    eval-sys/.github’s past year of commit activity
    0 0 0 0 Updated Sep 15, 2025
  • eval-sys/mcp-eval-website’s past year of commit activity
    TypeScript 2 0 0 0 Updated Aug 18, 2025

Top languages

Loading…

Most used topics

Loading…