Suggest using MCPMark: an agentic benchmark for agentic MCP use

Hi Alex 👋,

Really enjoyed your RLM blog post — especially your call for stronger, more realistic benchmarks for recursive reasoning.

I’d like to suggest [MCPMark](https://mcpmark.ai/) ([arXiv:2509.24002](https://arxiv.org/abs/2509.24002)) as a candidate. It’s specifically built for agentic reasoning in Model Context Protocol (MCP) environments, testing how models plan, act, and verify over extended multi-turn interactions.

⸻

What MCPMark provides
	•	127 tasks covering CRUD operations, tool use, state updates, and structured workflows — all verified automatically.
	•	Measures agent performance in realistic execution contexts, not just static comprehension.
	•	The challenge is reasoning and control over large, evolving context traces — not merely reading long text.

⸻

Why MCPMark fits RLM perfectly
	1.	Agentic by design — MCPMark centers on multi-turn, tool-using, stateful reasoning in MCP environments, where models must plan, act, and verify dynamically — exactly the kind of recursive control loops RLM aims to model and enhance.
	2.	Scalable context, concise output — a single task can involve ~1M tokens of accumulated context but only require a few thousand tokens of output, creating the ideal asymmetry for testing recursive reasoning and selective retrieval.

(Our latest release also includes a ReAct implementation example, so integrating RLM-based agents into MCPMark should be straightforward.)

⸻

Next steps

Would love to collaborate on integrating RLM with MCPMark if you need — e.g., adding recursive reasoning as an agent backend and comparing performance on existing metrics (pass@1, pass@4, etc.). I’d be happy to help with the adapter or evaluation setup.

Thanks again for pushing this frontier — RLM + MCPMark feels like a natural fit!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggest using MCPMark: an agentic benchmark for agentic MCP use #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Suggest using MCPMark: an agentic benchmark for agentic MCP use #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions