-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Hi Alex 👋,
Really enjoyed your RLM blog post — especially your call for stronger, more realistic benchmarks for recursive reasoning.
I’d like to suggest MCPMark (arXiv:2509.24002) as a candidate. It’s specifically built for agentic reasoning in Model Context Protocol (MCP) environments, testing how models plan, act, and verify over extended multi-turn interactions.
⸻
What MCPMark provides
• 127 tasks covering CRUD operations, tool use, state updates, and structured workflows — all verified automatically.
• Measures agent performance in realistic execution contexts, not just static comprehension.
• The challenge is reasoning and control over large, evolving context traces — not merely reading long text.
⸻
Why MCPMark fits RLM perfectly
1. Agentic by design — MCPMark centers on multi-turn, tool-using, stateful reasoning in MCP environments, where models must plan, act, and verify dynamically — exactly the kind of recursive control loops RLM aims to model and enhance.
2. Scalable context, concise output — a single task can involve ~1M tokens of accumulated context but only require a few thousand tokens of output, creating the ideal asymmetry for testing recursive reasoning and selective retrieval.
(Our latest release also includes a ReAct implementation example, so integrating RLM-based agents into MCPMark should be straightforward.)
⸻
Next steps
Would love to collaborate on integrating RLM with MCPMark if you need — e.g., adding recursive reasoning as an agent backend and comparing performance on existing metrics (pass@1, pass@4, etc.). I’d be happy to help with the adapter or evaluation setup.
Thanks again for pushing this frontier — RLM + MCPMark feels like a natural fit!