Training Details

We employ the MARTI framework to train both base and reasoning models, specifically Qwen2.5-3B and DeepScaleR-1.5B-Preview. For Qwen2.5-3B, we implement DeepSeek-R1 zero-like reinforcement learning training using Level 3-5 samples from the MATH dataset. The DeepScaleR-1.5B-Preview model, which exhibits strong inherent reasoning capabilities but presents training challenges, undergoes Test-Time Reinforcement Learning (TTRL) adaptation on AIME benchmark data. For multi-agent reinforcement learning, we employ a cluster configuration consisting of 3 nodes, each equipped with 8 A800 80GB GPUs, allocating one full node per agent.

Benchmark Results

We compare non-reasoning and reasoning models under various configurations and show that majority voting consistently outperforms multi-agent workflows when trained conventionally. This reflects known limitations of current LLM-based agent systems, such as poor role adherence and ineffective inter-agent communication.

To address this, MARTI enhances model reasoning through structured agent interactions. As shown in Figure 3 and Figure 4, our experiments show that:

MARTI-trained base models outperform standard RL setups and rival instructed models.
Large reasoning models trained with MARTI using TTRL achieve state-of-the-art results on challenging tasks (e.g., 66.7 AIME score with Multi-Agent Debates).
Multi-agent RL consistently surpasses single-agent systems in performance under the same compute budget.

Figure 3: Average scores of Qwen2.5-3B base and instruct models under different budget and settings

Figure 4: Average scores of reasoning models under different budget and settings

Training Dynamics

Multi-Agents Debate

We conduct multi-agent debate training with Qwen2.5-3B The Qwen2.5-3B model is trained using REINFORCE++ on Level 3 to 5 samples from the MATH-500 dataset.