Awesome RL-based Reasoning MLLMs

Recent advancements in leveraging reinforcement learning to enhance LLM reasoning capabilities have yielded remarkably promising results, exemplified by DeepSeek-R1, Kimi k1.5, OpenAI o3-mini, Grok 3. These exhilarating achievements herald ascendance of Large Reasoning Models, making us advance further along the thorny path towards Artificial General Intelligence (AGI). Study of LLM reasoning has garnered significant attention within the community, and researchers have concurrently summarized Awesome RL-based LLM Reasoning. Recently, researchers have also compiled a collection of some projects with detailed configurations about Large Reasoning Models in Awesome RL Reasoning Recipes ("Triple R"). Meanwhile, we have observed that remarkably awesome work has already been done in the domain of RL-based Reasoning Multimodal Large Language Models (MLLMs). We aim to provide the community with a comprehensive and timely synthesis of this fascinating and promising field, as well as some insights into it.

"The senses are the organs by which man perceives the world, and the soul acts through them as through tools."

— Leonardo da Vinci

This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!

News

🔥🔥🔥[2025-5-24] We write the position paper Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models that summarizes recent advancements on the topic of RFT for MLLMs. We focus on answering the following three questions: 1. What background should researchers interested in this field know? 2. What has the community done? 3. What could the community do next? We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI.

📧📧📧[2025-4-10] Based on existing work in the community, we provide some insights into this field, which you can find in the PowerPoint presentation file.

Figure 1: An overview of the works done on reinforcement fine-tuning (RFT) for multimodal large language models (MLLMs). Works are sorted by release time and are collected up to May 15, 2025.

Papers (Sort by Time of Release)📄

Benchmarks and Datasets📊

[2508] [MathReal] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models [Dataset 🤗] [Code 💻]
[2508] [DeepPHY] DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning [Code 💻]
[2507] [Zebra-CoT] Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [Dataset 🤗] [Code 💻]
[2507] [Video-TT] Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding [Project 🌐] [Dataset 🤗]
[2507] [EmbRACE-3K] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [Project 🌐] [Code 💻]
[2506] [PhysUniBench] PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [MMReason] MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [Code 💻]
[2506] [MindCube] Spatial Mental Modeling from Limited Views [Project 🌐] [Models 🤗] [Dataset 🤗] [Code 💻]
[2506] [VRBench] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [MORSE-500] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [VideoMathQA] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [MMRB] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [MMR-V] MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [OmniSpatial] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models [Project 🌐] [Dataset 🤗] [Code 💻]
[2506] [VS-Bench] VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [Open CaptchaWorld] Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents [Dataset 🤗] [Code 💻]
[2505] [FinMME] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation [Dataset 🤗] [Code 💻]
[2505] [CSVQA] CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs [Dataset 🤗] [Code 💻]
[2505] [VideoReasonBench] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [Video-Holmes] Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [MME-Reasoning] MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [MMPerspective] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [Project 🌐] [Code 💻]
[2505] [SeePhys] SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [CXReasonBench] CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays [Code 💻]
[2505] [OCR-Reasoning] OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [RBench-V] RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [MMMR] MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [ReasonMap] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [PhyX] PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [Project 🌐] [Dataset 🤗] [Code 💻]
[2505] [NOVA] NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI
[2505] [GDI-Bench] GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
[2504] [VisuLogic] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [Project 🌐] [Dataset 🤗] [Code 💻]
[2504] [Video-MMLU] Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark [Project 🌐] [Dataset 🤗] [Code 💻]
[2504] [GeoSense] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
[2504] [VCR-Bench] VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning [Project 🌐] [Dataset 🤗] [Code 💻]
[2504] [MDK12-Bench] MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [Code 💻]
[2503] [V1-33K] [V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks] [Project 🌐] [Dataset 🤗] [Code 💻]
[2502] [MM-IQ] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [Project 🌐] [Dataset 🤗] [Code 💻]
[2502] [MME-CoT] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [Project 🌐] [Dataset 🤗] [Code 💻]
[2502] [ZeroBench] ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [Project 🌐] [Dataset 🤗] [Code 💻]
[2502] [HumanEval-V] HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [Project 🌐] [Dataset 🤗] [Code 💻]

Open-Source Projects (Repos without Paper)🌐

Training Framework 🗼

EasyR1 💻 (An Efficient, Scalable, Multi-Modality RL Training Framework)

Vision (Image) 👀

Vision (Video)📹

Agent 👥

VAGEN 💻 Code 💻

Contribution and Acknowledgment❤️

This is an active repository and your contributions are always welcome! If you have any question about this opinionated list, do not hesitate to contact me [email protected].

I extend my sincere gratitude to all community members who provided valuable supplementary support.

Citation📑

If you find this repository useful for your research and applications, please star us ⭐ and consider citing:

@misc{sun2025reinforcementfinetuningpowersreasoning,
      title={Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models}, 
      author={Haoyuan Sun and Jiaqi Wu and Bo Xia and Yifu Luo and Yifei Zhao and Kai Qin and Xufei Lv and Tiantian Zhang and Yongzhe Chang and Xueqian Wang},
      year={2025},
      eprint={2505.18536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18536}, 
}

and

@misc{sun2025RL-Reasoning-MLLMs,
  title={Awesome RL-based Reasoning MLLMs},
  author={Haoyuan Sun, Xueqian Wang},
  year={2025},
  howpublished={\url{https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs}},
  note={Github Repository},
}

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
LICENSE		LICENSE
Multimodal.jpg		Multimodal.jpg
README.md		README.md
Reinforcement_Fine-Tuning_Powers_Reasoning_Capability_of_Multimodal_Large_Language_Models.pdf		Reinforcement_Fine-Tuning_Powers_Reasoning_Capability_of_Multimodal_Large_Language_Models.pdf
Report_on_2025-4-10.pptx		Report_on_2025-4-10.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome RL-based Reasoning MLLMs

News

Papers (Sort by Time of Release)📄

Vision (Image)👀

Vision (Video)📹

Medical Vision🏥

Embodied Vision🤖

Multimodal Reward Model 💯

Audio👂

Omni☺️

GUI Agent📲

Web Agent🌏

Autonomous Driving🚙

3D & Metaverse🌠

Benchmarks and Datasets📊