Recent advancements in leveraging reinforcement learning to enhance LLM reasoning capabilities have yielded remarkably promising results, exemplified by DeepSeek-R1, Kimi k1.5, OpenAI o3-mini, Grok 3. These exhilarating achievements herald ascendance of Large Reasoning Models, making us advance further along the thorny path towards Artificial General Intelligence (AGI). Study of LLM reasoning has garnered significant attention within the community, and researchers have concurrently summarized Awesome RL-based LLM Reasoning. Recently, researchers have also compiled a collection of some projects with detailed configurations about Large Reasoning Models in Awesome RL Reasoning Recipes ("Triple R"). Meanwhile, we have observed that remarkably awesome work has already been done in the domain of RL-based Reasoning Multimodal Large Language Models (MLLMs). We aim to provide the community with a comprehensive and timely synthesis of this fascinating and promising field, as well as some insights into it.
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
π₯π₯π₯[2025-5-24] We write the position paper Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models that summarizes recent advancements on the topic of RFT for MLLMs. We focus on answering the following three questions: 1. What background should researchers interested in this field know? 2. What has the community done? 3. What could the community do next? We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI.
π§π§π§[2025-4-10] Based on existing work in the community, we provide some insights into this field, which you can find in the PowerPoint presentation file.
Figure 1: An overview of the works done on reinforcement fine-tuning (RFT) for multimodal large language models (MLLMs). Works are sorted by release time and are collected up to May 15, 2025.
-
[2508] [We-Math 2.0] We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning [Project π] [Datasets π€] [Code π»]
-
[2508] [Skywork UniPic 2.0] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model [Project π] [Models π€] [Code π»]
-
[2508] [DocThinker] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding [Code π»]
-
[2508] [AR-GRPO (generation)] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning [Models π€] [Code π»]
-
[2508] [M2IO-R1] M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
-
[2508] [SIFThinker] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
-
[2508] [Shuffle-R1] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [Code π»]
-
[2508] [TempFlow-GRPO (generation)] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
-
[2508] [EARL (editing)] The Promise of RL for Autoregressive Image Editing [Code π»]
-
[2507] [VL-Cogito] VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning [Model π€] [Dataset π€] [Code π»]
-
[2507] [X-Omni (generation)] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again [Project π] [Models π€] [Dataset π€] [Code π»]
-
[2507] [MixGRPO (generation)] MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE [Project π] [Model π€] [Code π»]
-
[2507] [RRVF] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [Model π€] [Dataset π€] [Code π»]
-
[2507] [SOPHIA] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning
-
[2507] [Spatial-VLM-Investigator] Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning [Code π»]
-
[2507] [VisionThink] VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [Models π€] [Datasets π€] [Code π»]
-
[2507] [M2-Reasoning] M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning [Model π€] [Code π»]
-
[2507] [SFT-RL-SynergyDilemma] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [Models π€] [Datasets π€] [Code π»]
-
[2507] [PAPO] PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning [Project π] [Models π€] [Datasets π€] [Code π»]
-
[2507] [Skywork-R1V3] Skywork-R1V3 Technical Report [Model π€] [Code π»]
-
[2507] [Open-Vision-Reasoner] Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning [Project π] [Models π€] [Code π»]
-
[2507] [GLM-4.1V-Thinking] GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [Models π€] [Demo π€] [Code π»]
-
[2506] [MiCo] MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
-
[2506] [Visual-Structures] Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
-
[2506] [APO] APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization [Code π»]
-
[2506] [MMSearch-R1] MMSearch-R1: Incentivizing LMMs to Search [Code π»]
-
[2506] [PeRL] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [Code π»]
-
[2506] [MM-R5] MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval [Model π€] [Code π»]
-
[2506] [ViCrit] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [Models π€] [Datasets π€] [Code π»]
-
[2506] [ViLaSR] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [Models π€] [Datasets π€] [Code π»]
-
[2506] [Vision Matters] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning [Model π€] [Datasets π€] [Code π»]
-
[2506] [ViGaL] Play to Generalize: Learning to Reason Through Game Play [Project π] [Model π€] [Code π»]
-
[2506] [RAP] Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [Code π»]
-
[2506] [RACRO] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [Models π€] [Demo π€] [Code π»]
-
[2506] [Revisual-R1] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning [Models π€] [Code π»]
-
[2506] [Rex-Thinker] Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning [Project π] [Model π€] [Dataset π€] [Demo π€] [Code π»]
-
[2506] [ControlThinker (generation)] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [Code π»]
-
[2506] [Multimodal DeepResearcher] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework [Project π]
-
[2506] [SynthRL] SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis [Model π€] [Datasets π€] [Code π»]
-
[2506] [SRPO] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning [Project π] [Dataset π€] [Code π»]
-
[2506] [GThinker] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking [Model π€] [Datasets π€] [Code π»]
-
[2505] [ReasonGen-R1 (generation)] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL [Project π] [Models π€] [Datasets π€] [Code π»]
-
[2505] [MoDoMoDo] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning [Project π] [Datasets π€] [Code π»]
-
[2505] [DINO-R1] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models [Project π]
-
[2505] [VisualSphinx] VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL [Project π] [Model π€] [Datasets π€] [Code π»]
-
[2505] [PixelThink] PixelThink: Towards Efficient Chain-of-Pixel Reasoning [Project π] [Code π»]
-
[2505] [ViGoRL] Grounded Reinforcement Learning for Visual Reasoning [Project π] [Code π»]
-
[2505] [Jigsaw-R1] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles [Datasets π€] [Code π»]
-
[2505] [UniRL] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning [Model π€] [Code π»]
-
[2505] [Infi-MMR] Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models [Model π€] [Code π»]
-
[2505] [cadrille (generation)] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
-
[2505] [SAM-R1] SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning
-
[2505] [Thinking with Generated Images] Thinking with Generated Images [Models π€] [Code π»]
-
[2505] [MM-UPT] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO [Model π€] [Dataset π€] [Code π»]
-
[2505] [RL-with-Cold-Start] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start [Models π€] [Datasets π€] [Code π»]
-
[2505] [VRAG-RL] VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning [Models π€] [Code π»]
-
[2505] [MLRM-Halu] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models [Project π] [Benchmark π€] [Code π»]
-
[2505] [Active-O3] Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO [Project π] [Model π€] [Code π»]
-
[2505] [RLRF (generation)] Rendering-Aware Reinforcement Learning for Vector Graphics Generation
-
[2505] [VisTA] VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection [Project π] [Code π»]
-
[2505] [Point-RFT] Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
-
[2505] [VTool-R1] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use [Project π] [Models π€] [Code π»]
-
[2505] [SATORI-R1] SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards [Model π€] [Dataset π€] [Code π»]
-
[2505] [URSA] URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [Model π€] [Datasets π€] [Code π»]
-
[2505] [v1] Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation [Model π€] [Code π»]
-
[2505] [GRE Suite] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [Code π»]
-
[2505] [V-Triune] One RL to See Them All: Visual Triple Unified Reinforcement Learning [Models π€] [Dataset π€] [Code π»]
-
[2505] [RePrompt (generation)] RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [Code π»]
-
[2505] [ULM-R1 (unified)] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation [Datasets π€] [Code π»]
-
[2505] [GoT-R1 (generation)] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning [Models π€] [Code π»]
-
[2505] [SophiaVL-R1] SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward [Models π€] [Datasets π€] [Code π»]
-
[2505] [DPO-vs-GRPO (generation)] Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO [Code π»]
-
[2505] [R1-ShareVL] R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO [Code π»]
-
[2505] [VLM-R^3] VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
-
[2505] [TON] Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [Models π€] [Datasets π€] [Code π»]
-
[2505] [Pixel Reasoner] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning [Project π] [Models π€] [Datasets π€] [Demo π€] [Code π»]
-
[2505] [GRIT] GRIT: Teaching MLLMs to Think with Images [Project π] [Demo π€] [Code π»]
-
[2505] [STAR-R1] STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMs [Code π»]
-
[2505] [VARD (generation)] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL
-
[2505] [Chain-of-Focus] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [Project π]
-
[2505] [Visionary-R1] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning [Code π»]
-
[2505] [VisualQuality-R1] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank [Models π€] [Code π»]
-
[2505] [DeepEyes] Incentivizing "Thinking with Images" via Reinforcement Learning [Project π] [Model π€] [Dataset π€] [Code π»]
-
[2505] [Visual-ARFT] Visual Agentic Reinforcement Fine-Tuning [Models π€] [Datasets π€] [Code π»]
-
[2505] [UniVG-R1] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning [Project π] [Model π€] [Dataset π€] [Code π»]
-
[2505] [G1] G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning [Code π»]
-
[2505] [VisionReasoner] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning [Model π€] [Dataset π€] [Code π»]
-
[2505] [VPRL] Visual Planning: Letβs Think Only with Images [Code π»]
-
[2505] [GuardReasoner-VL] GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning [Code π»]
-
[2505] [OpenThinkIMG] OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [Model π€] [Datasets π€] [Code π»]
-
[2505] [DanceGRPO (generation)] DanceGRPO: Unleashing GRPO on Visual Generation [Project π] [Code π»]
-
[2505] [Flow-GRPO (generation)] Flow-GRPO: Training Flow Matching Models via Online RL [Models π€] [Code π»]
-
[2505] [X-Reasoner] X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains [Code π»]
-
[2505] [T2I-R1 (generation)] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT [Code π»]
-
[2504] [FAST] Fast-Slow Thinking for Large Vision-Language Model Reasoning [Code π»]
-
[2504] [Skywork R1V2] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning [Models π€] [Code π»]
-
[2504] [Relation-R1] Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension [Code π»]
-
[2504] [R1-SGG] Compile Scene Graphs with Reinforcement Learning [Code π»]
-
[2504] [NoisyRollout] Reinforcing Visual Reasoning with Data Augmentation [Models π€] [Datasets π€] [Code π»]
-
[2504] [Qwen-AD] Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [Code π»]
-
[2504] [SimpleAR (generation)] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL [Models π€] [Code π»]
-
[2504] [VL-Rethinker] Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [Project π] [Models π€] [Dataset π€ Code π»]
-
[2504] [Kimi-VL] Kimi-VL Technical Report [Project π] [Models π€] [Demo π€] [Code π»]
-
[2504] [VLAA-Thinking] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [Models π€] [Dataset π€] [Code π»]
-
[2504] [Perception-R1] Perception-R1: Pioneering Perception Policy with Reinforcement Learning [Model π€] [Datasets π€] [Code π»]
-
[2504] [SoTA with Less] SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [Model π€] [Datasets π€] [Code π»]
-
[2504] [VLM-R1] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [Model π€] [Dataset π€] [Demo π€] [Code π»]
-
[2504] [CrowdVLM-R1] CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward [Dataset π€] [Code π»]
-
[2504] [MAYE] Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Dataset π€] [Code π»]
-
[2503] [Q-Insight] Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [Code π»] [Model π€]
-
[2503] [Reason-RFT] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [Project π] [Dataset π€] [Code π»]
-
[2503] [OpenVLThinker] OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement [Model π€] [Code π»]
-
[2503] [Think or Not Think] Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [Models π€] [Datasets π€] [Code π»]
-
[2503] [OThink-MR1] OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
-
[2503] [R1-VL] R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [Model π€] [Code π»]
-
[2503] [Skywork R1V] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [Model π€] [Code π»]
-
[2503] [R1-Onevision] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [Model π€] [Dataset π€] [Demo π€] [Code π»]
-
[2503] [VisualPRM] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [Project π] [Model π€] [Dataset π€] [Benchmark π€]
-
[2503] [LMM-R1] LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [Code π»]
-
[2503] [VisRL] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [Project π] [Code π»]
-
[2503] [Curr-ReFT] Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [Models π€] [Dataset π€] [Code π»]
-
[2503] [VisualThinker-R1-Zero] R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Code π»]
-
[2503] [Vision-R1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [Code π»]
-
[2503] [Seg-Zero] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Model π€] [Dataset π€] [Code π»]
-
[2503] [MM-Eureka] MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [Models π€] [Dataset π€] [Code π»]
-
[2503] [Visual-RFT] Visual-RFT: Visual Reinforcement Fine-Tuning [Project π] [Datasets π€] [Code π»]
-
[2501] [Kimi k1.5] Kimi k1.5: Scaling Reinforcement Learning with LLMs [Project π]
-
[2501] [Mulberry] Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Model π€] [Code π»]
-
[2501] [Virgo] Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Model π€] [Code π»]
-
[2501] [Text-to-image COT] Can We Generate Images with CoT? Letβs Verify and Reinforce Image Generation Step by Step [Project π] [Model π€] [Code π»]
-
[2411] [InternVL2-MPO] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [Project π] [Model π€] [Code π»]
-
[2411] [Insight-V] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [Model π€] [Code π»]
-
[2507] [LongVILA-R1] Scaling RL to Long Videos [Code π»]
-
[2506] [GRPO-CARE] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [Model π€] [Dataset π€] [Code π»]
-
[2506] [Ego-R1] Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning [Project π] [Models π€] [Datasets π€] [Code π»]
-
[2506] [Motion-R1 (Human Motion Generation)] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation [Project π] [Code π»]
-
[2506] [VersaVid-R1] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks [Code π»]
-
[2506] [DeepVideo-R1] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO [Code π»]
-
[2506] [EgoVLM] EgoVLM: Policy Optimization for Egocentric Video Understanding
-
[2506] [Temporal-RLT] Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [Models π€] [Code π»]
-
[2506] [ReAgent-V] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [Code π»]
-
[2506] [ReFoCUS] ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding
-
[2505] [TW-GRPO] Reinforcing Video Reasoning with Focused Thinking [Model π€] [Code π»]
-
[2505] [Spatial-MLLM] Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [Project π] [Model π€] [Code π»]
-
[2505] [VAU-R1] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning [Project π] [Dataset π€] [Code π»]
-
[2505] [MUSEG] MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [Models π€] [Code π»]
-
[2505] [VerIPO] VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [Model π€] [Code π»]
-
[2505] [SpaceR] SpaceR: Reinforcing MLLMs in Video Spatial Reasoning [Model π€] [Dataset π€] [Code π»]
-
[2504] [TinyLLaVA-Video-R1] TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [Model π€] [Code π»]
-
[2504] [VideoChat-R1] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [Model π€] [Code π»]
-
[2504] [Spatial-R1] Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning [Code π»]
-
[2504] [R1-Zero-VSI] Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Code π»]
-
[2503] [SEED-Bench-R1] Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [Dataset π€] [Code π»]
-
[2503] [Video-R1] Video-R1: Reinforcing Video Reasoning in MLLMs [Model π€] [Dataset π€] [Code π»]
-
[2503] [TimeZero] TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [Model π€] [Code π»]
-
[2508] [MedReasoner] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision [Project π]
-
[2507] [SmartPath-R1] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model
-
[2506] [Medical-VIE-RLVR] Efficient Medical VIE via Reinforcement Learning
-
[2506] [ReasonMed] ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [Model π€] [Dataset π€] [Code π»]
-
[2506] [Med-PRM] Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards [Project π] [Model π€] [Dataset π€] [Code π»]
-
[2506] [Lingshu] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [Project π] [Models π€] [Code π»]
-
[2505] [MedCCO] Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning [Code π»]
-
[2505] [Medical-VQA-GRPO] Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models
-
[2505] [Patho-R1] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner [Code π»]
-
[2505] [RCMed] Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant
-
[2504] [ChestX-Reasoner] ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
-
[2504] [PathVLM-R1] PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
-
[2503] [Med-R1] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models [Model π€] [Code π»]
-
[2502] [MedVLM-R1] MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [Model π€]
-
[2508] [Affordance-R1] Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model [Model π€] [Code π»]
-
[2508] [VL-DAC] Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success [Code π»]
-
[2507] [ThinkAct] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [Project π]
-
[2506] [VLN-R1] VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [Project π] [Dataset π€] [Code π»]
-
[2506] [VIKI-R] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning [Project π] [Dataset π€] [Code π»]
-
[2506] [RoboRefer] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [Project π] [Dataset π€] [Code π»]
-
[2506] [Robot-R1] Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
-
[2505] [VLA RL Study] What Can RL Bring to VLA Generalization? An Empirical Study [Project π] [Models π€] [Code π»]
-
[2505] [VLA-RL] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [Code π»]
-
[2505] [ManipLVM-R1] ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models
-
[2504] [Embodied-R] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [Code π»]
-
[2503] [Embodied-Reasoner] Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [Project π] [Dataset π€] [Code π»]
-
[2506] [Listener-Rewarded Thinking] Listener-Rewarded Thinking in VLMs for Image Preferences [Model π€]
-
[2505] [Skywork-VL Reward] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning [Models π€] [Code π»]
-
[2505] [UnifiedReward-Think] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning [Project π] [Models π€] [Datasets π€] [Code π»]
-
[2505] [R1-Reward] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning [Model π€] [Dataset π€] [Code π»]
-
[2507] [DMOSpeech 2] DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis [Project π] [Model π€] [Demo π€] [Code π»]
-
[2506] [SoundMind] SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models [Model π€] [Dataset π€] [Code π»]
-
[2504] [SARI] SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning
-
[2503] [R1-AQA] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [Model π€] [Code π»]
-
[2503] [Audio-Reasoner] Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [Model π€] [Code π»]
-
[2506] [AV-Reasoner] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs [Project π] [π€ Model] [π€ Dataset] [π» Code]
-
[2505] [Omni-R1 (ZJU)] Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration [Project π] [Model π€] [Code π»]
-
[2505] [Omni-R1 (MIT)] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
-
[2505] [EchoInk-R1] EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [Model π€] [Dataset π€] [Code π»]
-
[2503] [R1-Omni] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [Model π€] [Code π»]
-
[2508] [UI-Venus] UI-Venus Technical Report: Building High-performance UI Agents with RFT [Code π»]
-
[2508] [GUI-RCPO] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [Project π] [Code π»]
-
[2508] [InfiGUI-G1] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization [Models π€] [Code π»]
-
[2507] [UI-AGILE] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [Models π€] [Dataset π€] [Code π»]
-
[2507] [MobileGUI-RL] MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
-
[2506] [Mobile-R1] Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards [Project π] [Dataset π€]
-
[2506] [ComfyUI-R1] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation [Project π]
-
[2506] [GUI-Critic-R1] Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [Code π»]
-
[2506] [AgentCPM-GUI] AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [Model π€] [Dataset π€] [Code π»]
-
[2505] [UI-Genie] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [Models π€] [Dataset π€] [Code π»]
-
[2505] [ARPO] ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay [Model π€] [Code π»]
-
[2505] [GUI-G1] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents [Code π»]
-
[2505] [UIShift] UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
-
[2505] [MobileIPL] Enhance Mobile Agents Thinking Process Via Iterative Preference Learning
-
[2504] [InfiGUI-R1] InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners [Model π€] [Code π»]
-
[2504] [GUI-R1] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [Model π€] [Dataset π€] [Code π»]
-
[2503] [UI-R1] UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
- [2505] [Web-Shepherd] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents [Models π€] [Datasets π€] [Code π»]
-
[2507] [DriveAgent-R1] DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception
-
[2506] [Drive-R1] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
-
[2506] [AutoVLA] AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning [Project π] [Code π»]
-
[2505] [AgentThink] AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
-
[2507] [3D-R1] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding [Project π] [Models π€] [Dataset π€] [Code π»]
-
[2506] [Scene-R1] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
-
[2503] [MetaSpatial] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse [Dataset π€] [Code π»]
-
[2508] [MathReal] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models [Dataset π€] [Code π»]
-
[2508] [DeepPHY] DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning [Code π»]
-
[2507] [Zebra-CoT] Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [Dataset π€] [Code π»]
-
[2507] [Video-TT] Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding [Project π] [Dataset π€]
-
[2507] [EmbRACE-3K] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [Project π] [Code π»]
-
[2506] [PhysUniBench] PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [Project π] [Dataset π€] [Code π»]
-
[2506] [MMReason] MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [Code π»]
-
[2506] [MindCube] Spatial Mental Modeling from Limited Views [Project π] [Models π€] [Dataset π€] [Code π»]
-
[2506] [VRBench] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos [Project π] [Dataset π€] [Code π»]
-
[2506] [MORSE-500] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [Project π] [Dataset π€] [Code π»]
-
[2506] [VideoMathQA] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [Project π] [Dataset π€] [Code π»]
-
[2506] [MMRB] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark [Project π] [Dataset π€] [Code π»]
-
[2506] [MMR-V] MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos [Project π] [Dataset π€] [Code π»]
-
[2506] [OmniSpatial] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models [Project π] [Dataset π€] [Code π»]
-
[2506] [VS-Bench] VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments [Project π] [Dataset π€] [Code π»]
-
[2505] [Open CaptchaWorld] Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents [Dataset π€] [Code π»]
-
[2505] [FinMME] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation [Dataset π€] [Code π»]
-
[2505] [CSVQA] CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs [Dataset π€] [Code π»]
-
[2505] [VideoReasonBench] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [Project π] [Dataset π€] [Code π»]
-
[2505] [Video-Holmes] Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [Project π] [Dataset π€] [Code π»]
-
[2505] [MME-Reasoning] MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs [Project π] [Dataset π€] [Code π»]
-
[2505] [MMPerspective] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [Project π] [Code π»]
-
[2505] [SeePhys] SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [Project π] [Dataset π€] [Code π»]
-
[2505] [CXReasonBench] CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays [Code π»]
-
[2505] [OCR-Reasoning] OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning [Project π] [Dataset π€] [Code π»]
-
[2505] [RBench-V] RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs [Project π] [Dataset π€] [Code π»]
-
[2505] [MMMR] MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks [Project π] [Dataset π€] [Code π»]
-
[2505] [ReasonMap] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [Project π] [Dataset π€] [Code π»]
-
[2505] [PhyX] PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [Project π] [Dataset π€] [Code π»]
-
[2505] [NOVA] NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI
-
[2505] [GDI-Bench] GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
-
[2504] [VisuLogic] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [Project π] [Dataset π€] [Code π»]
-
[2504] [Video-MMLU] Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark [Project π] [Dataset π€] [Code π»]
-
[2504] [GeoSense] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
-
[2504] [VCR-Bench] VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning [Project π] [Dataset π€] [Code π»]
-
[2504] [MDK12-Bench] MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [Code π»]
-
[2503] [V1-33K] [V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks] [Project π] [Dataset π€] [Code π»]
-
[2502] [MM-IQ] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [Project π] [Dataset π€] [Code π»]
-
[2502] [MME-CoT] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [Project π] [Dataset π€] [Code π»]
-
[2502] [ZeroBench] ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [Project π] [Dataset π€] [Code π»]
-
[2502] [HumanEval-V] HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [Project π] [Dataset π€] [Code π»]
- EasyR1 π»
(An Efficient, Scalable, Multi-Modality RL Training Framework)
-
R1-Multimodal-Journey π»
(Latest progress at MM-Eureka)
This is an active repository and your contributions are always welcome! If you have any question about this opinionated list, do not hesitate to contact me [email protected].
I extend my sincere gratitude to all community members who provided valuable supplementary support.
If you find this repository useful for your research and applications, please star us β and consider citing:
@misc{sun2025reinforcementfinetuningpowersreasoning,
title={Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models},
author={Haoyuan Sun and Jiaqi Wu and Bo Xia and Yifu Luo and Yifei Zhao and Kai Qin and Xufei Lv and Tiantian Zhang and Yongzhe Chang and Xueqian Wang},
year={2025},
eprint={2505.18536},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18536},
}and
@misc{sun2025RL-Reasoning-MLLMs,
title={Awesome RL-based Reasoning MLLMs},
author={Haoyuan Sun, Xueqian Wang},
year={2025},
howpublished={\url{https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs}},
note={Github Repository},
}