Skip to content

JierunChen/SFT-RL-SynergyDilemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 

Repository files navigation

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

image/png

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that

  • πŸ“Œ Long-CoT SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. Long-CoT SFT enhances complex reasoning but becomes verbose and underperforms on simple questions
  • πŸ“Œ In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT.
  • πŸ“Œ Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length.
  • πŸ“Œ This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

Benchmarks with difficulty level tags

MathVision MathVerse MathVista MMMUval MMStar
πŸ€—link πŸ€—link πŸ€—link πŸ€—link πŸ€—link

Our training dataset Eureka-Distill

The 34k training dataset and 0.7k val dataset can be downloaded here.

Fine-tuned Models

Model MathVision MathVerse MathVista MMMUval MMStar Avg. HF Link
Qwen2.5-VL-7B 26.1 42.3 66.4 53.5 63.2 50.3 πŸ€—link
+ SFT 29.7 47.2 65.6 53.6 60.8 51.4 πŸ€—link
+ RL 29.0 52.1 72.6 55.1 66.5 55.1 πŸ€—link
+ Two-stage SFT & RL 29.3 47.1 66.6 53.0 60.9 51.4 πŸ€—link
+ Interleaved SFT & RL 29.2 48.7 71.8 54.1 64.3 53.6 πŸ€—link
+ Progressive SFT & RL 29.8 51.0 72.4 55.5 65.9 54.9 πŸ€—link
+ Data Mixing 29.2 51.2 72.0 55.1 62.7 54.0 πŸ€—link
+ Model Merging 29.6 50.4 71.8 53.7 66.2 54.3 πŸ€—link

Training

Code will be released soon. Please stay tuned :)

Acknowledgement

We train models using the verl and the LLaMA-Factory frameworks, and evaluate them using the VLMEvalKit framework.

Citation

If you find this project useful in your research, please consider citing this BibTex:

@misc{chen2025synergydilemmalongcotsft,
      title={The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs}, 
      author={Jierun Chen and Tiezheng Yu and Haoli Bai and Lewei Yao and Jiannan Wu and Kaican Li and Fei Mi and Chaofan Tao and Lei Zhu and Manyi Zhang and Xiaohui Li and Lu Hou and Lifeng Shang and Qun Liu},
      year={2025},
      eprint={2507.07562},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.07562}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published