VideoChat-R1 & -R1.5: Spatio-Temporal RL for Video Perception and Reasoning

🔥 Updates

2025/09/26:🔥🔥🔥 We release our VideoChat-R1.5 model at Huggingface, paper, and eval code.
2025/09/22: 🎉🎉🎉 Our VideoChat-R1.5 is accepted by NIPS2025.
2025/04/22:🔥🔥🔥 We release our VideoChat-R1-caption at Huggingface.
2025/04/14:🔥🔥🔥 We release our VideoChat-R1 and VideoChat-R1-thinking at Huggingface.
2025/04/10:🔥🔥🔥 We release our VideoChat-R1 paper and code.

🎯 Performances on Video Benchmarks

Across short-form & long-form videos, temporal grounding, video reasoning, and spatio-temporal perception, the model delivers consistently stronger results.

🦜 Introduction

We adopt multi-task joint RL to strengthen the model’s spatio-temporal perception and reasoning capabilities.

During inference, we simulate hierarchical human attention to enable the model to progressively localize the Region of Interest (ROI) within input videos. This multi-step perception process ensures that the model's performance improves with each step.

Demo & Inference

Please refer to hf README for the steps required to perform inference..

Evaluation

See eval_scripts and lmms-eval_videochat.

Training

See training_scripts.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}

@article{yan2025videochatr15,
  title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
  author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
  journal={arXiv preprint arXiv:2509.21100},
  year={2025}
}

For any inquiries regarding this work, please contact us at yanziang@pjlab.org.cn .

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Videochat-R1.5		Videochat-R1.5
Videochat-R1		Videochat-R1
README.md		README.md
framework.png		framework.png
perception.jpg		perception.jpg
perception.png		perception.png
requirements.txt		requirements.txt
sotas.png		sotas.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoChat-R1 & -R1.5: Spatio-Temporal RL for Video Perception and Reasoning

🔥 Updates

🎯 Performances on Video Benchmarks

🦜 Introduction

Demo & Inference

Evaluation

Training

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoChat-R1 & -R1.5: Spatio-Temporal RL for Video Perception and Reasoning

🔥 Updates

🎯 Performances on Video Benchmarks

🦜 Introduction

Demo & Inference

Evaluation

Training

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages