Welcome to the official repository for Fostering Video Reasoning via Next-Event Prediction! 🚀
Read our paper on arXiv: 📖 2505.22457
Browse the dataset on Hugging Face: 📂 V1-33K
To advance multimodal LLMs' reasoning ability, we introduce a future prediction task and its corresponding dataset. Predicting upcoming events from historical video data presents significant challenges for current multimodal LLMs. Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth (Self-Supervised Learning).
🤔 Why isn’t factual answering ideal for video reasoning?
Research indicates that reasoning models like DeepSeek R1 often “over-think”, which can lead to hallucinations. When applied to video data, similar pitfalls emerge if the model is restricted to answering straightforward factual questions. For instance, querying “Where is the cat in the video?” might prompt an overly extended reasoning process, inadvertently increasing the risk of hallucinated outputs.
💡 Why is future prediction a compelling case for video reasoning?
Much like Doctor Strange’s foresight inAvengers 3: Infinity War (2018)
, predicting the future demands reasoning over multiple potential outcomes. This challenge is analogous to techniques such as Monte Carlo tree search (MCTS), which systematically explores a wide array of possible scenarios. The inherent complexity of future prediction makes it a powerful task for evaluating and enhancing video reasoning capabilities.
📽️ Video Future Prediction: A Self-Supervised Task for Multimodal Reasoning
This task is inherently Self-Supervised Learning (SSL). It leverages the inherent causal logic present in video data. By dividing videos into sequential segments, we create implicit labels that embody the natural flow of cause and effect—allowing models to learn from the logical progression of events without the need for manual annotations.Much like
Image Contrastive Learning
, which uses inherent data structures to construct labels and guide what a model should capture,Video Future Prediction
is grounded in the philosophy that real-world events unfold through a chain of cause and effect. It drives the model to focus on the temporal and causal dimensions that underpin real-world scenarios, enhancing multimodal reasoning capabilities. By integrating visual cues, the model develops a holistic reasoning ability to more accurately predict and interpret the progression of complex events.Moreover, like other self-supervised learning tasks and unsupervised learning, the data construction is relatively cheap, making it a scalable solution for enhancing multimodal reasoning capabilities.
- 🔍 Next-Event Prediction for video reasoning
- 🎓 Demo scripts for instruction tuning & reinforcement learning
- 🛠️ Easy use with LLaMA-Factory on GitHub & EasyR1
conda create -n video_llm python=3.10 -y
conda activate video_llm
python v1_data_download.py
You should now see a folder named
V1-33K/
containing:
first_part_video/
video_dataset/
-
Clone the repo
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory
-
Install dependencies
pip install -e ".[torch,metrics]" --no-build-isolation
# From the project root
python video_data_generation.py
The generated data will be placed in
./LLaMA-Factory/data/
mv dataset_info.json LLaMA-Factory/data/
mv qwen2_5vl_7B_full_sft_5K.yaml LLaMA-Factory/examples/train_full/
-
Instruction Tuning
bash video_instruction_tuning_demo.sh
-
Install RL Env
cd EasyR1 pip install -e .
-
Run the GRPO training demo
bash video_GRPO_training_demo.sh
We run all our evaluations based on the lmms-eval
. Besides those benchmarks that have been implemented in lmms-eval
, we also incorporate evaluations of our FutureBench
as well as SeedBench-R1
into it. To start,
-
Install lmms-eval
# eval with lmms-eval cd third_party/lmms-eval pip install -e .
-
Preparing Dataset
You should also find the
futurebench.json
under the same folder namedV1-33K/
.# make dataset from futurebench.json python gen_dataset.py
-
Run the inference
Before running the following eval script, check the
dataset_path
andcache_dir
inthird_party/lmms-eval/lmms_eval/tasks/futurebench/futurebench.yaml
are correct.bash third_party/lmms-eval/examples/eval_futurebench.sh
To run evaluations on other benchamarks, see more settings in
third_party/lmms-eval/examples/
.
If you find this repository useful, please cite our paper:
@misc{wang2025fosteringvideoreasoningnextevent,
title={Fostering Video Reasoning via Next-Event Prediction},
author={Haonan Wang and Hongfu Liu and Xiangyan Liu and Chao Du and Kenji Kawaguchi and Ye Wang and Tianyu Pang},
year={2025},
eprint={2505.22457},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.22457},
}
😊 Happy exploring & feel free to open an issue or pull request! 🎉