Skip to content

sail-sg/Video-Next-Event-Prediction

Repository files navigation

🎬 Fostering Video Reasoning via Next-Event Prediction

🚀 Toward Video Reasoning via Future Prediction 🌟

Github Notion Twitter Hugging Face Collection


Welcome to the official repository for Fostering Video Reasoning via Next-Event Prediction! 🚀
Read our paper on arXiv: 📖 2505.22457
Browse the dataset on Hugging Face: 📂 V1-33K


Video Reasoning via Future Prediction

To advance multimodal LLMs' reasoning ability, we introduce a future prediction task and its corresponding dataset. Predicting upcoming events from historical video data presents significant challenges for current multimodal LLMs. Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth (Self-Supervised Learning).

🤔 Why isn’t factual answering ideal for video reasoning?
Research indicates that reasoning models like DeepSeek R1 often “over-think”, which can lead to hallucinations. When applied to video data, similar pitfalls emerge if the model is restricted to answering straightforward factual questions. For instance, querying “Where is the cat in the video?” might prompt an overly extended reasoning process, inadvertently increasing the risk of hallucinated outputs.

💡 Why is future prediction a compelling case for video reasoning?
Much like Doctor Strange’s foresight in Avengers 3: Infinity War (2018), predicting the future demands reasoning over multiple potential outcomes. This challenge is analogous to techniques such as Monte Carlo tree search (MCTS), which systematically explores a wide array of possible scenarios. The inherent complexity of future prediction makes it a powerful task for evaluating and enhancing video reasoning capabilities.

assets/example.png

📽️ Video Future Prediction: A Self-Supervised Task for Multimodal Reasoning
This task is inherently Self-Supervised Learning (SSL). It leverages the inherent causal logic present in video data. By dividing videos into sequential segments, we create implicit labels that embody the natural flow of cause and effect—allowing models to learn from the logical progression of events without the need for manual annotations.

Much like Image Contrastive Learning, which uses inherent data structures to construct labels and guide what a model should capture, Video Future Prediction is grounded in the philosophy that real-world events unfold through a chain of cause and effect. It drives the model to focus on the temporal and causal dimensions that underpin real-world scenarios, enhancing multimodal reasoning capabilities. By integrating visual cues, the model develops a holistic reasoning ability to more accurately predict and interpret the progression of complex events.

Moreover, like other self-supervised learning tasks and unsupervised learning, the data construction is relatively cheap, making it a scalable solution for enhancing multimodal reasoning capabilities.


📦 Features

  • 🔍 Next-Event Prediction for video reasoning
  • 🎓 Demo scripts for instruction tuning & reinforcement learning
  • 🛠️ Easy use with LLaMA-Factory on GitHub & EasyR1

🐍 Setup

1. Create a Conda environment

conda create -n video_llm python=3.10 -y
conda activate video_llm

2. Download the V1-33K dataset

python v1_data_download.py

You should now see a folder named V1-33K/ containing:

  • first_part_video/
  • video_dataset/

🔧 LLaMA-Factory Integration

  1. Clone the repo

    git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
    cd LLaMA-Factory
  2. Install dependencies

    pip install -e ".[torch,metrics]" --no-build-isolation

🗄️ Preparing Next-Event Prediction Data

# From the project root
python video_data_generation.py

The generated data will be placed in ./LLaMA-Factory/data/

Move necessary files

mv dataset_info.json LLaMA-Factory/data/
mv qwen2_5vl_7B_full_sft_5K.yaml LLaMA-Factory/examples/train_full/

🚀 Demo Training

  • Instruction Tuning

    bash video_instruction_tuning_demo.sh

🤖 Reinforcement Learning with GRPO

  1. Install RL Env

    cd EasyR1
    pip install -e .
  2. Run the GRPO training demo

    bash video_GRPO_training_demo.sh

🔥 Evaluation

We run all our evaluations based on the lmms-eval. Besides those benchmarks that have been implemented in lmms-eval, we also incorporate evaluations of our FutureBench as well as SeedBench-R1 into it. To start,

  1. Install lmms-eval

    # eval with lmms-eval
    cd third_party/lmms-eval
    pip install -e .
  2. Preparing Dataset

    You should also find the futurebench.json under the same folder named V1-33K/.

    # make dataset from futurebench.json 
    python gen_dataset.py
  3. Run the inference

    Before running the following eval script, check the dataset_path and cache_dir in third_party/lmms-eval/lmms_eval/tasks/futurebench/futurebench.yaml are correct.

    bash third_party/lmms-eval/examples/eval_futurebench.sh

    To run evaluations on other benchamarks, see more settings in third_party/lmms-eval/examples/.


📚 Citation

If you find this repository useful, please cite our paper:

@misc{wang2025fosteringvideoreasoningnextevent,
      title={Fostering Video Reasoning via Next-Event Prediction}, 
      author={Haonan Wang and Hongfu Liu and Xiangyan Liu and Chao Du and Kenji Kawaguchi and Ye Wang and Tianyu Pang},
      year={2025},
      eprint={2505.22457},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.22457}, 
}

😊 Happy exploring & feel free to open an issue or pull request! 🎉

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages