Ye Liu1โ , Kevin Qinghong Lin2โ , Chang Wen Chen1, Mike Zheng Shou2
1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers. This approach addresses the unique challenges of temporal-grounded reasoning in a progressive strategy.
2025.04.05
๐ See BENCHMARK.md for evaluation results of VideoMind on public benchmarks.2025.03.28
๐ VideoMind-2B is ready on Hugging Face Spaces. Check it out!2025.03.21
โญ๏ธ Code, model, and dataset release.2025.03.17
๐ Our tech report is available online.
Benchmark | Evaluation Results (2B/7B) |
---|---|
ZS CG-Bench (mini) |
long-acc: 31.0/38.4 rec@IoU: 8.50/9.93 acc@IoU: 4.02/4.67 |
ZS ReXTime (val) |
mIoU: 24.83/27.61 Acc: 69.06/74.59 Acc@IoU: 17.26/20.20 |
ZS NExT-GQA (test) |
mIoU: 28.6/31.4 mIoP: 36.4/39.0 Acc@GQA: 25.2/28.2 |
ZS Charades-STA (test) |
[email protected]: 51.1/59.1 [email protected]: 26.0/31.2 mIoU: 45.2/50.2 |
ZS ActivityNet-Captions (val_2) |
[email protected]: 26.5/30.3 [email protected]: 12.6/15.7 mIoU: 30.1/33.3 |
FT QVHighlights (test) |
[email protected]: 75.42/78.53 [email protected]: 59.35/61.09 mAP: 51.60/54.19 |
FT TACoS (test) |
[email protected]: 26.9/36.2 [email protected]: 15.5/21.4 mIoU: 27.4/34.4 |
ZS Ego4D-NLQ (val) |
[email protected]: 2.9/3.7 [email protected]: 1.2/1.7 mIoU: 4.7/5.4 |
ZS ActivityNet-RTL (val) |
[email protected]: 20.1/28.0 mIoU: 22.7/31.3 |
ZS Video-MME (w/o subs) |
All: 53.6/58.2 Long: 45.4/49.2 |
ZS MLVU |
M-Avg: 58.7/64.4 |
ZS LVBench |
Overall: 35.4/40.8 |
ZS MVBench |
Acc: 61.9/64.6 |
ZS LongVideoBench |
Acc: 48.8/56.3 |
ZS
means zero-shot evaluation, and FT
denotes fine-tuned on the training set.
See BENCHMARK.md for full evaluation results.
demo.mp4
Play with our online demo or see DEMO.md for guidelines about how to deploy it locally.
We provide raw videos, compressed videos, and pre-processed annotations of 27 video grounding / QA datasets, including our VideoMind-SFT (481K) for training and multiple benchmarks for evaluation. We also release the datasets used during our early exploration (but not included in the final version) to facilitate future research.
See our dataset repo for more details.
Our codebase supports training and evaluating on 27 video datasets and benchmarks with the following features.
- Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
- Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
- Customizing the base LLM and conversation templates
- Monitoring the training process via Tensorboard / Wandb
- Group sampling for mixed dataset training
- Multi-process / multi-device evaluation on public benchmarks
See TRAIN.md for a quick start guide.
See EVAL.md for details about evaluating VideoMind on public benchmarks.
Please kindly cite our paper if you find this project helpful.
@article{liu2025videomind,
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.13444},
year={2025}
}