Skip to content

yeliudev/VideoMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

38 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Ye Liu1โ€ , Kevin Qinghong Lin2โ€ , Chang Wen Chen1, Mike Zheng Shou2

1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers. This approach addresses the unique challenges of temporal-grounded reasoning in a progressive strategy.

๐Ÿ”ฅ News

  • 2025.04.05 ๐Ÿ“Š See BENCHMARK.md for evaluation results of VideoMind on public benchmarks.
  • 2025.03.28 ๐Ÿš€ VideoMind-2B is ready on Hugging Face Spaces. Check it out!
  • 2025.03.21 โญ๏ธ Code, model, and dataset release.
  • 2025.03.17 ๐ŸŽ‰ Our tech report is available online.

๐Ÿ† VideoMind on Public Benchmarks

Benchmark Evaluation Results (2B/7B)
ZS CG-Bench (mini) long-acc: 31.0/38.4 rec@IoU: 8.50/9.93 acc@IoU: 4.02/4.67
ZS ReXTime (val) mIoU: 24.83/27.61 Acc: 69.06/74.59 Acc@IoU: 17.26/20.20
ZS NExT-GQA (test) mIoU: 28.6/31.4 mIoP: 36.4/39.0 Acc@GQA: 25.2/28.2
ZS Charades-STA (test) [email protected]: 51.1/59.1 [email protected]: 26.0/31.2 mIoU: 45.2/50.2
ZS ActivityNet-Captions (val_2) [email protected]: 26.5/30.3 [email protected]: 12.6/15.7 mIoU: 30.1/33.3
FT QVHighlights (test) [email protected]: 75.42/78.53 [email protected]: 59.35/61.09 mAP: 51.60/54.19
FT TACoS (test) [email protected]: 26.9/36.2 [email protected]: 15.5/21.4 mIoU: 27.4/34.4
ZS Ego4D-NLQ (val) [email protected]: 2.9/3.7 [email protected]: 1.2/1.7 mIoU: 4.7/5.4
ZS ActivityNet-RTL (val) [email protected]: 20.1/28.0 mIoU: 22.7/31.3
ZS Video-MME (w/o subs) All: 53.6/58.2 Long: 45.4/49.2
ZS MLVU M-Avg: 58.7/64.4
ZS LVBench Overall: 35.4/40.8
ZS MVBench Acc: 61.9/64.6
ZS LongVideoBench Acc: 48.8/56.3

ZS means zero-shot evaluation, and FT denotes fine-tuned on the training set.

See BENCHMARK.md for full evaluation results.

๐Ÿ•น๏ธ Gradio Demo

demo.mp4

Play with our online demo or see DEMO.md for guidelines about how to deploy it locally.

๐Ÿ“ฆ Datasets

We provide raw videos, compressed videos, and pre-processed annotations of 27 video grounding / QA datasets, including our VideoMind-SFT (481K) for training and multiple benchmarks for evaluation. We also release the datasets used during our early exploration (but not included in the final version) to facilitate future research.

See our dataset repo for more details.

๐Ÿš€ Training

Our codebase supports training and evaluating on 27 video datasets and benchmarks with the following features.

  • Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
  • Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
  • Customizing the base LLM and conversation templates
  • Monitoring the training process via Tensorboard / Wandb
  • Group sampling for mixed dataset training
  • Multi-process / multi-device evaluation on public benchmarks

See TRAIN.md for a quick start guide.

๐Ÿ”ฎ Evaluation

See EVAL.md for details about evaluating VideoMind on public benchmarks.

๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

@article{liu2025videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2503.13444},
  year={2025}
}

Star History Chart

About

๐Ÿ’ก VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published