Skip to content

VisionXLab/Moment-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu*, Yilun Zhu*, Xiangyu Zhao*, Xuehui Wang,

Yan Li, Xin Li, Haoyu Cao, Xing Sun,

Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

arXiv PDF HuggingFace data Project Page

🎬 Introduction

Moment-Video is a benchmark for diagnosing the temporal fidelity of video multimodal large language models (MLLMs) on momentary visual events: localized actions or state transitions that may last only a few frames, yet determine the correct answer.

Unlike benchmarks centered on persistent objects, global scene context, or long-form semantic aggregation, Moment-Video asks whether models can notice, count, describe, and reason over brief answer-critical evidence. The benchmark contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering both real-world and virtual scenarios.

Moment-Video evaluates four complementary task types:

  • Temporal Occurrence (TO): whether a brief event or state transition happens.
  • Temporal Counting (TC): how many transient actions, object changes, or repeated events occur.
  • Action Description (AD): how a momentary event unfolds, including direction, trajectory, target, interaction, or state change.
  • Temporal Reasoning (TR): how the pre-event state, momentary event, and post-event state imply the final answer.

πŸ“Š Benchmark Performance

We evaluate 33 proprietary and open-source video MLLMs on Moment-Video. Seed-2.0-Pro, Seed-2.0-Lite, Seed-2.0-Mini, and MIMO-v2.5 use their default frame-sampling settings. All other models are evaluated with 1 FPS and a 64-frame cap, except GPT-5.4, which uses a 50-frame cap.

The best-performing model, Seed-2.0-Pro, reaches only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in current models' ability to capture and use brief but decisive visual evidence. In each column, the best result is shown in bold and the second-best result is underlined.

🎯 Overall accuracy by task type

Model TO (%) TC (%) AD (%) TR (%) Overall (%)
Seed-2.0-Pro πŸ…50.3731.1447.0842.3539.6
Seed-2.0-Lite πŸ₯ˆ34.8125.6436.0440.0031.3
Seed-2.0-Mini πŸ₯‰32.5922.8833.7725.8827.8
Gemini-3.1-Pro32.5919.7033.4434.1226.9
Gemini-3-Flash41.4818.4331.8232.9426.9
Gemini-3.1-Flash-Lite34.0718.6430.5227.0625.1
Kimi-2.625.9315.8936.6930.5924.9
Qwen3.5-27B20.7419.0732.4727.0624.1
Qwen3.5-397B-A17B27.4114.8332.7923.5322.8
Qwen3.6-27B19.2617.1631.8223.5322.5
MIMO-v2.518.5217.1630.8423.5322.1
Qwen3.5-122B-A10B25.1914.1927.6023.5320.6
GPT-5.421.4812.0830.1929.4120.4
Qwen3.6-35B-A3B17.0414.4129.2224.7120.2
Qwen3.5-35B-A3B20.0016.1026.3018.8220.0
Gemma-4-31B30.3713.5624.3522.3519.9
Qwen3.5-9B18.5213.3526.6218.8218.6
Qwen3.5-4B19.2613.7721.4317.6517.2
Qwen3-VL-235B-A22B15.5612.0823.3823.5317.0
InternVL3.5-30B-A3B12.5916.1021.7511.7617.0
Qwen3-VL-30B-A3B18.5213.1420.4520.0016.7
InternVL3.5-8B15.5615.0420.1315.2916.7
LLaVA-Video-72B13.3315.0418.8317.6516.2
InternVL3.5-241B-A28B19.2611.0220.4518.8215.7
Keye-VL-1.5-8B14.8112.9220.4514.1215.6
InternVL3.5-4B14.8113.1417.867.0614.3
GLM-4.6V15.568.6919.8121.1814.1
Qwen3-VL-8B15.5610.5917.8616.4714.0
Qwen3-VL-4B14.818.6919.4821.1813.9
LLaVA-Video-7B11.1113.1413.967.0612.6
VideoLLaMA3-7B7.5213.6911.728.3311.8
GLM-4.6V-Flash11.117.2014.2920.0011.0
VITA-1.519.268.4913.4210.5910.6

🌐 Overall accuracy by video domain

Model AIGC (%) GUI (%) Nature (%) Industry (%) Games (%) Human (%) Animal (%) Overall (%)
Seed-2.0-Pro πŸ…45.5733.9469.4441.3239.3727.2055.0039.6
Seed-2.0-Lite πŸ₯ˆ22.7827.9851.3930.5825.6226.8052.0031.3
Seed-2.0-Mini πŸ₯‰24.0522.9458.3329.7522.5020.8043.0027.8
Gemini-3.1-Pro27.8527.5244.4418.1824.3719.2046.0026.9
Gemini-3-Flash45.5720.6443.0615.7020.6223.2047.0026.9
Gemini-3.1-Flash-Lite27.8522.0241.6722.3120.0014.0057.0025.1
Kimi-2.615.1937.1641.6712.4016.2520.4034.0024.9
Qwen3.5-27B11.3921.5641.6718.1818.7520.0053.0024.1
Qwen3.5-397B-A17B16.4623.3947.2218.1817.5014.8043.0022.8
Qwen3.6-27B10.1326.6141.6719.0116.2514.4044.0022.5
MIMO-v2.52.5324.3152.7819.0119.3814.8037.0022.1
Qwen3.5-122B-A10B18.9922.0240.2817.3613.7514.4035.0020.6
GPT-5.46.3323.8540.284.9618.7518.0037.0020.4
Qwen3.6-35B-A3B3.8018.3541.6722.3115.6215.2039.0020.2
Qwen3.5-35B-A3B8.8619.2745.8319.8313.7512.8040.0020.0
Gemma-4-31B27.8511.9341.6717.3615.6215.6036.0019.9
Qwen3.5-9B10.1317.4340.2812.4011.8714.8040.0018.6
Qwen3.5-4B8.8616.9738.8913.2215.0011.2032.0017.2
Qwen3-VL-235B-A22B5.0616.9738.8910.7410.6214.0036.0017.0
InternVL3.5-30B-A3B3.8010.5538.8915.7015.0018.4027.0017.0
Qwen3-VL-30B-A3B8.8614.2243.0614.0515.629.6032.0016.7
InternVL3.5-8B3.8011.9338.8913.2213.7514.0037.0016.7
LLaVA-Video-72B0.008.2634.7219.8316.2515.2031.0016.2
InternVL3.5-241B-A28B1.2712.3940.289.0911.8713.6036.0015.7
Keye-VL-1.5-8B0.0010.0929.1717.3614.3714.4033.0015.6
InternVL3.5-4B3.8010.5529.1714.0513.7510.8030.0014.3
GLM-4.6V5.0610.5531.947.4411.2512.4033.0014.1
Qwen3-VL-8B8.8613.3041.679.926.888.8029.0014.0
Qwen3-VL-4B1.2714.2241.677.449.389.2030.0013.9
LLaVA-Video-7B0.008.7223.618.2615.6212.8023.0012.6
VideoLLaMA3-7B0.0010.4425.354.9610.1415.6415.0011.8
GLM-4.6V-Flash1.278.7243.0613.2210.006.0012.0011.0
VITA-1.50.005.3125.008.2611.258.8026.0010.6

πŸš€ Quick Start

πŸ“ 1. Data Preparation

The benchmark annotations are provided in data/annotation_all.csv and data/annotation_all.json. Place videos under the following structure:

data/videos/{Category}/{Subclass}/{Index}.mp4

For example, a sample with Category=Human, Subclass=Basketball, and Index=0001 should be stored as:

data/videos/human/basketball/1.mp4

🎞️ 2. Native Video Inference

Use this path for OpenAI-compatible endpoints that accept native video_url input.

python scripts/inference_native_video.py \
  --input-csv data/annotation_all.csv \
  --video-root data/videos \
  --output-dir result/output \
  --model Qwen3-VL-4B-Instruct \
  --base-url http://127.0.0.1:8085/v1

πŸ–ΌοΈ 3. Sampled-Frame Inference

Use this path when a model consumes sampled frames as multiple image inputs.

python scripts/inference_sampled_frames.py \
  --input-csv data/annotation_all.csv \
  --video-root data/videos \
  --output-dir result/output_frames \
  --model Kimi-2.6 \
  --base-url http://127.0.0.1:8085/v1 \
  --sample-fps 1 \
  --max-sampled-frames 64

βš–οΈ 4. LMM-as-a-Judge Evaluation

We provide an example evaluator that calls an LLM judge via an OpenAI-compatible API to assess model answers. Closed multiple-choice samples use the stored closed-question pass flag when available; open-ended samples are judged semantically against the reference answer.

python scripts/eval_llm_judge_openrouter.py \
  --input-csv result/output/result_video_{MODEL_NAME}_{TIMESTAMP}.csv \
  --output-dir result/judge \
  --workers 8

Set the appropriate API key (e.g., OPENROUTER_API_KEY) before running. The evaluator reports overall accuracy and grouped accuracy by answer type, video domain, subcategory, and task type.

πŸ“š Citation

Citation will be added after the arXiv version is available.

Coming soon.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors