Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu*, Yilun Zhu*, Xiangyu Zhao*, Xuehui Wang,

Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

🎬 Introduction

Moment-Video is a benchmark for diagnosing the temporal fidelity of video multimodal large language models (MLLMs) on momentary visual events: localized actions or state transitions that may last only a few frames, yet determine the correct answer.

Unlike benchmarks centered on persistent objects, global scene context, or long-form semantic aggregation, Moment-Video asks whether models can notice, count, describe, and reason over brief answer-critical evidence. The benchmark contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering both real-world and virtual scenarios.

Moment-Video evaluates four complementary task types:

Temporal Occurrence (TO): whether a brief event or state transition happens.
Temporal Counting (TC): how many transient actions, object changes, or repeated events occur.
Action Description (AD): how a momentary event unfolds, including direction, trajectory, target, interaction, or state change.
Temporal Reasoning (TR): how the pre-event state, momentary event, and post-event state imply the final answer.

📊 Benchmark Performance

We evaluate 33 proprietary and open-source video MLLMs on Moment-Video. Seed-2.0-Pro, Seed-2.0-Lite, Seed-2.0-Mini, and MIMO-v2.5 use their default frame-sampling settings. All other models are evaluated with 1 FPS and a 64-frame cap, except GPT-5.4, which uses a 50-frame cap.

The best-performing model, Seed-2.0-Pro, reaches only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in current models' ability to capture and use brief but decisive visual evidence. In each column, the best result is shown in bold and the second-best result is underlined.

🎯 Overall accuracy by task type

Model	TO (%)	TC (%)	AD (%)	TR (%)	Overall (%)
Seed-2.0-Pro 🏅	50.37	31.14	47.08	42.35	39.6
Seed-2.0-Lite 🥈	34.81	25.64	36.04	40.00	31.3
Seed-2.0-Mini 🥉	32.59	22.88	33.77	25.88	27.8
Gemini-3.1-Pro	32.59	19.70	33.44	34.12	26.9
Gemini-3-Flash	41.48	18.43	31.82	32.94	26.9
Gemini-3.1-Flash-Lite	34.07	18.64	30.52	27.06	25.1
Kimi-2.6	25.93	15.89	36.69	30.59	24.9
Qwen3.5-27B	20.74	19.07	32.47	27.06	24.1
Qwen3.5-397B-A17B	27.41	14.83	32.79	23.53	22.8
Qwen3.6-27B	19.26	17.16	31.82	23.53	22.5
MIMO-v2.5	18.52	17.16	30.84	23.53	22.1
Qwen3.5-122B-A10B	25.19	14.19	27.60	23.53	20.6
GPT-5.4	21.48	12.08	30.19	29.41	20.4
Qwen3.6-35B-A3B	17.04	14.41	29.22	24.71	20.2
Qwen3.5-35B-A3B	20.00	16.10	26.30	18.82	20.0
Gemma-4-31B	30.37	13.56	24.35	22.35	19.9
Qwen3.5-9B	18.52	13.35	26.62	18.82	18.6
Qwen3.5-4B	19.26	13.77	21.43	17.65	17.2
Qwen3-VL-235B-A22B	15.56	12.08	23.38	23.53	17.0
InternVL3.5-30B-A3B	12.59	16.10	21.75	11.76	17.0
Qwen3-VL-30B-A3B	18.52	13.14	20.45	20.00	16.7
InternVL3.5-8B	15.56	15.04	20.13	15.29	16.7
LLaVA-Video-72B	13.33	15.04	18.83	17.65	16.2
InternVL3.5-241B-A28B	19.26	11.02	20.45	18.82	15.7
Keye-VL-1.5-8B	14.81	12.92	20.45	14.12	15.6
InternVL3.5-4B	14.81	13.14	17.86	7.06	14.3
GLM-4.6V	15.56	8.69	19.81	21.18	14.1
Qwen3-VL-8B	15.56	10.59	17.86	16.47	14.0
Qwen3-VL-4B	14.81	8.69	19.48	21.18	13.9
LLaVA-Video-7B	11.11	13.14	13.96	7.06	12.6
VideoLLaMA3-7B	7.52	13.69	11.72	8.33	11.8
GLM-4.6V-Flash	11.11	7.20	14.29	20.00	11.0
VITA-1.5	19.26	8.49	13.42	10.59	10.6

🌐 Overall accuracy by video domain

Model	AIGC (%)	GUI (%)	Nature (%)	Industry (%)	Games (%)	Human (%)	Animal (%)	Overall (%)
Seed-2.0-Pro 🏅	45.57	33.94	69.44	41.32	39.37	27.20	55.00	39.6
Seed-2.0-Lite 🥈	22.78	27.98	51.39	30.58	25.62	26.80	52.00	31.3
Seed-2.0-Mini 🥉	24.05	22.94	58.33	29.75	22.50	20.80	43.00	27.8
Gemini-3.1-Pro	27.85	27.52	44.44	18.18	24.37	19.20	46.00	26.9
Gemini-3-Flash	45.57	20.64	43.06	15.70	20.62	23.20	47.00	26.9
Gemini-3.1-Flash-Lite	27.85	22.02	41.67	22.31	20.00	14.00	57.00	25.1
Kimi-2.6	15.19	37.16	41.67	12.40	16.25	20.40	34.00	24.9
Qwen3.5-27B	11.39	21.56	41.67	18.18	18.75	20.00	53.00	24.1
Qwen3.5-397B-A17B	16.46	23.39	47.22	18.18	17.50	14.80	43.00	22.8
Qwen3.6-27B	10.13	26.61	41.67	19.01	16.25	14.40	44.00	22.5
MIMO-v2.5	2.53	24.31	52.78	19.01	19.38	14.80	37.00	22.1
Qwen3.5-122B-A10B	18.99	22.02	40.28	17.36	13.75	14.40	35.00	20.6
GPT-5.4	6.33	23.85	40.28	4.96	18.75	18.00	37.00	20.4
Qwen3.6-35B-A3B	3.80	18.35	41.67	22.31	15.62	15.20	39.00	20.2
Qwen3.5-35B-A3B	8.86	19.27	45.83	19.83	13.75	12.80	40.00	20.0
Gemma-4-31B	27.85	11.93	41.67	17.36	15.62	15.60	36.00	19.9
Qwen3.5-9B	10.13	17.43	40.28	12.40	11.87	14.80	40.00	18.6
Qwen3.5-4B	8.86	16.97	38.89	13.22	15.00	11.20	32.00	17.2
Qwen3-VL-235B-A22B	5.06	16.97	38.89	10.74	10.62	14.00	36.00	17.0
InternVL3.5-30B-A3B	3.80	10.55	38.89	15.70	15.00	18.40	27.00	17.0
Qwen3-VL-30B-A3B	8.86	14.22	43.06	14.05	15.62	9.60	32.00	16.7
InternVL3.5-8B	3.80	11.93	38.89	13.22	13.75	14.00	37.00	16.7
LLaVA-Video-72B	0.00	8.26	34.72	19.83	16.25	15.20	31.00	16.2
InternVL3.5-241B-A28B	1.27	12.39	40.28	9.09	11.87	13.60	36.00	15.7
Keye-VL-1.5-8B	0.00	10.09	29.17	17.36	14.37	14.40	33.00	15.6
InternVL3.5-4B	3.80	10.55	29.17	14.05	13.75	10.80	30.00	14.3
GLM-4.6V	5.06	10.55	31.94	7.44	11.25	12.40	33.00	14.1
Qwen3-VL-8B	8.86	13.30	41.67	9.92	6.88	8.80	29.00	14.0
Qwen3-VL-4B	1.27	14.22	41.67	7.44	9.38	9.20	30.00	13.9
LLaVA-Video-7B	0.00	8.72	23.61	8.26	15.62	12.80	23.00	12.6
VideoLLaMA3-7B	0.00	10.44	25.35	4.96	10.14	15.64	15.00	11.8
GLM-4.6V-Flash	1.27	8.72	43.06	13.22	10.00	6.00	12.00	11.0
VITA-1.5	0.00	5.31	25.00	8.26	11.25	8.80	26.00	10.6

🚀 Quick Start

📁 1. Data Preparation

The benchmark annotations are provided in data/annotation_all.csv and data/annotation_all.json. Place videos under the following structure:

data/videos/{Category}/{Subclass}/{Index}.mp4

For example, a sample with Category=Human, Subclass=Basketball, and Index=0001 should be stored as:

data/videos/human/basketball/1.mp4

🎞️ 2. Native Video Inference

Use this path for OpenAI-compatible endpoints that accept native video_url input.

python scripts/inference_native_video.py \
  --input-csv data/annotation_all.csv \
  --video-root data/videos \
  --output-dir result/output \
  --model Qwen3-VL-4B-Instruct \
  --base-url http://127.0.0.1:8085/v1

🖼️ 3. Sampled-Frame Inference

Use this path when a model consumes sampled frames as multiple image inputs.

python scripts/inference_sampled_frames.py \
  --input-csv data/annotation_all.csv \
  --video-root data/videos \
  --output-dir result/output_frames \
  --model Kimi-2.6 \
  --base-url http://127.0.0.1:8085/v1 \
  --sample-fps 1 \
  --max-sampled-frames 64

⚖️ 4. LMM-as-a-Judge Evaluation

We provide an example evaluator that calls an LLM judge via an OpenAI-compatible API to assess model answers. Closed multiple-choice samples use the stored closed-question pass flag when available; open-ended samples are judged semantically against the reference answer.

python scripts/eval_llm_judge_openrouter.py \
  --input-csv result/output/result_video_{MODEL_NAME}_{TIMESTAMP}.csv \
  --output-dir result/judge \
  --workers 8

Set the appropriate API key (e.g., OPENROUTER_API_KEY) before running. The evaluator reports overall accuracy and grouped accuracy by answer type, video domain, subcategory, and task type.

📚 Citation

Citation will be added after the arXiv version is available.

Coming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

🎬 Introduction

📊 Benchmark Performance

🎯 Overall accuracy by task type

🌐 Overall accuracy by video domain

🚀 Quick Start

📁 1. Data Preparation

🎞️ 2. Native Video Inference

🖼️ 3. Sampled-Frame Inference

⚖️ 4. LMM-as-a-Judge Evaluation

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

🎬 Introduction

📊 Benchmark Performance

🎯 Overall accuracy by task type

🌐 Overall accuracy by video domain

🚀 Quick Start

📁 1. Data Preparation

🎞️ 2. Native Video Inference

🖼️ 3. Sampled-Frame Inference

⚖️ 4. LMM-as-a-Judge Evaluation

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages