You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Moment-Video is a benchmark for diagnosing the temporal fidelity of video multimodal large language models (MLLMs) on momentary visual events: localized actions or state transitions that may last only a few frames, yet determine the correct answer.
Unlike benchmarks centered on persistent objects, global scene context, or long-form semantic aggregation, Moment-Video asks whether models can notice, count, describe, and reason over brief answer-critical evidence. The benchmark contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering both real-world and virtual scenarios.
Moment-Video evaluates four complementary task types:
Temporal Occurrence (TO): whether a brief event or state transition happens.
Temporal Counting (TC): how many transient actions, object changes, or repeated events occur.
Action Description (AD): how a momentary event unfolds, including direction, trajectory, target, interaction, or state change.
Temporal Reasoning (TR): how the pre-event state, momentary event, and post-event state imply the final answer.
π Benchmark Performance
We evaluate 33 proprietary and open-source video MLLMs on Moment-Video. Seed-2.0-Pro, Seed-2.0-Lite, Seed-2.0-Mini, and MIMO-v2.5 use their default frame-sampling settings. All other models are evaluated with 1 FPS and a 64-frame cap, except GPT-5.4, which uses a 50-frame cap.
The best-performing model, Seed-2.0-Pro, reaches only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in current models' ability to capture and use brief but decisive visual evidence. In each column, the best result is shown in bold and the second-best result is underlined.
We provide an example evaluator that calls an LLM judge via an OpenAI-compatible API to assess model answers. Closed multiple-choice samples use the stored closed-question pass flag when available; open-ended samples are judged semantically against the reference answer.
Set the appropriate API key (e.g., OPENROUTER_API_KEY) before running. The evaluator reports overall accuracy and grouped accuracy by answer type, video domain, subcategory, and task type.
π Citation
Citation will be added after the arXiv version is available.