feat: add Video-MME-v2 benchmark task by mwxely · Pull Request #1289 · EvolvingLMMs-Lab/lmms-eval

mwxely · 2026-04-09T01:44:59Z

Summary

Add Video-MME-v2 benchmark (800 videos, 3200 8-option MCQ questions A-H)
Implement grouped non-linear scoring: relevance (quadratic) + logic (chain-based with dependency DAGs)
Report per-level (3 cognitive levels), per-category (10+33 categories), and group-type breakdowns
Scoring logic ported from official evaluation code and verified against VLMEvalKit implementation

Tasks added

Task	Description
`videomme_v2`	Standard eval (no subtitles)
`videomme_v2_w_subtitle`	With concatenated subtitles from word-level JSONL files
`videomme_v2_reasoning`	Chain-of-thought reasoning mode (max_new_tokens=4096)

Details

Dataset: MME-Benchmarks/Video-MME-v2
Paper: Video-MME-v2 Tech Report
VLMEvalKit reference: open-compass/VLMEvalKit@ca905c5
Key differences from v1: 8 options (vs 4), grouped non-linear scoring (vs simple accuracy), cognitive level splits (vs duration splits)

Test plan

All 3 tasks register correctly
Unit tests: scoring functions (all 3 group structures + relevance)
Unit tests: answer extraction (11 prefixes including Final Answer:/Answer:/Option:)
Unit tests: subtitle loading + prompt integration
Unit tests: reasoning prompt format
pre-commit (black + isort) passes
Smoke test with Qwen3-VL-8B: all 3 tasks run end-to-end (--limit 8)
Prompt format aligned with official INSTRUCT_PROMPT and THINK_PROMPT
Scoring verified identical to VLMEvalKit cal_relevance + cal_logic

Planned follow-ups

Interleaved subtitle mode: The official eval supports inserting subtitle tokens between video frames aligned by timestamp (--subtitle-interleave). In lmms-eval, this requires coordination with the model adapter layer (each model handles frame/text interleaving differently), so it cannot be implemented purely at the task level. Will need model-specific support.
Per-level/per-type metrics as separate columns: The official eval reports level_1/2/3, relevance_score, logic_score as independent metrics. Currently these are logged via eval_logger.info() during aggregation. Exposing them as separate reportable metrics requires adding multiple metric_list entries with dedicated aggregation functions. Current implementation reports the overall grouped score as the primary metric.

- Dataset: MME-Benchmarks/Video-MME-v2 (800 videos, 3200 questions) - 8-option MCQ (A-H) with grouped non-linear scoring - Generation config: max_new_tokens=64, temperature=0

- Grouped non-linear scoring: relevance (quadratic) + logic (chain-based) - 3 group structures: [1,2,3,4], [1,[2,3],4], [[1,2],3,4] - Answer extraction with 11 prefix patterns (A-H range) - Per-level, per-category, per-group-type breakdown reporting - Prompt aligned with official INSTRUCT_PROMPT - Verified against VLMEvalKit implementation

- Load word-level JSONL subtitles and prepend to prompt - Graceful fallback when subtitle file is missing - Task: videomme_v2_w_subtitle

- Chain-of-thought prompt requiring Final Answer: <letter> format - max_new_tokens=4096 for reasoning space - Task: videomme_v2_reasoning

kcz358 · 2026-04-09T07:34:42Z

The aggregation can revise a bit to report sub category score better. Just like #1285 does. Can let agent refer and change a bit. Other LGTM

Per review feedback (ref: PR #1285 pattern): - Report relevance/logic group-type scores separately - Report per-level (1/2/3) scores separately - Refactor aggregate logic into _compute_all_subscores helper - process_results returns same entry under all 6 metric keys - Detailed second_head/third_head breakdowns still logged

mwxely · 2026-04-09T07:53:12Z

Thanks @kcz358 for the review! Updated in 81e162e — now reports 6 separate metrics following the PR #1285 pattern:

videomme_v2_score (overall)
videomme_v2_relevance_score / videomme_v2_logic_score (per group type)
videomme_v2_level_1 / level_2 / level_3 (per cognitive level)

Detailed second_head/third_head breakdowns are still logged via eval_logger.info.

mwxely force-pushed the feat/video-mme-v2 branch 3 times, most recently from 272c0d2 to a98c362 Compare April 9, 2026 02:59

mwxely added 4 commits April 9, 2026 03:32

feat(videomme_v2): add task config and default template

15733fc

- Dataset: MME-Benchmarks/Video-MME-v2 (800 videos, 3200 questions) - 8-option MCQ (A-H) with grouped non-linear scoring - Generation config: max_new_tokens=64, temperature=0

feat(videomme_v2): add subtitle variant (concatenated mode)

3b5ddf5

- Load word-level JSONL subtitles and prepend to prompt - Graceful fallback when subtitle file is missing - Task: videomme_v2_w_subtitle

feat(videomme_v2): add reasoning mode variant

73603b4

- Chain-of-thought prompt requiring Final Answer: <letter> format - max_new_tokens=4096 for reasoning space - Task: videomme_v2_reasoning

mwxely force-pushed the feat/video-mme-v2 branch from a98c362 to 73603b4 Compare April 9, 2026 03:32

mwxely requested review from Luodian and kcz358 April 9, 2026 03:43

kcz358 reviewed Apr 9, 2026

View reviewed changes

kcz358 approved these changes Apr 9, 2026

View reviewed changes

kcz358 merged commit 52c5620 into main Apr 9, 2026
3 checks passed

kcz358 deleted the feat/video-mme-v2 branch April 9, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Video-MME-v2 benchmark task#1289

feat: add Video-MME-v2 benchmark task#1289
kcz358 merged 5 commits into
mainfrom
feat/video-mme-v2

mwxely commented Apr 9, 2026 •

edited

Loading

Uh oh!

kcz358 Apr 9, 2026

Uh oh!

mwxely commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwxely commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tasks added

Details

Test plan

Planned follow-ups

Uh oh!

kcz358 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

mwxely commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mwxely commented Apr 9, 2026 •

edited

Loading