Skip to content

feat: add Video-MME-v2 benchmark task#1289

Merged
kcz358 merged 5 commits into
mainfrom
feat/video-mme-v2
Apr 9, 2026
Merged

feat: add Video-MME-v2 benchmark task#1289
kcz358 merged 5 commits into
mainfrom
feat/video-mme-v2

Conversation

@mwxely
Copy link
Copy Markdown
Collaborator

@mwxely mwxely commented Apr 9, 2026

Summary

  • Add Video-MME-v2 benchmark (800 videos, 3200 8-option MCQ questions A-H)
  • Implement grouped non-linear scoring: relevance (quadratic) + logic (chain-based with dependency DAGs)
  • Report per-level (3 cognitive levels), per-category (10+33 categories), and group-type breakdowns
  • Scoring logic ported from official evaluation code and verified against VLMEvalKit implementation

Tasks added

Task Description
videomme_v2 Standard eval (no subtitles)
videomme_v2_w_subtitle With concatenated subtitles from word-level JSONL files
videomme_v2_reasoning Chain-of-thought reasoning mode (max_new_tokens=4096)

Details

Test plan

  • All 3 tasks register correctly
  • Unit tests: scoring functions (all 3 group structures + relevance)
  • Unit tests: answer extraction (11 prefixes including Final Answer:/Answer:/Option:)
  • Unit tests: subtitle loading + prompt integration
  • Unit tests: reasoning prompt format
  • pre-commit (black + isort) passes
  • Smoke test with Qwen3-VL-8B: all 3 tasks run end-to-end (--limit 8)
  • Prompt format aligned with official INSTRUCT_PROMPT and THINK_PROMPT
  • Scoring verified identical to VLMEvalKit cal_relevance + cal_logic

Planned follow-ups

  • Interleaved subtitle mode: The official eval supports inserting subtitle tokens between video frames aligned by timestamp (--subtitle-interleave). In lmms-eval, this requires coordination with the model adapter layer (each model handles frame/text interleaving differently), so it cannot be implemented purely at the task level. Will need model-specific support.
  • Per-level/per-type metrics as separate columns: The official eval reports level_1/2/3, relevance_score, logic_score as independent metrics. Currently these are logged via eval_logger.info() during aggregation. Exposing them as separate reportable metrics requires adding multiple metric_list entries with dedicated aggregation functions. Current implementation reports the overall grouped score as the primary metric.

@mwxely mwxely force-pushed the feat/video-mme-v2 branch 3 times, most recently from 272c0d2 to a98c362 Compare April 9, 2026 02:59
mwxely added 4 commits April 9, 2026 03:32
- Dataset: MME-Benchmarks/Video-MME-v2 (800 videos, 3200 questions)
- 8-option MCQ (A-H) with grouped non-linear scoring
- Generation config: max_new_tokens=64, temperature=0
- Grouped non-linear scoring: relevance (quadratic) + logic (chain-based)
- 3 group structures: [1,2,3,4], [1,[2,3],4], [[1,2],3,4]
- Answer extraction with 11 prefix patterns (A-H range)
- Per-level, per-category, per-group-type breakdown reporting
- Prompt aligned with official INSTRUCT_PROMPT
- Verified against VLMEvalKit implementation
- Load word-level JSONL subtitles and prepend to prompt
- Graceful fallback when subtitle file is missing
- Task: videomme_v2_w_subtitle
- Chain-of-thought prompt requiring Final Answer: <letter> format
- max_new_tokens=4096 for reasoning space
- Task: videomme_v2_reasoning
@mwxely mwxely force-pushed the feat/video-mme-v2 branch from a98c362 to 73603b4 Compare April 9, 2026 03:32
@mwxely mwxely requested review from Luodian and kcz358 April 9, 2026 03:43
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation can revise a bit to report sub category score better. Just like #1285 does. Can let agent refer and change a bit. Other LGTM

Per review feedback (ref: PR #1285 pattern):
- Report relevance/logic group-type scores separately
- Report per-level (1/2/3) scores separately
- Refactor aggregate logic into _compute_all_subscores helper
- process_results returns same entry under all 6 metric keys
- Detailed second_head/third_head breakdowns still logged
@mwxely
Copy link
Copy Markdown
Collaborator Author

mwxely commented Apr 9, 2026

Thanks @kcz358 for the review! Updated in 81e162e — now reports 6 separate metrics following the PR #1285 pattern:

  • videomme_v2_score (overall)
  • videomme_v2_relevance_score / videomme_v2_logic_score (per group type)
  • videomme_v2_level_1 / level_2 / level_3 (per cognitive level)

Detailed second_head/third_head breakdowns are still logged via eval_logger.info.

@kcz358 kcz358 merged commit 52c5620 into main Apr 9, 2026
3 checks passed
@kcz358 kcz358 deleted the feat/video-mme-v2 branch April 9, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants