[Benchmark Backfill] Integrate TVBench into lmms-eval by Luodian · Pull Request #1160 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T12:57:16Z

Summary

Integrate TVBench as a grouped benchmark task (tvbench) with 10 subset task configs under lmms_eval/tasks/tvbench/.
Add TVBench task utilities for video path resolution, prompt formatting, answer letter normalization, and accuracy scoring.
Add task docs mapping in docs/current_tasks.md and unit coverage for registration plus utility behavior.

Validation

uv run python -m unittest discover -s test/eval -p \"test_tvbench_task.py\"
uv run python -m lmms_eval --tasks list (includes tvbench and all tvbench_* tasks)
uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks tvbench_action_antonym --limit 1 --batch_size 1 --output_path outputs/tvbench_smoke --log_samples

Closes #1138

Smoke Validation (limit=8)

Status: PASS (LMM-289 / tvbench)

Output Table

Metric	Value
tvbench_action_antonym.tvbench_acc	0.0

Note: Smoke run on single subtask tvbench_action_antonym (full group has 10 subtasks × 8 samples = 80 video calls, too slow for single-run smoke).

Sample Output

Sample 1 (doc_id: 0)

Input: What is the action being performed in the video? ↵ A. Put on a hat/cap. ↵ B. Take off a hat/cap. ↵ Answer with the option letter only.
Model Output: B
Reference: A
Scores: tvbench_acc = 0.0
Tokens: output=1, reasoning=0

Sample 2 (doc_id: 1)

Input: What is the action being performed in the video? ↵ A. Put on a hat/cap. ↵ B. Take off a hat/cap. ↵ Answer with the option letter only.
Model Output: B
Reference: A
Scores: tvbench_acc = 0.0
Tokens: output=1, reasoning=0

Test Params

uv run python -m lmms_eval --model openai --model_args "model_version=google/gemini-2.5-flash-lite-preview-09-2025" --tasks tvbench_action_antonym --batch_size 1 --limit 8 --log_samples

feat: integrate TVBench benchmark tasks

35f10d9

Luodian merged commit a484b24 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-289-tvbench branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: integrate TVBench benchmark tasks (#1160)

468973b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark Backfill] Integrate TVBench into lmms-eval#1160

[Benchmark Backfill] Integrate TVBench into lmms-eval#1160
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-289-tvbench

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading