Skip to content

[Benchmark Backfill] Integrate TVBench into lmms-eval#1160

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-289-tvbench
Feb 23, 2026
Merged

[Benchmark Backfill] Integrate TVBench into lmms-eval#1160
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-289-tvbench

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • Integrate TVBench as a grouped benchmark task (tvbench) with 10 subset task configs under lmms_eval/tasks/tvbench/.
  • Add TVBench task utilities for video path resolution, prompt formatting, answer letter normalization, and accuracy scoring.
  • Add task docs mapping in docs/current_tasks.md and unit coverage for registration plus utility behavior.

Validation

  • uv run python -m unittest discover -s test/eval -p \"test_tvbench_task.py\"
  • uv run python -m lmms_eval --tasks list (includes tvbench and all tvbench_* tasks)
  • uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks tvbench_action_antonym --limit 1 --batch_size 1 --output_path outputs/tvbench_smoke --log_samples

Closes #1138

Smoke Validation (limit=8)

Status: PASS (LMM-289 / tvbench)

Output Table

Metric Value
tvbench_action_antonym.tvbench_acc 0.0

Note: Smoke run on single subtask tvbench_action_antonym (full group has 10 subtasks × 8 samples = 80 video calls, too slow for single-run smoke).

Sample Output

Sample 1 (doc_id: 0)

  • Input: What is the action being performed in the video? ↵ A. Put on a hat/cap. ↵ B. Take off a hat/cap. ↵ Answer with the option letter only.
  • Model Output: B
  • Reference: A
  • Scores: tvbench_acc = 0.0
  • Tokens: output=1, reasoning=0

Sample 2 (doc_id: 1)

  • Input: What is the action being performed in the video? ↵ A. Put on a hat/cap. ↵ B. Take off a hat/cap. ↵ Answer with the option letter only.
  • Model Output: B
  • Reference: A
  • Scores: tvbench_acc = 0.0
  • Tokens: output=1, reasoning=0

Test Params

uv run python -m lmms_eval --model openai --model_args "model_version=google/gemini-2.5-flash-lite-preview-09-2025" --tasks tvbench_action_antonym --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit a484b24 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-289-tvbench branch February 23, 2026 08:25
Luodian added a commit that referenced this pull request Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant