feat: integrate ViVerBench benchmark task by Luodian · Pull Request #1166 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T13:09:03Z

Summary

add a new viverbench task under lmms_eval/tasks/viverbench with YAML config, prompt/visual builders, and boolean answer parsing/aggregation logic
support multi-image samples from comin/ViVerBench by decoding byte arrays into RGB PIL images and evaluating accuracy by category + overall
add ViVerBench to docs/current_tasks.md so docs and task discovery stay in sync

Validation

uv run python -c "import subprocess, sys; out=subprocess.check_output([sys.executable, '-m', 'lmms_eval', '--tasks', 'list'], text=True); print('viverbench_in_task_list=', 'viverbench' in out)"
uv run python -m lmms_eval --model dummy_video_reader --model_args response=true --tasks viverbench --limit 8 --batch_size 1
uv run pre-commit run --files docs/current_tasks.md lmms_eval/tasks/viverbench/_default_template_yaml lmms_eval/tasks/viverbench/viverbench.yaml lmms_eval/tasks/viverbench/utils.py lmms_eval/tasks/viverbench/README.md
uv run python -m compileall lmms_eval/tasks/viverbench

Smoke Validation (limit=8)

Status: PASS (LMM-295 / viverbench)

Output Table

Metric	Value
viverbench_acc	0.75

Sample Output

Sample 1 (doc_id: 0)

Input: This image was generated from the prompt: "A majestic castle perched high on a cliff, overlooking a vast expanse of blue water. The castle, with its numerous turrets and two red flags, exudes an ancient and solemn aura. A large sailing ship gracefully glides across the water, also bearing a red flag…
Model Output: { ↵ "answer": true, ↵ "explanation": "All objects and their quantities mentioned in the prompt are correctly represented in the image, including the majestic castle with turrets and two red flags, the large sailing ship with a red flag, two weathered parchment scrolls (one unfurled, one partially rolled), two golden rings, and tall trees framing the scene." ↵ }
Reference: True
Scores: N/A
Tokens: output=337, reasoning=256

Sample 2 (doc_id: 1)

Input: This image was generated from the prompt: "A majestic castle perched high on a cliff, overlooking a vast expanse of blue water. The castle, with its numerous turrets and two red flags, exudes an ancient and solemn aura. A large sailing ship gracefully glides across the water, also bearing a red flag…
Model Output: { ↵ "answer": true, ↵ "explanation": "All objects and their quantities mentioned in the prompt are correctly represented in the image, including the castle with turrets and two red flags, the sailing ship with a red flag, two weathered parchment scrolls (one unfurled, one partially rolled), two golden rings, and the framing tall trees." ↵ }
Reference: False
Scores: N/A
Tokens: output=280, reasoning=202

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks viverbench --batch_size 1 --limit 8.0 --log_samples

feat: integrate ViVerBench benchmark task

2e3b42a

Luodian merged commit 4e74261 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-295-viverbench branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: integrate ViVerBench benchmark task (#1166)

11392d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate ViVerBench benchmark task#1166

feat: integrate ViVerBench benchmark task#1166
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-295-viverbench

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading