Skip to content

feat: integrate ViVerBench benchmark task#1166

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-295-viverbench
Feb 23, 2026
Merged

feat: integrate ViVerBench benchmark task#1166
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-295-viverbench

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • add a new viverbench task under lmms_eval/tasks/viverbench with YAML config, prompt/visual builders, and boolean answer parsing/aggregation logic
  • support multi-image samples from comin/ViVerBench by decoding byte arrays into RGB PIL images and evaluating accuracy by category + overall
  • add ViVerBench to docs/current_tasks.md so docs and task discovery stay in sync

Validation

  • uv run python -c "import subprocess, sys; out=subprocess.check_output([sys.executable, '-m', 'lmms_eval', '--tasks', 'list'], text=True); print('viverbench_in_task_list=', 'viverbench' in out)"
  • uv run python -m lmms_eval --model dummy_video_reader --model_args response=true --tasks viverbench --limit 8 --batch_size 1
  • uv run pre-commit run --files docs/current_tasks.md lmms_eval/tasks/viverbench/_default_template_yaml lmms_eval/tasks/viverbench/viverbench.yaml lmms_eval/tasks/viverbench/utils.py lmms_eval/tasks/viverbench/README.md
  • uv run python -m compileall lmms_eval/tasks/viverbench

Smoke Validation (limit=8)

Status: PASS (LMM-295 / viverbench)

Output Table

Metric Value
viverbench_acc 0.75

Sample Output

Sample 1 (doc_id: 0)

  • Input: This image was generated from the prompt: "A majestic castle perched high on a cliff, overlooking a vast expanse of blue water. The castle, with its numerous turrets and two red flags, exudes an ancient and solemn aura. A large sailing ship gracefully glides across the water, also bearing a red flag…
  • Model Output: { ↵ "answer": true, ↵ "explanation": "All objects and their quantities mentioned in the prompt are correctly represented in the image, including the majestic castle with turrets and two red flags, the large sailing ship with a red flag, two weathered parchment scrolls (one unfurled, one partially rolled), two golden rings, and tall trees framing the scene." ↵ }
  • Reference: True
  • Scores: N/A
  • Tokens: output=337, reasoning=256

Sample 2 (doc_id: 1)

  • Input: This image was generated from the prompt: "A majestic castle perched high on a cliff, overlooking a vast expanse of blue water. The castle, with its numerous turrets and two red flags, exudes an ancient and solemn aura. A large sailing ship gracefully glides across the water, also bearing a red flag…
  • Model Output: { ↵ "answer": true, ↵ "explanation": "All objects and their quantities mentioned in the prompt are correctly represented in the image, including the castle with turrets and two red flags, the sailing ship with a red flag, two weathered parchment scrolls (one unfurled, one partially rolled), two golden rings, and the framing tall trees." ↵ }
  • Reference: False
  • Scores: N/A
  • Tokens: output=280, reasoning=202

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks viverbench --batch_size 1 --limit 8.0 --log_samples

@Luodian Luodian merged commit 4e74261 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-295-viverbench branch February 23, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant