feat: integrate worldvqa benchmark task by Luodian · Pull Request #1168 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T13:41:02Z

Summary

add a new worldvqa task backed by moonshotai/WorldVQA with base64 image decoding and exact-match scoring
add WorldQA compatibility aliases under lmms_eval/tasks/worldvqa/ (worldvqa_gen, worldvqa_mc, worldvqa_mc_ppl) to keep parity with existing worldqa task flows
update docs/current_tasks.md to document the new WorldVQA task and compatibility aliases

Verification

uv run python -c "import subprocess, sys; output = subprocess.check_output([sys.executable, '-m', 'lmms_eval', '--tasks', 'list'], text=True); print('worldvqa_present', 'worldvqa' in output)"
HF_TOKEN='' HUGGING_FACE_HUB_TOKEN='' uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks worldvqa --limit 1 --batch_size 1 --output_path /tmp/worldvqa_smoke --log_samples
lsp_diagnostics clean for lmms_eval/tasks/worldvqa/utils.py (YAML files still emit expected !function unresolved-tag schema warnings)

Closes #1146

Smoke Validation (limit=8)

Status: PASS (LMM-297 / worldvqa)

Output Table

Metric	Value
exact_match	0.25

Sample Output

Sample 1 (doc_id: 0)

Input: What breed of dog is in the picture? ↵ Answer briefly.
Model Output: Doberman Pinscher
Reference: Greek Hound
Scores: exact_match = 0.0
Tokens: output=95, reasoning=89

Sample 2 (doc_id: 1)

Input: What breed of dog is in the picture? ↵ Answer briefly.
Model Output: Finnish Lapphund.
Reference: European Russian Laika
Scores: exact_match = 0.0
Tokens: output=601, reasoning=594

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks worldvqa --batch_size 1 --limit 8.0 --log_samples

feat: integrate worldvqa benchmark task

04088e2

Luodian merged commit c3e3675 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-297-worldvqa branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: integrate worldvqa benchmark task (#1168)

0059a43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate worldvqa benchmark task#1168

feat: integrate worldvqa benchmark task#1168
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-297-worldvqa

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading