feat: backfill VisuLogic benchmark integration (LMM-288) by Luodian · Pull Request #1159 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T12:56:19Z

Summary

Added a new visulogic task integration under lmms_eval/tasks/visulogic/ with task YAML + utils.
Implemented VisuLogic image loading from VisuLogic/VisuLogic (data.jsonl + images.zip) and multiple-choice answer extraction for A/B/C/D scoring.
Updated docs/current_tasks.md to include VisuLogic in the Image Tasks benchmark index.
Scope check: confirmed no existing VisuLogic task coverage in this worktree before implementation (glob '**/*visulogic*' and repo grep returned no matches).

Validation

uv run python -m lmms_eval --tasks list
- Result: task list includes visulogic.
uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks visulogic --limit 8 --batch_size 1
- Result: completed successfully; output metric table includes visulogic_acc = 0.125 for the smoke slice.
uv run pre-commit run --all-files
- Result: black passed, isort passed.
uv run python -m unittest discover -s test/eval -p "test_model_registry_v2.py"
- Result: command completed successfully (Ran 0 tests, OK).

Issue Link

Linear: LMM-288
GitHub: [Benchmark Backfill] Integrate VisuLogic into lmms-eval #1137

Smoke Validation (limit=8)

Status: PASS (LMM-288 / visulogic)

Output Table

Metric	Value
visulogic_acc	0.125

Sample Output

Sample 1 (doc_id: 0)

Input: From the four given options, select the most suitable one to fill in the question mark, so that a certain regularity is presented: ↵ ↵ ↵ ↵ A: A ↵ B: B ↵ C: C ↵ D: D ↵ Answer with the option letter only.
Model Output: To determine the most suitable option, we analyze the pattern in the given figures: ↵ ↵ ↵ ### Step 1: Identify the pattern in the left set ↵ The left set consists of three figures. The first figure is a square with two intersecting diagonal lines (dividing the square into 4 regions). The second
Reference: A
Scores: visulogic_acc = 0.0
Tokens: output=5177, reasoning=5113

Sample 2 (doc_id: 1)

Input: From the four given options, choose the most suitable one to fill in the question mark, so that it presents a certain regularity: ↵ ↵ ↵ A: A ↵ B: B ↵ C: C ↵ D: D ↵ Answer with the option letter only.
Model Output: To determine the pattern, we analyze the left and right sets of figures: ↵ ↵ ↵ ### Left Set Analysis ↵ - First figure: Two triangles overlapping, forming a shape with 5 edges. ↵ - Second figure: Two squares intersecting, forming a shape with 8 edges. ↵ - **Third figure
Reference: D
Scores: visulogic_acc = 0.0
Tokens: output=4287, reasoning=4223

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks visulogic --batch_size 1 --limit 8.0 --log_samples

* feat: integrate VisuLogic benchmark task (#1137) * docs: add VisuLogic to current task index

Luodian added 2 commits February 22, 2026 20:55

feat: integrate VisuLogic benchmark task (#1137)

d105518

docs: add VisuLogic to current task index

ca95b4d

Luodian merged commit ff912cc into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-288-visulogic branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: backfill VisuLogic benchmark integration (LMM-288) (#1159)

e818fea

* feat: integrate VisuLogic benchmark task (#1137) * docs: add VisuLogic to current task index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: backfill VisuLogic benchmark integration (LMM-288)#1159

feat: backfill VisuLogic benchmark integration (LMM-288)#1159
Luodian merged 2 commits into
dev-v0d7from
feat/lmm-288-visulogic

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Issue Link

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading