fix: harden MMSI-Bench parity handling by Luodian · Pull Request #1162 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T13:05:04Z

Summary

harden mmsi_bench prompt rendering by handling missing/non-dict lmms_eval_specific_kwargs safely.
normalize legacy positional category labels (for example Obj.-Obj.) to canonical MMSI-Bench labels so metric aggregation remains stable across dataset variants.
make answer extraction case-insensitive for MCQ letters and add focused regression tests for these edge cases.

Validation

uv run python -m unittest discover -s test/eval -p "test_mmsi_bench_utils.py"
uv run python -m py_compile lmms_eval/tasks/mmsi_bench/utils.py test/eval/test_mmsi_bench_utils.py
uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks mmsi_bench --limit 8 --batch_size 1
uv run python -c "from lmms_eval.tasks import TaskManager; tm = TaskManager(); print('mmsi_bench' in tm.all_tasks)" -> True

Tracking

Closes [Benchmark Backfill] Integrate MMSIBench into lmms-eval #1143
Linear: LMM-294

Smoke Validation (limit=8)

Status: PASS (LMM-294 / mmsi_bench)

Output Table

Metric	Value
Attribute (Meas.)	0.0
Motion (Cam.)	0.3333333333333333
Positional Relationship (Cam.–Obj.)	1.0
average	0.5

Sample Output

Sample 1 (doc_id: 0)

Input: The images are taken continuously from a first-person perspective. In which direction are you moving? ↵ Options: A: Left while moving backward, B: Forward to the left, C: Forward to the right, D: Right while moving backward ↵ Answer with the option's letter from the given choices directly. Enclose t…
Model Output: C
Reference: C
Scores: Motion (Cam.) = 1.0 (question_id: 0, l2_category: Motion (Cam.)) · average = 1.0 (question_id: 0, l2_category: Motion (Cam.))
Tokens: output=695, reasoning=694

Sample 2 (doc_id: 1)

Input: The images are taken continuously from a first-person perspective. In which direction is the camera rotating? ↵ Options: A: Back, B: Left, C: Right, D: Forward ↵ Answer with the option's letter from the given choices directly. Enclose the option's letter within ``.
Model Output: To determine the direction of the camera rotation, we analyze the continuous first-person perspective. Observing the scene, the camera's view shifts such that the objects in the tray appear to move relative to the fixed perspective. By examining the movement of the objects and the background, we can infer the rotation direction. The camera is rotating to the right. ↵ ↵ C
Reference: B
Scores: Motion (Cam.) = 0.0 (question_id: 1, l2_category: Motion (Cam.)) · average = 0.0 (question_id: 1, l2_category: Motion (Cam.))
Tokens: output=777, reasoning=706

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mmsi_bench --batch_size 1 --limit 8.0 --log_samples

Handle missing prompt kwargs and legacy category labels so MMSI-Bench runs reliably across dataset variants. Add focused regression tests for helper behavior.

fix: harden mmsi-bench utils parsing

d85641d

Handle missing prompt kwargs and legacy category labels so MMSI-Bench runs reliably across dataset variants. Add focused regression tests for helper behavior.

Luodian merged commit 63bea01 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-294-mmsibench branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

fix: harden mmsi-bench utils parsing (#1162)

9d11279

Handle missing prompt kwargs and legacy category labels so MMSI-Bench runs reliably across dataset variants. Add focused regression tests for helper behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden MMSI-Bench parity handling#1162

fix: harden MMSI-Bench parity handling#1162
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-294-mmsibench

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Tracking

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading