Skip to content

fix: harden MMSI-Bench parity handling#1162

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-294-mmsibench
Feb 23, 2026
Merged

fix: harden MMSI-Bench parity handling#1162
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-294-mmsibench

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • harden mmsi_bench prompt rendering by handling missing/non-dict lmms_eval_specific_kwargs safely.
  • normalize legacy positional category labels (for example Obj.-Obj.) to canonical MMSI-Bench labels so metric aggregation remains stable across dataset variants.
  • make answer extraction case-insensitive for MCQ letters and add focused regression tests for these edge cases.

Validation

  • uv run python -m unittest discover -s test/eval -p "test_mmsi_bench_utils.py"
  • uv run python -m py_compile lmms_eval/tasks/mmsi_bench/utils.py test/eval/test_mmsi_bench_utils.py
  • uv run python -m lmms_eval --model dummy_video_reader --model_args response=A,fail_on_missing=false --tasks mmsi_bench --limit 8 --batch_size 1
  • uv run python -c "from lmms_eval.tasks import TaskManager; tm = TaskManager(); print('mmsi_bench' in tm.all_tasks)" -> True

Tracking

Smoke Validation (limit=8)

Status: PASS (LMM-294 / mmsi_bench)

Output Table

Metric Value
Attribute (Meas.) 0.0
Motion (Cam.) 0.3333333333333333
Positional Relationship (Cam.–Obj.) 1.0
average 0.5

Sample Output

Sample 1 (doc_id: 0)

  • Input: The images are taken continuously from a first-person perspective. In which direction are you moving? ↵ Options: A: Left while moving backward, B: Forward to the left, C: Forward to the right, D: Right while moving backward ↵ Answer with the option's letter from the given choices directly. Enclose t…
  • Model Output: C
  • Reference: C
  • Scores: Motion (Cam.) = 1.0 (question_id: 0, l2_category: Motion (Cam.)) · average = 1.0 (question_id: 0, l2_category: Motion (Cam.))
  • Tokens: output=695, reasoning=694

Sample 2 (doc_id: 1)

  • Input: The images are taken continuously from a first-person perspective. In which direction is the camera rotating? ↵ Options: A: Back, B: Left, C: Right, D: Forward ↵ Answer with the option's letter from the given choices directly. Enclose the option's letter within ``.
  • Model Output: To determine the direction of the camera rotation, we analyze the continuous first-person perspective. Observing the scene, the camera's view shifts such that the objects in the tray appear to move relative to the fixed perspective. By examining the movement of the objects and the background, we can infer the rotation direction. The camera is rotating to the right. ↵ ↵ C
  • Reference: B
  • Scores: Motion (Cam.) = 0.0 (question_id: 1, l2_category: Motion (Cam.)) · average = 0.0 (question_id: 1, l2_category: Motion (Cam.))
  • Tokens: output=777, reasoning=706

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mmsi_bench --batch_size 1 --limit 8.0 --log_samples

Handle missing prompt kwargs and legacy category labels so MMSI-Bench runs reliably across dataset variants.

Add focused regression tests for helper behavior.
@Luodian Luodian merged commit 63bea01 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-294-mmsibench branch February 23, 2026 08:25
Luodian added a commit that referenced this pull request Feb 28, 2026
Handle missing prompt kwargs and legacy category labels so MMSI-Bench runs reliably across dataset variants.

Add focused regression tests for helper behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant