feat: add SAT task by njb-nvidia · Pull Request #1348 · EvolvingLMMs-Lab/lmms-eval

njb-nvidia · 2026-05-20T23:07:40Z

Summary

Adds SAT (Spatial Aptitude Test) — a 150-item binary-MCQ spatial-reasoning benchmark with 1-2 images per item. Items cover object movement, ego movement, action consequences, perspective, and goal-aim reasoning.

Dataset: nv-njb/SAT — a re-host of the original array/SAT with two upstream-friendly fixes:

Images stored as `Sequence(Image())` so non-streaming `load_dataset` works. The original parquet uses a nested `list` schema for `image_bytes` that trips pyarrow's chunked-array conversion (`Nested data conversions not implemented for chunked array outputs`). Streaming would dodge this, but lmms-eval's `api/task.py` calls `load_dataset` with `num_proc=1`, which is incompatible with streaming — so neither path works on the original.
Answer order pre-shuffled with `random.Random(42)` and `correct_answer_idx` baked in. The downstream fork's `api/task.py` had a SAT-specific branch that shuffled answer order at load time with a non-deterministic seed; baking it in makes evaluation reproducible without any framework patch.

All other fields (`question`, `question_type`, `correct_answer`) pass through unchanged. See the dataset card for the full schema and a citation.

Files

`lmms_eval/tasks/sat/sat.yaml` — task config.
`lmms_eval/tasks/sat/utils.py` — doc transforms, `MultiChoiceRegexFilter`, `NumberWordsToDigitsFilter`.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full test split (150 items), 8x H100, greedy decoding.

Source	exact_match	Stderr	Notes
Fork	0.6267	±0.0396	non-deterministic answer order shuffle at load time
Upstream	0.5933	±0.0402	deterministic seed=42 shuffle baked into the dataset

Per-doc correctness agreement: 78.0% (117/150 same verdict). The 3.3pp delta is within 1 stderr — consistent with binary-MCQ noise on 150 items plus the `qwen3_vl` model-class drift we have observed on prior ports (egotaskqa, egoplan2, metavqa, openxvqa, robo_spatial).

Background — why this supersedes the earlier draft

A previous draft PR (#1342, closed) hit two blockers that prevented an end-to-end run on upstream main. Both are addressed by the re-host described above; no framework changes are required.

Test plan

`uv run lmms-eval --tasks sat --limit 4` smoke test
Full `test` run on 8x H100 with Qwen3-VL-2B-Instruct; score matches the fork within stderr
Non-streaming `load_dataset("nv-njb/SAT")` returns 150 docs without pyarrow errors
`correct_answer_idx` matches `answers.index(correct_answer)` for all 150 docs

SAT (Spatial Aptitude Test) is a 150-item, binary-MCQ spatial reasoning benchmark with 1-2 images per item. Items cover object movement, ego movement, action consequences, perspective, and goal-aim reasoning. Dataset: nv-njb/SAT on HuggingFace — a re-host of array/SAT with two upstream-friendly fixes: 1. Images stored as Sequence(Image()) so non-streaming load_dataset works (the original parquet's nested list<binary> for image_bytes trips pyarrow's chunked-array conversion). 2. Answer order pre-shuffled with random.Random(42) and correct_answer_idx baked in, making evaluation deterministic without any framework patch. Metric: exact_match on the MCQ letter (flexible-extract filter).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SAT task#1348

feat: add SAT task#1348
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-sat-task

njb-nvidia commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

njb-nvidia commented May 20, 2026

Summary

Files

Parity vs. local fork

Background — why this supersedes the earlier draft

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant