Skip to content

feat: add SAT task#1348

Open
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-sat-task
Open

feat: add SAT task#1348
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-sat-task

Conversation

@njb-nvidia
Copy link
Copy Markdown
Contributor

Summary

Adds SAT (Spatial Aptitude Test) — a 150-item binary-MCQ spatial-reasoning benchmark with 1-2 images per item. Items cover object movement, ego movement, action consequences, perspective, and goal-aim reasoning.

Dataset: nv-njb/SAT — a re-host of the original array/SAT with two upstream-friendly fixes:

  1. Images stored as `Sequence(Image())` so non-streaming `load_dataset` works. The original parquet uses a nested `list` schema for `image_bytes` that trips pyarrow's chunked-array conversion (`Nested data conversions not implemented for chunked array outputs`). Streaming would dodge this, but lmms-eval's `api/task.py` calls `load_dataset` with `num_proc=1`, which is incompatible with streaming — so neither path works on the original.
  2. Answer order pre-shuffled with `random.Random(42)` and `correct_answer_idx` baked in. The downstream fork's `api/task.py` had a SAT-specific branch that shuffled answer order at load time with a non-deterministic seed; baking it in makes evaluation reproducible without any framework patch.

All other fields (`question`, `question_type`, `correct_answer`) pass through unchanged. See the dataset card for the full schema and a citation.

Files

  • `lmms_eval/tasks/sat/sat.yaml` — task config.
  • `lmms_eval/tasks/sat/utils.py` — doc transforms, `MultiChoiceRegexFilter`, `NumberWordsToDigitsFilter`.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full test split (150 items), 8x H100, greedy decoding.

Source exact_match Stderr Notes
Fork 0.6267 ±0.0396 non-deterministic answer order shuffle at load time
Upstream 0.5933 ±0.0402 deterministic seed=42 shuffle baked into the dataset

Per-doc correctness agreement: 78.0% (117/150 same verdict). The 3.3pp delta is within 1 stderr — consistent with binary-MCQ noise on 150 items plus the `qwen3_vl` model-class drift we have observed on prior ports (egotaskqa, egoplan2, metavqa, openxvqa, robo_spatial).

Background — why this supersedes the earlier draft

A previous draft PR (#1342, closed) hit two blockers that prevented an end-to-end run on upstream main. Both are addressed by the re-host described above; no framework changes are required.

Test plan

  • `uv run lmms-eval --tasks sat --limit 4` smoke test
  • Full `test` run on 8x H100 with Qwen3-VL-2B-Instruct; score matches the fork within stderr
  • Non-streaming `load_dataset("nv-njb/SAT")` returns 150 docs without pyarrow errors
  • `correct_answer_idx` matches `answers.index(correct_answer)` for all 150 docs

SAT (Spatial Aptitude Test) is a 150-item, binary-MCQ spatial reasoning
benchmark with 1-2 images per item. Items cover object movement, ego
movement, action consequences, perspective, and goal-aim reasoning.

Dataset: nv-njb/SAT on HuggingFace — a re-host of array/SAT with two
upstream-friendly fixes:

  1. Images stored as Sequence(Image()) so non-streaming load_dataset
     works (the original parquet's nested list<binary> for image_bytes
     trips pyarrow's chunked-array conversion).
  2. Answer order pre-shuffled with random.Random(42) and
     correct_answer_idx baked in, making evaluation deterministic
     without any framework patch.

Metric: exact_match on the MCQ letter (flexible-extract filter).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant