feat: integrate MathKangaroo benchmark task by Luodian · Pull Request #1158 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T12:56:05Z

Summary

Add a new mathkangaroo task wired to dfkiuser/kangaroo_math_mc_questions (train split) with generation defaults for option-letter answers.
Implement mathkangaroo task utilities for image loading, prompt construction, and robust A-E answer extraction (including mixed labels like C/D).
Update task documentation mapping in docs/current_tasks.md to include MathKangaroo.

Validation

uv run python -m lmms_eval --tasks list (confirmed mathkangaroo appears)
uv run python -m lmms_eval --model dummy_video_reader --model_args response=A --tasks mathkangaroo --limit 8 --batch_size 1 --output_path ./logs/mathkangaroo_smoke
- Run succeeded and reported mathkangaroo_accuracy = 0.125 on the smoke subset.
uv run pre-commit run --all-files (passed)

Tracking

Closes [Benchmark Backfill] Integrate MathKangaroo into lmms-eval #1135
Linear: LMM-286

Smoke Validation (limit=8)

Status: PASS (LMM-286 / mathkangaroo)

Output Table

Metric	Value
mathkangaroo_accuracy	0.75

Sample Output

Sample 1 (doc_id: 0)

Input: 1. Wenn die Buchstaben A, G, K, N, O, R, U die folgenden Werte haben: A = 2, G = 2, K = 10, N = 7, O = 0, R = 3, U = 1, welchen Wert hat dann das Wort KANGOUROU (so heißt Känguruh auf französisch)? ↵ Answer with the option letter (A, B, C, D, or E) only.
Model Output: D ↵ 1. Zuerst spellst du das Wort KANGOUROU aus: K, A, N, G, O, U, R, O, U. ↵ 2. Nimm die Werte der Buchstaben: K=10, A=2, N=7, G
Reference: D
Scores: mathkangaroo_accuracy = 1.0
Tokens: output=589, reasoning=525

Sample 2 (doc_id: 1)

Input: Ich denke mir eine Zahl. Dann subtrahiere ich 40 davon und addiere zum Resultat 2000. Ich erhalte 3250. Welche Zahl hatte ich mir gedacht? ↵ Answer with the option letter (A, B, C, D, or E) only.
Model Output: B ↵ Schritt-für-Schritt-Erklärung: ↵ Lass die gedachte Zahl ( x ) sein. Nach den Operationen lautet die Gleichung: ↵ [ ↵ (x - 40) + 2000 = 3250 ↵ ] ↵ Vereinfachen
Reference: B
Scores: mathkangaroo_accuracy = 1.0
Tokens: output=516, reasoning=452

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mathkangaroo --batch_size 1 --limit 8 --log_samples

feat: integrate MathKangaroo benchmark task (#1135)

0672856

Luodian merged commit e7c89c2 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-286-mathkangaroo branch February 23, 2026 08:24

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: integrate MathKangaroo benchmark task (#1135) (#1158)

8d79eb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate MathKangaroo benchmark task#1158

feat: integrate MathKangaroo benchmark task#1158
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-286-mathkangaroo

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Tracking

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading