Skip to content

feat: integrate MathKangaroo benchmark task#1158

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-286-mathkangaroo
Feb 23, 2026
Merged

feat: integrate MathKangaroo benchmark task#1158
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-286-mathkangaroo

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • Add a new mathkangaroo task wired to dfkiuser/kangaroo_math_mc_questions (train split) with generation defaults for option-letter answers.
  • Implement mathkangaroo task utilities for image loading, prompt construction, and robust A-E answer extraction (including mixed labels like C/D).
  • Update task documentation mapping in docs/current_tasks.md to include MathKangaroo.

Validation

  • uv run python -m lmms_eval --tasks list (confirmed mathkangaroo appears)
  • uv run python -m lmms_eval --model dummy_video_reader --model_args response=A --tasks mathkangaroo --limit 8 --batch_size 1 --output_path ./logs/mathkangaroo_smoke
    • Run succeeded and reported mathkangaroo_accuracy = 0.125 on the smoke subset.
  • uv run pre-commit run --all-files (passed)

Tracking

Smoke Validation (limit=8)

Status: PASS (LMM-286 / mathkangaroo)

Output Table

Metric Value
mathkangaroo_accuracy 0.75

Sample Output

Sample 1 (doc_id: 0)

  • Input: 1. Wenn die Buchstaben A, G, K, N, O, R, U die folgenden Werte haben: A = 2, G = 2, K = 10, N = 7, O = 0, R = 3, U = 1, welchen Wert hat dann das Wort KANGOUROU (so heißt Känguruh auf französisch)? ↵ Answer with the option letter (A, B, C, D, or E) only.
  • Model Output: D ↵ 1. Zuerst spellst du das Wort KANGOUROU aus: K, A, N, G, O, U, R, O, U. ↵ 2. Nimm die Werte der Buchstaben: K=10, A=2, N=7, G
  • Reference: D
  • Scores: mathkangaroo_accuracy = 1.0
  • Tokens: output=589, reasoning=525

Sample 2 (doc_id: 1)

  • Input: Ich denke mir eine Zahl. Dann subtrahiere ich 40 davon und addiere zum Resultat 2000. Ich erhalte 3250. Welche Zahl hatte ich mir gedacht? ↵ Answer with the option letter (A, B, C, D, or E) only.
  • Model Output: B ↵ Schritt-für-Schritt-Erklärung: ↵ Lass die gedachte Zahl ( x ) sein. Nach den Operationen lautet die Gleichung: ↵ [ ↵ (x - 40) + 2000 = 3250 ↵ ] ↵ Vereinfachen
  • Reference: B
  • Scores: mathkangaroo_accuracy = 1.0
  • Tokens: output=516, reasoning=452

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mathkangaroo --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit e7c89c2 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-286-mathkangaroo branch February 23, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant