Skip to content

feat: add mme-cc benchmark task#1185

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-mme-cc
Feb 23, 2026
Merged

feat: add mme-cc benchmark task#1185
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-mme-cc

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • add a new mme_cc benchmark task under lmms_eval/tasks/mme_cc
  • load all 12 MME-CC subtasks from MaxwellWen/MME-CC JSON files and keep task auto-discovery via YAML
  • implement prompt/visual/target helpers plus exact-match and answered-rate metrics, and include a task README with a smoke command

Validation

  • uv run pre-commit run --files lmms_eval/tasks/mme_cc/mme_cc.yaml lmms_eval/tasks/mme_cc/utils.py lmms_eval/tasks/mme_cc/README.md (pass)
  • HF_TOKEN= uv run python -m lmms_eval --model dummy_video_reader --model_args response=yes --tasks mme_cc --limit 8 --batch_size 1 --log_samples --output_path /tmp/mme_cc_smoke
    • score table:
      • mme_cc_answered_rate = 1
      • mme_cc_exact_match = 0
    • output files:
      • /tmp/mme_cc_smoke/20260222_225753_results.json
      • /tmp/mme_cc_smoke/20260222_225753_samples_mme_cc.jsonl
    • JSONL sample check: filtered_resps entries are non-empty (e.g., "yes" for doc_id 0-7)

Smoke Validation (limit=8)

Status: PASS (LMM-299 / mme_cc)

Output Table

Metric Value
mme_cc_exact_match 0.125
mme_cc_answered_rate 1.000

Sample Output

Sample 1 (doc_id: 0)

  • Input: Please list the math section of this image in the following format and swap the scores of deepseek-V3 and GPT-4o. Do not include any content other than this: ↵ ↵ Model name: Score ↵ Model name: Score ↵ Model name: Score ↵ ... ↵ Return only your final answer with no explanation.
  • Model Output: Grok-3: 52 ↵ Grok-3 mini: 40 ↵ Gemini-2 Pro: 36 ↵ DeepSeek-V3: 9 ↵ Claude 3.5 Sonnet: 16 ↵ GPT-4o: 39
  • Reference: Grok-3: 52 ↵ Grok-3 mini: 40 ↵ Gemini-2 Pro: 36 ↵ GPT-4o: 39 ↵ Claude 3.5 Sonnet: 16 ↵ DeepSeek-V3: 9
  • Scores: mme_cc_exact_match = 0.0 (total: 1.0, subtask: Chart_Modification) · mme_cc_answered_rate = 1.0 (total: 1.0)
  • Tokens: output=333, reasoning=275

Sample 2 (doc_id: 1)

  • Input: Extract the corresponding data of the blue bars from Store 1 to Store 5 in the following format, and exchange the maximum and minimum values of the blue bar data without including any additional content ↵ Store 1: xx ↵ Store 2: xx ↵ ... ↵ Return only your final answer with no explanation.
  • Model Output: Store 1: 25 ↵ Store 2: 32 ↵ Store 3: 1 ↵ Store 4: 30 ↵ Store 5: 37
  • Reference: Store 1: 25 ↵ Store 2: 32 ↵ Store 3: 1 ↵ Store 4: 30 ↵ Store 5: 37
  • Scores: mme_cc_exact_match = 1.0 (total: 1.0, subtask: Chart_Modification) · mme_cc_answered_rate = 1.0 (total: 1.0)
  • Tokens: output=351, reasoning=313

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mme_cc --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit 0c3821d into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-299-mme-cc branch February 23, 2026 08:25
Luodian added a commit that referenced this pull request Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant