feat: add mme-cc benchmark task by Luodian · Pull Request #1185 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T14:59:24Z

Summary

add a new mme_cc benchmark task under lmms_eval/tasks/mme_cc
load all 12 MME-CC subtasks from MaxwellWen/MME-CC JSON files and keep task auto-discovery via YAML
implement prompt/visual/target helpers plus exact-match and answered-rate metrics, and include a task README with a smoke command

Validation

uv run pre-commit run --files lmms_eval/tasks/mme_cc/mme_cc.yaml lmms_eval/tasks/mme_cc/utils.py lmms_eval/tasks/mme_cc/README.md (pass)
HF_TOKEN= uv run python -m lmms_eval --model dummy_video_reader --model_args response=yes --tasks mme_cc --limit 8 --batch_size 1 --log_samples --output_path /tmp/mme_cc_smoke
- score table:
  - mme_cc_answered_rate = 1
  - mme_cc_exact_match = 0
- output files:
  - /tmp/mme_cc_smoke/20260222_225753_results.json
  - /tmp/mme_cc_smoke/20260222_225753_samples_mme_cc.jsonl
- JSONL sample check: filtered_resps entries are non-empty (e.g., "yes" for doc_id 0-7)

Smoke Validation (limit=8)

Status: PASS (LMM-299 / mme_cc)

Output Table

Metric	Value
mme_cc_exact_match	0.125
mme_cc_answered_rate	1.000

Sample Output

Sample 1 (doc_id: 0)

Input: Please list the math section of this image in the following format and swap the scores of deepseek-V3 and GPT-4o. Do not include any content other than this: ↵ ↵ Model name: Score ↵ Model name: Score ↵ Model name: Score ↵ ... ↵ Return only your final answer with no explanation.
Model Output: Grok-3: 52 ↵ Grok-3 mini: 40 ↵ Gemini-2 Pro: 36 ↵ DeepSeek-V3: 9 ↵ Claude 3.5 Sonnet: 16 ↵ GPT-4o: 39
Reference: Grok-3: 52 ↵ Grok-3 mini: 40 ↵ Gemini-2 Pro: 36 ↵ GPT-4o: 39 ↵ Claude 3.5 Sonnet: 16 ↵ DeepSeek-V3: 9
Scores: mme_cc_exact_match = 0.0 (total: 1.0, subtask: Chart_Modification) · mme_cc_answered_rate = 1.0 (total: 1.0)
Tokens: output=333, reasoning=275

Sample 2 (doc_id: 1)

Input: Extract the corresponding data of the blue bars from Store 1 to Store 5 in the following format, and exchange the maximum and minimum values of the blue bar data without including any additional content ↵ Store 1: xx ↵ Store 2: xx ↵ ... ↵ Return only your final answer with no explanation.
Model Output: Store 1: 25 ↵ Store 2: 32 ↵ Store 3: 1 ↵ Store 4: 30 ↵ Store 5: 37
Reference: Store 1: 25 ↵ Store 2: 32 ↵ Store 3: 1 ↵ Store 4: 30 ↵ Store 5: 37
Scores: mme_cc_exact_match = 1.0 (total: 1.0, subtask: Chart_Modification) · mme_cc_answered_rate = 1.0 (total: 1.0)
Tokens: output=351, reasoning=313

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks mme_cc --batch_size 1 --limit 8 --log_samples

feat: add mme-cc benchmark task

0b961dc

Luodian merged commit 0c3821d into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-299-mme-cc branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: add mme-cc benchmark task (#1185)

a1353c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add mme-cc benchmark task#1185

feat: add mme-cc benchmark task#1185
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-mme-cc

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading