feat: add ZeroBench benchmark task by Luodian · Pull Request #1182 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-22T14:56:35Z

Summary

add a new zerobench benchmark task under lmms_eval/tasks/zerobench using YAML auto-discovery
add task utilities for multi-image extraction, prompt formatting, and normalized exact-match scoring with aggregation
add task README with benchmark context, paper link, dataset link, and usage

Validation

lsp_diagnostics lmms_eval/tasks/zerobench/utils.py -> clean
smoke command:
- HF_TOKEN='' HUGGINGFACE_HUB_TOKEN='' uv run python -m lmms_eval --model dummy_video_reader --model_args response=42 --tasks zerobench --limit 8 --batch_size 1 --log_samples --output_path outputs/zerobench_smoke
score table output includes:
- |zerobench|0|none|0|zerobench_exact_match|↑|0|±|N/A|0|
output artifacts generated:
- outputs/zerobench_smoke/20260222_225459_results.json
- outputs/zerobench_smoke/20260222_225459_samples_zerobench.jsonl
JSONL evidence of non-empty model outputs:
- sample rows contain "filtered_resps": "42" with populated zerobench_exact_match payload

Smoke Validation (limit=8)

Status: PASS (LMM-299 / zerobench)

Output Table

Metric	Value
zerobench_exact_match	0.0

Sample Output

Sample 1 (doc_id: 0)

Input: I want to purchase all the Montellier bottles from the top three shelves. How much do I save by purchasing the bottles with a loyalty card? Give your final answer in dollars. ↵ Answer with only the final answer.
Model Output: 1.25
Reference: 11.90
Scores: zerobench_exact_match = 0.0 (question_id: 1, prediction: 1.25, target: 11.90)
Tokens: output=4, reasoning=0

Sample 2 (doc_id: 1)

Input: Image 1 shows the three sample types of screws for reference. Based on this reference, we assign scores to the screws as follows: largest=1, middle=2, small=3. Now take a look at image 2. ↵ ↵ (1) Find the total score of the screws ↵ (2) Find the score of the screws that are completely below the sh…
Model Output: Here is the analysis based on the provided images and scoring system: ↵ ↵ Scoring Reference (from Image 1): ↵ * Largest screw (Longest, Cylindrical Head): Score 1 ↵ * Middle screw (Medium length, Cylindrical Head): Score 2 ↵ * Small screw (Shortest, Countersunk/Flat Head): Score 3 ↵ ↵ Analysis of Image 2: ↵ ↵ First, we identify all the screws present in Image 2 and assign their scores based on the reference: ↵ ↵ 1. Longest screw (Score 1): Top left. ↵ 2. **Medium
Reference: 2704
Scores: zerobench_exact_match = 0.0 (question_id: 2, prediction: Here is the analysis based on the provided images and scoring system:

Scoring Reference (from Image 1):

Largest screw (Longest, Cylindrical Head): Score 1
Middle screw (Medium length, Cylindrical Head): Score 2
Small screw (Shortest, Countersunk/Flat Head): Score 3

Analysis of Image 2:

First, we identify all the screws present in Image 2 and assign their scores based on the reference:

Longest screw (Score 1): Top left.
**Medium, target: 2704)

Tokens: output=127, reasoning=0

Test Params

uv run python -m lmms_eval --model openai --model_args "model_version=google/gemini-2.5-flash-lite-preview-09-2025" --tasks zerobench --batch_size 1 --limit 8 --log_samples

feat: add ZeroBench benchmark task

3be029b

Luodian merged commit a6b70f9 into dev-v0d7 Feb 23, 2026
2 checks passed

Luodian deleted the feat/lmm-299-zerobench branch February 23, 2026 08:25

Luodian added a commit that referenced this pull request Feb 28, 2026

feat: add ZeroBench benchmark task (#1182)

e40e04e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ZeroBench benchmark task#1182

feat: add ZeroBench benchmark task#1182
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-zerobench

Luodian commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Smoke Validation (limit=8)

Output Table

Sample Output

Test Params

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 22, 2026 •

edited

Loading