Skip to content

feat: add ZeroBench benchmark task#1182

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-zerobench
Feb 23, 2026
Merged

feat: add ZeroBench benchmark task#1182
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-299-zerobench

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • add a new zerobench benchmark task under lmms_eval/tasks/zerobench using YAML auto-discovery
  • add task utilities for multi-image extraction, prompt formatting, and normalized exact-match scoring with aggregation
  • add task README with benchmark context, paper link, dataset link, and usage

Validation

  • lsp_diagnostics lmms_eval/tasks/zerobench/utils.py -> clean
  • smoke command:
    • HF_TOKEN='' HUGGINGFACE_HUB_TOKEN='' uv run python -m lmms_eval --model dummy_video_reader --model_args response=42 --tasks zerobench --limit 8 --batch_size 1 --log_samples --output_path outputs/zerobench_smoke
  • score table output includes:
    • |zerobench|0|none|0|zerobench_exact_match|↑|0|±|N/A|0|
  • output artifacts generated:
    • outputs/zerobench_smoke/20260222_225459_results.json
    • outputs/zerobench_smoke/20260222_225459_samples_zerobench.jsonl
  • JSONL evidence of non-empty model outputs:
    • sample rows contain "filtered_resps": "42" with populated zerobench_exact_match payload

Smoke Validation (limit=8)

Status: PASS (LMM-299 / zerobench)

Output Table

Metric Value
zerobench_exact_match 0.0

Sample Output

Sample 1 (doc_id: 0)

  • Input: I want to purchase all the Montellier bottles from the top three shelves. How much do I save by purchasing the bottles with a loyalty card? Give your final answer in dollars. ↵ Answer with only the final answer.
  • Model Output: 1.25
  • Reference: 11.90
  • Scores: zerobench_exact_match = 0.0 (question_id: 1, prediction: 1.25, target: 11.90)
  • Tokens: output=4, reasoning=0

Sample 2 (doc_id: 1)

  • Input: Image 1 shows the three sample types of screws for reference. Based on this reference, we assign scores to the screws as follows: largest=1, middle=2, small=3. Now take a look at image 2. ↵ ↵ (1) Find the total score of the screws ↵ (2) Find the score of the screws that are completely below the sh…
  • Model Output: Here is the analysis based on the provided images and scoring system: ↵ ↵ Scoring Reference (from Image 1): ↵ * Largest screw (Longest, Cylindrical Head): Score 1 ↵ * Middle screw (Medium length, Cylindrical Head): Score 2 ↵ * Small screw (Shortest, Countersunk/Flat Head): Score 3 ↵ ↵ Analysis of Image 2: ↵ ↵ First, we identify all the screws present in Image 2 and assign their scores based on the reference: ↵ ↵ 1. Longest screw (Score 1): Top left. ↵ 2. **Medium
  • Reference: 2704
  • Scores: zerobench_exact_match = 0.0 (question_id: 2, prediction: Here is the analysis based on the provided images and scoring system:

Scoring Reference (from Image 1):

  • Largest screw (Longest, Cylindrical Head): Score 1
  • Middle screw (Medium length, Cylindrical Head): Score 2
  • Small screw (Shortest, Countersunk/Flat Head): Score 3

Analysis of Image 2:

First, we identify all the screws present in Image 2 and assign their scores based on the reference:

  1. Longest screw (Score 1): Top left.
  2. **Medium, target: 2704)
  • Tokens: output=127, reasoning=0

Test Params

uv run python -m lmms_eval --model openai --model_args "model_version=google/gemini-2.5-flash-lite-preview-09-2025" --tasks zerobench --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit a6b70f9 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-299-zerobench branch February 23, 2026 08:25
Luodian added a commit that referenced this pull request Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant