Add WISE Benchmark Task (#1301)

Purshow · mwxely · claude · web-flow · commit b3d9dff68e1c · 2026-04-20T12:45:45.000+08:00
* Add WISE Benchmark Task

* style: fix black and isort formatting for WISE utils

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* fix: address WISE task review feedback

* fix: use Hugging Face WISE dataset

* update wise test

---------

Co-authored-by: mwxely &lt;mwxely@users.noreply.github.com&gt;
Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/lmms_eval/tasks/WISE/README.md b/lmms_eval/tasks/WISE/README.md
@@ -0,0 +1,72 @@
+# WISE
+
+WISE is a knowledge-intensive text-to-image benchmark that evaluates whether models can use commonsense, cultural, scientific, spatial, and temporal knowledge to generate correct images.
+
+- Paper: https://arxiv.org/abs/2503.07265
+- Dataset: https://huggingface.co/datasets/Yuwei-Niu/WISE
+
+## Overview
+
+**Dataset:** 1000 prompts loaded from the Hugging Face `test` split across 6 categories (Culture, Time, Space, Biology, Physics, Chemistry)
+
+**Evaluation:** GPT-4o judges each generated image on three dimensions:
+- **Consistency** (0-2): How well the image matches the prompt
+- **Realism** (0-2): Visual quality and photorealism
+- **Aesthetic Quality** (0-2): Artistic appeal and composition
+
+**WiScore Formula:** `(0.7 × consistency + 0.2 × realism + 0.1 × aesthetic) / 2`
+
+**Final Score:** Weighted average across categories (Culture: 0.4, Time: 0.167, Space: 0.133, Biology/Physics/Chemistry: 0.1 each)
+
+## Environment Variables
+
+```bash
+export WISE_API_KEY="your-api-key"                    # Judge API key
+export WISE_BASE_URL="https://api.openai.com/v1"     # Judge API endpoint
+export WISE_MODEL_NAME="gpt-4o-2024-05-13"           # Judge model name
+```
+
+## Image Generation Integration
+
+WISE uses `output_type: generate_until` because lmms-eval routes both text-only generation and image-capable model generation through the same request type.
+Image-generation model wrappers should save generated images to disk and return a JSON string like:
+
+```json
+{"text": "", "images": ["/path/to/model/output/WISE_0.png"]}
+```
+
+During scoring, WISE reads the first path in `images` and sends that file to the judge. Models that only write files without returning this JSON format are not supported by this task.
+
+## Usage
+
+### Full Evaluation with an Image-Generation Model
+
+```bash
+cd /path/to/lmms-eval
+
+export WISE_API_KEY="your-api-key"
+export WISE_BASE_URL="https://api.openai.com/v1"
+export WISE_MODEL_NAME="gpt-4o-2024-05-13"
+
+python -m lmms_eval \
+  --model your_image_generation_model \
+  --model_args pretrained=/path/to/checkpoint,mode=generate,output_dir=/path/to/lmms-eval/outputs/WISE_raw/model_name \
+  --tasks WISE \
+  --batch_size 1 \
+  --log_samples \
+  --output_path /path/to/lmms-eval/outputs/WISE_eval/model_name
+```
+
+## Metrics
+
+- `WISE_culture_score`: Culture category score (prompt_id 1-400)
+- `WISE_time_score`: Time category score (prompt_id 401-567)
+- `WISE_space_score`: Space category score (prompt_id 568-700)
+- `WISE_biology_score`: Biology category score (prompt_id 701-800)
+- `WISE_physics_score`: Physics category score (prompt_id 801-900)
+- `WISE_chemistry_score`: Chemistry category score (prompt_id 901-1000)
+- `WISE_overall_wiscore`: Weighted overall score (main metric)
+
+All scores are in the range [0.0, 1.0].
+
+Do not use VQA-only models or image-editing models for this task.
diff --git a/lmms_eval/tasks/WISE/WISE.yaml b/lmms_eval/tasks/WISE/WISE.yaml
@@ -0,0 +1,51 @@
+# WISE text-to-image benchmark.
+
+dataset_path: Yuwei-Niu/WISE
+task: WISE
+test_split: test
+output_type: generate_until
+
+doc_to_visual: !function utils.wise_doc_to_visual
+doc_to_text: !function utils.wise_doc_to_text
+doc_to_target: !function utils.wise_doc_to_target
+
+generation_kwargs:
+  max_new_tokens: 64
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+
+process_results: !function utils.wise_process_results
+
+metric_list:
+  - metric: WISE_culture_score
+    aggregation: !function utils.wise_aggregate_culture
+    higher_is_better: true
+  - metric: WISE_time_score
+    aggregation: !function utils.wise_aggregate_time
+    higher_is_better: true
+  - metric: WISE_space_score
+    aggregation: !function utils.wise_aggregate_space
+    higher_is_better: true
+  - metric: WISE_biology_score
+    aggregation: !function utils.wise_aggregate_biology
+    higher_is_better: true
+  - metric: WISE_physics_score
+    aggregation: !function utils.wise_aggregate_physics
+    higher_is_better: true
+  - metric: WISE_chemistry_score
+    aggregation: !function utils.wise_aggregate_chemistry
+    higher_is_better: true
+  - metric: WISE_overall_wiscore
+    aggregation: !function utils.wise_aggregate_overall_wiscore
+    higher_is_better: true
+
+lmms_eval_specific_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+
+metadata:
+  - version: 0.1
+    description: "WISE text-to-image benchmark with OpenAI-compatible GPT image judge"
diff --git a/lmms_eval/tasks/WISE/utils.py b/lmms_eval/tasks/WISE/utils.py