Skip to content

Commit b3d9dff

Browse files
Purshowmwxelyclaude
authored
Add WISE Benchmark Task (#1301)
* Add WISE Benchmark Task * style: fix black and isort formatting for WISE utils Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address WISE task review feedback * fix: use Hugging Face WISE dataset * update wise test --------- Co-authored-by: mwxely <mwxely@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 118b226 commit b3d9dff

3 files changed

Lines changed: 758 additions & 0 deletions

File tree

lmms_eval/tasks/WISE/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# WISE
2+
3+
WISE is a knowledge-intensive text-to-image benchmark that evaluates whether models can use commonsense, cultural, scientific, spatial, and temporal knowledge to generate correct images.
4+
5+
- Paper: https://arxiv.org/abs/2503.07265
6+
- Dataset: https://huggingface.co/datasets/Yuwei-Niu/WISE
7+
8+
## Overview
9+
10+
**Dataset:** 1000 prompts loaded from the Hugging Face `test` split across 6 categories (Culture, Time, Space, Biology, Physics, Chemistry)
11+
12+
**Evaluation:** GPT-4o judges each generated image on three dimensions:
13+
- **Consistency** (0-2): How well the image matches the prompt
14+
- **Realism** (0-2): Visual quality and photorealism
15+
- **Aesthetic Quality** (0-2): Artistic appeal and composition
16+
17+
**WiScore Formula:** `(0.7 × consistency + 0.2 × realism + 0.1 × aesthetic) / 2`
18+
19+
**Final Score:** Weighted average across categories (Culture: 0.4, Time: 0.167, Space: 0.133, Biology/Physics/Chemistry: 0.1 each)
20+
21+
## Environment Variables
22+
23+
```bash
24+
export WISE_API_KEY="your-api-key" # Judge API key
25+
export WISE_BASE_URL="https://api.openai.com/v1" # Judge API endpoint
26+
export WISE_MODEL_NAME="gpt-4o-2024-05-13" # Judge model name
27+
```
28+
29+
## Image Generation Integration
30+
31+
WISE uses `output_type: generate_until` because lmms-eval routes both text-only generation and image-capable model generation through the same request type.
32+
Image-generation model wrappers should save generated images to disk and return a JSON string like:
33+
34+
```json
35+
{"text": "", "images": ["/path/to/model/output/WISE_0.png"]}
36+
```
37+
38+
During scoring, WISE reads the first path in `images` and sends that file to the judge. Models that only write files without returning this JSON format are not supported by this task.
39+
40+
## Usage
41+
42+
### Full Evaluation with an Image-Generation Model
43+
44+
```bash
45+
cd /path/to/lmms-eval
46+
47+
export WISE_API_KEY="your-api-key"
48+
export WISE_BASE_URL="https://api.openai.com/v1"
49+
export WISE_MODEL_NAME="gpt-4o-2024-05-13"
50+
51+
python -m lmms_eval \
52+
--model your_image_generation_model \
53+
--model_args pretrained=/path/to/checkpoint,mode=generate,output_dir=/path/to/lmms-eval/outputs/WISE_raw/model_name \
54+
--tasks WISE \
55+
--batch_size 1 \
56+
--log_samples \
57+
--output_path /path/to/lmms-eval/outputs/WISE_eval/model_name
58+
```
59+
60+
## Metrics
61+
62+
- `WISE_culture_score`: Culture category score (prompt_id 1-400)
63+
- `WISE_time_score`: Time category score (prompt_id 401-567)
64+
- `WISE_space_score`: Space category score (prompt_id 568-700)
65+
- `WISE_biology_score`: Biology category score (prompt_id 701-800)
66+
- `WISE_physics_score`: Physics category score (prompt_id 801-900)
67+
- `WISE_chemistry_score`: Chemistry category score (prompt_id 901-1000)
68+
- `WISE_overall_wiscore`: Weighted overall score (main metric)
69+
70+
All scores are in the range [0.0, 1.0].
71+
72+
Do not use VQA-only models or image-editing models for this task.

lmms_eval/tasks/WISE/WISE.yaml

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# WISE text-to-image benchmark.
2+
3+
dataset_path: Yuwei-Niu/WISE
4+
task: WISE
5+
test_split: test
6+
output_type: generate_until
7+
8+
doc_to_visual: !function utils.wise_doc_to_visual
9+
doc_to_text: !function utils.wise_doc_to_text
10+
doc_to_target: !function utils.wise_doc_to_target
11+
12+
generation_kwargs:
13+
max_new_tokens: 64
14+
temperature: 0
15+
top_p: 1.0
16+
num_beams: 1
17+
do_sample: false
18+
19+
process_results: !function utils.wise_process_results
20+
21+
metric_list:
22+
- metric: WISE_culture_score
23+
aggregation: !function utils.wise_aggregate_culture
24+
higher_is_better: true
25+
- metric: WISE_time_score
26+
aggregation: !function utils.wise_aggregate_time
27+
higher_is_better: true
28+
- metric: WISE_space_score
29+
aggregation: !function utils.wise_aggregate_space
30+
higher_is_better: true
31+
- metric: WISE_biology_score
32+
aggregation: !function utils.wise_aggregate_biology
33+
higher_is_better: true
34+
- metric: WISE_physics_score
35+
aggregation: !function utils.wise_aggregate_physics
36+
higher_is_better: true
37+
- metric: WISE_chemistry_score
38+
aggregation: !function utils.wise_aggregate_chemistry
39+
higher_is_better: true
40+
- metric: WISE_overall_wiscore
41+
aggregation: !function utils.wise_aggregate_overall_wiscore
42+
higher_is_better: true
43+
44+
lmms_eval_specific_kwargs:
45+
default:
46+
pre_prompt: ""
47+
post_prompt: ""
48+
49+
metadata:
50+
- version: 0.1
51+
description: "WISE text-to-image benchmark with OpenAI-compatible GPT image judge"

0 commit comments

Comments
 (0)