Skip to content

Commit e06c725

Browse files
committed
Add WISE Benchmark Task
1 parent 118b226 commit e06c725

3 files changed

Lines changed: 758 additions & 0 deletions

File tree

lmms_eval/tasks/WISE/README.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# WISE
2+
3+
WISE is a knowledge-intensive text-to-image benchmark that evaluates whether models can use commonsense, cultural, scientific, spatial, and temporal knowledge to generate correct images.
4+
5+
- Paper: https://arxiv.org/abs/2503.07265
6+
- Dataset: https://huggingface.co/datasets/Yuwei-Niu/WISE
7+
8+
## Overview
9+
10+
**Dataset:** 1000 prompts across 6 categories (Culture, Time, Space, Biology, Physics, Chemistry)
11+
12+
**Evaluation:** GPT-4o judges each generated image on three dimensions:
13+
- **Consistency** (0-2): How well the image matches the prompt
14+
- **Realism** (0-2): Visual quality and photorealism
15+
- **Aesthetic Quality** (0-2): Artistic appeal and composition
16+
17+
**WiScore Formula:** `(0.7 × consistency + 0.2 × realism + 0.1 × aesthetic) / 2`
18+
19+
**Final Score:** Weighted average across categories (Culture: 0.4, Time: 0.167, Space: 0.133, Biology/Physics/Chemistry: 0.1 each)
20+
21+
## Environment Variables
22+
23+
```bash
24+
export WISE_API_KEY="your-api-key" # Judge API key
25+
export WISE_BASE_URL="https://api.openai.com/v1" # Judge API endpoint
26+
export WISE_MODEL_NAME="gpt-4o-2024-05-13" # Judge model name
27+
export WISE_RAW_OUTPUT_DIR="/path/to/model/output" # Where model saves generated images
28+
```
29+
30+
## Usage
31+
32+
### Full Evaluation with bagel
33+
34+
```bash
35+
cd /pfs/weiyang/Show/lmms-eval
36+
37+
export WISE_API_KEY="sk-..."
38+
export WISE_BASE_URL="https://api.bltcy.ai/v1"
39+
export WISE_MODEL_NAME="gpt-4o"
40+
export WISE_RAW_OUTPUT_DIR="/pfs/weiyang/Show/lmms-eval/outputs/WISE_raw/bagel_umm"
41+
42+
python -m lmms_eval \
43+
--model bagel_umm \
44+
--model_args pretrained=/pfs/weiyang/WISE_re/CKPT/ByteDance-Seed/BAGEL-7B-MoT,mode=generate,output_dir=/pfs/weiyang/Show/lmms-eval/outputs/WISE_raw/bagel_umm \
45+
--tasks WISE \
46+
--batch_size 1 \
47+
--log_samples \
48+
--output_path /pfs/weiyang/Show/lmms-eval/outputs/WISE_eval/bagel_umm
49+
```
50+
51+
## Metrics
52+
53+
- `WISE_culture_score`: Culture category score (prompt_id 1-400)
54+
- `WISE_time_score`: Time category score (prompt_id 401-567)
55+
- `WISE_space_score`: Space category score (prompt_id 568-700)
56+
- `WISE_biology_score`: Biology category score (prompt_id 701-800)
57+
- `WISE_physics_score`: Physics category score (prompt_id 801-900)
58+
- `WISE_chemistry_score`: Chemistry category score (prompt_id 901-1000)
59+
- `WISE_overall_wiscore`: Weighted overall score (main metric)
60+
61+
All scores are in the range [0.0, 1.0].
62+
63+
Do not use VQA-only models or image-editing models for this task.

lmms_eval/tasks/WISE/WISE.yaml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# WISE text-to-image benchmark.
2+
#
3+
# Dataset notes:
4+
# - The source JSON is expected to be uploaded to Hugging Face as:
5+
# Yuwei-Niu/WISE/final_data.json
6+
# - data_files maps that JSON into the lmms-eval test split without using any
7+
# local mount path.
8+
9+
dataset_path: json
10+
dataset_kwargs:
11+
data_files:
12+
test: Yuwei-Niu/WISE
13+
task: WISE
14+
test_split: test
15+
output_type: generate_until
16+
17+
doc_to_visual: !function utils.wise_doc_to_visual
18+
doc_to_text: !function utils.wise_doc_to_text
19+
doc_to_target: !function utils.wise_doc_to_target
20+
21+
generation_kwargs:
22+
max_new_tokens: 64
23+
temperature: 0
24+
top_p: 1.0
25+
num_beams: 1
26+
do_sample: false
27+
28+
process_results: !function utils.wise_process_results
29+
30+
metric_list:
31+
- metric: WISE_culture_score
32+
aggregation: !function utils.wise_aggregate_culture
33+
higher_is_better: true
34+
- metric: WISE_time_score
35+
aggregation: !function utils.wise_aggregate_time
36+
higher_is_better: true
37+
- metric: WISE_space_score
38+
aggregation: !function utils.wise_aggregate_space
39+
higher_is_better: true
40+
- metric: WISE_biology_score
41+
aggregation: !function utils.wise_aggregate_biology
42+
higher_is_better: true
43+
- metric: WISE_physics_score
44+
aggregation: !function utils.wise_aggregate_physics
45+
higher_is_better: true
46+
- metric: WISE_chemistry_score
47+
aggregation: !function utils.wise_aggregate_chemistry
48+
higher_is_better: true
49+
- metric: WISE_overall_wiscore
50+
aggregation: !function utils.wise_aggregate_overall_wiscore
51+
higher_is_better: true
52+
53+
lmms_eval_specific_kwargs:
54+
default:
55+
pre_prompt: ""
56+
post_prompt: ""
57+
58+
metadata:
59+
- version: 0.1
60+
description: "WISE text-to-image benchmark with OpenAI-compatible GPT image judge"

0 commit comments

Comments
 (0)