Qwen-Image-Bench

An evaluation toolkit for text-to-image (T2I) generation models. It uses a fine-tuned Q-Judger (Qwen3.6-27B) to score generated images across 5 hierarchical dimensions (Quality, Aesthetics, Alignment, Real-world Fidelity, Creative Generation) covering 56 fine-grained facets.

Key Features

Evaluate any T2I model — run the judge model on your own generated images and get structured, multi-dimensional scores
Compute scores from pre-generated responses — reproduce the leaderboard from the released benchmark dataset
Powered by ms-swift — uses the same inference setup that produced the benchmark responses

Quick Start

# 1. Clone the repo
git clone https://github.com/QwenLM/Qwen-Image-Bench.git
cd Qwen-Image-Bench

# 2. Install dependencies
uv venv myenv --python 3.11 && source myenv/bin/activate
# Install PyTorch first: https://pytorch.org/get-started/locally/
uv pip install -r requirements.txt

# 3. Run judge on your images
python judge.py \
  --input your_data.jsonl \
  --model Qwen/Qwen-Image-Bench

Your input file should be a CSV/JSON/JSONL with three columns:

Column	Type	Description
`ID`	int	Prompt identifier (1–1000), must match benchmark metadata
`prompt`	str	The text prompt used to generate the image
`image_path`	str	Path to the generated image file

Installation

Step-by-step

1. Create and activate a virtual environment:

uv venv myenv --python 3.11
source myenv/bin/activate

2. Install PyTorch (select the command matching your CUDA version):

See the official guide: https://pytorch.org/get-started/locally/

3. Install Python dependencies:

uv pip install -r requirements.txt

This installs all required dependencies including ms-swift.

Usage

Evaluate Your Own T2I Model (`judge.py`)

Run Judge Inference

python judge.py \
  --input your_data.jsonl \
  --model Qwen/Qwen-Image-Bench

CLI Options

Argument	Default	Description
`--input`	(required)	Input CSV/JSON/JSONL with `ID`, `prompt`, `image_path`
`--model`	(required)	HuggingFace model ID or local path
`--backend`	`vllm`	Inference backend: `vllm` (continuous batching, fast) or `pt` (ms-swift `PtEngine` / HF static batching)
`--hf-bench-repo`	—	HF dataset repo for bench metadata
`--local-metadata`	—	Local metadata file path (overrides default)
`--max-batch-size`	24	ms-swift `PtEngine` max_batch_size (`pt` backend only)
`--max-new-tokens`	4096	Max generation tokens
`--max-num-seqs`	256	vLLM max concurrent sequences (`vllm` backend)
`--gpu-memory-utilization`	0.9	vLLM GPU memory fraction (`vllm` backend)
`--tensor-parallel-size`	1	vLLM tensor parallel size / GPUs (`vllm` backend)
`--max-model-len`	—	vLLM max context length (`vllm` backend; default: model config)

The default vllm backend runs the judge with continuous batching + PagedAttention for substantially higher throughput on long, variable-length (thinking-mode) generations. It requires vllm in the venv (see requirements.txt). Use --backend pt to fall back to the original ms-swift PtEngine path.

Output Files

After running judge.py, three files are written next to your input:

File	Contents
`<input>_judged.{jsonl,csv}`	Per-row results: original fields + `judge_model_output` (combined raw scores JSON) + `<dim>_judge_output` (raw judge text per L1 dimension)
`<input>_bench_scores.json`	Aggregated scores: `level1`, `level2`, `total`
`<input>_bench_scores.xlsx`	Same scores in Excel: `Level-1 Summary` sheet + one sheet per L1 dimension with L2 detail

Compute Scores from Pre-generated Responses (`compute_scores.py`)

# From local file
python compute_scores.py --input qwen_image_bench_hf_v0518.jsonl

# Or download from HuggingFace
python compute_scores.py --hf-repo Qwen/Qwen-Image-Bench

Output: scores_result.xlsx + scores_detail.json

Top-5 Models

Model	Quality	Aesthetics	Alignment	Real-world Fidelity	Creative Generation	Overall
GPT Image 2	58.65	67.53	65.85	57.38	75.23	64.69
Nano Banana 2.0	54.77	61.08	62.40	54.28	67.05	59.82
GPT Image 1.5	55.14	60.88	61.72	53.95	66.35	59.65
Nano Banana Pro	55.67	60.26	61.25	54.07	66.23	59.45
Qwen Image 2.0 Pro	54.39	58.67	59.28	51.83	64.94	57.84

Full results for all 18 models are available in the paper.

Inference Parameters

The judge model uses fixed inference parameters for reproducibility:

Parameter	Value
`seed`	42
`temperature`	0
`top_k`	1
`top_p`	1.0
`repetition_penalty`	1.05
`max_new_tokens`	4096
`enable_thinking`	True
`max_batch_size`	24

Project Structure

.
├── judge.py                 # Run judge model inference on new images
├── compute_scores.py        # Compute scores from pre-generated responses
├── score_utils.py           # Score extraction, mapping, correction, aggregation
├── checklists.py            # Evaluation prompts and dimension definitions
├── backends/
│   └── ms_swift_backend.py  # ms-swift inference engine
├── metadata/
│   └── bench_metadata.json  # ID → dims_en metadata for judge inference
├── requirements.txt
└── assets/                  # Figures for documentation

Evaluation Framework

The benchmark uses a 3-level hierarchical scoring system with 5 L1 dimensions, 23 L2 sub-capabilities, and 56 L3 facets:

L1 Dimension	L2 Sub-capabilities
Quality	Realism, Detail, Resolution
Aesthetics	Composition, Color Harmony, Lighting, Anatomical Portraiture, Emotional Expression, Style Control
Alignment	Attributes, Actions, Layout, Relations, Scene
Real-world Fidelity	Fairness, Safety & Compliance, World Knowledge
Creative Generation	Imagination, Feature Matching, Logical Resolution, Text Rendering, Design Applications, Visual Storytelling

Scoring: Each L3 facet is rated 0 (Fail → 0), 1 (Pass → 60), or 2 (Excel → 100), with N/A excluded. Scores aggregate bottom-up: L3 → L2 → L1 → Overall.

For the complete dimension hierarchy and detailed analysis, see the benchmark dataset card.

Citation

If you find this benchmark useful, please cite our paper:

@misc{li2026qwenimagebenchgenerationcreationtexttoimage,
      title={Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation}, 
      author={Niantong Li and Guangzheng Hu and Weixu Qiao and Ying Ba and Qichen Hong and Shijun Shen and Jinlin Wang and Fan Zhou and Jianye Kang and Xin Shang and Ziyi He and Wei Wang and Dalin Li and Jiahao Li and Jie Zhang and Kaiyuan Gao and Kun Yan and Lihan Jiang and Ningyuan Tang and Shengming Yin and Tianhe Wu and Xiao Xu and Xiaoyue Chen and Yuxiang Chen and Yan Shu and Yanran Zhang and Yilei Chen and Yixian Xu and Zekai Zhang and Zhendong Wang and Zihao Liu and Zikai Zhou and Hongzhu Shi and Yi Wang and Bing Zhao and Hu Wei and Lin Qu and Chenfei Wu},
      year={2026},
      eprint={2605.28091},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.28091}, 
}

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen-Image-Bench

Key Features

Quick Start

Installation

Step-by-step

Usage

Evaluate Your Own T2I Model (`judge.py`)

Run Judge Inference

CLI Options

Output Files

Compute Scores from Pre-generated Responses (`compute_scores.py`)

Top-5 Models

Inference Parameters

Project Structure

Evaluation Framework

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
backends		backends
metadata		metadata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checklists.py		checklists.py
compute_scores.py		compute_scores.py
judge.py		judge.py
requirements.txt		requirements.txt
score_utils.py		score_utils.py

Folders and files

Latest commit

History

Repository files navigation

Qwen-Image-Bench

Key Features

Quick Start

Installation

Step-by-step

Usage

Evaluate Your Own T2I Model (judge.py)

Run Judge Inference

CLI Options

Output Files

Compute Scores from Pre-generated Responses (compute_scores.py)

Top-5 Models

Inference Parameters

Project Structure

Evaluation Framework

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Evaluate Your Own T2I Model (`judge.py`)

Compute Scores from Pre-generated Responses (`compute_scores.py`)

Packages