Skip to content

Stability-AI/Qwen-Image-Bench

 
 

Repository files navigation

Qwen-Image-Bench

Paper Dataset Model ModelScope License

An evaluation toolkit for text-to-image (T2I) generation models. It uses a fine-tuned Q-Judger (Qwen3.6-27B) to score generated images across 5 hierarchical dimensions (Quality, Aesthetics, Alignment, Real-world Fidelity, Creative Generation) covering 56 fine-grained facets.

Qwen-Image-Bench dimension framework and representative model outputs

Key Features

  • Evaluate any T2I model — run the judge model on your own generated images and get structured, multi-dimensional scores
  • Compute scores from pre-generated responses — reproduce the leaderboard from the released benchmark dataset
  • Powered by ms-swift — uses the same inference setup that produced the benchmark responses

Quick Start

# 1. Clone the repo
git clone https://github.com/QwenLM/Qwen-Image-Bench.git
cd Qwen-Image-Bench

# 2. Install dependencies
uv venv myenv --python 3.11 && source myenv/bin/activate
# Install PyTorch first: https://pytorch.org/get-started/locally/
uv pip install -r requirements.txt

# 3. Run judge on your images
python judge.py \
  --input your_data.jsonl \
  --model Qwen/Qwen-Image-Bench

Your input file should be a CSV/JSON/JSONL with three columns:

Column Type Description
ID int Prompt identifier (1–1000), must match benchmark metadata
prompt str The text prompt used to generate the image
image_path str Path to the generated image file

Installation

Step-by-step

1. Create and activate a virtual environment:

uv venv myenv --python 3.11
source myenv/bin/activate

2. Install PyTorch (select the command matching your CUDA version):

See the official guide: https://pytorch.org/get-started/locally/

3. Install Python dependencies:

uv pip install -r requirements.txt

This installs all required dependencies including ms-swift.

Usage

Evaluate Your Own T2I Model (judge.py)

Run Judge Inference

python judge.py \
  --input your_data.jsonl \
  --model Qwen/Qwen-Image-Bench

CLI Options

Argument Default Description
--input (required) Input CSV/JSON/JSONL with ID, prompt, image_path
--model (required) HuggingFace model ID or local path
--backend vllm Inference backend: vllm (continuous batching, fast) or pt (ms-swift PtEngine / HF static batching)
--hf-bench-repo HF dataset repo for bench metadata
--local-metadata Local metadata file path (overrides default)
--max-batch-size 24 ms-swift PtEngine max_batch_size (pt backend only)
--max-new-tokens 4096 Max generation tokens
--max-num-seqs 256 vLLM max concurrent sequences (vllm backend)
--gpu-memory-utilization 0.9 vLLM GPU memory fraction (vllm backend)
--tensor-parallel-size 1 vLLM tensor parallel size / GPUs (vllm backend)
--max-model-len vLLM max context length (vllm backend; default: model config)

The default vllm backend runs the judge with continuous batching + PagedAttention for substantially higher throughput on long, variable-length (thinking-mode) generations. It requires vllm in the venv (see requirements.txt). Use --backend pt to fall back to the original ms-swift PtEngine path.

Output Files

After running judge.py, three files are written next to your input:

File Contents
<input>_judged.{jsonl,csv} Per-row results: original fields + judge_model_output (combined raw scores JSON) + <dim>_judge_output (raw judge text per L1 dimension)
<input>_bench_scores.json Aggregated scores: level1, level2, total
<input>_bench_scores.xlsx Same scores in Excel: Level-1 Summary sheet + one sheet per L1 dimension with L2 detail

Compute Scores from Pre-generated Responses (compute_scores.py)

# From local file
python compute_scores.py --input qwen_image_bench_hf_v0518.jsonl

# Or download from HuggingFace
python compute_scores.py --hf-repo Qwen/Qwen-Image-Bench

Output: scores_result.xlsx + scores_detail.json

Top-5 Models

Model Quality Aesthetics Alignment Real-world Fidelity Creative Generation Overall
GPT Image 2 58.65 67.53 65.85 57.38 75.23 64.69
Nano Banana 2.0 54.77 61.08 62.40 54.28 67.05 59.82
GPT Image 1.5 55.14 60.88 61.72 53.95 66.35 59.65
Nano Banana Pro 55.67 60.26 61.25 54.07 66.23 59.45
Qwen Image 2.0 Pro 54.39 58.67 59.28 51.83 64.94 57.84

Full results for all 18 models are available in the paper.

Inference Parameters

The judge model uses fixed inference parameters for reproducibility:

Parameter Value
seed 42
temperature 0
top_k 1
top_p 1.0
repetition_penalty 1.05
max_new_tokens 4096
enable_thinking True
max_batch_size 24

Project Structure

.
├── judge.py                 # Run judge model inference on new images
├── compute_scores.py        # Compute scores from pre-generated responses
├── score_utils.py           # Score extraction, mapping, correction, aggregation
├── checklists.py            # Evaluation prompts and dimension definitions
├── backends/
│   └── ms_swift_backend.py  # ms-swift inference engine
├── metadata/
│   └── bench_metadata.json  # ID → dims_en metadata for judge inference
├── requirements.txt
└── assets/                  # Figures for documentation

Evaluation Framework

The benchmark uses a 3-level hierarchical scoring system with 5 L1 dimensions, 23 L2 sub-capabilities, and 56 L3 facets:

L1 Dimension L2 Sub-capabilities
Quality Realism, Detail, Resolution
Aesthetics Composition, Color Harmony, Lighting, Anatomical Portraiture, Emotional Expression, Style Control
Alignment Attributes, Actions, Layout, Relations, Scene
Real-world Fidelity Fairness, Safety & Compliance, World Knowledge
Creative Generation Imagination, Feature Matching, Logical Resolution, Text Rendering, Design Applications, Visual Storytelling

Scoring: Each L3 facet is rated 0 (Fail → 0), 1 (Pass → 60), or 2 (Excel → 100), with N/A excluded. Scores aggregate bottom-up: L3 → L2 → L1 → Overall.

For the complete dimension hierarchy and detailed analysis, see the benchmark dataset card.

Citation

If you find this benchmark useful, please cite our paper:

@misc{li2026qwenimagebenchgenerationcreationtexttoimage,
      title={Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation}, 
      author={Niantong Li and Guangzheng Hu and Weixu Qiao and Ying Ba and Qichen Hong and Shijun Shen and Jinlin Wang and Fan Zhou and Jianye Kang and Xin Shang and Ziyi He and Wei Wang and Dalin Li and Jiahao Li and Jie Zhang and Kaiyuan Gao and Kun Yan and Lihan Jiang and Ningyuan Tang and Shengming Yin and Tianhe Wu and Xiao Xu and Xiaoyue Chen and Yuxiang Chen and Yan Shu and Yanran Zhang and Yilei Chen and Yixian Xu and Zekai Zhang and Zhendong Wang and Zihao Liu and Zikai Zhou and Hongzhu Shi and Yi Wang and Bing Zhao and Hu Wei and Lin Qu and Chenfei Wu},
      year={2026},
      eprint={2605.28091},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.28091}, 
}

License

This project is licensed under the Apache License 2.0.

About

Fork of Qwen Image Bench for Stable Tribunal

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%