Skip to content

Latest commit

 

History

History
110 lines (83 loc) · 3.88 KB

File metadata and controls

110 lines (83 loc) · 3.88 KB

VisionDirector (CVPR 2026)

Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

arXiv Dataset Project Page

Overview

Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. VisionDirector is a training-free, vision-language supervisor that:

  1. Extracts structured goals from long instructions
  2. Dynamically decides between one-shot generation and staged edits
  3. Runs micro-grid sampling with semantic verification/rollback after every edit
  4. Logs goal-level rewards for transparent evaluation

We further fine-tune the planner with Group Relative Policy Optimization (GRPO), yielding shorter edit trajectories (3.1 vs 4.2 steps) and stronger alignment.

LGBench: Long Goal Benchmark

To expose the gap between current models and real-world design requirements, we introduce LGBench, a 2,000-task benchmark:

T2I I2I Total
Tasks 1,000 1,000 2,000
Total Goals 18,035 11,217 29,252
Avg Goals/Task 18.0 11.2

📦 Dataset: huggingface.co/datasets/TruemanV5/LGBench

Results

VisionDirector achieves new state-of-the-art on:

  • GenEval: +7% overall improvement
  • ImgEdit: +0.07 absolute improvement
Model GenEval Overall
Qwen-Image 0.87
GPT Image 1 0.84
VisionDirector 0.94

Repository Structure

VisionDirector/
├── bench/                    # LGBench benchmark data
│   ├── t2i_1000.json         # 1000 T2I tasks (18k goals)
│   ├── i2i_1000.json         # 1000 I2I tasks (11k goals)
│   └── runners/              # Model inference scripts
│
├── evaluation/               # VLM-based goal verification
│   ├── goal_verify.py        # Main evaluation script
│   ├── parallel_evaluate.py  # Multi-GPU parallel evaluation
│   └── evaluate_sources.py   # Source image quality check
│
├── agent/                    # 🚧 Coming Soon
└── training/                 # 🚧 Coming Soon

Code & Models

Component Status
LGBench Data ✅ Available
Evaluation Scripts ✅ Available
VisionDirector Agent 🚧 Coming Soon
GRPO Fine-tuning 🚧 Coming Soon

Citation

@article{chu2025visiondirector,
  title={VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis},
  author={Chu, Meng and Yang, Senqiao and Che, Haoxuan and Zhang, Suiyun and Zhang, Xichen and Yu, Shaozuo and Gui, Haokun and Rao, Zhefan and Tu, Dandan and Liu, Rui and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.19243},
  year={2025}
}

Authors

  • Meng Chu (HKUST) — Project Lead
  • Senqiao Yang (CUHK)
  • Haoxuan Che (Huawei Research)
  • Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao
  • Dandan Tu, Rui Liu (Huawei Research)
  • Jiaya Jia (HKUST)

License

This project is released under the Apache 2.0 License.

Acknowledgements

  • Qwen-VL for the VLM backbone
  • FLUX for the diffusion model