VisionDirector (CVPR 2026)

Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Overview

Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. VisionDirector is a training-free, vision-language supervisor that:

Extracts structured goals from long instructions
Dynamically decides between one-shot generation and staged edits
Runs micro-grid sampling with semantic verification/rollback after every edit
Logs goal-level rewards for transparent evaluation

We further fine-tune the planner with Group Relative Policy Optimization (GRPO), yielding shorter edit trajectories (3.1 vs 4.2 steps) and stronger alignment.

LGBench: Long Goal Benchmark

To expose the gap between current models and real-world design requirements, we introduce LGBench, a 2,000-task benchmark:

	T2I	I2I	Total
Tasks	1,000	1,000	2,000
Total Goals	18,035	11,217	29,252
Avg Goals/Task	18.0	11.2	—

📦 Dataset: huggingface.co/datasets/TruemanV5/LGBench

Results

VisionDirector achieves new state-of-the-art on:

GenEval: +7% overall improvement
ImgEdit: +0.07 absolute improvement

Model	GenEval Overall
Qwen-Image	0.87
GPT Image 1	0.84
VisionDirector	0.94

Repository Structure

VisionDirector/
├── bench/                    # LGBench benchmark data
│   ├── t2i_1000.json         # 1000 T2I tasks (18k goals)
│   ├── i2i_1000.json         # 1000 I2I tasks (11k goals)
│   └── runners/              # Model inference scripts
│
├── evaluation/               # VLM-based goal verification
│   ├── goal_verify.py        # Main evaluation script
│   ├── parallel_evaluate.py  # Multi-GPU parallel evaluation
│   └── evaluate_sources.py   # Source image quality check
│
├── agent/                    # 🚧 Coming Soon
└── training/                 # 🚧 Coming Soon

Code & Models

Component	Status
LGBench Data	✅ Available
Evaluation Scripts	✅ Available
VisionDirector Agent	🚧 Coming Soon
GRPO Fine-tuning	🚧 Coming Soon

Citation

@article{chu2025visiondirector,
  title={VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis},
  author={Chu, Meng and Yang, Senqiao and Che, Haoxuan and Zhang, Suiyun and Zhang, Xichen and Yu, Shaozuo and Gui, Haokun and Rao, Zhefan and Tu, Dandan and Liu, Rui and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.19243},
  year={2025}
}

Authors

Meng Chu (HKUST) — Project Lead
Senqiao Yang (CUHK)
Haoxuan Che (Huawei Research)
Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao
Dandan Tu, Rui Liu (Huawei Research)
Jiaya Jia (HKUST)

License

This project is released under the Apache 2.0 License.

Acknowledgements

Qwen-VL for the VLM backbone
FLUX for the diffusion model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VisionDirector (CVPR 2026)

Overview

LGBench: Long Goal Benchmark

Results

Repository Structure

Code & Models

Citation

Authors

License

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VisionDirector (CVPR 2026)

Overview

LGBench: Long Goal Benchmark

Results

Repository Structure

Code & Models

Citation

Authors

License

Acknowledgements