📄 Paper: https://arxiv.org/abs/2606.01399 🔗 Project page: https://gaohy63601-droid.github.io/promising/
Heyuan Gao1,2,, Bangxun Tang1,3,, Yiren Song1,4,*,‡, Guian Fang1,4, Zijian He1, Jie Yang1, Mike Zheng Shou4,†
1Utopai Studios 2Nanyang Technological University 3University of California, Irvine 4Show Lab, National University of Singapore
*Equal contribution ‡Project leader †Corresponding author
We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.
Code release is coming soon.