UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Setup

Installation Requirmenets

The code is compatible with python 3.8 and pytorch 1.9.

You can create an anaconda environment called UniD3 with the required dependencies by running:

git clone https://github.com/mhh0318/UniD3.git
cd UniD3
conda create -n unid3 python=3.8
pip install -r requirements.txt

Download Pretrained Weights

Download the pretrained models from here, and save them to pretrained_models/.

Download the released VQ-GAN model GumbelVQGAN on OpenImages and put them under ./misc/taming_dvae/

Quick Inference

For the simultaneous vision-language generation, please ru:

python ./UniDiff/dist_eval_sample.py --model CKPT_PATH  --condition unconditional --log pair_samples

If the environment is setup correctly, this command should function properly and generate some results in the folder /pair_samples.

Comments

Our codebase for the diffusion models builds heavily on https://github.com/lucidrains/denoising-diffusion-pytorch, VQ-Diffusion and Multi-nomial Diffusion Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@article{hu2022unified,
  title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
  author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
  journal = {arXiv},
  year = {2022},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!