ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models [Paper]
Qinyu Zhao, Stephen Gould, Liang Zheng
🔥 This repository contains the source code for the technical report. We will conduct more ablation and variant studies and scale up this model in our report updates.
👍 Our codebase heavily depends on MAR and VAR. We really appreciate their outstanding work.
⭐ Besides, a recent work FractalGen explores a similar idea. We encourage readers to refer to their paper for further insights.
Samples generated by our 213M base model.
Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector
git clone https://github.com/Qinyu-Allen-Zhao/Arinar.git
cd Arinar
conda env create -f environment.yaml
conda activate arinar
Then, please install FlashAttention.
pip install flash-attn --no-build-isolation
Please download and prepare the training set of ImageNet-1k.
2. Download VAE Model Pre-Trained by MAR
python util/download.py
3. (Optional) Caching VAE Latents, Following MAR
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 \
main_cache.py \
--img_size 256 --vae_path pretrained_models/vae/kl16.ckpt --vae_embed_dim 16 \
--batch_size 128 \
--data_path ${IMAGENET_PATH} --cached_path ${CACHED_PATH}
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=12345 \
main_mar.py \
--img_size 256 --vae_path pretrained_models/vae/kl16.ckpt --vae_embed_dim 16 --vae_stride 16 --patch_size 1 --model mar_base --num_gaussians 4 \
--epochs 400 --warmup_epochs 100 --batch_size 64 --blr 1.0e-4 --output_dir ./outputs/mar_base_arinar_4_w768_d1 --resume ./outputs/mar_base_arinar_4_w768_d1 \
--data_path ${IMAGENET_PATH} --num_workers 2 --pin_mem \
--online_eval
Please add --use_cached --cached_path ${CACHED_PATH} if you want to train with cached VAE latents.
It takes about 8 days to train the base model on 4x A100(80G) GPUs, with cached VAE latents.
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=12345 \
main_mar.py \
--img_size 256 --vae_path pretrained_models/vae/kl16.ckpt --vae_embed_dim 16 --vae_stride 16 --patch_size 1 --model mar_base --num_gaussians 4 \
--epochs 400 --warmup_epochs 100 --batch_size 64 --blr 1.0e-4 --output_dir ./outputs/mar_base_arinar_4_w768_d1 --resume ./outputs/mar_base_arinar_4_w768_d1 \
--data_path ${IMAGENET_PATH} --num_workers 2 --pin_mem \
--temperature 1.0 --cfg 4.5
--evaluate
When temperature = 1.1, the best cfg is 3.9.
The checkpoint we trained was uploaded to GoogleDrive. Feel free to download and evaluate it.
| Model | #Parameters | FID | Time / image (s) |
|---|---|---|---|
| MAR-B | 208M | 2.31 | 65.69 |
| FractalMAR-B | 186M | 11.80 | 137.62 |
| ARINAR-B | 213M | 2.75 | 11.57 |
If you use our codebase or our results in your research, please cite our work:
@article{zhao2025arinar,
title={ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models},
author={Zhao, Qinyu and Gould, Stephen and Zheng, Liang},
journal={arXiv preprint arXiv:2503.02883},
year={2025}
}