Qinyu Zhao1,2 · Guangting Zheng2 · Tao Yang2 · Rui Zhu2† · Xingjian Leng1 · Stephen Gould1 · Liang Zheng1
1 Australian National University
2 ByteDance Seed
† Project Lead
[2025-12-15] Initial Release with Codebase.
To set up our environment, please run:
git clone https://github.com/ByteDance-Seed/SimFlow.git
cd SimFlow
# If you use conda, please uncomment the following lines.
# conda create -n simflow python=3.11.2 -y
# conda activate simflow
pip install -r requirements.txt
Please download and extract the training split of the ImageNet-1K dataset.
A sample code for training SimFlow+REPA-E is shown below.
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
train_vae_w_nf.py \
--seed=0 \
--data_path="./imagenet" \
--output_dir="./output/vae_f16d64_std0_5_simflow_adaln_2222246_repaAlign3/" \
--resume="./output/vae_f16d64_std0_5_simflow_adaln_2222246_repaAlign3/" \
--batch_size=16 \
--checkpointing-steps=100000 --sampling-steps 5000 \
--loss-cfg-path="configs/vae_loss/l1_lpips_kl_gan_joint_training.yaml" \
--vae="vae_f16d64" --use_variational=False --fixed_std=0.5 \
--channels=1152 --blocks=6 --layers_per_block=2,2,2,2,2,46 --num_heads=16 \
--lr_schedule='const_then_cosine' --warmup_epochs=0 --hold_epochs=80 --min_lr=1e-6 --epochs=160 --max-train-steps=800000 \
--enc_type="dinov2-vit-b" --repa_align_depth='-1,-1,1,-1,-1,-1' \
--disturb_latents='none' \
--online_eval --eval_steps=100000 --cfg=0.0
ImageNet 256x256 | FID = 2.15
torchrun --nproc_per_node=8 eval_vae_w_nf.py \
--seed=0 --output_dir="output/simflow_imagenet256x256" \
--resume="output/simflow_imagenet256x256" \
--vae="vae_f16d64" --use_variational=False --fixed_std=0.5 \
--channels=1152 --blocks=6 --layers_per_block=2,2,2,2,2,46 --num_heads=16 \
--evaluate --cfg=1.1 --temperature=0.95 --denoising_lr=0.25
ImageNet 256x256 with REPA-E | FID = 1.91
torchrun --nproc_per_node=8 eval_vae_w_nf.py \
--seed=0 --output_dir="output/simflow_imagenet256x256_repae" \
--resume="output/simflow_imagenet256x256_repae" \
--vae="vae_f16d64" --use_variational=False --fixed_std=0.5 \
--channels=1152 --blocks=6 --layers_per_block=2,2,2,2,2,46 --num_heads=16 \
--evaluate --cfg=1.1 --temperature=0.975 --denoising_lr=0.25
ImageNet 512x512 with REPA-E | FID = 2.74
torchrun --nproc_per_node=8 eval_vae_w_nf.py \
--seed=0 --output_dir="output/simflow_imagenet512x512_repae" \
--resume="output/simflow_imagenet512x512_repae" \
--resolution=512 \
--vae="vae_f16d64" --use_variational=False --fixed_std=0.5 \
--channels=1152 --blocks=6 --layers_per_block=2,2,2,2,2,46 --num_heads=16 \
--evaluate --cfg=1.0 --temperature=1.0 --eval_bsz=64
We also provide pretrained models, which can be downloaded on HuggingFace.
This codebase builds upon several excellent open-source projects, including:
We sincerely thank the authors for making their work and models publicly available.
If you find our work useful, please consider citing:
@article{zhao2025simflow,
title={SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows},
author={Zhao, Qinyu and Zheng, Guangting and Yang, Tao and Zhu, Rui and Leng, Xingjian and Gould, Stephen and Zheng, Liang},
journal={arXiv preprint arXiv:2512.04084},
year={2025}
}

