Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
ARC Lab Tencent PCG, Tsinghua University, Nanjing University
- 🚀 Super-large Codebook: Re-implements the advanced Lookup-Free Quantizer proposed by MAGVITv2, and achieves a super-large codebook (i.e., 2^18) with strong performance (1.17rFID).
- 💡 Auto-Regressive Innovation: Introduces asymmetric token factorization and the next sub-token prediction paradigm, enabling efficient generation with a super-large vocabulary and enhanced sub-token interactions.
- 🚀 Scalability: Validates the scalability of plain auto-regressive models across various parameter sizes (300M to 1.5B).
This repository provides the scripts and checkpoints to replicate our results.
- A series of visual tokenizers: (1) image tokenizer for class-conditional image generation (8
$\times$ and 16$\times$ downsampling rate with 2^18 codebook size), (2) text-conditional image generation (2^14 and 2^18 codebook size with 16$\times$ downsampling rate), (3) video tokenizer (2^18 codebook size with 4$\times$ 8$\times 8$ downsampling rate). - A family of the autoregressive model ranging from 300M to 1.5B for class-conditional image generation.
🤗 Open-MAGVIT2 is still under active development. Stay tuned for the update!
-
$128\times 128$ Tokenizer Training
bash scripts/train_tokenizer/Open-MAGVIT2/run_128_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$256\times 256$ Tokenizer Training
bash scripts/train_tokenizer/Open-MAGVIT/run_256_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$128\times 128$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_128.sh
-
$256\times 256$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_256.sh
| Method | Token Type | #Tokens | Train Data | Codebook Size | rFID | PSNR | Codebook Utilization | Checkpoint |
|---|---|---|---|---|---|---|---|---|
| Open-MAGVIT2-20240617 | 2D | 16 |
256 |
262144 | 1.53 | 21.53 | 100% | - |
| Open-MAGVIT2-20240617 | 2D | 16 |
128 |
262144 | 1.56 | 24.45 | 100% | - |
| Open-MAGVIT2 | 2D | 16 |
256 |
262144 | 1.17 | 21.90 | 100% | IN256_Large |
| Open-MAGVIT2 | 2D | 16 |
128 |
262144 | 1.18 | 25.08 | 100% | IN128_Large |
| Open-MAGVIT2* | 2D | 32 |
128 |
262144 | 0.34 | 26.19 | 100% | above |
(*) denotes that the results are from the direct inference using the model trained with
Please see in scripts/train_autogressive/run.sh for different model configurations.
bash scripts/train_autogressive/run.sh MASTER_ADDR MASTER_PORT NODE_RANK
Please see in scripts/train_autogressive/run.sh for different sampling hyper-parameters for different scale of models.
bash scripts/evaluation/sample_npu.sh or scripts/evaluation/sample_gpu.sh Your_Total_Rank
| Method | Params | #Tokens | FID | IS | Checkpoint |
|---|---|---|---|---|---|
| Open-MAGVIT2 | 343M | 16 |
3.08 | 258.26 | AR_256_B |
| Open-MAGVIT2 | 804M | 16 |
2.51 | 271.70 | AR_256_L |
| Open-MAGVIT2 | 1.5B | 16 |
2.33 | 271.77 | AR_256_XL |
We use LAION-COCO, CC12M, CC3M, LAION-HD, LAION-Aesthetic-umap, LAION-Aesthetic-v2 and JourneyDB for Pretraining.
bash scripts/train_tokenizer/Open-MAGVIT2/pretrain_256.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$256\times 256$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_256.sh
bash scripts/evaluation/evaluation_original.sh
| Method | Quantizer Type | Training Data | Ratio | Resolution | Codebook Size | Checkpoint | rFID(COCO) | PSNR(COCO) | SSIM(COCO) | rFID(In1k) | PSNR(In1k) | SSIM(In1k) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LlamaGen | VQ | 70M | 16 | 256 |
16384 | - | 8.40 | 20.28 | 0.55 | 2.47 | 20.65 | 0.54 |
| Show-o | LFQ | 35M | 16 | 256 |
8192 | - | 9.26 | 20.90 | 0.59 | 3.50 | 21.34 | 0.59 |
| Cosmos | FSQ | - | 16 | 256 |
64000 | - | 11.97 | 19.22 | 0.48 | 4.57 | 19.93 | 0.49 |
| Open-MAGVIT2 | LFQ | 100M | 16 | 256 |
16384 | Pretrain_256_16384 | 7.93 | 22.21 | 0.62 | 2.55 | 22.21 | 0.62 |
| Open-MAGVIT2 | LFQ | 100M | 16 | 256 |
262144 | Pretrain_256_262144 | 6.76 | 22.31 | 0.65 | 1.67 | 22.70 | 0.64 |
| Cosmos | FSQ | - | 16 | Original | 64000 | - | 7.51 | 20.45 | 0.52 | 1.93 | 20.56 | 0.51 |
| Open-MAGVIT2 | LFQ | 100M | 16 | Original | 16384 | Pretrain_256_16384 | 6.65 | 21.61 | 0.57 | 1.39 | 21.74 | 0.56 |
| Open-MAGVIT2 | LFQ | 100M | 16 | Original | 262144 | Pretrain_256_262144 | 5.10 | 22.18 | 0.60 | 0.78 | 22.24 | 0.59 |
bash scripts/train_tokenizer/Open-MAGVIT2/run_video.sh MASTER_ADDR MASTER_PORT NODE_RANK
bash scripts/evaluation/evaluation_video.sh
| Method | Token Type | Tokens | Ratio | Train Resolution | Codebook Size | rFVD | Checkpoints |
|---|---|---|---|---|---|---|---|
| TATS | 2D | 4 |
8 | 128 |
16384 | 162 | - |
| MAGVIT | 2D | 4 |
8 | 128 |
1024 | 25 | - |
| SweetTokenizer | 1D | 256 + 1024 | - | 256 |
10481 + 11139 | 44 | - |
| LARP-L | 1D | 1024 | - | 128 |
8192 | 24 | - |
| LARP-L-Long | 1D | 1024 | - | 128 |
8192 | 24 | - |
| SweetTokenizer | 1D | 5120 | - | 256 |
10481 + 11139 | 18 | - |
| Open-MAGVIT2 | 2D | 5 |
8 | 128 |
262144 | 16 | Video_128_262144 |

