| Model | Params | POPE | MME-P | MMB_dev | SEED | VQAv2 | GQA | MMMU | MM-Vet | TextVQA | MMStar |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini-Nano-1 | 1.8B | - | - | - | - | 62.7 | - | 26.3 | - | - | - |
| VILA-U | 7B | 85.8 | 1401.8 | - | 59.0 | 79.4 | 60.8 | - | 33.5 | - | - |
| Chameleon | 7B | - | 170.0 | 31.1 | 30.6 | - | - | 25.4 | 8.3 | - | 31.1 |
| Chameleon | 30B | - | 575.3 | 32.5 | 48.5 | - | - | 38.8 | - | - | 31.8 |
| DreamLLM | 7B | - | - | 58.2 | - | 72.9 | - | - | 36.6 | - | - |
| LaVIT | 7B | - | - | - | - | 66.0 | 46.8 | - | - | - | - |
| Video-LaVIT | 7B | - | 1551.8 | 67.3 | 64.0 | - | - | - | - | - | - |
| Emu | 13B | - | - | - | - | 52.0 | - | - | - | - | - |
| Emu3 | 8B | - | - | 58.5 | 68.2 | - | - | 31.6 | - | - | - |
| NExT-GPT | 13B | - | - | - | - | 66.7 | - | - | - | - | - |
| Show-o | 1.3B | 73.8 | 948.4 | - | - | 59.3 | 48.7 | 25.1 | - | - | - |
| Janus | 1.3B | 87.0 | 1338.0 | 69.4 | 63.7 | 77.3 | 59.1 | 30.5 | 34.3 | - | - |
| JanusFlow | 1.3B | 88.0 | 1333.1 | 74.9 | 70.5 | 79.8 | 60.3 | 29.3 | 30.9 | - | - |
| Orthus | 7B | 79.6 | 1265.8 | - | - | 63.2 | 52.8 | 28.2 | - | - | - |
| Liquid† | 7B | 81.1 | 1119.3 | - | - | 71.3* | 58.4* | - | - | 42.4 | - |
| Unified-IO2 | 6.8B | - | - | 71.5 | - | - | - | 86.2 | - | - | 61.8 |
| SEEDLLaMA | 7B | - | - | 45.8 | 51.5 | - | - | - | - | - | 31.7 |
| MUSE-VL | 7B | - | 1480.9 | 72.1 | 70.0 | - | - | 42.3 | - | - | 48.3 |
| MUSE-VL | 32B | - | 1581.6 | 81.8 | 71.0 | - | - | 50.1 | - | - | 56.7 |
| Method | Resolution | Params | #Images | FID |
|---|---|---|---|---|
| SD-XL | - | 2000M | 9.55 | |
| PixArt | - | 25M | 6.14 | |
| Playground v2.5 | - | - | 4.48 | |
| LWM | 7B | - | 17.77 | |
| VILA-U | 256 | 7B | 15M | 12.81 |
| VILA-U | 384 | 7B | 15M | 7.69 |
| Show-o | 1.3B | 36M | 15.18 | |
| Janus | 1.3B | - | 10.10 | |
| JanusFlow | 1.3B | - | 9.51 | |
| Liquid | 512 | 7B | 30M | 5.47 |
| Model | Params | Res. | Single Obj. | Two Obj. | Count. | Colors | Position | Color Attri. | Overall↑ |
|---|---|---|---|---|---|---|---|---|---|
| Generation Model | |||||||||
| LlamaGen | 0.8B | - | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 | 0.32 |
| LDM | 1.4B | - | 0.92 | 0.29 | 0.23 | 0.70 | 0.02 | 0.05 | 0.37 |
| SDv1.5 | 0.9B | - | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 | 0.43 |
| PixArt-α | 0.6B | - | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 |
| SDv2.1 | 0.9B | - | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 |
| DALL-E 2 | 6.5B | - | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 | 0.52 |
| Emu3-Gen | 8B | - | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 |
| SDXL | 2.6B | - | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 |
| IF-XL | 4.3B | - | 0.97 | 0.74 | 0.66 | 0.81 | 0.13 | 0.35 | 0.61 |
| DALL-E 3 | - | - | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 |
| Unified Model | |||||||||
| Chameleon | 34B | - | - | - | - | - | - | - | 0.39 |
| LWM | 7B | - | 0.93 | 0.41 | 0.46 | 0.79 | 0.09 | 0.15 | 0.47 |
| SEED-X | 17B | - | 0.97 | 0.58 | 0.26 | 0.80 | 0.19 | 0.14 | 0.49 |
| Show-o | 1.3B | - | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 |
| Janus | 1.3B | - | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 |
| JanusFlow | 1.3B | - | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 | 0.63 |
| Orthus | 7B | 512 | 0.99 | 0.75 | 0.26 | 0.84 | 0.28 | 0.38 | 0.58 |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 17/11 | VQ-VAE | Neural Discrete Representation Learning | arXiv | |
| 20/12 | VQGAN | Taming Transformers for High-Resolution Image Synthesis | arXiv | GitHub |
| 21/10 | ViT-VQGAN | VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN | arXiv | |
| 22/03 | RQ-VAE | Autoregressive Image Generation using Residual Quantization | arXiv | GitHub |
| 22/09 | MoVQ | MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation | arXiv | |
| 22/12 | MAGVIT | MAGVIT: Masked Generative Video Transformer | arXiv | GitHub |
| 23/09 | FSQ-VQ-VAE | FINITE SCALAR QUANTIZATION: VQ-VAE MADE SIMPLE | arXiv | GitHub |
| 23/10 | Efficient-VQGAN | Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers | arXiv | |
| 23/10 | MAGVIT-v2 | Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | arXiv | |
| 24/06 | LlamaGEN | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation | arXiv | GitHub |
| 24/06 | OmniTokenizer | OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | arXiv | GitHub |
| 24/06 | VQGAN-LC | Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99% | arXiv | |
| 24/09 | Open-MAGVIT2 | Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation | arXiv | GitHub |
| 24/12 | IBQ | Scalable Image Tokenization with Index Backpropagation Quantization | arXiv | GitHub |
| 24/12 | ZipAR | ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality | arXiv | |
| 24/12 | VIDTOK | VIDTOK: A VERSATILE AND OPEN-SOURCE VIDEO TOKENIZER | arXiv | GitHub |
| 25/01 | R3GAN | The GAN is dead; long live the GAN! A Modern GAN Baseline | arXiv | GitHub |
| 25/03 | FAR | Frequency Autoregressive Image Generation with Continuous Tokens | arXiv | GitHub |
| 25/03 | NFIG | NFIG: Autoregressive Image Generation with Next-Frequency Prediction | arXiv |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/12 | ViLex | Visual Lexicon: Rich Image Features in Language Space | arXiv | |
| 25/03 | V2Flow | V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation | arXiv | GitHub |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/06 | BSQ-ViT | Image and Video Tokenization with Binary Spherical Quantization | arXiv | GitHub |
| 23/06 | SPAE | SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs | arXiv | |
| 24/06 | TiTok | An Image is Worth 32 Tokens for Reconstruction and Generation | arXiv | GitHub |
| 24/10 | ImageFolder | ImageFolder: Autoregressive Image Generation with Folded Tokens | arXiv | GitHub |
| 24/10 | BPE-VQ | From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities | arXiv | |
| 24/11 | VQ-KD | Image Understanding Makes for A Good Tokenizer for Image Generation | arXiv | GitHub |
| 24/11 | FQGAN | Factorized Visual Tokenization and Generation | arXiv | GitHub |
| 24/12 | XQ-GAN | XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation | arXiv | GitHub |
| 24/12 | Divot | Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation | arXiv | GitHub |
| 24/12 | SoftVQ-VAE | SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer | arXiv | GitHub |
| 25/01 | VA-VAE | Reconstruction vs. Generation:Taming Optimization Dilemma in Latent Diffusion Models | arXiv | GitHub |
| 25/02 | ReaLS | Exploring Representation-Aligned Latent Space for Better Generation | arXiv | GitHub |
| 25/02 | MAETok | Masked Autoencoders Are Effective Tokenizers for Diffusion Models | arXiv | GitHub |
| 25/02 | QLIP | QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation | arXiv | GitHub |
| 25/02 | FlexTok | FlexTok: Resampling Images into 1D Token Sequences of Flexible Length | arXiv | |
| 25/02 | UniTok | UniTok: A Unified Tokenizer for Visual Generation and Understanding | arXiv | GitHub |
| 25/03 | USP | USP: Unified Self-Supervised Pretraining for Image Generation and Understanding | arXiv | GitHub |
| 25/03 | SemHiTok | SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation | arXiv | |
| 25/03 | Robust Tokenizer | Robust Latent Matters: Boosting Image Generation with Sampling Error | arXiv | GitHub |
| 25/03 | PCA Tokenizer | “Principal Components” Enable A New Language of Images | arXiv | GitHub |
| 25/03 | FlowMo | Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization | arXiv | Project Page |
| 25/03 | DualToken | DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies | arXiv | GitHub |
| 25/03 | CTF | Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction | arXiv | GitHub |
| 25/03 | TokenBridge | Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation | arXiv | Project Page |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 23/10 | SEED | Making Llama See and Draw With Seed Tokenizer | arXiv | GitHub |
| 24/02 | AnyGPT | Anygpt: Unified Multimodal Llm With Discrete Sequence Modeling | arXiv | GitHub |
| 24/06 | LAViT | Unified Language-Vision Pretraining in Llm With Dynamic Discrete Visual Tokenization | arXiv | GitHub |
| 24/09 | MIO | Mio: A Foundation Model on Multimodal Tokens | arXiv | GitHub |
| 24/12 | illume | Illume: Illuminating Your Llms to See, Draw, and Self-Enhance | arXiv | GitHub |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 23/06 | Emu | Emu: Generative Pretraining in Multimodality | arXiv | GitHub |
| 23/09 | NExT-GPT | Next-Gpt | arXiv | GitHub |
| 23/10 | MiniGPT-5 | Minigpt-5: Interleaved Vision-and-Language Generation Via Generative Vokens | arXiv | |
| 23/12 | VL-GPT | Vl-Gpt: A Generative Pre-Trained Transformer for Vision and Language Understanding and Generation | arXiv | GitHub |
| 24/01 | MM-Interleaved | Mm-Interleaved: Interleaved Image-Text Generative Modeling Via Multi-Modal Feature Synchronizer | arXiv | GitHub |
| 24/02 | EasyGen | Easygen: Easing Multimodal Generation With Bidiffuser and Llms | arXiv | |
| 24/03 | CoDi-2 | Codi-2: In-Context Interleaved and Interactive Any-to-Any Generation | arXiv | |
| 24/04 | SEED-X | Seed-X: Multimodal Models With Unified Multi-Granularity Comprehension and Generation | arXiv | GitHub |
| 24/04 | DreamLLM | Dreamllm: Synergistic Multimodal Comprehension and Creation | arXiv | GitHub |
| 24/05 | DEEM | Deem: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception | arXiv | |
| 24/05 | X-VILA | X-Vila: Cross-Modality Alignment for Large Language Model | arXiv | GitHub |
| 24/06 | Emu2 | Generative Multimodal Models Are In-Context Learners | arXiv | GitHub |
| 24/11 | Spider | Spider: Any-to-Many Multimodal Llm | arXiv | GitHub |
| 24/12 | MetaMorph | Metamorph: Multimodal Understanding and Generation Via Instruction Tuning | arXiv | GitHub |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 21/06 | OFA | Ofa: Unifying Multimodal Pretrained Models | arXiv | GitHub |
| 22/04 | Unified-IO | Unified-Io: A Unified Model for Vision, Language, and Multi-Modal Tasks | arXiv | GitHub |
| 23/11 | Teal | Teal: Tokenize and Embed All for Multi-Modal Large Language Models | arXiv | |
| 24/02 | LWM | World Model on Million-Length Video and Language With Ringattention | arXiv | GitHub |
| 24/05 | Chameleon | Chameleon: Mixed-Modal Early-Fusion Foundation Models | arXiv | GitHub |
| 24/06 | LlamaGEN | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation | arXiv | GitHub |
| 24/06 | 4M-21 | 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities | arXiv | |
| 24/08 | ANOLE | Anole: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation | arXiv | GitHub |
| 24/08 | Show-o | Show-o: One single transformer to unify multimodal understanding and generation | arXiv | |
| 24/09 | Emu3 | Emu3: Next-Token Prediction is All You Need | arXiv | |
| 24/12 | Liquid | Liquid: Language Models are Scalable Multi-modal Generators | arXiv | |
| 24/12 | SynerGen-VL | SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding | arXiv |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/04 | LIBRA | Libra: Building Decoupled Vision System on Large Language Models | arXiv | |
| 24/09 | VILA-U | VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation | arXiv | |
| 24/11 | MUSE-VL | MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding | arXiv | |
| 24/12 | TokenFlow | TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation | arXiv | |
| 25/02 | QLIP | QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation | arXiv | GitHub |
| 25/02 | UniTok | UniTok: A Unified Tokenizer for Visual Generation and Understanding | arXiv | GitHub |
| 25/03 | DualToken | DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies | arXiv | GitHub |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/05 | Morph-Tokens | Auto-Encoding Morph-Tokens for Multimodal LLM | arXiv | |
| 24/06 | Unified-IO 2 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action | arXiv | GitHub |
| 24/10 | Janus | Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation | arxiv | GitHub |
| 24/12 | ILLUME | ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | arxiv | |
| 25/01 | VARGPT | VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | arxiv | GitHub |
| 25/01 | Janus-Pro | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | arxiv | GitHub |
| 25/03 | OmniMamba | OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models | arxiv | GitHub |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/08 | Transfusion | Transfusion: Predict The Next Token and Diffuse Images With One Multi-Modal Model | arXiv | |
| 24/09 | MonoFormer | MonoFormer: One transformer for both diffusion and autoregression | arXiv | GitHub |
| 24/11 | JanusFlow | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | arXiv | GitHub |
| 24/11 | JetFormer | JetFormer: An Autoregressive Generative Model of Raw Images and Text | arXiv | |
| 24/12 | CausalFusion | Causal Diffusion Transformers for Generative Modeling | arXiv | |
| 24/12 | LLaMAFusion | LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation | arXiv |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 24/10 | MMAR | Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling | arXiv | |
| 24/12 | Orthus | Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads | arXiv | |
| 24/12 | LatentLM | Multimodal Latent Language Modeling with Next-Token Diffusion | arxiv | |
| 25/03 | UniFluid | Unified Autoregressive Visual Generation and Understanding with Continuous Tokens | arxiv |
| Publication Date | Method Abbreviation | Full Title | arXiv Link | Code Repository |
|---|---|---|---|---|
| 2023-03-13 | UniDiffuser | One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | arXiv | |
| 2023-05-20 | CoDi | Any-to-Any Generation via Composable Diffusion | arXiv | Project Page |
| 2023-06-01 | UniDiff | UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning | arXiv | |
| 2024-12-02 | OmniFlow | OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows | arXiv | GitHub |
| 2025-01-00 | Dual Diffusion | Dual Diffusion for Unified Image Generation and Understanding | arXiv | |
| 2025-03-06 | X2I | X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation | arXiv | GitHub |