Skip to content

Mingyang-Han/Awesome-Native-Multimodal-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Unified Models for Vision Language Understanding and Generation

Benchmark Results

Understanding

Model Params POPE MME-P MMB_dev SEED VQAv2 GQA MMMU MM-Vet TextVQA MMStar
Gemini-Nano-1 1.8B - - - - 62.7 - 26.3 - - -
VILA-U 7B 85.8 1401.8 - 59.0 79.4 60.8 - 33.5 - -
Chameleon 7B - 170.0 31.1 30.6 - - 25.4 8.3 - 31.1
Chameleon 30B - 575.3 32.5 48.5 - - 38.8 - - 31.8
DreamLLM 7B - - 58.2 - 72.9 - - 36.6 - -
LaVIT 7B - - - - 66.0 46.8 - - - -
Video-LaVIT 7B - 1551.8 67.3 64.0 - - - - - -
Emu 13B - - - - 52.0 - - - - -
Emu3 8B - - 58.5 68.2 - - 31.6 - - -
NExT-GPT 13B - - - - 66.7 - - - - -
Show-o 1.3B 73.8 948.4 - - 59.3 48.7 25.1 - - -
Janus 1.3B 87.0 1338.0 69.4 63.7 77.3 59.1 30.5 34.3 - -
JanusFlow 1.3B 88.0 1333.1 74.9 70.5 79.8 60.3 29.3 30.9 - -
Orthus 7B 79.6 1265.8 - - 63.2 52.8 28.2 - - -
Liquid† 7B 81.1 1119.3 - - 71.3* 58.4* - - 42.4 -
Unified-IO2 6.8B - - 71.5 - - - 86.2 - - 61.8
SEEDLLaMA 7B - - 45.8 51.5 - - - - - 31.7
MUSE-VL 7B - 1480.9 72.1 70.0 - - 42.3 - - 48.3
MUSE-VL 32B - 1581.6 81.8 71.0 - - 50.1 - - 56.7

MJHQ-30K

Method Resolution Params #Images FID
SD-XL - 2000M 9.55
PixArt - 25M 6.14
Playground v2.5 - - 4.48
LWM 7B - 17.77
VILA-U 256 7B 15M 12.81
VILA-U 384 7B 15M 7.69
Show-o 1.3B 36M 15.18
Janus 1.3B - 10.10
JanusFlow 1.3B - 9.51
Liquid 512 7B 30M 5.47

GenEval Bench

Model Params Res. Single Obj. Two Obj. Count. Colors Position Color Attri. Overall↑
Generation Model
LlamaGen 0.8B - 0.71 0.34 0.21 0.58 0.07 0.04 0.32
LDM 1.4B - 0.92 0.29 0.23 0.70 0.02 0.05 0.37
SDv1.5 0.9B - 0.97 0.38 0.35 0.76 0.04 0.06 0.43
PixArt-α 0.6B - 0.98 0.50 0.44 0.80 0.08 0.07 0.48
SDv2.1 0.9B - 0.98 0.51 0.44 0.85 0.07 0.17 0.50
DALL-E 2 6.5B - 0.94 0.66 0.49 0.77 0.10 0.19 0.52
Emu3-Gen 8B - 0.98 0.71 0.34 0.81 0.17 0.21 0.54
SDXL 2.6B - 0.98 0.74 0.39 0.85 0.15 0.23 0.55
IF-XL 4.3B - 0.97 0.74 0.66 0.81 0.13 0.35 0.61
DALL-E 3 - - 0.96 0.87 0.47 0.83 0.43 0.45 0.67
Unified Model
Chameleon 34B - - - - - - - 0.39
LWM 7B - 0.93 0.41 0.46 0.79 0.09 0.15 0.47
SEED-X 17B - 0.97 0.58 0.26 0.80 0.19 0.14 0.49
Show-o 1.3B - 0.95 0.52 0.49 0.82 0.11 0.28 0.53
Janus 1.3B - 0.97 0.68 0.30 0.84 0.46 0.42 0.61
JanusFlow 1.3B - 0.97 0.59 0.45 0.83 0.53 0.42 0.63
Orthus 7B 512 0.99 0.75 0.26 0.84 0.28 0.38 0.58

Image Tokenizer

VQGAN Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
17/11 VQ-VAE Neural Discrete Representation Learning arXiv
20/12 VQGAN Taming Transformers for High-Resolution Image Synthesis arXiv GitHub
21/10 ViT-VQGAN VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN arXiv
22/03 RQ-VAE Autoregressive Image Generation using Residual Quantization arXiv GitHub
22/09 MoVQ MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation arXiv
22/12 MAGVIT MAGVIT: Masked Generative Video Transformer arXiv GitHub
23/09 FSQ-VQ-VAE FINITE SCALAR QUANTIZATION: VQ-VAE MADE SIMPLE arXiv GitHub
23/10 Efficient-VQGAN Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers arXiv
23/10 MAGVIT-v2 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation arXiv
24/06 LlamaGEN Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation arXiv GitHub
24/06 OmniTokenizer OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation arXiv GitHub
24/06 VQGAN-LC Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99% arXiv
24/09 Open-MAGVIT2 Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation arXiv GitHub
24/12 IBQ Scalable Image Tokenization with Index Backpropagation Quantization arXiv GitHub
24/12 ZipAR ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality arXiv
24/12 VIDTOK VIDTOK: A VERSATILE AND OPEN-SOURCE VIDEO TOKENIZER arXiv GitHub
25/01 R3GAN The GAN is dead; long live the GAN! A Modern GAN Baseline arXiv GitHub
25/03 FAR Frequency Autoregressive Image Generation with Continuous Tokens arXiv GitHub
25/03 NFIG NFIG: Autoregressive Image Generation with Next-Frequency Prediction arXiv

Semantic Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/12 ViLex Visual Lexicon: Rich Image Features in Language Space arXiv
25/03 V2Flow V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation arXiv GitHub

Semantic And Reconstructed Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/06 BSQ-ViT Image and Video Tokenization with Binary Spherical Quantization arXiv GitHub
23/06 SPAE SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs arXiv
24/06 TiTok An Image is Worth 32 Tokens for Reconstruction and Generation arXiv GitHub
24/10 ImageFolder ImageFolder: Autoregressive Image Generation with Folded Tokens arXiv GitHub
24/10 BPE-VQ From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities arXiv
24/11 VQ-KD Image Understanding Makes for A Good Tokenizer for Image Generation arXiv GitHub
24/11 FQGAN Factorized Visual Tokenization and Generation arXiv GitHub
24/12 XQ-GAN XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation arXiv GitHub
24/12 Divot Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation arXiv GitHub
24/12 SoftVQ-VAE SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer arXiv GitHub
25/01 VA-VAE Reconstruction vs. Generation:Taming Optimization Dilemma in Latent Diffusion Models arXiv GitHub
25/02 ReaLS Exploring Representation-Aligned Latent Space for Better Generation arXiv GitHub
25/02 MAETok Masked Autoencoders Are Effective Tokenizers for Diffusion Models arXiv GitHub
25/02 QLIP QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation arXiv GitHub
25/02 FlexTok FlexTok: Resampling Images into 1D Token Sequences of Flexible Length arXiv
25/02 UniTok UniTok: A Unified Tokenizer for Visual Generation and Understanding arXiv GitHub
25/03 USP USP: Unified Self-Supervised Pretraining for Image Generation and Understanding arXiv GitHub
25/03 SemHiTok SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation arXiv
25/03 Robust Tokenizer Robust Latent Matters: Boosting Image Generation with Sampling Error arXiv GitHub
25/03 PCA Tokenizer “Principal Components” Enable A New Language of Images arXiv GitHub
25/03 FlowMo Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization arXiv Project Page
25/03 DualToken DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies arXiv GitHub
25/03 CTF Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction arXiv GitHub
25/03 TokenBridge Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation arXiv Project Page

External Image Generator

Discrete Condition

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
23/10 SEED Making Llama See and Draw With Seed Tokenizer arXiv GitHub
24/02 AnyGPT Anygpt: Unified Multimodal Llm With Discrete Sequence Modeling arXiv GitHub
24/06 LAViT Unified Language-Vision Pretraining in Llm With Dynamic Discrete Visual Tokenization arXiv GitHub
24/09 MIO Mio: A Foundation Model on Multimodal Tokens arXiv GitHub
24/12 illume Illume: Illuminating Your Llms to See, Draw, and Self-Enhance arXiv GitHub

Continuous Condition

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
23/06 Emu Emu: Generative Pretraining in Multimodality arXiv GitHub
23/09 NExT-GPT Next-Gpt arXiv GitHub
23/10 MiniGPT-5 Minigpt-5: Interleaved Vision-and-Language Generation Via Generative Vokens arXiv
23/12 VL-GPT Vl-Gpt: A Generative Pre-Trained Transformer for Vision and Language Understanding and Generation arXiv GitHub
24/01 MM-Interleaved Mm-Interleaved: Interleaved Image-Text Generative Modeling Via Multi-Modal Feature Synchronizer arXiv GitHub
24/02 EasyGen Easygen: Easing Multimodal Generation With Bidiffuser and Llms arXiv
24/03 CoDi-2 Codi-2: In-Context Interleaved and Interactive Any-to-Any Generation arXiv
24/04 SEED-X Seed-X: Multimodal Models With Unified Multi-Granularity Comprehension and Generation arXiv GitHub
24/04 DreamLLM Dreamllm: Synergistic Multimodal Comprehension and Creation arXiv GitHub
24/05 DEEM Deem: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception arXiv
24/05 X-VILA X-Vila: Cross-Modality Alignment for Large Language Model arXiv GitHub
24/06 Emu2 Generative Multimodal Models Are In-Context Learners arXiv GitHub
24/11 Spider Spider: Any-to-Many Multimodal Llm arXiv GitHub
24/12 MetaMorph Metamorph: Multimodal Understanding and Generation Via Instruction Tuning arXiv GitHub

Discrete Image Modelling

VQGAN Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
21/06 OFA Ofa: Unifying Multimodal Pretrained Models arXiv GitHub
22/04 Unified-IO Unified-Io: A Unified Model for Vision, Language, and Multi-Modal Tasks arXiv GitHub
23/11 Teal Teal: Tokenize and Embed All for Multi-Modal Large Language Models arXiv
24/02 LWM World Model on Million-Length Video and Language With Ringattention arXiv GitHub
24/05 Chameleon Chameleon: Mixed-Modal Early-Fusion Foundation Models arXiv GitHub
24/06 LlamaGEN Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation arXiv GitHub
24/06 4M-21 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities arXiv
24/08 ANOLE Anole: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation arXiv GitHub
24/08 Show-o Show-o: One single transformer to unify multimodal understanding and generation arXiv
24/09 Emu3 Emu3: Next-Token Prediction is All You Need arXiv
24/12 Liquid Liquid: Language Models are Scalable Multi-modal Generators arXiv
24/12 SynerGen-VL SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding arXiv

Semantic Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/04 LIBRA Libra: Building Decoupled Vision System on Large Language Models arXiv
24/09 VILA-U VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation arXiv
24/11 MUSE-VL MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding arXiv
24/12 TokenFlow TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation arXiv
25/02 QLIP QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation arXiv GitHub
25/02 UniTok UniTok: A Unified Tokenizer for Visual Generation and Understanding arXiv GitHub
25/03 DualToken DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies arXiv GitHub

Decoupled Encoder

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/05 Morph-Tokens Auto-Encoding Morph-Tokens for Multimodal LLM arXiv
24/06 Unified-IO 2 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action arXiv GitHub
24/10 Janus Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation arxiv GitHub
24/12 ILLUME ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance arxiv
25/01 VARGPT VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model arxiv GitHub
25/01 Janus-Pro Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling arxiv GitHub
25/03 OmniMamba OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models arxiv GitHub

Continuous Image Modelling

Diffusion

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/08 Transfusion Transfusion: Predict The Next Token and Diffuse Images With One Multi-Modal Model arXiv
24/09 MonoFormer MonoFormer: One transformer for both diffusion and autoregression arXiv GitHub
24/11 JanusFlow JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation arXiv GitHub
24/11 JetFormer JetFormer: An Autoregressive Generative Model of Raw Images and Text arXiv
24/12 CausalFusion Causal Diffusion Transformers for Generative Modeling arXiv
24/12 LLaMAFusion LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation arXiv

AR+Diffusion

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
24/10 MMAR Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling arXiv
24/12 Orthus Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads arXiv
24/12 LatentLM Multimodal Latent Language Modeling with Next-Token Diffusion arxiv
25/03 UniFluid Unified Autoregressive Visual Generation and Understanding with Continuous Tokens arxiv

Diffusion Models

Publication Date Method Abbreviation Full Title arXiv Link Code Repository
2023-03-13 UniDiffuser One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale arXiv
2023-05-20 CoDi Any-to-Any Generation via Composable Diffusion arXiv Project Page
2023-06-01 UniDiff UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning arXiv
2024-12-02 OmniFlow OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows arXiv GitHub
2025-01-00 Dual Diffusion Dual Diffusion for Unified Image Generation and Understanding arXiv
2025-03-06 X2I X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation arXiv GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published