Unified Models for Vision Language Understanding and Generation

Benchmark Results

Understanding

Model	Params	POPE	MME-P	MMB_dev	SEED	VQAv2	GQA	MMMU	MM-Vet	TextVQA	MMStar
Gemini-Nano-1	1.8B	-	-	-	-	62.7	-	26.3	-	-	-
VILA-U	7B	85.8	1401.8	-	59.0	79.4	60.8	-	33.5	-	-
Chameleon	7B	-	170.0	31.1	30.6	-	-	25.4	8.3	-	31.1
Chameleon	30B	-	575.3	32.5	48.5	-	-	38.8	-	-	31.8
DreamLLM	7B	-	-	58.2	-	72.9	-	-	36.6	-	-
LaVIT	7B	-	-	-	-	66.0	46.8	-	-	-	-
Video-LaVIT	7B	-	1551.8	67.3	64.0	-	-	-	-	-	-
Emu	13B	-	-	-	-	52.0	-	-	-	-	-
Emu3	8B	-	-	58.5	68.2	-	-	31.6	-	-	-
NExT-GPT	13B	-	-	-	-	66.7	-	-	-	-	-
Show-o	1.3B	73.8	948.4	-	-	59.3	48.7	25.1	-	-	-
Janus	1.3B	87.0	1338.0	69.4	63.7	77.3	59.1	30.5	34.3	-	-
JanusFlow	1.3B	88.0	1333.1	74.9	70.5	79.8	60.3	29.3	30.9	-	-
Orthus	7B	79.6	1265.8	-	-	63.2	52.8	28.2	-	-	-
Liquid†	7B	81.1	1119.3	-	-	71.3*	58.4*	-	-	42.4	-
Unified-IO2	6.8B	-	-	71.5	-	-	-	86.2	-	-	61.8
SEEDLLaMA	7B	-	-	45.8	51.5	-	-	-	-	-	31.7
MUSE-VL	7B	-	1480.9	72.1	70.0	-	-	42.3	-	-	48.3
MUSE-VL	32B	-	1581.6	81.8	71.0	-	-	50.1	-	-	56.7

MJHQ-30K

Method	Resolution	Params	#Images	FID
SD-XL		-	2000M	9.55
PixArt		-	25M	6.14
Playground v2.5		-	-	4.48
LWM		7B	-	17.77
VILA-U	256	7B	15M	12.81
VILA-U	384	7B	15M	7.69
Show-o		1.3B	36M	15.18
Janus		1.3B	-	10.10
JanusFlow		1.3B	-	9.51
Liquid	512	7B	30M	5.47

GenEval Bench

Model	Params	Res.	Single Obj.	Two Obj.	Count.	Colors	Position	Color Attri.	Overall↑
Generation Model
LlamaGen	0.8B	-	0.71	0.34	0.21	0.58	0.07	0.04	0.32
LDM	1.4B	-	0.92	0.29	0.23	0.70	0.02	0.05	0.37
SDv1.5	0.9B	-	0.97	0.38	0.35	0.76	0.04	0.06	0.43
PixArt-α	0.6B	-	0.98	0.50	0.44	0.80	0.08	0.07	0.48
SDv2.1	0.9B	-	0.98	0.51	0.44	0.85	0.07	0.17	0.50
DALL-E 2	6.5B	-	0.94	0.66	0.49	0.77	0.10	0.19	0.52
Emu3-Gen	8B	-	0.98	0.71	0.34	0.81	0.17	0.21	0.54
SDXL	2.6B	-	0.98	0.74	0.39	0.85	0.15	0.23	0.55
IF-XL	4.3B	-	0.97	0.74	0.66	0.81	0.13	0.35	0.61
DALL-E 3	-	-	0.96	0.87	0.47	0.83	0.43	0.45	0.67
Unified Model
Chameleon	34B	-	-	-	-	-	-	-	0.39
LWM	7B	-	0.93	0.41	0.46	0.79	0.09	0.15	0.47
SEED-X	17B	-	0.97	0.58	0.26	0.80	0.19	0.14	0.49
Show-o	1.3B	-	0.95	0.52	0.49	0.82	0.11	0.28	0.53
Janus	1.3B	-	0.97	0.68	0.30	0.84	0.46	0.42	0.61
JanusFlow	1.3B	-	0.97	0.59	0.45	0.83	0.53	0.42	0.63
Orthus	7B	512	0.99	0.75	0.26	0.84	0.28	0.38	0.58

Image Tokenizer

VQGAN Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
17/11	VQ-VAE	Neural Discrete Representation Learning	arXiv
20/12	VQGAN	Taming Transformers for High-Resolution Image Synthesis	arXiv	GitHub
21/10	ViT-VQGAN	VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN	arXiv
22/03	RQ-VAE	Autoregressive Image Generation using Residual Quantization	arXiv	GitHub
22/09	MoVQ	MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation	arXiv
22/12	MAGVIT	MAGVIT: Masked Generative Video Transformer	arXiv	GitHub
23/09	FSQ-VQ-VAE	FINITE SCALAR QUANTIZATION: VQ-VAE MADE SIMPLE	arXiv	GitHub
23/10	Efficient-VQGAN	Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers	arXiv
23/10	MAGVIT-v2	Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	arXiv
24/06	LlamaGEN	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation	arXiv	GitHub
24/06	OmniTokenizer	OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation	arXiv	GitHub
24/06	VQGAN-LC	Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%	arXiv
24/09	Open-MAGVIT2	Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation	arXiv	GitHub
24/12	IBQ	Scalable Image Tokenization with Index Backpropagation Quantization	arXiv	GitHub
24/12	ZipAR	ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality	arXiv
24/12	VIDTOK	VIDTOK: A VERSATILE AND OPEN-SOURCE VIDEO TOKENIZER	arXiv	GitHub
25/01	R3GAN	The GAN is dead; long live the GAN! A Modern GAN Baseline	arXiv	GitHub
25/03	FAR	Frequency Autoregressive Image Generation with Continuous Tokens	arXiv	GitHub
25/03	NFIG	NFIG: Autoregressive Image Generation with Next-Frequency Prediction	arXiv

Semantic Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
24/12	ViLex	Visual Lexicon: Rich Image Features in Language Space	arXiv
25/03	V2Flow	V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation	arXiv	GitHub

Semantic And Reconstructed Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
24/06	BSQ-ViT	Image and Video Tokenization with Binary Spherical Quantization	arXiv	GitHub
23/06	SPAE	SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs	arXiv
24/06	TiTok	An Image is Worth 32 Tokens for Reconstruction and Generation	arXiv	GitHub
24/10	ImageFolder	ImageFolder: Autoregressive Image Generation with Folded Tokens	arXiv	GitHub
24/10	BPE-VQ	From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities	arXiv
24/11	VQ-KD	Image Understanding Makes for A Good Tokenizer for Image Generation	arXiv	GitHub
24/11	FQGAN	Factorized Visual Tokenization and Generation	arXiv	GitHub
24/12	XQ-GAN	XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation	arXiv	GitHub
24/12	Divot	Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation	arXiv	GitHub
24/12	SoftVQ-VAE	SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer	arXiv	GitHub
25/01	VA-VAE	Reconstruction vs. Generation:Taming Optimization Dilemma in Latent Diffusion Models	arXiv	GitHub
25/02	ReaLS	Exploring Representation-Aligned Latent Space for Better Generation	arXiv	GitHub
25/02	MAETok	Masked Autoencoders Are Effective Tokenizers for Diffusion Models	arXiv	GitHub
25/02	QLIP	QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation	arXiv	GitHub
25/02	FlexTok	FlexTok: Resampling Images into 1D Token Sequences of Flexible Length	arXiv
25/02	UniTok	UniTok: A Unified Tokenizer for Visual Generation and Understanding	arXiv	GitHub
25/03	USP	USP: Unified Self-Supervised Pretraining for Image Generation and Understanding	arXiv	GitHub
25/03	SemHiTok	SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation	arXiv
25/03	Robust Tokenizer	Robust Latent Matters: Boosting Image Generation with Sampling Error	arXiv	GitHub
25/03	PCA Tokenizer	“Principal Components” Enable A New Language of Images	arXiv	GitHub
25/03	FlowMo	Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization	arXiv	Project Page
25/03	DualToken	DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv	GitHub
25/03	CTF	Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction	arXiv	GitHub
25/03	TokenBridge	Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation	arXiv	Project Page

External Image Generator

Discrete Condition

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
23/10	SEED	Making Llama See and Draw With Seed Tokenizer	arXiv	GitHub
24/02	AnyGPT	Anygpt: Unified Multimodal Llm With Discrete Sequence Modeling	arXiv	GitHub
24/06	LAViT	Unified Language-Vision Pretraining in Llm With Dynamic Discrete Visual Tokenization	arXiv	GitHub
24/09	MIO	Mio: A Foundation Model on Multimodal Tokens	arXiv	GitHub
24/12	illume	Illume: Illuminating Your Llms to See, Draw, and Self-Enhance	arXiv	GitHub

Continuous Condition

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
23/06	Emu	Emu: Generative Pretraining in Multimodality	arXiv	GitHub
23/09	NExT-GPT	Next-Gpt	arXiv	GitHub
23/10	MiniGPT-5	Minigpt-5: Interleaved Vision-and-Language Generation Via Generative Vokens	arXiv
23/12	VL-GPT	Vl-Gpt: A Generative Pre-Trained Transformer for Vision and Language Understanding and Generation	arXiv	GitHub
24/01	MM-Interleaved	Mm-Interleaved: Interleaved Image-Text Generative Modeling Via Multi-Modal Feature Synchronizer	arXiv	GitHub
24/02	EasyGen	Easygen: Easing Multimodal Generation With Bidiffuser and Llms	arXiv
24/03	CoDi-2	Codi-2: In-Context Interleaved and Interactive Any-to-Any Generation	arXiv
24/04	SEED-X	Seed-X: Multimodal Models With Unified Multi-Granularity Comprehension and Generation	arXiv	GitHub
24/04	DreamLLM	Dreamllm: Synergistic Multimodal Comprehension and Creation	arXiv	GitHub
24/05	DEEM	Deem: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception	arXiv
24/05	X-VILA	X-Vila: Cross-Modality Alignment for Large Language Model	arXiv	GitHub
24/06	Emu2	Generative Multimodal Models Are In-Context Learners	arXiv	GitHub
24/11	Spider	Spider: Any-to-Many Multimodal Llm	arXiv	GitHub
24/12	MetaMorph	Metamorph: Multimodal Understanding and Generation Via Instruction Tuning	arXiv	GitHub

Discrete Image Modelling

VQGAN Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
21/06	OFA	Ofa: Unifying Multimodal Pretrained Models	arXiv	GitHub
22/04	Unified-IO	Unified-Io: A Unified Model for Vision, Language, and Multi-Modal Tasks	arXiv	GitHub
23/11	Teal	Teal: Tokenize and Embed All for Multi-Modal Large Language Models	arXiv
24/02	LWM	World Model on Million-Length Video and Language With Ringattention	arXiv	GitHub
24/05	Chameleon	Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	GitHub
24/06	LlamaGEN	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation	arXiv	GitHub
24/06	4M-21	4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities	arXiv
24/08	ANOLE	Anole: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation	arXiv	GitHub
24/08	Show-o	Show-o: One single transformer to unify multimodal understanding and generation	arXiv
24/09	Emu3	Emu3: Next-Token Prediction is All You Need	arXiv
24/12	Liquid	Liquid: Language Models are Scalable Multi-modal Generators	arXiv
24/12	SynerGen-VL	SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv

Semantic Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
24/04	LIBRA	Libra: Building Decoupled Vision System on Large Language Models	arXiv
24/09	VILA-U	VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	arXiv
24/11	MUSE-VL	MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	arXiv
24/12	TokenFlow	TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation	arXiv
25/02	QLIP	QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation	arXiv	GitHub
25/02	UniTok	UniTok: A Unified Tokenizer for Visual Generation and Understanding	arXiv	GitHub
25/03	DualToken	DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv	GitHub

Decoupled Encoder

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
24/05	Morph-Tokens	Auto-Encoding Morph-Tokens for Multimodal LLM	arXiv
24/06	Unified-IO 2	Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action	arXiv	GitHub
24/10	Janus	Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	arxiv	GitHub
24/12	ILLUME	ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arxiv
25/01	VARGPT	VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	arxiv	GitHub
25/01	Janus-Pro	Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	arxiv	GitHub
25/03	OmniMamba	OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models	arxiv	GitHub

Continuous Image Modelling

Diffusion

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
24/08	Transfusion	Transfusion: Predict The Next Token and Diffuse Images With One Multi-Modal Model	arXiv
24/09	MonoFormer	MonoFormer: One transformer for both diffusion and autoregression	arXiv	GitHub
24/11	JanusFlow	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	arXiv	GitHub
24/11	JetFormer	JetFormer: An Autoregressive Generative Model of Raw Images and Text	arXiv
24/12	CausalFusion	Causal Diffusion Transformers for Generative Modeling	arXiv
24/12	LLaMAFusion	LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation	arXiv

AR+Diffusion

Publication Date	Method Abbreviation	Full Title	arXiv Link
24/10	MMAR	Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling	arXiv
24/12	Orthus	Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	arXiv
24/12	LatentLM	Multimodal Latent Language Modeling with Next-Token Diffusion	arxiv
25/03	UniFluid	Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	arxiv

Diffusion Models

Publication Date	Method Abbreviation	Full Title	arXiv Link	Code Repository
2023-03-13	UniDiffuser	One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale	arXiv
2023-05-20	CoDi	Any-to-Any Generation via Composable Diffusion	arXiv	Project Page
2023-06-01	UniDiff	UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning	arXiv
2024-12-02	OmniFlow	OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows	arXiv	GitHub
2025-01-00	Dual Diffusion	Dual Diffusion for Unified Image Generation and Understanding	arXiv
2025-03-06	X2I	X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation	arXiv	GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unified Models for Vision Language Understanding and Generation

Benchmark Results

Understanding

MJHQ-30K

GenEval Bench

Image Tokenizer

VQGAN Encoder

Semantic Encoder

Semantic And Reconstructed Encoder

External Image Generator

Discrete Condition

Continuous Condition

Discrete Image Modelling

VQGAN Encoder

Semantic Encoder

Decoupled Encoder

Continuous Image Modelling

Diffusion

AR+Diffusion

Diffusion Models

About

Uh oh!

Releases

Packages

Mingyang-Han/Awesome-Native-Multimodal-Models

Folders and files

Latest commit

History

Repository files navigation

Unified Models for Vision Language Understanding and Generation

Benchmark Results

Understanding

MJHQ-30K

GenEval Bench

Image Tokenizer

VQGAN Encoder

Semantic Encoder

Semantic And Reconstructed Encoder

External Image Generator

Discrete Condition

Continuous Condition

Discrete Image Modelling

VQGAN Encoder

Semantic Encoder

Decoupled Encoder

Continuous Image Modelling

Diffusion

AR+Diffusion

Diffusion Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages