EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Highlights

🔗 Unified & Consistent: Reconciles the "granularity gap" by maintaining task-decoupled yet consistent representations within a shared space via residual evolution.

📊 Data Efficient: Trained entirely on fully open-source data (w/o any re-captioning or distillation). Despite using only 13M images, EvoTok achieves 0.43 rFID on ImageNet-1K (256×256).

🚀 Versatile Performance: Competitive results across 7/9 understanding benchmarks and leading generation benchmarks (e.g., GenEval, GenAI-Bench).

Introduction

EvoTok is a unified image tokenizer designed to bridge the gap between high-level semantic and low-level pixel features. By modeling visual features as an evolutionary trajectory, EvoTok achieves strong performance in both domains within a single, shared latent space.

EvoTok delivers superior performance in multimodal understanding and text-to-image generation. For multimodal understanding, EvoTok achieves strong performance across a wide range of benchmarks. For text-to-image generation, it produces high-quality images at 256×256 resolution with strong visual fidelity and semantic alignment across diverse visual domains (portraits, landscapes, objects, etc.).

Installation

conda create -n evotok python==3.12 -y
conda activate evotok
pip install -r requirements.txt

# Install FlashAttention
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Dataset

We follow the data format of LLaVA-Pretrain to organize our dataset.

For understanding, <image> appears in the human turn as input:

[{"from": "human", "value": "description\n<image>"},
 {"from": "gpt",   "value": "answer."}]

For generation, <image> appears in the assistant turn as the target:

[{"from": "human", "value": "description"},
 {"from": "gpt",   "value": "<image>"}]

Training

EvoTok Tokenizer

bash evotok/scripts/train_evotok_siglip256.sh

Multimodal Understanding

# Stage 1: Pretraining
bash mllm/scripts/pretrain_understanding.sh
# Stage 2: Supervised Fine-tuning
bash mllm/scripts/sft_understanding.sh

Text-to-Image Generation

# Stage 1: Pretraining
bash mllm/scripts/pretrain_generation.sh
# Stage 2: Supervised Fine-tuning
bash mllm/scripts/sft_generation.sh

Evaluation

bash evotok/scripts/reconstruction_rq.sh

Acknowledgement

This project builds upon several excellent open-source works: RQ-VAE, VILA-U, TokenFlow, and LLaVA. We sincerely thank the authors for their contributions.

Citation

@misc{li2026evotok,
      title={EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation}, 
      author={Yan Li and Ning Liao and Xiangyu Zhao and Shaofeng Zhang and Xiaoxing Wang and Yifan Yang and Junchi Yan and Xue Yang},
      year={2026},
      eprint={2603.12108},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.12108}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
evotok		evotok
mllm		mllm
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Highlights

Introduction

Installation

Dataset

Training

EvoTok Tokenizer

Multimodal Understanding

Text-to-Image Generation

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Highlights

Introduction

Installation

Dataset

Training

EvoTok Tokenizer

Multimodal Understanding

Text-to-Image Generation

Evaluation

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages