Survey

03 Nov 2024 Understanding Multimodal LLMs

Pre-training of Large Vision Encoders

Survey

February 3, 2026 Vision Encoders in Vision-Language Models: A Survey
- Jina AI by Elastic

Visual Backbone Networks

使用 ImageNet-1K classification 和 ImageNet-22K 训练

Thu, 22 Oct 2020 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- ViT
Mon, 10 Jan 2022 A ConvNet for the 2020s
- ConvNext

CLIP (Contrastive Language-Image Pre-Training)

Fri, 26 Feb 2021 Learning Transferable Visual Models From Natural Language Supervision
- CLIP dual-encoder architecture
- 这么简单的方法效果居然出奇的好，效果让人瞠目结舌
- openai当时还是那个力大砖飞的openai
Tue, 9 Nov 2021 FILIP: Fine-grained Interactive Language-Image Pre-Training:
- FILIP 和 ColPali 是不是很像
Mon, 15 Nov 2021 LiT: Zero-Shot Transfer with Locked-image text Tuning
- Locked-image text Tuning
Mon, 27 Mar 2023 EVA-CLIP: Improved Training Techniques for CLIP at Scale
- EVA-CLIP
Mon, 27 Mar 2023 Sigmoid Loss for Language Image Pre-Training
- Unlike standard contrastive learning with softmax normalization,
- the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization.
- The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes.
- 对比学习，batchsize 越大效果越好，32K 差不多饱和
- The input images are resized to 224×224 resolution
Wed, 12 Jul 2023 Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- To support variable aspect ratios and readily extrapolate to unseen resolutions, we introduce factorized positional embeddings, where we decompose into separate embeddings ϕx and ϕy of x and y coordinates. These are then summed together
Thu, 28 Sep 2023 Demystifying CLIP Data
Thu, 21 Dec 2023 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- a vision encoder InternViT-6B
- a language middleware QLLaMA
  - QLLaMA is developed based on the pre-trained multilingual LLaMA, and newly added 96 learnable queries and cross-attention layers (1 billion parameters) that are randomly initialized. This manner allows QLLaMA to smoothly integrate visual elements into the language model, thereby enhancing the coherence and effectiveness of the combined features.
- Training
  - stage 1: contrastive pre-training
    - InternViT-6B vs LLaMA-7B -> contrastive loss
    - 等一下，为什么要跟 LLaMA-7B 对比学习，LLaMA-7B 输出的 embed 是用来做采样下一个词的，跟检索没啥关系啊？？？？
Tue, 29 Jul 2025 MetaCLIP 2: A Worldwide Scaling Recipe
- 在2025年训练一个输入为224的 CLIP. 可以，这很Meta

DINO

Thu, 29 Apr 2021 Emerging Properties in Self-Supervised Vision Transformers
Fri, 14 Apr 2023 DINOv2: Learning Robust Visual Features without Supervision
Thu, 14 Aug 2025 DINOv3

Generative Pretrained Visual Encoders

Thu, 21 Nov 2024 Multimodal Autoregressive Pre-training of Large Vision Encoders
- 使用自回归方法训练 Large Vision Encoders
- Causal Multimodal Decoder + Pixel MSE Loss + Cross-entropy Loss
- Architecture
  - Vision Transformer (ViT) architecture
  - Prefix Attention
    - Following El-Nouby et al. [33], we constrain the self-attention mechanism within the vision encoder by applying a prefix attention mask [95].
    - This strategy facilitates the use of bidirectional attention during inference without additional tuning.
- Post-Training
  - High-resolution Adaptation
  - Native Resolution Fine-tuning
- Multimodal Instruction Tuning
  - Architecture
    - txt: Llama 3.0 8B
    - img: AIMV2-L
    - projector: 2-layer MLP connector
  - Varying the LLM and Data Mixture
    - Across all settings, AIMV2 provides a stronger, or at worst on par, performance compared the OAI CLIP and SigLIP
  - High-Resolution via Tiling
    - We observe that the performance of all methods improves with higher resolutions,
    - with a significant improvement for TextVQA.
    - Notably, AIMV2 maintains its advantage over the baselines in high-resolution tiling settings, demonstrating its versatility.
Mon, 1 Sep 2025 OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
- 直接使用ntp（autoregressive decoder） + synthetically generated 的 image captions 效果就好？
- 也就是说从0直接训练vlm，比用预训练的Vision Encoder 效果好
- 对于 vlm 来说，合成数据训练是有效的

Multimodal Projector(abstractors)

Mon, 30 Jan 2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM
Mon, 17 Apr 2023 Visual Instruction Tuning
- We consider a simple linear layer to connect image features into the word embedding space.
Mon, 11 Dec 2023 Honeybee: Locality-enhanced Projector for Multimodal LLM
- 高效将M个视觉模态中间状态映射为N个LLM状态
- C-Abstractor & D-Abstractor

High-resolution LMMs

Mon, 18 Mar 2024 LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
- High-Resolution Image Partition Strategy
- Arbitrary Aspect Ratio Slice Encoding
- Compression Layer
- Spatial Schema for Image Slices
Wed, 18 Dec 2024 LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
- Hierarchical window (Hiwin) transformer

Vision-Language Models

Fri, 28 Jan 2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- loss ITC + ITM + LM
- 2022 年 11 月 30 日 chatgpt 发布之前的模型，还没有 GPT like NTP 大一统
Mon, 30 Jan 2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Architecture
  - txt: OPT/FlanT5
  - img: ViT-L/14 from CLIP (Radford et al., 2021) and ViT-g/14 from EVA-CLIP (Fang et al., 2022).
  - projector:
    - We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM
Mon, 17 Apr 2023 Visual Instruction Tuning
- Architecture
  - txt: Vicuna
  - img: CLIP visual encoder ViT-L/14
  - projector:
    - We consider a simple linear layer to connect image features into the word embedding space.
- Training
  - Stage 1: Pre-training for Feature Alignment.
    - we keep both the visual encoder and LLM weights frozen,
    - and maximize the likelihood of (3) with trainable parameters θ = W (the projection matrix) only
  - Stage 2: Fine-tuning End-to-End.
Thu, 5 Oct 2023 Improved Baselines with Visual Instruction Tuning
- LLaVA-1.5
  - +VQA-v2, +Format prompt, +MLP VL connector, +OKVQA/OCR
  - +Region-level VQA, +Scale up resolution(336), +GQA, +ShareGPT, +Scale up LLM
  - Architecture
    - txt: Vicuna
    - img: CLIP visual encoder ViT-L/14
    - projector:
      - we find that improving the vision-language connector’s representation power with a two-layer MLP
      - can improve LLaVA’s multimodal capabilities, compared with the original linear projection.
- LLaVA-1.5-HD(AnyRes)
  - Dynamic High Resolution (split & resize)
  - we overcome this by dividing the image into smaller image patches of the resolution that the vision encoder is originally trained for, and encode them independently.
  - After obtaining the feature maps of individual patches, we then combine them into a single large feature map of the target resolution, and feed that into the LLM.
  - To provide the LLM with the global context and to reduce the artifact of the split-encode-merge operation,
  - we additionally concatenate the feature of a downsampled image to the merged feature map.
  - This allows us to scale the input to any arbitrary resolution and maintain the data efficiency of LLaVA-1.5.
  - We call this resulting model LLaVA-1.5-HD.
Thu, 24 Aug 2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- Architecture
  - txt: Qwen-7B
  - img: Openclip’s ViT-bigG, input images are resized to a specific resolution
  - projector:
    - Learnable Query Embs + CrossAttn + 2D absolute positional encodings
    - This mechanism compresses the visual feature sequence to a fixed length of 256
- IO
  - img tag: 448*448 resolution image -> compresses the visual feature sequence to a fixed length of 256
  - box tag: The coordinate box is expressed as (x1,y1),(x2,y2)·, where (x1, y1) and (x2, y2) are normalized values in the range [0, 1000).
  - ref tag: Its corresponding text description can be identified by text_caption.
- Training
  - Pre-training
    - The input images are resized to 224 × 224
    - We freeze the large language model and only optimize the vision encoder and VL adapter in this stage.
  - Multi-task Pre-training
    - We increase the input resolution of the visual encoder from 224 × 224 to 448 × 448
    - We unlocked the large language model and trained the whole model
  - Supervised Fine-tuning
    - In this stage, we freeze the visual encoder and optimize the language model and adapter module
Tue, 9 Apr 2024 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
- Architecture
  - txt: InternLM2-7B
  - img: CLIP visual encoder ViT-L/14(336×336)
  - projector:
    - Dynamic Image Partition + Global-Local Format + Image 2D Structure Newline Indicator.
- Dive into Resolution
  - High-Resolution Training is Critical for HD-OCR tasks.
  - Higher Inference Resolution Leads to better results on Text-related Tasks.
- High-Resolution Strategy Ablation
  - The Role of Global-View
  - The Role of the Newline Token
  - Influence of Token Merging Strategy
Mon, 22 Apr 2024 Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
- Phi-3.5-Vision
- Architecture
  - txt: phi-3.5-mini
  - img: CLIP ViT-L/14
  - projector:
    - dynamic cropping strategy [DZZ+24b] is utilized to split the input image into a 2d array of blocks
    - InternLM-XComposer2-4KHD
Tue, 30 Apr 2024 LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
- 使用相同的303.5M Vision Encoder，更新更大的 Qwen1.5-110B， Qwen1.5-72B，LLaMA3-8B，效果就是好
- 多模态MMMU 和单模态 MMLU & 模型大小非常相关
Mon, 18 Mar 2024 LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
- Architecture (LLaVA-1.5)
  - txt: Vicuna
  - img: CLIP visual encoder ViT-L/14
  - projector:
    - we compress the visual tokens of each image slice using a shared perceiver resampler layer,
- Modularized Visual Encoding
  - High-Resolution Image Partition Strategy
  - Arbitrary Aspect Ratio Slice Encoding
  - Compression Layer
  - Spatial Schema for Image Slices
- Ablation Study
  - (1) We replace the padding strategy of LLaVA-1.5 with the adaptive encoding strategy of LLaVA-UHD, supporting arbitrary aspect ratios while maintaining identical maximum resolutions. We can observe consistent improvement since wasted computation from padding is avoided.
  - (2) We replace the perceiver resampler of LLaVA-UHD with the 2-layer MLP of LLaVA-1.5. We observe that perceiver resampler achieves comparable or better performance than MLP, using only 12.9% computation cost.
  - (3) We further replace the LLaVA-UHD image partition strategy with the naive partition strategy [24] (i.e., fixed 2 × 2 slices). Results show that LLaVA-UHD can more properly divide images into slices for better performance.
  - (4) We remove the spatial schema from LLaVA-UHD. The performance degradation demonstrates the effectiveness and necessity of spatial schema in informing the dynamic slice positions for LMMs.
Mon, 24 Jun 2024 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
- 内容太丰富了，但最重要的一点是使用多个 vision feature extractor 互补
- 比如 CLIP 视觉和文本对齐，OpenCLIP ConvNeXt-XXL@1024 支持高分辨率，DINOv2 ViT-L/14@518 视觉任务比较强
Sat, 3 Aug 2024 MiniCPM-V: A GPT-4V Level MLLM on Your Phone
- Architecture
  - txt:
    - MiniCPM 2B & Llama3-Instruct 8B
  - img: SigLIP SoViT-400m/14
  - projector:
    - we take advantage of the adaptive visual encoding method proposed by LLaVA-UHD
    - Image Partition & Slice Encoding
    - Token Compression
      - the visual tokens of each slice are compressed into 64 queries for MiniCPM
    - 64 queries for MiniCPM V1&2 and 96 tokens for MiniCPM-Llama3-V 2.5 through this layer.
    - Spatial Schema
- Training
  - Pre-training
    - Stage-1 224×224，只训练 compression layer
    - Stage-2 224×224 to 448×448， The whole visual encoder is trained, leaving other parameters frozen
    - Stage-3 The LLM is kept frozen to avoid disruption from the relatively low-quality pre-training data
Fri, 16 Aug 2024 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
- Architecture
  - txt: phi-3.5-mini 4B & 14B
  - img: DFN vs SigLIP
- Training
  - Stage-1 Base Resolution Pre-training
  - Stage-2 High Resolution Pre-training
  - Single-Image Supervised Fine-tuning
  - Interleaved Multi-Image Supervised Fine-tuning
- Data Recipe
- Ablation Studies
  - Ablation on Stage-1 Pre-training
    - Few-shot Pre-training Evaluation
    - Scaling Pre-training Data
    - Visual Backbones
    - we find SigLIP provides better visual representations that boost performance on OCR tasks, and we adopt SigLIP in the final model architecture as the ViT backbone.
  - Ablation on Stage-2 Pre-training
    - Impact of Adding Stage-2 Pre-training and with Different Resolutions.
  - Ablation on Instruction Tuning
    - Perceiver Resampler vs. MLP
    - Any-Resolution Vision Token Sampling
    - Instruction-Aware Vision Token Sampling
Wed, 28 Aug 2024 Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
- STRONGER CLIP ENCODER
  - Direct interpolation (CLIP encoder) to 448 × 448 can achieve competitive performance while being more efficient
- VISION EXPERTS
  - experts
    - (1) Vision-Language Alignment: CLIP/ConvNeXt/OpenCLIP
    - (2) Object-Centric: EVA-02
    - (3) OCR: Pix2Struct
    - (4) Segmentation: SAM
    - (5) Self-supervised: DINOv2
  - distinct advantages of different experts
    - We resize the output 2D feature maps of each vision encoder using bilinear interpolation or
    - pixel shuffle (Shi et al., 2016) to ensure that the visual token number equals 1024.
    - unfreezing the vision experts again leads to consistent improvement,
      - MLLMs with these task specific vision encoders achieve optimal performance in their pretraining domains
- FUSION STRATEGY
  - (1) Sequence Append
  - (2) Channel Concatenation
  - (3) LLaVA-HR: injecting highresolution features into low-resolution vision encoders using mixture-of-resolution adapter
  - (4) Mini-Gemini: using the CLIP tokens as the low-resolution queries to cross-attend another high-resolution vision encoder in the co-located local windows
  - (5) Deformable Attention
  - Channel Concatenation stands out with the best performance, expandability, and efficiency.
- VISON-LANGUAGE PRE-ALIGNMENT
  - 1. training each pre-trained vision expert with their own projector, while keeping the language model frozen;
  - 1. combining all vision experts from the first step and training both the projector and vision experts;
  - 1. training the whole model on SFT data.
- EXTENSION TO MULTI-EXPERTS
  - introducing additional vision encoders enhances the performance
  - CLIP-448 + ConvNext-1024 组合作为baseline
Wed, 18 Sep 2024 Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- 2B 7B 72B
- Architecture
  - txt: Qwen2 series 1.5B, 7.6B, 72B
  - img:
    - 675M Vision Encoder(DFN’s ViT) + RoPE-2D
    - Naive Dynamic Resolution
    - Multimodal Rotary Position Embedding (M-RoPE)
    - Unified Image and Video Understanding
- Training
  - Following Qwen-VL (Bai et al., 2023b), we adopt a three-stage training methodology
    - In the first stage, we focus exclusively on training the Vision Transformer (ViT) component
    - In the second stage, we unfreeze all parameters and train with a wider range of data for more comprehensive learning
    - In the final stage, we lock the ViT parameters and perform exclusive fine-tuning of the LLM using instructional datasets
- Ablation Study
  - Dynamic Resolution
  - M-RoPE
  - Model Scaling
Wed, 25 Sep 2024 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- Architecture
  - txt: OLMoE-1B-7B, OLMo-7B-1024-preview, Qwen2 7B, Qwen2 72B
  - img: OpenAI’s ViT-L/14 336px CLIP model
  - Evaluation**
    - Broadly speaking, the academic benchmark results and human evaluation agree, with the exception of Qwen2- VL ,
    - which performs strongly on the academic benchmarks and comparatively underperforms in the human evaluation
Wed, 18 Dec 2024 LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
- projector:
  - we present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector, the Hierarchical window (Hiwin) transformer. Hiwin transformer enhances MLLM’s ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid.
Mon, 20 Jan 2025 Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
- Architecture
  - txt: Qwen2.5-7B
  - img: SigLIP 448448 + ConvNeXt 512512
Wed, 19 Feb 2025 Qwen2.5-VL Technical Report
- 3B 7B 72B
- Architecture
  - txt: Qwen2.5 series
  - img:
    - redesigned Vision Transformer (ViT) architecture
    - 2D-RoPE and window attention
    - During both training and inference, the height and width of the input images are resized to multiples of 28 before being fed into the ViT
    - The vision encoder processes images by splitting them into patches with a stride of 14, generating a set of image features
  - projector:
    - we first group spatially adjacent sets of four patch features
    - These grouped features are then concatenated and passed through a two-layer multi-layer perceptron (MLP)
    - to project them into a dimension that aligns with the text embeddings used in the LLM
- Training
  - ViT
    - we train the redesigned ViT from scratch
    - CLIP pre-training, vision-language alignment, end-to-end fine-tuning
    - Native Dynamic Resolution and Frame Rate
    - Multimodal Rotary Position Embedding Aligned to Absolute Time
  - Pre-Training
    - Pre-Training Data
    - In the first phase, only the Vision Transformer (ViT) is trained to improve its alignment with the language model, laying a solid foundation for multimodal understanding
    - In the second phase, all model parameters are unfrozen, and the model is trained on a diverse set of multimodal image data to enhance its capacity to process complex visual information.
    - In the third phase, to further enhance the model’s reasoning capabilities over longer sequences, video, and agent-based data are incorporated, alongside an increase in sequence length
  - Post-training
    - Supervised Fine-Tuning (SFT)
    - Direct Preference Optimization (DPO)
Thu, 10 Apr 2025 Kimi-VL Technical Report
- Architecture
  - txt: Moonlight MoE language model 16B 2.8B activated (8K) context length
  - img: SigLIP-SO-400M + 2D RoPE +
  - projector: pixel unshuffle operation + 2×2 downsampling + Multi-Layer Perceptron (MLP)
- Muon Optimizer
- Pre-Training
  - ViT Training Stages Following CoCa’s approach
  - Joint Pre-training Stage
  - Joint Cooldown Stage
  - Joint Long-context Activation Stage
- Post-Training
  - Joint Supervised Fine-tuning (SFT)
  - Long-CoT Supervised Fine-Tuning
  - Reinforcement Learning (similar as Kimi k1.5)
Mon, 14 Apr 2025 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
- Architecture
  - txt: Qwen2.5 series and InternLM3-8B
  - img: InternViT-300M and InternViT-6B
  - projector: Multi-Layer Perceptron (MLP) + pixel unshuffle operation
- Native Multimodal Pre-Training
  - 不使用 projector warming up
  - our method updates all model parameters jointly during multimodal pre-training
- Post-Training
  - Supervised Fine-Tuning
  - Mixed Preference Optimization (DPO)
- Test-Time Scaling
  - Visual Process Reward Model
- Infrastructure
Sun, 11 May 2025 Seed1.5-VL Technical Report
- Architecture
  - txt: Mixture-of-Experts (MoE) LLM of 20B active parameters
  - img: a 532M-parameter vision encoder + 2D RoPE + NaViT
  - projector: 2 × 2 average pooling + Multi-Layer Perceptron (MLP)
- ViT Pre-training Stage
  - MIM with 2D RoPE
    - We leverage the EVA02-CLIP-E [29] as the teacher model, and the student model is randomly initialized following the architecture defined in table 1
  - Native-Resolution Contrastive Learning
    - text encoder is initialized using the text encoder from EVA-02-CLIP-E
    - Alignment between the image and text embeddings is then achieved by jointly optimizing the SigLIP loss [171] and the SuperClass loss [52].
  - Omni-modal Pre-training
    - This stage adopts the MiCo framework
- Video Encoding
  - Dynamic Frame-Resolution Sampling
- Pre-training
  - Stage 0 Projector Warmup
  - Stage 1 Vision-Language Alignment
  - Stage 2 Multimodal Pre-training
- Post-training
  - Supervised Fine-tuning
  - Reinforcement Learning from Human Feedback
    - VLM as a Reward Model
  - Reinforcement Learning with Verifiable Rewards
    - Visual STEM
    - Visual Perception and Reasoning
  - Hybrid Reinforcement Learning
    - our training is a combination of RLHF and RLVR.
    - Format reward
    - Hybrid reward
    - Shared critic
    - KL coefficients
  - Iterative Update by Rejection Sampling Fine-tuning
  - Post-Training Framework
    - We conduct hybrid reinforcement learning with both human feedback (RLHF) and verifier feedback (RLVF) of Seed1.5-VL on a verl-based [122] framework.
Wed, 4 Jun 2025 MiMo-VL Technical Report
- Architecture
  - txt: MiMo-7B
  - img: Qwen2.5-VL
  - projector: Multi-Layer Perceptron (MLP)
- Training
  - Stage 1 Projector Warmup
  - Stage 2 Vision-Language Alignment
  - Stage 3 Multimodal Pre-training
  - Stage 4 Long-context SFT
- Post-Training
  - Reinforcement Learning with Verifiable Rewards (On-Policy GRPO)
    - Visual Reasoning\Text Reasoning\Image Grounding\Visual Counting\Temporal Video Grounding
  - Mixed On-Policy Reinforcement Learning
- Discussion
  - Boosting Reasoning Capability in Pre-training
  - On-Policy RL v.s. Vanilla GRPO
  - Interference Between RL Tasks
Tue, 1 Jul 2025 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Architecture
  - txt: GLM as the LLM
  - img: AIMv2- Huge
  - projector: MLP adapter
- Pre-training
  - Pre-training Data
  - Training Recipe
    - Multimodal pre-training
    - Long-context continual training
- Supervised Fine-Tuning
  - Supervised Fine-Tuning Data
  - Training Recipe
    - Interestingly, we observe that even when cold-start training uses noisy reasoning data, which contain formatting inconsistencies or repetitive patterns, subsequent RL remains effective. This suggests that imperfect reasoning traces can still provide useful guidance. Nonetheless, models initialized with clean and consistent data show more stable RL convergence and achieve higher overall performance.
- Reinforcement Learning: What’s Challenging and What Works (GRPO)
  - Data preparation
  - Reward system
    - The extraction of the final answer in RLVR.
    - Avoid reward hacking
    - Domain-specific reward system
  - Reinforcement Learning with Curriculum Sampling (RLCS)
  - Infrastructure
- Evaluation
Thu, 31 Jul 2025 [https://stepfun.ai/research/zh/step3](Step3: Cost-Effective Multimodal Intelligence)
- VL 开启 MOE 时代
- Architecture
  - txt: Step3 321B total parameters and 38B active
  - img: 16 spatial down-sampling + Eva-CLIP 5B
Wed, 6 Aug 2025 dots.vlm1
- 1.2B vision encoder + DeepSeek V3 ┓( ´∀` )┏
Mon, 11 Aug 2025 GLM-4.5V
- technical_report: https://github.com/zai-org/GLM-V/blob/main/resources/GLM-4.5V_technical_report.pdf
- txt: GLM-4.5-Air (106B parameters, 12B active)
- SFT+RL 相对 SFT 都所有项目都有比较大的提高
Mon, 25 Aug 2025 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
- Architecture
  - txt: Qwen3 series and GPT-OSS-20B
  - img: InternViT-300M and InternViT-6B
  - projector: Multi-Layer Perceptron (MLP) + pixel unshuffle operation
- Pre-Training
  - next token prediction (NTP) ~250B tokens
- Post-Training
  - Supervised Fine-Tuning(SFT) ~130B tokens
  - Cascade Reinforcement Learning (Cascade RL) 270K samples
    - mixed preference optimization (MPO)
    - GSPO
  - Visual Consistency Learning (ViCO) ~30Btokens
    - ConsistencyTraining
    - Router Training (InternVL3.5-Flash)
Thu, 28 Aug 2025 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
- Architecture
  - txt: Qwen3-4B
  - img: SigLIP2-So400m enhanced with the AnyRes strategy
  - projector: Multi-Layer Perceptron (MLP)
- Training
  - Stage 1 MLP Warmup
  - Stage 2 Vision-Language Alignment
  - Stage 3 Joint Multimodal Pre-training
- Post-Training
  - Stage 1 Bi-Mode Annealing
Tue, 21 Oct 2025 DeepSeek-OCR: Contexts Optical Compression
- Architecture
  - txt: DeepSeek3B-MoE-A570M
  - img: 80M SAM-base + 300M CLIP-large
  - projector:
    - we borrow from Vary [36] and use a 2-layer convolutional module to perform 16× downsampling of vision tokens
    - the DeepEncoder will segment it into 1024/16×1024/16=4096 patch tokens.
  - Vision Tokens
    - Native Resolution
      - Tiny, 512, 64, resize
      - Small, 640, 100, resize
      - Base, 1024, 256, padding
      - Large, 1280, 400, padding
    - Dynamic Resolution
      - Gundam 640+1024, n×100+256, resize + padding
      - Gundam-M, 1024+1280, n×256+400, resize + padding
  - Conclusion
    - In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens.
Wed, 3 Dec 2025 Jina-VLM: Small Multilingual Vision Language Model
- Architecture
  - txt: Qwen3-1.7B-Base
  - img: SigLIP2-So400M/14-384 (global thumbnail+overlapping tiles)
  - projector:
    - attention pooling over 2×2 patch neighborhoods, using meanpooled features as queries
27 Jan 2026 DeepSeek OCR2
- MinerU2.5 90.67
- PaddleOCR-VL 92.86
- DeepSeek-OCR 87.36
- DeepSeek-OCR2 91.09
27 Jan 2026 Kimi K2.5
- Kimi-K2-Base(1T) + visual(MoonViT 400M)
- Key Features
  - Native Multimodality
  - Coding with Vision
  - Agent Swarm

Video Understanding

Mon, 1 Sep 2025 Kwai Keye-VL 1.5 Technical Report
- Architecture
  - txt: Qwen3-8B
  - img: SigLIP-400M-384-14 + 2D RoPE -> native-resolution ViT
  - projector: MLP 2×2 Patch Merge
  - video: Slow-Fast Video Encoding
- Pre-Training
  - Stage 1: cross-modal alignment
  - Stage 2: multi-task pre-training
  - Stage 3: annealing
  - Sequence Length Extension to 128K
- Post-Training
  - Non-Reasoning Stage: SFT + MPO
  - Keye-Reward Model (SFT+RL training process)
    - LongCoT Cold-Start
    - Iterative General RL
      - General RLVR Training (GSPO)
      - Progressive Hint Sampling
      - Iterative General RL & Cold-Start Enhancement
    - Alignment RL
      - Reward System Design
- Evaluation
  - Keye-VL-1.5 8B-Thinking
  - Keye-VL-Preview 8B-Thinking
  - Qwen2.5-VL 7B
  - InternVL3 8B
  - MiMo-VL 7B-RL 2508
  - GPT-4o
  - Claude 3.7 Sonnet
- Ablation Studies
  - Effects of SFT, MPO, and Long CoT Cold Start
  - Effectiveness of Expert Models and Model Merging
  - Effectiveness of Alignment Reinforcement Learning
  - Effect of Partial Solutions During RL Phase
  - Impact of Rejection Sampling on SFT and RL Performance

Multimodal+Thinking

Mon, 14 Apr 2025 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
- Post-Training
  - Supervised Fine-Tuning
  - Mixed Preference Optimization (DPO)
- Test-Time Scaling
  - Visual Process Reward Model
Wed, 4 Jun 2025 MiMo-VL Technical Report
- Post-Training
  - Reinforcement Learning with Verifiable Rewards (On-Policy GRPO)
    - Visual Reasoning\Text Reasoning\Image Grounding\Visual Counting\Temporal Video Grounding
  - Mixed On-Policy Reinforcement Learning
Tue, 1 Jul 2025 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Reinforcement Learning: What’s Challenging and What Works (GRPO)
  - Data preparation
  - Reward system
    - The extraction of the final answer in RLVR.
    - Avoid reward hacking
    - Domain-specific reward system
  - Reinforcement Learning with Curriculum Sampling (RLCS)
  - Infrastructure
Sun, 11 May 2025 Seed1.5-VL Technical Report
- Hybrid Reinforcement Learning
  - our training is a combination of RLHF and RLVR.
Mon, 11 Aug 2025 GLM-4.5V
- technical_report: https://github.com/zai-org/GLM-V/blob/main/resources/GLM-4.5V_technical_report.pdf
- txt: GLM-4.5-Air (106B parameters, 12B active)
- SFT+RL 相对 SFT 都所有项目都有比较大的提高
Mon, 25 Aug 2025 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
- Post-Training
  - Supervised Fine-Tuning(SFT) ~130B tokens
  - Cascade Reinforcement Learning (Cascade RL) 270K samples
    - mixed preference optimization (MPO)
    - GSPO
Thu, 28 Aug 2025 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
- we first propose bi-mode annealing, which is designed to train a model that is inherently capable of both thinking and non-thinking modes in general domains.
- Then, we perform bimode annealing by mixing these datasets and obtain R-4B-Base. This lays a solid foundation for the model’s subsequent auto-thinking training in general-purpose domains.
Mon, 1 Sep 2025 Kwai Keye-VL 1.5 Technical Report
- Post-Training
  - Non-Reasoning Stage: SFT + MPO
  - Keye-Reward Model (SFT+RL training process)
    - LongCoT Cold-Start
    - Iterative General RL
      - General RLVR Training (GSPO)
      - Progressive Hint Sampling
      - Iterative General RL & Cold-Start Enhancement
    - Alignment RL
      - Reward System Design

Knowledge Distillation

Sun, 10 Dec 2023 AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One
- CLIP, DINOv2, and SAM
Tue, 10 Dec 2024 RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models
- Challenges
  - 3.1. Achieving Multi-Resolution Robustness
    - where feature distributions shift significantly based on input resolution
    - Specifically, low-resolution inputs yield DINO-like features,
    - while high-resolution inputs produce SAM-like features
    - We trace this behavior to the student learning from different teachers at different resolutions during training
  - 3.2. Token Count
    - an excessive number of vision tokens can negatively impact performance or lead to sequence overflows
- Method
  - Finding 1. High-resolution inference through tiling causes the vision encoder to lose global context and exhibit poor scaling equivariance.
  - Finding 2. For the student model to be consistently accurate across resolutions, it is sufficient to match all teachers at all resolutions, and to train at two resolutions simultaneously in the final training stage.
  - Finding 3. Mosaic augmentation greatly reduces the training cost associated with learning from high-resolution teachers and eliminates the need for feature interpolation. Student quality is even improved with this optimization.
  - Finding 4. PHI Standardization helps balance the energy spent learning from each teacher.
  - Finding 5. All teachers are beneficial, including SAM, despite recent trends. It also has broad downstream applicability, granting our student the same abilities.
  - Finding 6. Minimizing the number of partitions seems to be beneficial, assuming you can afford the teacher overhead. Under compute constraints, partitioning is an effective strategy to reduce the overhead.
  - SigLIP Teacher
    - Our choice is validated by the significant improvements observed in VLM tasks
  - Finding 7. Token Merging is very effective at retaining the most diverse information under high compression ratios.
  - Finding 8. Intermediate layer activations greatly benefit downstream tasks if a non-linear transformation is employed.
- Ablation Studies
  - 𝒜: RADIOv2.1-L*: Baseline
  - ℬ: 𝒜 + multi-res: Eliminate modes
  - 𝒞: ℬ - OpenAICLIP + SigLIP: Better VLM
  - 𝒟: 𝒞 + ViT-H: Bigger backbone
  - ℰ: 𝒟 + Token Merging: Improve VLM

Token Merging

Thu, 30 Mar 2023Token Merging for Fast Stable Diffusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Survey

Pre-training of Large Vision Encoders

Survey

Visual Backbone Networks

CLIP (Contrastive Language-Image Pre-Training)

DINO

Generative Pretrained Visual Encoders

Multimodal Projector(abstractors)

High-resolution LMMs

Vision-Language Models

Video Understanding

Multimodal+Thinking

Knowledge Distillation

Token Merging

FilesExpand file tree

awesome_vlm.md

Latest commit

History

awesome_vlm.md

File metadata and controls

Survey

Pre-training of Large Vision Encoders

Survey

Visual Backbone Networks

CLIP (Contrastive Language-Image Pre-Training)

DINO

Generative Pretrained Visual Encoders

Multimodal Projector(abstractors)

High-resolution LMMs

Vision-Language Models

Video Understanding

Multimodal+Thinking

Knowledge Distillation

Token Merging