Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 3.05 KB

File metadata and controls

27 lines (23 loc) · 3.05 KB
title PaliGemma 2: A Family of Versatile VLMs for Transfer
source arxiv
arxiv_id 2412.03555
url https://arxiv.org/abs/2412.03555
authors
Andreas Steiner
André Susano Pinto
Michael Tschannen
Daniel Keysers
Xiao Wang
Yonatan Bitton
Alexey Gritsenko
Matthias Minderer
Anthony Sherbondy
Shangbang Long
Siyang Qin
Reeve Ingle
Emanuele Bugliarello
Sahar Kazemzadeh
Thomas Mesnard
Ibrahim Alabdulmohsin
Lucas Beyer
Xiaohua Zhai
published 2024-12-04
categories
cs.CV
primary_category cs.CV
fetched_at 2026-05-29T00:00:00Z
topics
multimodal
language-models
computer-vision
aliases
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2
tags
topic/multimodal
topic/language-models
topic/computer-vision
level/frontier
medium/paper
task/multimodal
task/language
technique/transformer
technique/attention
technique/lora-peft
technique/embeddings

Abstract

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

Why it matters

  • PaliGemma 2 scales a vision-language model family from 2B to 27B parameters by pairing the SigLIP-So400m vision encoder with the full Gemma 2 language model range, providing a systematic study of size vs. resolution trade-offs for transfer learning.
  • Training at three resolutions (224px, 448px, 896px) in multiple stages enables broad downstream fine-tuning coverage, and the paper analyzes how learning rate, model size, and resolution interact with task type.
  • The model achieves state-of-the-art results on a diverse set of transfer tasks beyond standard VQA benchmarks, including table structure recognition, molecular structure recognition, music score recognition, long fine-grained captioning, and radiography report generation.
  • As an open model family, PaliGemma 2 provides the community with strong, scalable VLM baselines specifically designed for fine-tuning and transfer rather than direct instruction-following deployment.

Source: https://arxiv.org/abs/2412.03555. This entry is the paper's abstract + metadata; read the full paper at the link.