Skip to content

Releases: huggingface/transformers

Patch release v4.50.1

25 Mar 14:34
Compare
Choose a tag to compare

Patch release v4.50.1

There were some very minor bugs with the new hub kernels, and with remote code that we had to fix

Release v4.50.0

21 Mar 13:40
Compare
Choose a tag to compare

Release v4.50.0

New Model Additions

Model-based releases

Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.

Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:

  • v4.49.0-Gemma-3
  • v4.49.0-AyaVision

⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.

Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.

For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:

              o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
            /                  \  
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
       \          
        o---- v4.49.0-AyaVision

We strive to merge model specific fixes on their respective branches as fast as possible!

Gemma 3

image

Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Shield Gemma2

ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:

  • No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
  • No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
  • No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).

We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.

Aya Vision

AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

image

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:

  • Multimodal capabilities in 23 languages
  • Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
  • High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
  • Seamless integration of visual and textual information in 23 languages.

Mistral 3.1

Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

image

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:

  • Fast-response conversational agents.
  • Low-latency function calling.
  • Subject matter experts via fine-tuning.
  • Local inference for hobbyists and organizations handling sensitive data.
  • Programming and math reasoning.
  • Long document understanding.
  • Visual understanding.

Smol VLM 2

SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

image

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

  • It uses SmolLM2 for the text model.
  • It supports multi-image and video inputs

SigLIP-2

SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

image

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

  1. FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
  2. NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

Prompt Depth Anything

PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

image

New tool: attention visualization

We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:

from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")
visualizer("A normal attention mask")

visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")
visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")

visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer("<img> You are an assistant.", suffix = "What is on the image?")

visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("<img>You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

image

Deprecating transformers.agents in favor of smolagents

We are deprecating transformers.agents in favour of the smolagents library. Read...

Read more

Mistral 3 (Based on v4.49.0)

18 Mar 16:06
e959530
Compare
Choose a tag to compare

A new model is added to transformers: Mistral 3.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Mistral 3

image

The model is detailed in the following blog post.
The models are available on the Hub with the following tag: mistral3

Overview

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:

  • Fast-response conversational agents.
  • Low-latency function calling.
  • Subject matter experts via fine-tuning.
  • Local inference for hobbyists and organizations handling sensitive data.
  • Programming and math reasoning.
  • Long document understanding.
  • Visual understanding.

This model was contributed by cyrilvallez and yonigozlan.

The original code can be found here and here.

Usage example

Inference with Pipeline

Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code:

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'

Inference on a single image

This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Describe this image"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...

Text-only generation

This example shows how to generate text using the Mistral3 model without providing any image input.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

>>> messages = [
...    {"role": "system", "content": SYSTEM_PROMPT},
...    {"role": "user", "content": user_prompt},
... ]

>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]

>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!

```
 /\_/\
( o.o )
 > ^ <
```"

Batched image and text inputs

Mistral3 models also support batched image and text inputs.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]

Batched multi-image input and quantization with BitsAndBytes

This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
This example also how to use BitsAndBytes to load the model in 4bit quantization.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
...     model_checkpoint, quantization_config=quantization_config
... )

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]

Gemma 3 (Based on v4.49.0)

18 Mar 13:40
Compare
Choose a tag to compare

A new model is added to transformers: Gemma 3.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Gemma 3

image

The model is detailed in the following blog post.
The models and demos using the model are available in the following collection.

A Space to play around with the 12B-it flavor is available here.

Overview

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Usage tips

  • For image+text and image-only inputs use Gemma3ForConditionalGeneration.
  • For text-only inputs use Gemma3ForCausalLM for generation to avoid loading the vision tower.
  • Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
  • The text passed to the processor should have the "<start_of_image_>" token where the images should be inserted.
  • The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from apply_chat_template. See the examples below for more details on how to use it.

Image cropping for high resolution images

The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.

Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    do_pan_and_scan=True,
).to(model.device)

Usage Example

Single-image Inference

from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Multi-image Inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url_cow},
            {"type": "image", "url": url_stop},
            {"type": "text", "text": "Are these two images identical?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Text-only inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")

input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=100)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(text)

Aya Vision (Based on v4.49.0)

04 Mar 13:58
84f0186
Compare
Choose a tag to compare

A new model is added to transformers: Aya Vision.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Aya Vision

image

The model is detailed in the following blog post.

Overview

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:

  • Multimodal capabilities in 23 languages
  • Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
  • High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
  • Seamless integration of visual and textual information in 23 languages.

Usage Example

Here's an example usage of the Aya Vision model.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

SigLIP-2 (Based on v4.49.0)

21 Feb 10:28
a957b79
Compare
Choose a tag to compare

A new model is added to transformers: SigLIP-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SigLIP2

image

The paper page for the model is available here.
It is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

  1. FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
  2. NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

The abstract from the paper is the following:

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success
of the original SigLIP. In this second iteration, we extend the original image-text training objective with
several prior, independently developed techniques into a unified recipe—this includes decoder-based
pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With
these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities,
including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot
accuracy), image-text retrieval, and transfer performance when extracting visual representations for
Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements
on localization and dense prediction tasks. We also train variants which support multiple resolutions
and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that
includes de-biasing techniques, leading to much better multilingual understanding and improved fair-
ness. To provide users with the ability to trade-off inference cost with performance, we release model
checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).

Usage tips

  • Usage of SigLIP2 is similar to SigLIP and CLIP. The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
  • Training is supported but does not use torch.distributed utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
  • When using the standalone [GemmaTokenizerFast] make sure to pass padding="max_length" and max_length=64 as that's how the model was trained.
  • Model was trained with lowercased text, make sure you make the same preprocessing for your text labels.
  • To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
  • The NaFlex variant supports processing images at higher resolutions by adjusting the max_num_patches parameter in the Processor. The default value is max_num_patches=256. Increasing max_num_patches to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

drawing

This model was contributed by qubvel.
The original code can be found here.

Usage example

There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.

FixRes variant

Pipeline API

The pipeline allows to use the model in a few lines of code:

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> # load pipe
>>> image_classifier = pipeline(
...     task="zero-shot-image-classification",
...     model="google/siglip2-base-patch16-224",
... )

>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]

Using the model yourself

If you want to do the pre- and postprocessing yourself, here's how to do that:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
15.0% that image 0 is '2 cats'

NaFlex variant

NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths
with a single ViT model, and NaViT, namely processing images at their native aspect ratio.
This enables processing different types of images at appropriate resolution, e.g. using a
larger resolution to process document images, while at the same time minimizing the impact
of aspect ratio distortion on certain inference tasks, e.g. on OCR.

Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing
the input image such that the height and width after resizing are multiples of the patch size,
while

1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)

The resulting distortion in width and height is at most (patch_size - 1) / width and
(patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios.
After resizing, the image is split into a sequence of patches, and a mask with padding information is added.

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing
# higher values e.g. `max_num_patches=512`
>>> inputs = processor(text=texts, images=image, max_num_patches=256, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
21.1% that image 0 is '2 cats'

SmolVLM-2 (Based on v4.49.0)

20 Feb 14:08
4397dfc
Compare
Choose a tag to compare

A new model is added to transformers: SmolVLM-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SmolVLM-2

image

SmolVLM-2 is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

  • It uses SmolLM2 for the text model.
  • It supports multi-image and video inputs

Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

Videos should not be upsampled.

If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default.
The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.

Here’s how to control resizing and set a custom size:

image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)

Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.

This model was contributed by orrzohar.

Usage example

Single Media inference

The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct")
model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

conversation = [
    {
        "role": "user",
        "content":[
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)


# Video
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2, GPTQModel

17 Feb 15:19
Compare
Choose a tag to compare

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

image

Qwen2.5-VL

The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

image

SuperGlue

The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

image

Granite Vision Support

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

image

GOT-OCR 2.0

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.

image

DAB-DETR

DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

image

Depth PRO

DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

RT-DETRv2

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Transformers-CLI

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.

This feature exists in TRL and has been migrated to transformers for easier usage.

ezgif-56c494108b6d77

Processor Standardization

An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.

In this release, several processors have been standardized and have seen their fast version be contributed.

Breaking changes

DPT segmentation maps

DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.

  • 🔴 🔴 🔴 Added segmentation maps support for DPT image processor by @simonreise in #34345

Image classification pipeline and single vs multi-label

The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.

  • 🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848

Fixing the LayerNorm beta/gamma renames

The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:

  • 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615

VLM cleanup

The ignore_index property of the llava configuration has been removed as it was not serving a purpose.

Quantization

Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.

Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).

GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.

  • Enable gptqmodel by @ jiqing-feng in #35012
  • Split and clean up GGUF quantization tests by @Isotr0py in #35502
  • Display warning for unknown quants config instead of an error by @SunMarc in #35963
  • Adding FP8 Quantization to transformers by @MekkCyber in #36026
  • New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148

Generate

  • [generate] revert change in Aria: the maximum cache length must match max_length by @gante in #36120
  • 🧹 remove generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677
  • [generate] ...
Read more

Patch release v4.48.3

07 Feb 10:10
Compare
Choose a tag to compare

Patch release v4.48.3

This ends the python3.9 issues mostly!

For some very niche cases, the new rope embedding introduced device failures

Num items in batch

  • Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the num_items_in_batch

Finally the fix to Gemma2 is propagated to paligemma2!

Patch release v4.48.2

30 Jan 19:52
Compare
Choose a tag to compare

Patch release v4.48.2

Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

Then we had a small regression for DBRX saving:

Finally we have a fix for gemma and the hybrid attention architectures:

Miscellaneous: