Skip to content

Conversation

@tomaarsen
Copy link
Member

Hello!

Pull Request overview

  • Introduce cross-modality and multi-modality support via refactors of SentenceTransformer, Router, and Transformer
  • Modularize the CrossEncoder class, initially by subclassing SentenceTransformer, but long term I want to subclass a new superclass

Details

This pull request is very much a work-in-progress, although it is already functional. In short:

  1. Transformer now works with an AutoProcessor and handles inputs through that. This accepts multiple modalities
  2. SentenceTransformer, Transformer and Router check the modality of inputs, only one modality is allowed per inference
  3. Router has been adapted to allow for modality-based routing
  4. There is a strict distinction between a model with modalities ["text", "image"] and [("text", "image")]. The former is cross-modal, i.e. you can pass either text or images, and you can then compare the embeddings across the modalities. The latter is multi-modal, i.e. you can pass text AND images at the same time, and this joint input results in one embedding output. The "one input in, one embedding out" is a core feature.
  5. Multimodal models can be called with lists of dictionaries using modalities as keys, e.g. model.encode([{"text": "This is my <image>", "image": "cat.jpg"}, ...]).
  6. Transformer is designed to be somewhat flexible moving forward. Model authors can specify which modalities are supported, which methods on the AutoModel need to be called, and which output keys need to be used from the outputs from those methods. The goal is to have strong defaulting as well.
  7. model.modalities gives a list of supported modalities. E.g. SentenceTransformer("laion/clap-htsat-unfused").modalities is ['text', 'audio']
  8. The modalities are text, image, audio, video, and combination of the previous, and message. The latter uses processor.apply_chat_template.

Here's two cross-modal models that I trained:

This is an incomplete list of models that you can simply initialize with SentenceTransformer(model_name):

  • text
    • all-MiniLM-L6-v2
    • google/embeddinggemma-300m
    • Qwen/Qwen3-Embedding-0.6B
    • google/gemma-3-1b-pt
  • image
    • google/vit-base-patch16-224-in21k
    • facebook/deit-base-distilled-patch16-224
    • facebook/dinov2-with-registers-small
    • DeepGlint-AI/mlcd-vit-bigG-patch14-336
    • microsoft/resnet-18
    • timm/mobilenetv4_conv_medium.e500_r256_in1k
    • timm/convnext_base.clip_laion2b
    • microsoft/beit-base-patch16-224
    • google/bit-50
    • microsoft/conditional-detr-resnet-50
  • audio
    • facebook/wav2vec2-large-960h-lv60-self
    • nari-labs/Dia-1.6B-0626
    • facebook/hubert-large-ls960-ft
  • cross text+image
    • kakaobrain/align-base
    • apple/aimv2-large-patch14-224-lit
    • openai/clip-vit-base-patch32
    • google/siglip-base-patch16-224
  • cross text+audio
    • laion/clap-htsat-unfused
  • multi text+image
    • google/paligemma-3b-mix-448
    • ds4sd/SmolDocling-256M-preview
    • ibm-granite/granite-docling-258M
    • ibm-granite/granite-vision-3.3-2b
  • multi text+image as message
    • deepseek-community/deepseek-vl-1.3b-chat

For more complex setups, you can use a Router, e.g. when one transformer model doesn't have the modalities that you're after

from PIL import Image
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Dense, Pooling, Router, Transformer

# Create separate encoders for different modalities
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
# Project to 768 dims to match image encoder
text_dense = Dense(text_encoder.get_word_embedding_dimension(), 768, module_input_name="token_embeddings")
image_encoder = Transformer(
    "ModernVBERT/modernvbert",
    model_args={"trust_remote_code": True},
    tokenizer_args={"trust_remote_code": True},
    config_args={"trust_remote_code": True},
)
pooling = Pooling(text_encoder.get_word_embedding_dimension())

# Route based on modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_dense],
        "image": [image_encoder],
    },
    route_mappings={
        (None, "text"): "text",  # Any task with text goes to text encoder
        (None, ("text", "image")): "image",  # Any task with text-image together goes to image encoder
    },
)

model = SentenceTransformer(modules=[router, pooling])

# Modality is automatically inferred
text_embedding = model.encode("A photo of a cat")
multimodal_embedding = model.encode({"text": "A photo of a <image>", "image": Image.open("cat.jpg")})

similarity = model.similarity(text_embedding, multimodal_embedding)

For the text modality, this example uses all-MiniLM-L6-v2 with a linear layer that projects the token embeddings to 768-dimensional before pooling. For the multimodal text+image, this uses ModernVBERT/modernvbert, a model that supports both text AND image inputs simultaneously for one output embedding. This model could be trained to perform multimodal retrieval or similar tasks.

There are currently a lot of tiny breaking changes that I want to iron out. If I can't get rid of all of them, then sadly this refactor will have to wait until a v6.0 release, which I would normally only do alongside the introduction of a new archetype, Late Interaction in this case. Tons of TODOs also remain.

cc @NohTow - this should in theory allow you to work on multi-modal/cross-modal Late Interaction. One big annoyance for you for now will likely be that many architectures like CLIP/CLAP default to using get_text_features/get_..._features from the transformers model, and these methods all output pooled embeddings rather than token embeddings. This is okay-ish for Sentence Transformers, but a big problem for LI models.

  • Tom Aarsen

@coreintelligence
Copy link

coreintelligence commented Nov 12, 2025

Question: In addition to Qwen/Qwen3-Embedding-#B, would the rerankers be supported as an initializer to CrossEncoder class (e.g: Qwen3-Reranker-#B) ?

@tomaarsen
Copy link
Member Author

@coreintelligence Yes, that is the intention. The 'Qwen3-Reranker-#B' rerankers are a new style of reranker that are based on CausalLM models with specific templates whose scores for specific tokens (e.g. "yes", "no", "1", "0") are used to compute a score. Other examples are https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2 or https://huggingface.co/ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b.

  • Tom Aarsen

tomaarsen added a commit to omkar-334/sentence-transformers that referenced this pull request Dec 5, 2025
This will align better with my goals for the big refactor of huggingface#3554, where these methods will be called _multi_process and _multi_process_worker
tomaarsen added a commit that referenced this pull request Dec 5, 2025
* add multiprocessing support for Cross Encoder

* Rename _predict_multi_process... -> _multi_process...

This will align better with my goals for the big refactor of #3554, where these methods will be called _multi_process and _multi_process_worker

* Add test suite for multi-processing, mirroring the test suite for ST models

* Reorder kwargs to match ST

* Change how device is determined, matching ST

* Add device, pool, and chunk_size to other predict typings

* Upgrade  with multi-gpu reranking

* Update test_hard_negatives test, simplify mine_hard_negatives slightly

---------

Co-authored-by: Tom Aarsen <[email protected]>
@tomaarsen
Copy link
Member Author

TODO for myself: Consider hard-deprecating safe_serialization when moving to transformers v5 in this release.

@tomaarsen tomaarsen requested a review from Copilot December 15, 2025 19:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants