Skip to content

Latest commit

 

History

History
146 lines (100 loc) · 5.4 KB

File metadata and controls

146 lines (100 loc) · 5.4 KB

Supported Models

The Multimodal Embedding Serving microservice supports multiple vision-language models for generating embeddings from text, images, and videos.

Available Models

CLIP (Contrastive Language-Image Pretraining)

Model ID Architecture Embedding Dimension
CLIP/clip-vit-b-32 ViT-B-32 512
CLIP/clip-vit-b-16 ViT-B-16 512
CLIP/clip-vit-l-14 ViT-L-14 768
CLIP/clip-vit-h-14 ViT-H-14 1024

Standard OpenAI CLIP models for general-purpose vision-language understanding.

CN-CLIP (Chinese CLIP)

Model ID Architecture Embedding Dimension
CN-CLIP/cn-clip-vit-b-16 ViT-B-16 512
CN-CLIP/cn-clip-vit-l-14 ViT-L-14 768
CN-CLIP/cn-clip-vit-h-14 ViT-H-14 1024

Chinese-optimized CLIP models supporting both Chinese and English text.

MobileCLIP

Model ID Architecture Embedding Dimension
MobileCLIP/mobileclip_s0 MobileCLIP-S0 512
MobileCLIP/mobileclip_s1 MobileCLIP-S1 512
MobileCLIP/mobileclip_s2 MobileCLIP-S2 512
MobileCLIP/mobileclip_b MobileCLIP-B 512
MobileCLIP/mobileclip_blt MobileCLIP-BLT 512

Lightweight CLIP models designed for mobile and edge deployment.

SigLIP

Model ID Architecture Embedding Dimension
SigLIP/siglip2-vit-b-16 ViT-B-16 768
SigLIP/siglip2-vit-l-16 ViT-L-16 1024
SigLIP/siglip2-so400m-patch16-384 ViT-So400M 1152

CLIP models with sigmoid loss function.

BLIP-2 (Semantic Search / Retrieval)

Model ID Architecture Embedding Dimension HuggingFace Model Handler
Blip2/blip2_transformers BLIP-2 + Q-Former 256 Salesforce/blip2-itm-vit-g Transformers

The BLIP-2 handler uses Blip2ForImageTextRetrieval from HuggingFace Transformers with projection layers (768D→256D) to generate compact embeddings.

For detailed architecture and implementation details, see BLIP-2 Transformers Guide.

Qwen Text Embeddings

Model ID Hugging Face Repo Embedding Dimension Precision Notes
QwenText/qwen3-embedding-0.6b Qwen/Qwen3-Embedding-0.6B 1024 INT8 Text-only, instruction-aware, Context Length: 32k
QwenText/qwen3-embedding-4b Qwen/Qwen3-Embedding-4B 2560 INT8 Text-only, instruction-aware, Context Length: 32k
QwenText/qwen3-embedding-8b Qwen/Qwen3-Embedding-8B 4096 INT8 Text-only, instruction-aware, Context Length: 32k

The Qwen text embedding handler provides high-quality multilingual embeddings optimised with OpenVINO. These models:

  • Are text-only and do not expose image or video encoders.
  • Automatically wrap queries using the recommended instruction template ("Instruct: {task_description}\nQuery:{query}").
  • Convert to OpenVINO INT8 format on first use and store compiled artifacts under the configured EMBEDDING_OV_MODELS_DIR.
  • Require trust_remote_code=true (handled by the factory).
  • Support Intel GPU execution via OpenVINO.

Use the /model/capabilities endpoint to inspect which modalities the currently loaded model supports.

Model Configuration

Set your chosen model using environment variables:

# Example: Using BLIP-2 (Transformers)
export EMBEDDING_MODEL_NAME="Blip2/blip2_transformers"

# Example: Using CLIP
export EMBEDDING_MODEL_NAME="CLIP/clip-vit-b-16"

# Example: Using MobileCLIP
export EMBEDDING_MODEL_NAME="MobileCLIP/mobileclip_s0"

# Example: Using Qwen text embeddings (INT8 OpenVINO)
export EMBEDDING_MODEL_NAME="QwenText/qwen3-embedding-0.6b"
export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=GPU  # or CPU/AUTO
export EMBEDDING_OV_MODELS_DIR=/app/ov_models

source setup.sh

All models support OpenVINO optimization for Intel hardware acceleration:

export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=CPU  # or GPU

OpenVINO Conversion Support

The service supports automatic OpenVINO conversion for all models. The conversion process automatically detects whether a model has HuggingFace Hub support and uses the appropriate conversion method.

Supported Input Formats

  • Text: UTF-8 strings (available for all models)
  • Images: JPEG, PNG, WebP, base64-encoded (and other formats supported by PIL). Not available for Qwen text-only models.
  • Videos: Any format supported by FFmpeg (MP4, AVI, MOV, etc.), base64-encoded. Not available for Qwen text-only models.

All models are compatible with the OpenAI embeddings API format.

API Usage

Query available models:

curl http://localhost:9777/model/list

Get current model information:

curl http://localhost:9777/model/current

Inspect modality support for the active model:

curl http://localhost:9777/model/capabilities

Related Documentation