The Multimodal Embedding Serving microservice supports multiple vision-language models for generating embeddings from text, images, and videos.
| Model ID | Architecture | Embedding Dimension |
|---|---|---|
CLIP/clip-vit-b-32 |
ViT-B-32 | 512 |
CLIP/clip-vit-b-16 |
ViT-B-16 | 512 |
CLIP/clip-vit-l-14 |
ViT-L-14 | 768 |
CLIP/clip-vit-h-14 |
ViT-H-14 | 1024 |
Standard OpenAI CLIP models for general-purpose vision-language understanding.
| Model ID | Architecture | Embedding Dimension |
|---|---|---|
CN-CLIP/cn-clip-vit-b-16 |
ViT-B-16 | 512 |
CN-CLIP/cn-clip-vit-l-14 |
ViT-L-14 | 768 |
CN-CLIP/cn-clip-vit-h-14 |
ViT-H-14 | 1024 |
Chinese-optimized CLIP models supporting both Chinese and English text.
| Model ID | Architecture | Embedding Dimension |
|---|---|---|
MobileCLIP/mobileclip_s0 |
MobileCLIP-S0 | 512 |
MobileCLIP/mobileclip_s1 |
MobileCLIP-S1 | 512 |
MobileCLIP/mobileclip_s2 |
MobileCLIP-S2 | 512 |
MobileCLIP/mobileclip_b |
MobileCLIP-B | 512 |
MobileCLIP/mobileclip_blt |
MobileCLIP-BLT | 512 |
Lightweight CLIP models designed for mobile and edge deployment.
| Model ID | Architecture | Embedding Dimension |
|---|---|---|
SigLIP/siglip2-vit-b-16 |
ViT-B-16 | 768 |
SigLIP/siglip2-vit-l-16 |
ViT-L-16 | 1024 |
SigLIP/siglip2-so400m-patch16-384 |
ViT-So400M | 1152 |
CLIP models with sigmoid loss function.
| Model ID | Architecture | Embedding Dimension | HuggingFace Model | Handler |
|---|---|---|---|---|
Blip2/blip2_transformers |
BLIP-2 + Q-Former | 256 | Salesforce/blip2-itm-vit-g |
Transformers |
The BLIP-2 handler uses Blip2ForImageTextRetrieval from HuggingFace Transformers with projection layers (768D→256D) to generate compact embeddings.
For detailed architecture and implementation details, see BLIP-2 Transformers Guide.
| Model ID | Hugging Face Repo | Embedding Dimension | Precision | Notes |
|---|---|---|---|---|
QwenText/qwen3-embedding-0.6b |
Qwen/Qwen3-Embedding-0.6B |
1024 | INT8 | Text-only, instruction-aware, Context Length: 32k |
QwenText/qwen3-embedding-4b |
Qwen/Qwen3-Embedding-4B |
2560 | INT8 | Text-only, instruction-aware, Context Length: 32k |
QwenText/qwen3-embedding-8b |
Qwen/Qwen3-Embedding-8B |
4096 | INT8 | Text-only, instruction-aware, Context Length: 32k |
The Qwen text embedding handler provides high-quality multilingual embeddings optimised with OpenVINO. These models:
- Are text-only and do not expose image or video encoders.
- Automatically wrap queries using the recommended instruction template (
"Instruct: {task_description}\nQuery:{query}"). - Convert to OpenVINO INT8 format on first use and store compiled artifacts under the configured
EMBEDDING_OV_MODELS_DIR. - Require
trust_remote_code=true(handled by the factory). - Support Intel GPU execution via OpenVINO.
Use the /model/capabilities endpoint to inspect which modalities the currently loaded model supports.
Set your chosen model using environment variables:
# Example: Using BLIP-2 (Transformers)
export EMBEDDING_MODEL_NAME="Blip2/blip2_transformers"
# Example: Using CLIP
export EMBEDDING_MODEL_NAME="CLIP/clip-vit-b-16"
# Example: Using MobileCLIP
export EMBEDDING_MODEL_NAME="MobileCLIP/mobileclip_s0"
# Example: Using Qwen text embeddings (INT8 OpenVINO)
export EMBEDDING_MODEL_NAME="QwenText/qwen3-embedding-0.6b"
export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=GPU # or CPU/AUTO
export EMBEDDING_OV_MODELS_DIR=/app/ov_models
source setup.shAll models support OpenVINO optimization for Intel hardware acceleration:
export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=CPU # or GPUThe service supports automatic OpenVINO conversion for all models. The conversion process automatically detects whether a model has HuggingFace Hub support and uses the appropriate conversion method.
- Text: UTF-8 strings (available for all models)
- Images: JPEG, PNG, WebP, base64-encoded (and other formats supported by PIL). Not available for Qwen text-only models.
- Videos: Any format supported by FFmpeg (MP4, AVI, MOV, etc.), base64-encoded. Not available for Qwen text-only models.
All models are compatible with the OpenAI embeddings API format.
Query available models:
curl http://localhost:9777/model/listGet current model information:
curl http://localhost:9777/model/currentInspect modality support for the active model:
curl http://localhost:9777/model/capabilities- Get Started: Step-by-step deployment instructions
- Quick Reference: Essential commands and configurations
- SDK Usage: Python SDK integration guide
- Overview: Architecture and capabilities overview
- BLIP-2 Transformers Guide: Detailed BLIP-2 implementation guide