Embeddings

vllm-mlx supports text embeddings using mlx-embeddings, providing an OpenAI-compatible /v1/embeddings endpoint.

Installation

pip install mlx-embeddings>=0.0.5

Quick Start

Start the server with an embedding model

# Pre-load a specific embedding model at startup
vllm-mlx serve my-llm-model --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

If you don't use --embedding-model, the embedding model is loaded lazily on the first request.

Generate embeddings with the OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Single text
response = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input="Hello world"
)
print(response.data[0].embedding[:5])  # First 5 dimensions

# Batch of texts
response = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=[
        "I love machine learning",
        "Deep learning is fascinating",
        "Natural language processing rocks"
    ]
)
for item in response.data:
    print(f"Text {item.index}: {len(item.embedding)} dimensions")

Using curl

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/all-MiniLM-L6-v2-4bit",
    "input": ["Hello world", "How are you?"]
  }'

Supported Models

Any BERT, XLM-RoBERTa, or ModernBERT model from HuggingFace that is compatible with mlx-embeddings:

Model	Use Case	Size
`mlx-community/all-MiniLM-L6-v2-4bit`	Fast, compact	Small
`mlx-community/embeddinggemma-300m-6bit`	High quality	300M
`mlx-community/bge-large-en-v1.5-4bit`	Best for English	Large

Model Management

Lazy loading

By default, the embedding model is loaded on the first /v1/embeddings request. You can switch models between requests and the previous model will be unloaded automatically.

Pre-loading at startup

Use --embedding-model to load a model at startup. When this flag is set, only that specific model can be used for embeddings:

vllm-mlx serve my-llm-model --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

Requesting a different model will return a 400 error.

API Reference

POST /v1/embeddings

Create embeddings for the given input text(s).

Request body:

Field	Type	Required	Description
`model`	string	Yes	Model name from HuggingFace
`input`	string or list[string]	Yes	Text(s) to embed

Response:

{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.023, -0.982, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.112, -0.543, ...]}
  ],
  "model": "mlx-community/all-MiniLM-L6-v2-4bit",
  "usage": {"prompt_tokens": 12, "total_tokens": 12}
}

Python API

Direct usage without server

from vllm_mlx.embedding import EmbeddingEngine

engine = EmbeddingEngine("mlx-community/all-MiniLM-L6-v2-4bit")
engine.load()

vectors = engine.embed(["Hello world", "How are you?"])
print(f"Dimensions: {len(vectors[0])}")

tokens = engine.count_tokens(["Hello world"])
print(f"Token count: {tokens}")

Troubleshooting

mlx-embeddings not installed

pip install mlx-embeddings>=0.0.5

Model not found

Make sure the model name matches a HuggingFace repository compatible with mlx-embeddings. You can pre-download models:

huggingface-cli download mlx-community/all-MiniLM-L6-v2-4bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings

Installation

Quick Start

Start the server with an embedding model

Generate embeddings with the OpenAI SDK

Using curl

Supported Models

Model Management

Lazy loading

Pre-loading at startup

API Reference

POST /v1/embeddings

Python API

Direct usage without server

Troubleshooting

mlx-embeddings not installed

Model not found

FilesExpand file tree

embeddings.md

Latest commit

History

embeddings.md

File metadata and controls

Embeddings

Installation

Quick Start

Start the server with an embedding model

Generate embeddings with the OpenAI SDK

Using curl

Supported Models

Model Management

Lazy loading

Pre-loading at startup

API Reference

POST /v1/embeddings

Python API

Direct usage without server

Troubleshooting

mlx-embeddings not installed

Model not found