Local embeddings server for Apple Silicon using MLX, providing OpenAI-compatible API endpoints.
This project enables running embedding models locally on Apple Silicon (M1/M2/M3) using MLX, Apple's machine learning framework. It provides several key benefits:
- Cost-Effective: No API fees for embeddings - run unlimited inference locally
- Privacy: All data processing happens on your local machine
- Performance: MLX is optimized for Apple Silicon, providing fast inference with efficient memory usage
- OpenAI-Compatible API: Drop-in replacement for OpenAI embeddings API
- LiteLLM Integration: Can be proxied through LiteLLM alongside other providers for unified access
- MLX Format Support: Alternatives like LM Studio don't properly recognize MLX embedding models as embedding types (issue #808), making a dedicated MLX server necessary
Your Application
↓
LiteLLM (optional proxy)
↓
mlx-serve (localhost:8000)
↓
MLX Framework (Apple Silicon)
↓
Local Embedding Model (Qwen3-Embedding-4B-4bit-DWQ)
- Apple Silicon Mac (M1/M2/M3)
- Python 3.12+
- uv for package management
# Install dependencies
uv syncStart the embeddings server with the default model:
./start_embeddings.shOr specify a different model:
./start_embeddings.sh mlx-community/qwen3-embedding-0.6b-8bitThe server will:
- Listen on
http://127.0.0.1:8000 - Log to
logs/embeddings_server.log - Store PID in
embeddings_server.pid
kill $(cat embeddings_server.pid)Add to your LiteLLM config:
model_list:
- model_name: qwen/qwen3-embedding-4b
litellm_params:
model: mlx-community/Qwen3-Embedding-4B-4bit-DWQ
api_base: http://127.0.0.1:8000/v1
api_key: dummy # required but not usedimport openai
client = openai.OpenAI(
api_key="dummy", # required but not used
base_url="http://127.0.0.1:8000/v1"
)
response = client.embeddings.create(
model="mlx-community/Qwen3-Embedding-4B-4bit-DWQ",
input="Your text to embed"
)
embeddings = response.data[0].embeddingThe server can be customized by editing start_embeddings.sh:
MODEL: The MLX model to use (default: Qwen3-Embedding-4B-4bit-DWQ)HOST: Server host (default: 127.0.0.1)PORT: Server port (default: 8000)--max-concurrency: Maximum concurrent requests (default: 4)--queue-timeout: Request queue timeout in seconds (default: 300)--queue-size: Maximum queue size (default: 100)
Common MLX embedding models:
mlx-community/Qwen3-Embedding-4B-4bit-DWQ(default) - 4-bit quantized, ~2GBmlx-community/qwen3-embedding-0.6b-8bit- 8-bit quantized, smaller/faster- Other MLX-compatible embedding models from Hugging Face
View server logs:
tail -f logs/embeddings_server.log- mlx-openai-server - OpenAI-compatible API server for MLX models
- MLX - Apple's machine learning framework
This project configuration is provided as-is for local development use.