MLX Serve Embeddings

Local embeddings server for Apple Silicon using MLX, providing OpenAI-compatible API endpoints.

Why This Setup?

This project enables running embedding models locally on Apple Silicon (M1/M2/M3) using MLX, Apple's machine learning framework. It provides several key benefits:

Cost-Effective: No API fees for embeddings - run unlimited inference locally
Privacy: All data processing happens on your local machine
Performance: MLX is optimized for Apple Silicon, providing fast inference with efficient memory usage
OpenAI-Compatible API: Drop-in replacement for OpenAI embeddings API
LiteLLM Integration: Can be proxied through LiteLLM alongside other providers for unified access
MLX Format Support: Alternatives like LM Studio don't properly recognize MLX embedding models as embedding types (issue #808), making a dedicated MLX server necessary

Architecture

Your Application
        ↓
    LiteLLM (optional proxy)
        ↓
mlx-serve (localhost:8000)
        ↓
MLX Framework (Apple Silicon)
        ↓
Local Embedding Model (Qwen3-Embedding-4B-4bit-DWQ)

Quick Start

Prerequisites

Apple Silicon Mac (M1/M2/M3)
Python 3.12+
uv for package management

Installation

# Install dependencies
uv sync

Running the Server

Start the embeddings server with the default model:

./start_embeddings.sh

Or specify a different model:

./start_embeddings.sh mlx-community/qwen3-embedding-0.6b-8bit

The server will:

Listen on http://127.0.0.1:8000
Log to logs/embeddings_server.log
Store PID in embeddings_server.pid

Stopping the Server

kill $(cat embeddings_server.pid)

Usage with LiteLLM

Add to your LiteLLM config:

model_list:
  - model_name: qwen/qwen3-embedding-4b
    litellm_params:
      model: mlx-community/Qwen3-Embedding-4B-4bit-DWQ
      api_base: http://127.0.0.1:8000/v1
      api_key: dummy  # required but not used

Direct API Usage

import openai

client = openai.OpenAI(
    api_key="dummy",  # required but not used
    base_url="http://127.0.0.1:8000/v1"
)

response = client.embeddings.create(
    model="mlx-community/Qwen3-Embedding-4B-4bit-DWQ",
    input="Your text to embed"
)

embeddings = response.data[0].embedding

Configuration

The server can be customized by editing start_embeddings.sh:

MODEL: The MLX model to use (default: Qwen3-Embedding-4B-4bit-DWQ)
HOST: Server host (default: 127.0.0.1)
PORT: Server port (default: 8000)
--max-concurrency: Maximum concurrent requests (default: 4)
--queue-timeout: Request queue timeout in seconds (default: 300)
--queue-size: Maximum queue size (default: 100)

Available Models

Common MLX embedding models:

mlx-community/Qwen3-Embedding-4B-4bit-DWQ (default) - 4-bit quantized, ~2GB
mlx-community/qwen3-embedding-0.6b-8bit - 8-bit quantized, smaller/faster
Other MLX-compatible embedding models from Hugging Face

Logs

View server logs:

tail -f logs/embeddings_server.log

Dependencies

mlx-openai-server - OpenAI-compatible API server for MLX models
MLX - Apple's machine learning framework

License

This project configuration is provided as-is for local development use.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
launchd		launchd
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
start_embeddings.sh		start_embeddings.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Serve Embeddings

Why This Setup?

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Stopping the Server

Usage with LiteLLM

Direct API Usage

Configuration

Available Models

Logs

Dependencies

License

About

Uh oh!

Releases

Packages

Languages

aperepel/mlx-serve-embeddings

Folders and files

Latest commit

History

Repository files navigation

MLX Serve Embeddings

Why This Setup?

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Stopping the Server

Usage with LiteLLM

Direct API Usage

Configuration

Available Models

Logs

Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages