Rapid-MLX

Run AI on your Mac. Faster than anything else.

Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.

pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.

	Your Mac	Model	Speed (tok/s = words/sec)
16 GB MacBook Air	Qwen3.5-4B	160 tok/s	Chat, coding, tools
32+ GB Mac Mini / Studio	Nemotron-Nano 30B	141 tok/s	🆕 Fastest 30B, 100% tools
32+ GB Mac Mini / Studio	Qwen3.6-35B	95 tok/s	256 experts, 262K context
64 GB Mac Mini / Studio	Qwen3.5-35B	83 tok/s	Best balance of smart + fast
96+ GB Mac Studio / Pro	Qwen3.5-122B	57 tok/s	Frontier-level intelligence
128+ GB Mac Studio Ultra	🆕 DeepSeek V4 Flash 158B-A13B	31-56 tok/s	Day-0 frontier MoE, 1M context

New to local AI? Quick glossary

tok/s (tokens per second) — roughly how many words the AI generates per second. Higher = faster.
4bit / 8bit — compression levels for models. 4bit uses less memory (recommended); 8bit is higher quality.
TTFT (Time To First Token) — how long before the AI starts responding.
Tool calling — the AI can call functions in your code. Used by Cursor, Claude Code, and coding assistants.
OpenAI API compatible — Rapid-MLX speaks the same language as ChatGPT's API, so any app that works with ChatGPT can work with Rapid-MLX by just changing the server address.
Ollama / llama.cpp — other popular tools for running local AI. Rapid-MLX is 2-4x faster on Apple Silicon.

Quick Start

Step 1 — Install (pick one):

# Homebrew (recommended — just works, no Python version issues)
brew install raullenchai/rapid-mlx/rapid-mlx

# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)
pip install rapid-mlx

# Or one-liner with auto-setup (installs Python if needed)
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

Vision/multimodal models (Gemma 4, Qwen-VL, etc.) need extras: pip install 'rapid-mlx[vision]'. Text-only install is ~460 MB; vision adds ~322 MB. See Optional Extras for the full list.

"No matching distribution" error? Your Python is too old. Run python3 --version — if it says 3.9, install a newer Python: brew install python@3.12 then python3.12 -m pip install rapid-mlx

Tapping homebrew/core / Operation not permitted during brew install? Brew 5.x's install sandbox can't auto-tap homebrew/core mid-install. Pre-tap it once, then retry:
brew tap homebrew/core --force   # ~1.3 GB, one-time
brew install raullenchai/rapid-mlx/rapid-mlx

Step 2 — Serve a model:

rapid-mlx serve qwen3.5-4b

First run downloads the model (~2.5 GB) — you'll see a progress bar. Wait for Ready: http://localhost:8000/v1.

Want vision? pip install 'rapid-mlx[vision]' then rapid-mlx serve gemma-4-26b (~14 GB).

Step 3 — Chat (open a second terminal tab):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'

That's it — you now have an OpenAI-compatible AI server on localhost:8000. Point any app at http://localhost:8000/v1 and it just works.

Want a Claude Code-like TUI? Rapid-MLX is the backend — pair it with an open-source agent CLI like OpenCode or codex for the full slash-commands / tool-use / multi-turn experience. Run rapid-mlx agents opencode --setup (or codex --setup) to wire it up automatically.

Tip: Run rapid-mlx models to see all available model aliases. For a smaller/faster model, try rapid-mlx serve qwen3.5-9b (~5 GB).

More install options

From source (for development):

git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX && pip install -e .

Vision models (adds mlx-vlm + opencv + torch, ~322 MB extra):

pip install 'rapid-mlx[vision]'

Audio (TTS/STT via mlx-audio):

pip install 'rapid-mlx[audio]'

Try it with Python (make sure the server is running, then pip install openai):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")  # any value works, no real key needed

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)

Works With

Agent Harnesses (MHI-tested)

Harness	Type	Notes
Hermes Agent	Agent	62 tools, multi-turn (test)
PydanticAI	Framework	Typed agents, structured output (test)
LangChain	Framework	`ChatOpenAI`, tools, streaming (test)
smolagents	Framework	CodeAgent + ToolCallingAgent (test)
OpenClaude (Anthropic SDK)	Agent	`CLAUDE_CODE_USE_OPENAI=1` (test)
Aider	Agent	CLI edit-and-commit, architect mode (test)
Goose	Agent	Ollama provider via `OLLAMA_HOST`
OpenCode	TUI Agent	Claude Code-like terminal UX, OpenAI-compat provider
Claw Code	Agent	OpenAI & Anthropic endpoints

UI / IDE Clients

Client	Status	Setup
Cursor	Compatible	Settings → OpenAI Base URL
Continue.dev	Compatible	VS Code / JetBrains extension
LibreChat	Tested	Docker (test)
Open WebUI	Tested	Docker (test)
Any OpenAI-compatible app	Compatible	Point at `http://localhost:8000/v1`

Model-Harness Index (MHI)

MHI measures how well a model works with a specific agent harness. It combines three dimensions:

Dimension	Weight	What it measures	Source
Tool Calling	50%	Can the model+harness execute function calls correctly?	`rapid-mlx agents --test`
HumanEval	30%	Can the model generate correct code?	HumanEval (10 tasks)
MMLU	20%	Does the harness degrade base knowledge?	tinyMMLU (10 tasks)

MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)

Model	Best MHI	Best Harness	Tool Calling
Qwopus 27B	92	All (Hermes, PydanticAI, LangChain, smolagents)	100%
Qwen3.5 27B	82	Hermes / PydanticAI / LangChain	100%
Llama 3.3 70B	83	smolagents (text-based)	100%
Nemotron Nano 30B	59	PydanticAI / LangChain	91-93%
Gemma 4 26B	62	Hermes / smolagents	100%

Full MHI table (25 model-harness combinations) + methodology

MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)

Run rapid-mlx agents to see all supported agents and python3 scripts/mhi_eval.py to compute MHI on your own setup.

Model + Harness	Tool Calling	HumanEval	MMLU	MHI
Qwopus 27B + Hermes	100%	80%	90%	92
Qwopus 27B + PydanticAI	100%	80%	90%	92
Qwen3.5 27B + Hermes	100%	40%	100%	82
Llama 3.3 70B + smolagents	100%	50%	90%	83
DeepSeek-R1 32B + smolagents	100%	30%	100%	79
Gemma 4 26B + Hermes	100%	0%	60%	62
Nemotron Nano 30B + PydanticAI	93%	0%	60%	59

Quick setup for popular apps:

Cursor: Settings → Models → Add Model:

OpenAI API Base:  http://localhost:8000/v1
API Key:          not-needed
Model name:       default          (or qwen3.5-9b — either works)

Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.

Claw Code:

export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
claw --model "openai/default" prompt "summarize this repo"

OpenClaude:

CLAUDE_CODE_USE_OPENAI=1 OPENAI_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=not-needed OPENAI_MODEL=default openclaude -p "hello"

Hermes Agent (~/.hermes/config.yaml):

model:
  provider: "custom"
  default: "default"
  base_url: "http://localhost:8000/v1"
  context_length: 32768

Goose:

GOOSE_PROVIDER=ollama OLLAMA_HOST=http://localhost:8000 \
GOOSE_MODEL=default goose run --text "hello"

Claude Code:

OPENAI_BASE_URL=http://localhost:8000/v1 claude

More client setup instructions

Continue.dev (~/.continue/config.yaml):

models:
  - name: rapid-mlx
    provider: openai
    model: default
    apiBase: http://localhost:8000/v1
    apiKey: not-needed

Aider:

aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed

Swival (~/.swival/config.toml):

[profiles.rapidmlx]
provider = "generic"
base_url = "http://127.0.0.1:8000"
model = "default"

Run with:

swival --profile rapidmlx "summarize this repo"

Open WebUI (Docker one-liner):

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e ENABLE_OLLAMA_API=False \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=not-needed \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

OpenCode (opencode.json in your project root):

{
  "provider": {
    "openai": {
      "api": "http://localhost:8000/v1",
      "models": {
        "default": {
          "name": "rapid-mlx local",
          "limit": { "context": 32768, "output": 8192 }
        }
      },
      "options": { "apiKey": "not-needed" }
    }
  }
}

PydanticAI (pip install pydantic-ai):

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)
agent = Agent(model)
print(agent.run_sync("What is 2+2?").output)

smolagents (pip install smolagents):

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(
    model_id="default",
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
)
agent = CodeAgent(tools=[], model=model)
agent.run("What is 5 multiplied by 7?")

LibreChat (librechat.yaml, under endpoints.custom):

- name: "Rapid-MLX"
  apiKey: "rapid-mlx"
  baseURL: "http://localhost:8000/v1/"
  models:
    default: ["default"]
    fetch: true
  titleConvo: true
  titleModel: "current_model"
  modelDisplayLabel: "Rapid-MLX"

Anthropic SDK (pip install anthropic):

from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

message = client.messages.create(
    model="default",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Say hello"}],
)
print(message.content[0].text)

Choose Your Model

What fits my Mac?

The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monitor shows red memory pressure, pick a smaller model from the table below.

Your Mac	Best Model	RAM Used	Speed	Quality
16 GB MacBook Air/Pro	Qwen3.5-4B 4bit	2.4 GB	160 tok/s	Good for chat and simple tasks
24 GB MacBook Pro	Qwen3.5-9B 4bit	5.1 GB	108 tok/s	Great all-rounder
32 GB Mac Mini / Studio	Qwen3.5-27B 4bit	15.3 GB	39 tok/s	Solid coding model
32 GB Mac Mini / Studio	🆕 Nemotron-Nano 30B 4bit	18 GB	141 tok/s	Fastest 30B, 100% tool calling
32 GB Mac Mini / Studio	Qwen3.6-35B-A3B 4bit	20 GB	95 tok/s	256 MoE experts, 262K context
36 GB MacBook Pro M3/M4 Pro	Qwen3.5-27B 4bit	15.3 GB	39 tok/s	Same as 32 GB — extra headroom for long contexts
48 GB Mac Mini / Studio	Qwen3.5-35B-A3B 8bit	37 GB	83 tok/s	Sweet spot — smart + fast
64 GB Mac Mini / Studio	Qwen3.5-35B-A3B 8bit	37 GB	83 tok/s	Same model, more room for KV cache
96 GB Mac Studio / Pro	Qwen3.5-122B mxfp4	65 GB	57 tok/s	Best model, fits comfortably
128 GB Mac Studio / Pro	🆕 DeepSeek V4 Flash 2-bit DQ	91 GB	56 tok/s	158B-A13B frontier MoE, day-0 (chat only)
192 GB Mac Studio / Pro	Qwen3.5-122B 8bit	130 GB	44 tok/s	Maximum quality
256 GB Mac Studio Ultra	🆕 DeepSeek V4 Flash 8-bit	136 GB	31 tok/s	158B-A13B frontier MoE, 1M context (chat only)

4bit vs 8bit: 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. "mxfp4" is a high-quality 4bit format.

Full model lineup

58 short aliases across 21 families ship in v0.6.37. Run rapid-mlx models for the live list with quant tier, MoE / hybrid flags, and DFlash eligibility.

Show all 58 aliases by family

Family	Aliases	Notable
Qwen3.5	`qwen3.5-4b`, `-9b`, `-27b`, `-27b-8bit` ✨, `-35b`, `-35b-4bit`, `-122b`, `-122b-8bit`	DeltaNet hybrid; 27b-8bit = DFlash
Qwen3.6	`qwen3.6-27b`, `-27b-8bit` ✨, `-27b-ud`, `-35b`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud`	262K ctx, 256 MoE experts; 27b-8bit = DFlash
Qwen3	`qwen3-coder`, `qwen3-coder-30b`, `qwen3-vl-4b`, `-8b`, `-30b`	Coding + vision
Qwopus	`qwopus-9b`, `qwopus-27b`, `qwopus-27b-8bit`	92 MHI on tool calling
DeepSeek	`deepseek-r1-8b`, `-32b`, `deepseek-v4-flash` (2/4/8-bit)	R1 reasoning + V4 Flash 158B-A13B day-0
Gemma	`gemma-3n-e4b`, `gemma-4-26b`, `-31b`, `gemma3-1b`, `-12b`, `-27b`	Vision-capable (gemma-4)
Llama / Hermes	`llama3-1b`, `-3b`, `hermes3-8b`, `hermes4-70b`
GLM	`glm4.5-air`, `glm4.7-9b`
GPT-OSS	`gpt-oss-20b`	Harmony native
MiniMax / Kimi	`minimax-m2.5`, `kimi-48b`, `kimi-k2.5`
Mistral / Devstral	`mistral-24b`, `devstral-24b`, `devstral-v2-24b`, `ministral-3b`
Other	`phi4-14b`, `smollm3-3b`, `nemotron-30b` / `-nano`, `bonsai-1.7b/4b/8b`, `granite4-tiny`

✨ = DFlash speculative decoding enabled by default. rapid-mlx info <alias> shows per-alias capabilities.

Copy-paste commands

Pick the one that matches your Mac. Short aliases work — run rapid-mlx models to see all available models.

# 16 GB — lightweight, fast
rapid-mlx serve qwen3.5-4b --port 8000

# 24 GB — best small model
rapid-mlx serve qwen3.5-9b --port 8000

# 32 GB — solid coding model
rapid-mlx serve qwen3.5-27b --port 8000

# 32 GB — Nemotron Nano (fastest 30B, 141 tok/s, NVIDIA MoE)
rapid-mlx serve nemotron-30b --port 8000

# 32+ GB — Qwen 3.6 (256 experts, 262K context)
rapid-mlx serve qwen3.6-35b --port 8000

# 64 GB — sweet spot
rapid-mlx serve qwen3.5-35b --prefill-step-size 8192 --port 8000  # faster first response

# 96+ GB — best model
rapid-mlx serve qwen3.5-122b --prefill-step-size 8192 --port 8000

# Coding agent — fast MoE, great for Claude Code / Cursor
rapid-mlx serve qwen3-coder --prefill-step-size 8192 --port 8000  # MoE = only uses part of the model, so it's fast

# Vision — image understanding (see note below)
rapid-mlx serve qwen3-vl-4b --mllm --port 8000

Vision deps: Install into the same environment where rapid-mlx lives:

install.sh users: ~/.rapid-mlx/bin/pip install 'rapid-mlx[vision]'

pip users: pip install 'rapid-mlx[vision]' (in the same venv)

brew users: $(brew --prefix)/opt/rapid-mlx/libexec/bin/pip install 'rapid-mlx[vision]'

Parser auto-detection & manual overrides

Parsers are auto-detected from the model name — you don't need to specify --tool-call-parser or --reasoning-parser for supported families. Explicit flags always override auto-detection.

Model Family	Auto-detected `--tool-call-parser`	Auto-detected `--reasoning-parser`	Notes
Qwen3.5 (all sizes)	`hermes`	`qwen3`	Recommended — 100% tool calling
🆕 Qwen3.6	`qwen3_coder_xml`	`qwen3`	XML tool format, 262K context
Qwen3-Coder-Next	`hermes`	(none)	Fast coding, non-thinking mode
DeepSeek R1-0528 / V3.1	`deepseek_v31`	`deepseek_r1`	Dedicated V3.1 parser
DeepSeek R1 (older)	`deepseek`	`deepseek_r1`	With reasoning
DeepSeek V3 / V2.5	`deepseek`	(none)	No reasoning parser
GLM-4.7	`glm47`	(none)	100% tool calling
MiniMax-M2.5	`minimax`	`minimax`	XML tool format
GPT-OSS	`harmony`	`harmony`	Native format
Kimi-Linear	`kimi`	(none)	Kimi tool format
Llama 3.x	`llama`	(none)	JSON tool format
Mistral / Devstral	`hermes`	(none)	Hermes-compatible
Gemma	`hermes`	(none)	Hermes-compatible
Phi-3/4	`hermes`	(none)	Hermes-compatible

All 17 parsers include automatic recovery — if a quantized model outputs broken tool calls as text, they're auto-converted back to structured format.

Benchmarks

Tested on Mac Studio M3 Ultra (256GB). Rapid-MLX uses Apple's MLX framework — purpose-built for unified memory with native Metal compute kernels — which is why it beats C++-based engines (Ollama, llama.cpp) on most models. Ollama numbers tested with v0.20.4 (latest, with MLX backend).

Model	Rapid-MLX	Best Alternative	Speedup
Phi-4 Mini 14B	180 tok/s	77 (mlx-lm) / 56 (Ollama)	2.3x / 3.2x
Qwen3.5-4B	160 tok/s	155 (mlx-lm serve)	1.0x
Nemotron-Nano 30B	141 tok/s · 100% tools	—	—
🆕 DeepSeek V4 Flash 158B-A13B (2-bit DQ)	56 tok/s	— (only MLX engine, day-0)	—
🆕 DeepSeek V4 Flash 158B-A13B (8-bit)	31 tok/s	— (only MLX engine, day-0)	—
GPT-OSS 20B	127 tok/s · 100% tools	79 (mlx-lm serve)	1.6x
Qwen3.5-9B	108 tok/s	41 (Ollama)	2.6x
Qwen3.6-35B-A3B	95 tok/s · 100% tools	—	—
Kimi-Linear-48B	94 tok/s · 100% tools	— (only engine)	—
Gemma 4 26B-A4B	85 tok/s	68 (Ollama)	1.3x
Gemma 4 E4B	83 tok/s	—	—
Qwen3.5-35B-A3B	83 tok/s · 100% tools	75 (oMLX)	1.1x
Qwen3-Coder 80B	74 tok/s · 100% tools	69 (mlx-lm serve)	1.1x
Qwen3.5-122B	44 tok/s · 100% tools	43 (mlx-lm serve)	~1.0x
Gemma 4 31B	31 tok/s	—	—

Full benchmark data with all models, TTFT tables, DeltaNet snapshots, and engine comparison below.

TTFT — Prompt Cache Advantage

Prompt cache keeps multi-turn conversations fast. For standard transformers, KV cache trimming gives sub-100ms TTFT. For hybrid RNN models (Qwen3.5 DeltaNet), we use state snapshots — the first technique to bring prompt cache to non-trimmable architectures on MLX.

Pure KV cache (transformers):

Model	Rapid-MLX (cached)	mlx-lm serve	Speedup
Kimi-Linear-48B	0.08s	—	—
Llama 3.2 3B	0.10s	—	—
Hermes-3-Llama 8B	0.10s	0.18s	1.8x
Phi-4 Mini 14B	0.13s	0.15s	1.2x
Devstral-Small-2 24B	0.13s	0.38s	2.9x
Mistral Small 24B	0.13s	0.38s	2.9x
GLM-4.7-Flash 9B	0.13s	0.23s	1.8x
GLM-4.5-Air	0.14s	0.47s	3.4x
Qwen3-Coder-Next 80B	0.16s	0.27s	1.7x
GPT-OSS 20B	0.16s	0.27s	1.7x
Qwen3.5-9B	0.22s	0.26s	1.2x
Gemma 4 E4B	0.25s	— (day-0)	—
Gemma 4 26B-A4B	0.25s	— (day-0)	—
Gemma 4 31B	0.34s	0.57s (mlx-vlm bf16)	1.7x

DeltaNet state snapshots (hybrid RNN + attention):

Qwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines recreate the entire cache from scratch every request — we snapshot the RNN state at the system prompt boundary, restoring in ~0.1ms instead of re-running hundreds of tokens through the recurrent layers.

Model	Cold TTFT	Snapshot TTFT	Speedup
Qwen3-Coder-Next 6bit (48L)	0.66s	0.16s	4.3x
Qwen3.5-35B-A3B 8bit (40L)	0.49s	0.19s	2.6x
Qwen3.5-27B 4bit (40L)	0.58s	0.27s	2.1x
Qwen3.5-9B 4bit (40L)	0.27s	0.22s	1.2x
Qwen3.5-4B 4bit (32L)	0.24s	0.16s	1.5x

Capability Comparison

Feature	Rapid-MLX	oMLX	Ollama	llama.cpp	mlx-lm serve
Tool calling	100% (Qwen/GLM/GPT-OSS/Kimi)	N/A	100% (Qwen)	80% (Phi-4)	N/A
Tool call recovery	100%	N/A	100%	100%	N/A
Tool injection fallback	Yes	No	No	No	No
Think-tag leak	0%	N/A	0%	0%	N/A
Prompt cache	KV + DeltaNet	No	No	No	No
Vision	Yes	Yes	Yes	No	No
Audio (STT/TTS)	Yes	No	No	No	No
17 tool parsers	Yes	No	No	No	No
Cloud routing	Yes	No	No	No	No
Streaming	Yes	Yes	Yes	Yes	Yes
OpenAI API	Yes	Yes	Yes	Yes	Yes

Optimization Techniques Per Model

Technique	What it does	Models
KV prompt cache	Trim KV cache to common prefix, skip re-prefill	All transformer models
DeltaNet state snapshots	Deep-copy RNN state at prefix boundary, restore in ~0.1ms	Qwen3.5 (4B, 9B, 27B, 35B, 122B), Qwen3-Coder-Next
Hybrid cache sync	Keep trimmable KV + non-trimmable RNN layers in sync	Qwen3.5 (Gated DeltaNet + attention)
Tool logits bias	Jump-forward decoding — bias logits toward structured tokens	All models with `--enable-tool-logits-bias`
Auto tool recovery	Detect broken text-format tool calls, convert to structured	All 17 parser formats (incl. Gemma 4)
TurboQuant V-cache	Rotate + Lloyd-Max compress V cache (86% savings on dense models)	All models with `--kv-cache-turboquant`
KV cache quantization	Quantize prefix cache entries to reduce memory	All models with `--kv-cache-quantization`
DFlash speculative decoding	Block-diffusion drafter, parallel draft + verify	`qwen3.5-27b-8bit`, `qwen3.6-27b-8bit` (single-user)
SuffixDecoding	Drafter-free, statistical n-gram lookup speculative decoding	All BatchedEngine models with `--suffix-decoding`
Prefill chunking	Configurable step size for large-prompt throughput	All models
Cloud routing	Offload high-token requests to cloud LLM when local is slow	All models with `--cloud-model`

Eval benchmarks (20 models, 4 suites)

Tool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), general knowledge (MMLU-Pro). Top models:

Model	Decode	Tools	Code	Reason	General	Avg
Qwen3.5-122B 8bit	44 t/s	87%	90%	90%	90%	89%
Qwen3.5-35B 8bit	83 t/s	90%	90%	80%	80%	85%
Qwen3-Coder-Next 4bit	74 t/s	90%	90%	70%	70%	80%
Qwen3.5-27B 4bit	39 t/s	83%	90%	50%	80%	76%
Qwen3.5-9B 4bit	108 t/s	83%	70%	60%	70%	71%

Run your own: python scripts/benchmark_engines.py --engine rapid-mlx ollama --runs 3

Features

Tool Calling

Full OpenAI-compatible tool calling with 17 parser formats and automatic recovery when quantized models break. Models at 4-bit degrade after multiple tool rounds — Rapid-MLX auto-detects broken output and converts it back to structured tool_calls.

Reasoning Separation

Models with chain-of-thought (Qwen3, DeepSeek-R1) output reasoning in a separate reasoning_content field — cleanly separated from content in streaming mode. Works with Qwen3, DeepSeek-R1, MiniMax, and GPT-OSS reasoning formats.

Prompt Cache

Persistent cache across requests — only new tokens are prefilled on each turn. For standard transformers, KV cache trimming. For hybrid models (Qwen3.5 DeltaNet), RNN state snapshots restore non-trimmable layers from memory instead of re-computing. 2-5x faster TTFT on all architectures. Always on, no flags needed.

Smart Cloud Routing

Large-context requests auto-route to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be slow. Routing based on new tokens after cache hit. --cloud-model openai/gpt-5 --cloud-threshold 20000

Multimodal

Vision, audio (STT/TTS), video understanding, and text embeddings — all through the same OpenAI-compatible API.

DFlash Speculative Decoding (single-user)

z-lab's block-diffusion drafter (via mlx-vlm) accelerates single-stream generation on validated Qwen3.5/3.6 27B aliases. Currently enabled by default on:

Alias	Drafter	Avg speedup	Min / Max
`qwen3.6-27b-8bit`	`z-lab/Qwen3.6-27B-DFlash`	1.49×	1.06× / 2.07×
`qwen3.5-27b-8bit`	`z-lab/Qwen3.5-27B-DFlash`	1.31×	0.59× / 2.15×

pip install 'rapid-mlx[dflash]'
rapid-mlx info qwen3.5-27b-8bit       # check per-gate eligibility
rapid-mlx serve qwen3.5-27b-8bit --enable-dflash

Workload sensitivity: speedup varies by entropy. Coding / math / summarization typically see 1.5-2.7×; high-entropy creative writing and long-form chat can dip to 0.6-0.9× because the drafter's training distribution diverges from open-ended generation. This is a known pattern in spec-decode literature (arXiv 2604.14682, AdaEDL) — not a bug. Other Qwen3.5/3.6 sizes (35B-A3B MoE, 122B-A10B MoE) were benched and rejected because their average speedup was below the gate.

v1 limitations: DFlash mode runs a dedicated single-user server (mlx-vlm doesn't expose a batched DFlash kernel yet). Tool calling, MCP, and embeddings aren't available in DFlash mode — restart without --enable-dflash for those.

Also: logprobs API, structured JSON output (response_format), continuous batching, KV cache quantization (--kv-cache-quantization), and 3200+ tests.

Server Flags Reference

You don't need any flags to get started — the defaults work for most setups. These are for advanced tuning.

Core

Flag	Description	Default
`<model>`	HuggingFace model name, local path, or alias (positional arg)	(required)
`--host`	Host to bind to	`0.0.0.0`
`--port`	Port to bind to	`8000`
`--max-tokens`	Default max tokens for generation	`32768`

Tool Calling & Reasoning

Flag	Description	Default
`--tool-call-parser`	Parser: `hermes`, `minimax`, `qwen`, `llama`, `deepseek`, etc.	(auto-detected)
`--reasoning-parser`	Parser: `qwen3`, `deepseek_r1`, `minimax`, `gpt_oss`	(auto-detected)
`--enable-tool-logits-bias`	Jump-forward decoding for faster tool calls	off

Performance

Flag	Description	Default
`--prefill-step-size`	Tokens per prefill chunk	`2048`
`--kv-cache-turboquant`	TurboQuant V-cache compression (3-4 bit, 86% savings on dense models)	off
`--kv-cache-quantization`	Quantize prefix cache entries for memory savings	off
`--enable-prefix-cache`	Cache common prefixes across requests	off
`--enable-dflash`	DFlash speculative decoding (single-user; `qwen3.5-27b-8bit` / `qwen3.6-27b-8bit`)	off
`--suffix-decoding`	Drafter-free n-gram speculative decoding (BatchedEngine path)	off
`--enable-mtp`	MTP head speculative decoding (requires MTP-trained model)	off
`--gpu-memory-utilization`	Fraction of device memory to use (0.0-1.0)	`0.90`

Cloud Routing

Flag	Description	Default
`--cloud-model`	litellm model string (e.g. `openai/gpt-5`)	(disabled)
`--cloud-threshold`	New token threshold to trigger cloud routing	`20000`

Security & Other

Flag	Description	Default
`--api-key`	API key for authentication	(no auth)
`--rate-limit`	Requests per minute per client	(unlimited)
`--timeout`	Request timeout in seconds	`300`
`--mllm`	Force multimodal (vision) mode	auto-detect
`--mcp-config`	MCP configuration file for tool integration	(none)
`--embedding-model`	Pre-load embedding model at startup	(none)

Common Issues

"parameters not found in model" warnings at startup — Normal for VLMs. Vision weights are auto-skipped.

Out of memory / very slow (<5 tok/s) — Model too big. Check What fits my Mac? Try a smaller quantization (4bit) or smaller model.

Empty responses — Remove --reasoning-parser for non-thinking models.

Tool calls as plain text — Set the correct --tool-call-parser for your model. Even without it, Rapid-MLX auto-recovers most cases.

Other issues? Run rapid-mlx doctor for self-diagnostics.

Slow first response — Two different causes: (1) Qwen3.5 models reason before answering — add --no-thinking to skip reasoning for faster responses, or (2) cold start on long prompts — add --prefill-step-size 8192 to speed up processing. Subsequent turns hit prompt cache and are 10-30x faster.

Optional Extras

The base pip install rapid-mlx is ~460 MB and covers all text-only models. Vision, audio, and other features ship as opt-in extras:

Extra	Install	Adds	What it unlocks
`vision`	`pip install 'rapid-mlx[vision]'`	~322 MB	Gemma 4, Qwen-VL, video understanding (mlx-vlm + opencv + torch)
`audio`	`pip install 'rapid-mlx[audio]'`	~600 MB	TTS / STT (mlx-audio + spacy + scipy)
`embeddings`	`pip install 'rapid-mlx[embeddings]'`	~50 MB	`/v1/embeddings` endpoint (mlx-embeddings)
`chat`	`pip install 'rapid-mlx[chat]'`	~150 MB	Built-in Gradio chat UI
`guided`	`pip install 'rapid-mlx[guided]'`	~80 MB	Schema-constrained JSON generation (outlines)
`all`	`pip install 'rapid-mlx[all]'`	~1.1 GB	Vision + audio + chat + embeddings

If you installed via Homebrew and want vision/audio support, use pip install 'rapid-mlx[vision]' (or [audio]) inside your own Python 3.10+ venv — that gives you the full feature set without rebuilding the brew formula.

Troubleshooting

Run the built-in self-diagnostic (works from pip install, no dev tools needed):

rapid-mlx doctor

Rapid-MLX Doctor
============================================================
  [metal] OK        # Apple Silicon Metal GPU available
  [imports] OK      # Core modules import cleanly
  [cli] OK          # CLI commands respond
  [model_load] OK   # Inference pipeline works
Result: PASS

Telemetry

Rapid-MLX can send anonymous usage data to help us prioritise the right models and catch regressions. It is off by default and never starts collecting without your explicit opt-in.

What we collect (only if you opt in)

Subcommand names (serve / chat / agents / bench / doctor)
Model alias names (qwen3.5-9b) or canonical HF repo IDs (mlx-community/...) — local paths are redacted to <local>
Bucketed counts: prompt/completion tokens, TTFT, tokens/sec — never exact values
Error categories + a hash fingerprint of the failure site (exception class name + per-frame file:function:lineno only — never the message text or absolute paths)
OS, arch, Apple chip name, RAM (rounded to GB), Python major.minor

What we never collect

Prompts, completions, tool-call arguments, file contents, or any user-generated text
Local file paths, working directory, or model paths beyond their HF repo ID
IPs or hostnames (Phase 2 will route through a Cloudflare Worker that strips IPs before forwarding to the aggregator; Phase 1 ships no transport at all)
API keys, environment variable values, auth headers
Stack trace messages or argument values

Manage it

rapid-mlx telemetry status     # show current state and why
rapid-mlx telemetry preview    # print the exact JSON payload that would be sent
rapid-mlx telemetry enable     # opt in
rapid-mlx telemetry disable    # opt out
rapid-mlx telemetry reset      # delete consent + client-id files (re-prompts on next run)

Force-disable in scripts / CI

Either of these always wins, regardless of stored consent:

RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b
rapid-mlx --no-telemetry serve qwen3.5-9b

There is intentionally no env-var equivalent for force-on — opting in must be an explicit one-time rapid-mlx telemetry enable. CI agents will never silently contribute.

Where the code lives

Everything is in vllm_mlx/telemetry/ — read it. Phase 1 (this release) ships the consent mechanism and CLI surface; no network code is in the codebase yet. Phase 2 will add the transport behind the same opt-in gate; the schema is documented in vllm_mlx/telemetry/schema.py. Tracking issue: #236.

Development

Quick start

git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX
pip install -e ".[dev]"

Testing

Two layers: user-facing doctor (ships with pip) and dev test suite (source checkout only).

Dev test commands

Command	What	Time	Needs server?
`make lint`	ruff lint	~10s	No
`make test`	pytest unit suite (3200+ tests)	~30s	No
`make smoke`	lint + unit	~1 min	No
`make stress`	8-scenario stress test	~5 min	Yes
`make soak`	10-min agent soak test	10 min	Yes

For stress/soak, start a server first:

rapid-mlx serve mlx-community/Qwen3.5-4B-MLX-4bit --enable-auto-tool-choice --tool-call-parser hermes
# In another terminal:
make stress

Or use the script directly for more options:

python scripts/dev_test.py smoke              # lint + unit
python scripts/dev_test.py stress --port 8000 # custom port
python scripts/dev_test.py full               # everything

Regression harness (multi-model)

make check              # 1 model (~10 min, auto starts server)
make full               # 3 models + 12 agent profiles (~1 hr)
make benchmark          # all local models (overnight)

Architecture

vllm_mlx/
  server.py              # App factory + model loading + CLI entry
  config/                # ServerConfig singleton
  service/
    helpers.py           # Shared request helpers
    postprocessor.py     # Streaming pipeline (100% test coverage)
  routes/
    chat.py              # /v1/chat/completions
    completions.py       # /v1/completions
    anthropic.py         # /v1/messages (Anthropic API)
    health.py, models.py, embeddings.py, audio.py, mcp_routes.py
  engine/                # BatchedEngine (continuous batching)
  reasoning/             # 7 reasoning parsers (Qwen3, DeepSeek, MiniMax, ...)
  tool_parsers/          # 17 tool call parsers
  speculative/           # DFlash, SuffixDecoding, MTP drafters
  agents/                # 12 agent profiles (YAML)
  runtime/               # Model registry, cache persistence
  doctor/                # User self-diagnostic
scripts/                 # Dev-only (NOT shipped with pip)
  dev_test.py            # Unified test entry point
  stress_test.py         # 8-scenario stress test
  agent_soak_test.py     # 10-min agent soak test
  cross_model_stress.py  # Multi-model validation
tests/                   # pytest unit tests (3200+)
harness/                 # Regression baselines + thresholds

Roadmap

Technique	Expected Gain	Status
DFlash — block-diffusion drafter, single-user	1.3-2× decode	Shipping (qwen3.5-27b-8bit, qwen3.6-27b-8bit)
SuffixDecoding — drafter-free n-gram speculative	1.1-1.5× decode	Shipping (`--suffix-decoding`, per-model tier sweep ongoing)
MTP — Multi-Token Prediction head	1.4-1.7× decode	Experimental (requires MTP-trained checkpoint)
EAGLE-3 — feature-level draft on Metal	3-6.5× decode	Not started
ReDrafter — Apple's RNN draft head	1.4-1.5× decode	Not started

Contributing

We welcome contributions of all sizes! See CONTRIBUTING.md for setup and guidelines.

Easy first contributions (no model download needed):

Add a model alias — map a short name to a HuggingFace model ID
Request model support — tell us which model you want

Testing contributions (needs a Mac with Apple Silicon):

Benchmark a model and share results
Test with your favorite AI client (Cursor, Aider, LangChain, etc.)
Report a bug

Contributors

Star History

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 559 Commits
.claude		.claude
.github		.github
docs		docs
evals		evals
examples		examples
harness		harness
reports		reports
reviews		reviews
scripts		scripts
tests		tests
vllm_mlx		vllm_mlx
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
benchmark_all_prompt_lookup.py		benchmark_all_prompt_lookup.py
gsm8k_qwen3_0.6b_results.json		gsm8k_qwen3_0.6b_results.json
install.sh		install.sh
mise.toml		mise.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Folders and files

Latest commit

History

Repository files navigation

Rapid-MLX

Quick Start

Works With

Agent Harnesses (MHI-tested)

UI / IDE Clients

Model-Harness Index (MHI)

Choose Your Model

What fits my Mac?

Full model lineup

Copy-paste commands

Benchmarks

Features

Tool Calling

Reasoning Separation

Prompt Cache

Smart Cloud Routing

Multimodal

DFlash Speculative Decoding (single-user)

Core

Tool Calling & Reasoning

Performance

Cloud Routing

Security & Other

Optional Extras

Troubleshooting

Telemetry

What we collect (only if you opt in)

What we never collect

Manage it

Force-disable in scripts / CI

Where the code lives

Development

Quick start

Testing

Dev test commands

Regression harness (multi-model)

Architecture

Roadmap

Contributing

Contributors

Star History

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 89

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages