Releases: raullenchai/Rapid-MLX
Releases · raullenchai/Rapid-MLX
v0.3.12
What's New
API Spec Compliance
- Model name validation — Unknown model names now return HTTP 404 with available models listed, instead of silently using the loaded model
GET /v1/models/{model_id}— Per-model retrieve endpoint added (OpenAI spec)
UX Improvements
- Status endpoint — Changed
"stopped"→"idle"when no generation is active (avoids confusion with server state) - Cache stats — Clear message for text-only models instead of misleading "mlx_vlm not loaded" error
- README — Fixed
--modelflag docs →<model>positional arg
Quality
- Found and fixed via 30-round new-user simulation testing covering OpenAI SDK, Anthropic SDK, LangChain, Aider, curl edge cases, concurrency, tool calling, JSON mode, and more
Full Changelog: v0.3.11...v0.3.12
v0.3.11
v0.3.11 — Fix streaming tool call XML leak
Bug Fix
- Streaming tool call XML leak: After structured
tool_callschunks were emitted, the model's raw XML echo (e.g.,<tool_call><function=...>) leaked as additionalcontentchunks. Now suppressed in both reasoning and non-reasoning streaming paths.
Full Changelog
v0.3.10
v0.3.10 — README Accuracy & Reasoning Leak Fix
Phase 3 deep testing (autoresearch across multiple models and frameworks) found and fixed:
Bug Fix
- Non-streaming reasoning leak: Reasoning preamble leaked into
contentfield when tool calls were present in non-streaming mode. Removednot tool_callsguard so reasoning extraction always runs.
README Fixes
- OpenCode config: Fixed to correct
provider.openai.apiformat (was using fictionalopenai-compatibleprovider) - Capability table: Renamed
mlx-lm→mlx-lm serve, fixed Streaming & OpenAI API from "No" to "Yes" - Feature count: 35 → 37 to match actual subcategory totals (15+3+9+6+4)
- Reasoning claim: Softened "0% leak rate" to "cleanly separated in streaming mode"
- Anthropic SDK: Added to Works With table with setup instructions (
/v1/messagesendpoint)
Full Changelog
v0.3.9
Fixes from Autoresearch Round 2
Deep integration testing across 10 agentic frameworks (Aider, OpenCode, LangChain, LiteLLM, CrewAI, Cline, Continue.dev, OpenAI SDK, AutoGen, Semantic Kernel) — all connecting to Rapid-MLX as their LLM backend.
v0.3.9 Fixes
- System-only messages no longer crash — Jinja2
TemplateError("No user query found") now returns HTTP 400 with descriptive error instead of HTTP 500. Covers streaming, non-streaming, and Anthropic endpoints. prompt_tokensnow reported on/v1/completions— was always 0, now correctly counts prompt tokens. Fixes token usage tracking in Continue.dev, LiteLLM, and other tools.
v0.3.8 Fixes (included)
- Stop sequence truncation (P1) — both streaming and non-streaming
n > 1/max_tokens < 1/temperature > 2validation (P2)<|eom_id|>/<|python_tag|>Llama token stripping (P2)completion_tokensoff-by-one in non-streaming (P3)- uvicorn
timeout_keep_alive=30for agentic clients (P2)
Framework Compatibility
| Framework | Stars | Status |
|---|---|---|
| Aider | 95k | 5/5 tasks PASS |
| LangChain | 113k | 5/5 PASS (chains, agents, tools) |
| AutoGen | 55k | PASS (multi-agent + tool calling) |
| CrewAI | 46k | PASS |
| LiteLLM | 39k | 7/8 PASS |
| Semantic Kernel | 27k | PASS |
| LlamaIndex | 40k | PASS |
| Cline patterns | 59k | PASS |
| Continue.dev patterns | 20k | PASS |
| OpenCode | 120k | PASS |
v0.3.8
Fixes (8 bugs from autoresearch deep testing)
20 rounds of integration testing across OpenAI SDK, Aider, LangChain, LiteLLM, Cline, and OpenCode patterns found and fixed:
P1 — Critical
- Stop sequences not forwarded —
stream_chat()andchat()in SimpleEngine acceptedstopbut never passed it downstream - Stop sequences included in output — OpenAI spec requires truncation at the stop point; now properly truncated in both streaming and non-streaming
P2 — Important
n > 1silently ignored — now returns HTTP 400 with clear message- Negative
max_tokensaccepted — now validates ≥ 1 - Temperature out of range — now validates 0–2 per OpenAI spec
- Llama special tokens leaking —
<|eom_id|>and<|python_tag|>now stripped - uvicorn keep-alive too short — increased to 30s for agentic long-poll clients (Aider, Cline)
P3 — Minor
completion_tokensoff-by-one — non-streaming path re-encoded output text; now collects actual token IDs from generation
All fixes verified against live server. 6 files changed, +54/−11.
v0.3.7
What's New
Extensive fresh-user testing (4 rounds, 4 personas) found and fixed critical onboarding issues.
Critical Fixes
- Auto-detection wired to CLI —
rapid-mlx servenow auto-detects reasoning and tool parsers from model name. Previously onlypython -m vllm_mlx.serverhad this, causing think tags to leak into content for all CLI users. - Streaming fixed with
--no-thinking—enable_thinkingkwarg was leaking tostream_generate()which doesn't accept it, crashing all streaming requests. Fixed for both LLM and MLLM paths. --no-thinkingactually works — previously only disabled the parser; model still generated think tokens. Now forcesenable_thinking=Falsein the chat template across all endpoints.
UX Improvements
--version/-Vflag —rapid-mlx --versionnow works- Port check before model load — detects port conflicts before wasting 5GB+ RAM loading the model
- Clean error for missing models — shows helpful message instead of 48-line Python traceback
- Empty messages returns 400 — instead of 500 Internal Server Error
owned_by: "rapid-mlx"— rebranded from oldvllm-mlxname in API responses- README curl example — removed
!that caused zsh history expansion errors - 36GB/48GB RAM rows in "What fits my Mac?" table (M3/M4 Pro configs)
- Thinking model tip in Quick Start for first-time users
v0.3.6
Fixes
- Standalone Python installer — updated from defunct
indygreg/python-build-standalonetoastral-sh/python-build-standalone(cpython 3.12.13); users without Homebrew can now install again - Metal shader warmup — moved to FastAPI lifespan hook so it works for all engine types (simple, batched, hybrid), not just SimpleEngine
- Think tag leakage — Anthropic
/v1/messagesendpoint now uses the reasoning parser to strip<think>tags from streaming and non-streaming responses - Disk space check — warns before model download if available disk is insufficient
generate_warmup()added toBatchedEngineandHybridEngine(previously inherited no-op from base)
v0.3.5 — UX Polish
What's New
API Response Clarity
- Responses return actual model name — no longer echoes back arbitrary client input. Sending
model: "gpt-4o"now correctly returns the real loaded model (e.g.mlx-community/Llama-3.2-3B-Instruct-4bit), preventing confusion about which model is running. /v1/modelslists alias + full name — if you started withrapid-mlx serve llama3-3b, the models endpoint shows bothllama3-3band the full HuggingFace path.
Quieter Defaults
- Security warning removed from default output — the
SECURITY WARNING: Server running without API keymessage is now debug-level only. Local inference doesn't need auth; the warning was causing unnecessary anxiety for new users.
Full Changelog
- fix: UX friction — model name echo, security warning, alias in /v1/models (#41)
v0.3.4 — Onboarding Polish
What's New
Onboarding Fixes (from real user testing)
- Homebrew install fixed — was missing uvicorn/fastapi deps, now installs from PyPI correctly
- Quick Start rewritten — 3 clear steps with separate code blocks, bold "open second terminal" instruction
- pip on macOS — instructions now include venv creation (required on Sonoma+)
- install.sh — post-install message uses short alias (
qwen3.5-9b) instead of full HF path - Vision deps — install instructions now cover all 3 methods (install.sh, pip, brew)
- "What's next" links — Quick Start now guides users to Choose Your Model and Works With sections
- CLI — bench-kv-cache output prints
rapid-mlxinstead of legacyvllm-mlx
Full Changelog
- fix: onboarding P0 — pip venv + model field clarity + 9 friction points (#40)
v0.3.3 — Model Aliases
What's New
Model Alias Registry
No more memorizing HuggingFace paths. Use short names:
rapid-mlx serve qwen3.5-9b --port 8000
# resolves to mlx-community/Qwen3.5-9B-4bitList all 20 aliases with rapid-mlx models.
README Polish
- Vision install instructions now cover both
install.shandpippaths - Benchmark header clarified (mlx-lm serve, not vllm-mlx)
- Copy-paste commands use short aliases
- Minor clarity improvements for new users