Releases · raullenchai/Rapid-MLX

22 Mar 00:41

raullenchai

v0.3.12

4350a10

v0.3.12 Latest

Latest

What's New

API Spec Compliance

Model name validation — Unknown model names now return HTTP 404 with available models listed, instead of silently using the loaded model
GET /v1/models/{model_id} — Per-model retrieve endpoint added (OpenAI spec)

UX Improvements

Status endpoint — Changed "stopped" → "idle" when no generation is active (avoids confusion with server state)
Cache stats — Clear message for text-only models instead of misleading "mlx_vlm not loaded" error
README — Fixed --model flag docs → <model> positional arg

Quality

Found and fixed via 30-round new-user simulation testing covering OpenAI SDK, Anthropic SDK, LangChain, Aider, curl edge cases, concurrency, tool calling, JSON mode, and more

Full Changelog: v0.3.11...v0.3.12

Assets 2

21 Mar 21:54

raullenchai

v0.3.11

6d80003

v0.3.11

v0.3.11 — Fix streaming tool call XML leak

Bug Fix

Streaming tool call XML leak: After structured tool_calls chunks were emitted, the model's raw XML echo (e.g., <tool_call><function=...>) leaked as additional content chunks. Now suppressed in both reasoning and non-reasoning streaming paths.

Full Changelog

v0.3.10...v0.3.11

Assets 2

21 Mar 21:19

raullenchai

v0.3.10

b6de467

v0.3.10

v0.3.10 — README Accuracy & Reasoning Leak Fix

Phase 3 deep testing (autoresearch across multiple models and frameworks) found and fixed:

Bug Fix

Non-streaming reasoning leak: Reasoning preamble leaked into content field when tool calls were present in non-streaming mode. Removed not tool_calls guard so reasoning extraction always runs.

README Fixes

OpenCode config: Fixed to correct provider.openai.api format (was using fictional openai-compatible provider)
Capability table: Renamed mlx-lm → mlx-lm serve, fixed Streaming & OpenAI API from "No" to "Yes"
Feature count: 35 → 37 to match actual subcategory totals (15+3+9+6+4)
Reasoning claim: Softened "0% leak rate" to "cleanly separated in streaming mode"
Anthropic SDK: Added to Works With table with setup instructions (/v1/messages endpoint)

Full Changelog

v0.3.9...v0.3.10

Assets 2

21 Mar 17:14

raullenchai

v0.3.9

587322c

v0.3.9

Fixes from Autoresearch Round 2

Deep integration testing across 10 agentic frameworks (Aider, OpenCode, LangChain, LiteLLM, CrewAI, Cline, Continue.dev, OpenAI SDK, AutoGen, Semantic Kernel) — all connecting to Rapid-MLX as their LLM backend.

v0.3.9 Fixes

System-only messages no longer crash — Jinja2 TemplateError ("No user query found") now returns HTTP 400 with descriptive error instead of HTTP 500. Covers streaming, non-streaming, and Anthropic endpoints.
prompt_tokens now reported on /v1/completions — was always 0, now correctly counts prompt tokens. Fixes token usage tracking in Continue.dev, LiteLLM, and other tools.

v0.3.8 Fixes (included)

Stop sequence truncation (P1) — both streaming and non-streaming
n > 1 / max_tokens < 1 / temperature > 2 validation (P2)
<|eom_id|> / <|python_tag|> Llama token stripping (P2)
completion_tokens off-by-one in non-streaming (P3)
uvicorn timeout_keep_alive=30 for agentic clients (P2)

Framework Compatibility

Framework	Stars	Status
Aider	95k	5/5 tasks PASS
LangChain	113k	5/5 PASS (chains, agents, tools)
AutoGen	55k	PASS (multi-agent + tool calling)
CrewAI	46k	PASS
LiteLLM	39k	7/8 PASS
Semantic Kernel	27k	PASS
LlamaIndex	40k	PASS
Cline patterns	59k	PASS
Continue.dev patterns	20k	PASS
OpenCode	120k	PASS

Assets 2

21 Mar 16:48

raullenchai

v0.3.8

745b397

v0.3.8

Fixes (8 bugs from autoresearch deep testing)

20 rounds of integration testing across OpenAI SDK, Aider, LangChain, LiteLLM, Cline, and OpenCode patterns found and fixed:

P1 — Critical

Stop sequences not forwarded — stream_chat() and chat() in SimpleEngine accepted stop but never passed it downstream
Stop sequences included in output — OpenAI spec requires truncation at the stop point; now properly truncated in both streaming and non-streaming

P2 — Important

n > 1 silently ignored — now returns HTTP 400 with clear message
Negative max_tokens accepted — now validates ≥ 1
Temperature out of range — now validates 0–2 per OpenAI spec
Llama special tokens leaking — <|eom_id|> and <|python_tag|> now stripped
uvicorn keep-alive too short — increased to 30s for agentic long-poll clients (Aider, Cline)

P3 — Minor

completion_tokens off-by-one — non-streaming path re-encoded output text; now collects actual token IDs from generation

All fixes verified against live server. 6 files changed, +54/−11.

Assets 2

21 Mar 15:20

raullenchai

v0.3.7

4c972c9

v0.3.7

What's New

Extensive fresh-user testing (4 rounds, 4 personas) found and fixed critical onboarding issues.

Critical Fixes

Auto-detection wired to CLI — rapid-mlx serve now auto-detects reasoning and tool parsers from model name. Previously only python -m vllm_mlx.server had this, causing think tags to leak into content for all CLI users.
Streaming fixed with --no-thinking — enable_thinking kwarg was leaking to stream_generate() which doesn't accept it, crashing all streaming requests. Fixed for both LLM and MLLM paths.
--no-thinking actually works — previously only disabled the parser; model still generated think tokens. Now forces enable_thinking=False in the chat template across all endpoints.

UX Improvements

--version / -V flag — rapid-mlx --version now works
Port check before model load — detects port conflicts before wasting 5GB+ RAM loading the model
Clean error for missing models — shows helpful message instead of 48-line Python traceback
Empty messages returns 400 — instead of 500 Internal Server Error
owned_by: "rapid-mlx" — rebranded from old vllm-mlx name in API responses
README curl example — removed ! that caused zsh history expansion errors
36GB/48GB RAM rows in "What fits my Mac?" table (M3/M4 Pro configs)
Thinking model tip in Quick Start for first-time users

Assets 2

21 Mar 14:04

raullenchai

v0.3.6

6f6e2fd

v0.3.6

Fixes

Standalone Python installer — updated from defunct indygreg/python-build-standalone to astral-sh/python-build-standalone (cpython 3.12.13); users without Homebrew can now install again
Metal shader warmup — moved to FastAPI lifespan hook so it works for all engine types (simple, batched, hybrid), not just SimpleEngine
Think tag leakage — Anthropic /v1/messages endpoint now uses the reasoning parser to strip <think> tags from streaming and non-streaming responses
Disk space check — warns before model download if available disk is insufficient
generate_warmup() added to BatchedEngine and HybridEngine (previously inherited no-op from base)

Assets 2

21 Mar 13:10

raullenchai

v0.3.5

1535d59

v0.3.5 — UX Polish

What's New

API Response Clarity

Responses return actual model name — no longer echoes back arbitrary client input. Sending model: "gpt-4o" now correctly returns the real loaded model (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit), preventing confusion about which model is running.
/v1/models lists alias + full name — if you started with rapid-mlx serve llama3-3b, the models endpoint shows both llama3-3b and the full HuggingFace path.

Quieter Defaults

Security warning removed from default output — the SECURITY WARNING: Server running without API key message is now debug-level only. Local inference doesn't need auth; the warning was causing unnecessary anxiety for new users.

Full Changelog

fix: UX friction — model name echo, security warning, alias in /v1/models (#41)

Assets 2

21 Mar 12:59

raullenchai

v0.3.4

79f2c08

v0.3.4 — Onboarding Polish

What's New

Onboarding Fixes (from real user testing)

Homebrew install fixed — was missing uvicorn/fastapi deps, now installs from PyPI correctly
Quick Start rewritten — 3 clear steps with separate code blocks, bold "open second terminal" instruction
pip on macOS — instructions now include venv creation (required on Sonoma+)
install.sh — post-install message uses short alias (qwen3.5-9b) instead of full HF path
Vision deps — install instructions now cover all 3 methods (install.sh, pip, brew)
"What's next" links — Quick Start now guides users to Choose Your Model and Works With sections
CLI — bench-kv-cache output prints rapid-mlx instead of legacy vllm-mlx

Full Changelog

fix: onboarding P0 — pip venv + model field clarity + 9 friction points (#40)

Assets 2

21 Mar 12:39

raullenchai

v0.3.3

786a453

v0.3.3 — Model Aliases

What's New

Model Alias Registry

No more memorizing HuggingFace paths. Use short names:

rapid-mlx serve qwen3.5-9b --port 8000
# resolves to mlx-community/Qwen3.5-9B-4bit

List all 20 aliases with rapid-mlx models.

README Polish

Vision install instructions now cover both install.sh and pip paths
Benchmark header clarified (mlx-lm serve, not vllm-mlx)
Copy-paste commands use short aliases
Minor clarity improvements for new users

Full Changelog

feat: model alias registry (#39)
fix: README polish for non-tech users (#38)

Assets 2

Releases: raullenchai/Rapid-MLX

v0.3.12

What's New

API Spec Compliance

UX Improvements

Quality

Uh oh!

v0.3.11

v0.3.11 — Fix streaming tool call XML leak

Bug Fix

Full Changelog

Uh oh!

v0.3.10

v0.3.10 — README Accuracy & Reasoning Leak Fix

Bug Fix

README Fixes

Full Changelog

Uh oh!

v0.3.9

Fixes from Autoresearch Round 2

v0.3.9 Fixes

v0.3.8 Fixes (included)

Framework Compatibility

Uh oh!

v0.3.8

Fixes (8 bugs from autoresearch deep testing)

P1 — Critical

P2 — Important

P3 — Minor

Uh oh!

v0.3.7

What's New

Critical Fixes

UX Improvements

Uh oh!

v0.3.6

Fixes

Uh oh!

v0.3.5 — UX Polish

What's New

API Response Clarity

Quieter Defaults

Full Changelog

Uh oh!

v0.3.4 — Onboarding Polish

What's New

Onboarding Fixes (from real user testing)

Full Changelog

Uh oh!

v0.3.3 — Model Aliases

What's New

Model Alias Registry

README Polish

Full Changelog

Uh oh!