Skip to content

Releases: raullenchai/Rapid-MLX

v0.3.12

22 Mar 00:41

Choose a tag to compare

What's New

API Spec Compliance

  • Model name validation — Unknown model names now return HTTP 404 with available models listed, instead of silently using the loaded model
  • GET /v1/models/{model_id} — Per-model retrieve endpoint added (OpenAI spec)

UX Improvements

  • Status endpoint — Changed "stopped""idle" when no generation is active (avoids confusion with server state)
  • Cache stats — Clear message for text-only models instead of misleading "mlx_vlm not loaded" error
  • README — Fixed --model flag docs → <model> positional arg

Quality

  • Found and fixed via 30-round new-user simulation testing covering OpenAI SDK, Anthropic SDK, LangChain, Aider, curl edge cases, concurrency, tool calling, JSON mode, and more

Full Changelog: v0.3.11...v0.3.12

v0.3.11

21 Mar 21:54

Choose a tag to compare

v0.3.11 — Fix streaming tool call XML leak

Bug Fix

  • Streaming tool call XML leak: After structured tool_calls chunks were emitted, the model's raw XML echo (e.g., <tool_call><function=...>) leaked as additional content chunks. Now suppressed in both reasoning and non-reasoning streaming paths.

Full Changelog

v0.3.10...v0.3.11

v0.3.10

21 Mar 21:19

Choose a tag to compare

v0.3.10 — README Accuracy & Reasoning Leak Fix

Phase 3 deep testing (autoresearch across multiple models and frameworks) found and fixed:

Bug Fix

  • Non-streaming reasoning leak: Reasoning preamble leaked into content field when tool calls were present in non-streaming mode. Removed not tool_calls guard so reasoning extraction always runs.

README Fixes

  • OpenCode config: Fixed to correct provider.openai.api format (was using fictional openai-compatible provider)
  • Capability table: Renamed mlx-lmmlx-lm serve, fixed Streaming & OpenAI API from "No" to "Yes"
  • Feature count: 35 → 37 to match actual subcategory totals (15+3+9+6+4)
  • Reasoning claim: Softened "0% leak rate" to "cleanly separated in streaming mode"
  • Anthropic SDK: Added to Works With table with setup instructions (/v1/messages endpoint)

Full Changelog

v0.3.9...v0.3.10

v0.3.9

21 Mar 17:14

Choose a tag to compare

Fixes from Autoresearch Round 2

Deep integration testing across 10 agentic frameworks (Aider, OpenCode, LangChain, LiteLLM, CrewAI, Cline, Continue.dev, OpenAI SDK, AutoGen, Semantic Kernel) — all connecting to Rapid-MLX as their LLM backend.

v0.3.9 Fixes

  • System-only messages no longer crash — Jinja2 TemplateError ("No user query found") now returns HTTP 400 with descriptive error instead of HTTP 500. Covers streaming, non-streaming, and Anthropic endpoints.
  • prompt_tokens now reported on /v1/completions — was always 0, now correctly counts prompt tokens. Fixes token usage tracking in Continue.dev, LiteLLM, and other tools.

v0.3.8 Fixes (included)

  • Stop sequence truncation (P1) — both streaming and non-streaming
  • n > 1 / max_tokens < 1 / temperature > 2 validation (P2)
  • <|eom_id|> / <|python_tag|> Llama token stripping (P2)
  • completion_tokens off-by-one in non-streaming (P3)
  • uvicorn timeout_keep_alive=30 for agentic clients (P2)

Framework Compatibility

Framework Stars Status
Aider 95k 5/5 tasks PASS
LangChain 113k 5/5 PASS (chains, agents, tools)
AutoGen 55k PASS (multi-agent + tool calling)
CrewAI 46k PASS
LiteLLM 39k 7/8 PASS
Semantic Kernel 27k PASS
LlamaIndex 40k PASS
Cline patterns 59k PASS
Continue.dev patterns 20k PASS
OpenCode 120k PASS

v0.3.8

21 Mar 16:48

Choose a tag to compare

Fixes (8 bugs from autoresearch deep testing)

20 rounds of integration testing across OpenAI SDK, Aider, LangChain, LiteLLM, Cline, and OpenCode patterns found and fixed:

P1 — Critical

  • Stop sequences not forwardedstream_chat() and chat() in SimpleEngine accepted stop but never passed it downstream
  • Stop sequences included in output — OpenAI spec requires truncation at the stop point; now properly truncated in both streaming and non-streaming

P2 — Important

  • n > 1 silently ignored — now returns HTTP 400 with clear message
  • Negative max_tokens accepted — now validates ≥ 1
  • Temperature out of range — now validates 0–2 per OpenAI spec
  • Llama special tokens leaking<|eom_id|> and <|python_tag|> now stripped
  • uvicorn keep-alive too short — increased to 30s for agentic long-poll clients (Aider, Cline)

P3 — Minor

  • completion_tokens off-by-one — non-streaming path re-encoded output text; now collects actual token IDs from generation

All fixes verified against live server. 6 files changed, +54/−11.

v0.3.7

21 Mar 15:20

Choose a tag to compare

What's New

Extensive fresh-user testing (4 rounds, 4 personas) found and fixed critical onboarding issues.

Critical Fixes

  • Auto-detection wired to CLIrapid-mlx serve now auto-detects reasoning and tool parsers from model name. Previously only python -m vllm_mlx.server had this, causing think tags to leak into content for all CLI users.
  • Streaming fixed with --no-thinkingenable_thinking kwarg was leaking to stream_generate() which doesn't accept it, crashing all streaming requests. Fixed for both LLM and MLLM paths.
  • --no-thinking actually works — previously only disabled the parser; model still generated think tokens. Now forces enable_thinking=False in the chat template across all endpoints.

UX Improvements

  • --version / -V flagrapid-mlx --version now works
  • Port check before model load — detects port conflicts before wasting 5GB+ RAM loading the model
  • Clean error for missing models — shows helpful message instead of 48-line Python traceback
  • Empty messages returns 400 — instead of 500 Internal Server Error
  • owned_by: "rapid-mlx" — rebranded from old vllm-mlx name in API responses
  • README curl example — removed ! that caused zsh history expansion errors
  • 36GB/48GB RAM rows in "What fits my Mac?" table (M3/M4 Pro configs)
  • Thinking model tip in Quick Start for first-time users

v0.3.6

21 Mar 14:04

Choose a tag to compare

Fixes

  • Standalone Python installer — updated from defunct indygreg/python-build-standalone to astral-sh/python-build-standalone (cpython 3.12.13); users without Homebrew can now install again
  • Metal shader warmup — moved to FastAPI lifespan hook so it works for all engine types (simple, batched, hybrid), not just SimpleEngine
  • Think tag leakage — Anthropic /v1/messages endpoint now uses the reasoning parser to strip <think> tags from streaming and non-streaming responses
  • Disk space check — warns before model download if available disk is insufficient
  • generate_warmup() added to BatchedEngine and HybridEngine (previously inherited no-op from base)

v0.3.5 — UX Polish

21 Mar 13:10

Choose a tag to compare

What's New

API Response Clarity

  • Responses return actual model name — no longer echoes back arbitrary client input. Sending model: "gpt-4o" now correctly returns the real loaded model (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit), preventing confusion about which model is running.
  • /v1/models lists alias + full name — if you started with rapid-mlx serve llama3-3b, the models endpoint shows both llama3-3b and the full HuggingFace path.

Quieter Defaults

  • Security warning removed from default output — the SECURITY WARNING: Server running without API key message is now debug-level only. Local inference doesn't need auth; the warning was causing unnecessary anxiety for new users.

Full Changelog

  • fix: UX friction — model name echo, security warning, alias in /v1/models (#41)

v0.3.4 — Onboarding Polish

21 Mar 12:59

Choose a tag to compare

What's New

Onboarding Fixes (from real user testing)

  • Homebrew install fixed — was missing uvicorn/fastapi deps, now installs from PyPI correctly
  • Quick Start rewritten — 3 clear steps with separate code blocks, bold "open second terminal" instruction
  • pip on macOS — instructions now include venv creation (required on Sonoma+)
  • install.sh — post-install message uses short alias (qwen3.5-9b) instead of full HF path
  • Vision deps — install instructions now cover all 3 methods (install.sh, pip, brew)
  • "What's next" links — Quick Start now guides users to Choose Your Model and Works With sections
  • CLI — bench-kv-cache output prints rapid-mlx instead of legacy vllm-mlx

Full Changelog

  • fix: onboarding P0 — pip venv + model field clarity + 9 friction points (#40)

v0.3.3 — Model Aliases

21 Mar 12:39

Choose a tag to compare

What's New

Model Alias Registry

No more memorizing HuggingFace paths. Use short names:

rapid-mlx serve qwen3.5-9b --port 8000
# resolves to mlx-community/Qwen3.5-9B-4bit

List all 20 aliases with rapid-mlx models.

README Polish

  • Vision install instructions now cover both install.sh and pip paths
  • Benchmark header clarified (mlx-lm serve, not vllm-mlx)
  • Copy-paste commands use short aliases
  • Minor clarity improvements for new users

Full Changelog

  • feat: model alias registry (#39)
  • fix: README polish for non-tech users (#38)