Skip to content

v0.3.0

Choose a tag to compare

@raullenchai raullenchai released this 20 Mar 19:13
· 37 commits to main since this release
6589519

Rapid-MLX v0.3.0

140 commits since v0.2.6. Major performance and compatibility release.

Highlights

  • DeltaNet state snapshots — 1.5-4.3x TTFT speedup for Qwen3.5 hybrid RNN models. First prompt cache implementation for non-trimmable architectures on MLX.
  • MTP multi-token prediction — 1.4x optimistic decode throughput in SimpleEngine.
  • Tool injection fallback — system prompt injection for models with broken chat templates. Mistral, Gemma, and Devstral go from 0% to 100% tool calling.
  • SSE streaming optimization — pre-computed templates + micro-optimizations for +10.5% composite improvement.
  • Auto parser detection — tool call and reasoning parsers auto-detected from model name. No more manual --tool-call-parser flags for supported families.
  • 22 models benchmarked across 6 engines (Rapid-MLX, upstream vllm-mlx, mlx-lm, oMLX, Ollama, llama.cpp).
  • CI expanded — 9 new test files added (15 → 24 test files in CI).

New parsers

  • deepseek_v31 — dedicated DeepSeek V3.1/R1-0528 tool parser
  • kimi — Kimi-Linear tool format
  • glm47 — GLM-4.7 tool format

Breaking changes

None. Drop-in upgrade from v0.2.6.

Install

pip install git+https://github.com/raullenchai/Rapid-MLX.git@v0.3.0

Full changelog: v0.2.6...v0.3.0