v0.3.0
Rapid-MLX v0.3.0
140 commits since v0.2.6. Major performance and compatibility release.
Highlights
- DeltaNet state snapshots — 1.5-4.3x TTFT speedup for Qwen3.5 hybrid RNN models. First prompt cache implementation for non-trimmable architectures on MLX.
- MTP multi-token prediction — 1.4x optimistic decode throughput in SimpleEngine.
- Tool injection fallback — system prompt injection for models with broken chat templates. Mistral, Gemma, and Devstral go from 0% to 100% tool calling.
- SSE streaming optimization — pre-computed templates + micro-optimizations for +10.5% composite improvement.
- Auto parser detection — tool call and reasoning parsers auto-detected from model name. No more manual
--tool-call-parserflags for supported families. - 22 models benchmarked across 6 engines (Rapid-MLX, upstream vllm-mlx, mlx-lm, oMLX, Ollama, llama.cpp).
- CI expanded — 9 new test files added (15 → 24 test files in CI).
New parsers
deepseek_v31— dedicated DeepSeek V3.1/R1-0528 tool parserkimi— Kimi-Linear tool formatglm47— GLM-4.7 tool format
Breaking changes
None. Drop-in upgrade from v0.2.6.
Install
pip install git+https://github.com/raullenchai/Rapid-MLX.git@v0.3.0Full changelog: v0.2.6...v0.3.0