Release v0.3.0 · raullenchai/Rapid-MLX

Rapid-MLX v0.3.0

140 commits since v0.2.6. Major performance and compatibility release.

DeltaNet state snapshots — 1.5-4.3x TTFT speedup for Qwen3.5 hybrid RNN models. First prompt cache implementation for non-trimmable architectures on MLX.
MTP multi-token prediction — 1.4x optimistic decode throughput in SimpleEngine.
Tool injection fallback — system prompt injection for models with broken chat templates. Mistral, Gemma, and Devstral go from 0% to 100% tool calling.
SSE streaming optimization — pre-computed templates + micro-optimizations for +10.5% composite improvement.
Auto parser detection — tool call and reasoning parsers auto-detected from model name. No more manual --tool-call-parser flags for supported families.
22 models benchmarked across 6 engines (Rapid-MLX, upstream vllm-mlx, mlx-lm, oMLX, Ollama, llama.cpp).
CI expanded — 9 new test files added (15 → 24 test files in CI).

None. Drop-in upgrade from v0.2.6.

pip install git+https://github.com/raullenchai/Rapid-MLX.git@v0.3.0

Full changelog: v0.2.6...v0.3.0