Releases: samuelfaj/lightning-mlx
v0.7.0 — MTP d=5 default + agentic 2.3x-5.2x faster
MTP draft depth 5 is now the default across all models (bench + serve).
Agentic benchmarks (M5 Max, macOS 26.3):
- 35B Short: 182.87 tok/s (3.5x vs 52.50)
- 35B Long: 215.43 tok/s (1.8x vs 117.40)
- 27B Short: 47.53 tok/s (2.3x vs 20.40)
- 27B Long: 55.03 tok/s (1.4x vs 38.60)
Also: Metal4 detection, ngram per-agent profiles, MLX floor 0.30.0.
Full experiment trail: reports/exp/INDEX.md
v0.6.32 — Ornstein3.6-27B-MTP-NSC-ACE-SABER aliases
What's new
ornstein3.6-27b-nsc-ace-saber-4bit→samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-4bit-MTPLX-Optimized-Speedornstein3.6-27b-nsc-ace-saber(default, 6-bit) →samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-6bit-MTPLX-Optimized-Speedornstein3.6-27b-nsc-ace-saber-8bit→samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-8bit-MTPLX-Optimized-Speed
All three resolve through the existing Ornstein MTPLX preset: MTP on, n-gram off, agentic tool-use defaults (temp 0.6, top_p 0.95, enable_auto_tool_choice).
Usage
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-4bit
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-8bit
Source model: GestaltLabs/Ornstein3.6-27B-MTP-NSC-ACE-SABER
Collection: Ornstein MLX MTPLX
v0.6.31 — --daemon boot persistence (launchd / systemd)
Highlights
lightning-mlx serve --daemon is now boot-persistent by default. The supervisor restarts at user login / system boot via a per-user OS service:
- macOS: LaunchAgent at
~/Library/LaunchAgents/com.lightning-mlx.<id>.plist(RunAtLoad,KeepAlive). - Linux: systemd user unit at
~/.config/systemd/user/lightning-mlx-<id>.service(Restart=always).
lightning-mlx kill removes the autostart before signaling so the OS service manager cannot resurrect the supervisor.
Opt-out
lightning-mlx serve <model> --daemon=non-persistKeeps the original in-session detached behavior (dies on reboot).
Linux post-logout survival
Run once:
loginctl enable-linger $USERRobustness
- Install failure unlinks the on-disk record and exits with a clean
DaemonError(no traceback). lightning-mlx statusshows persistent daemons whose supervisor is not yet live aspendingrather thanstale.launchctl unloadbeforeload -wmakes re-install idempotent.
Tests
tests/test_persistence.py(7): plist/unit content, install + uninstall on darwin/linux, idempotent uninstall, unsupported-platform error.tests/test_daemon.py: persistent vs non-persist start, install-failure cleanup, stop-uninstall ordering, list-pending for persistent dead-pid, CLI dispatch for--daemon=non-persist.- 21/21 PASS on changed-file suite. 2262/2262 PASS on relevant full suite.
PR: #2
v0.6.30
Highlights
- New aliases:
qwopus3.6-35b(default 6-bit),qwopus3.6-35b-4bit,qwopus3.6-35b-8bit— point to thesamuelfaj/Qwopus3.6-35B-A3B-v1-*-MTPLX-Optimized-Speedcollection. Defaults mirrorqwen3.6-35b-nsc-ace-saber(XML tool parser, qwen3 reasoning, MTP on, 35B-A3B n-gram preset). - Alias rename for consistency with
qwen3.6-35b:ornstein3.6-35-saber*→ornstein3.6-35b-saber*qwen3.6-35-nsc-ace-saber*→qwen3.6-35b-nsc-ace-saber*
Usage
lightning-mlx serve qwopus3.6-35b
lightning-mlx serve qwopus3.6-35b-4bit
lightning-mlx serve qwopus3.6-35b-8bit
v0.6.29 — Qwen3.6-35B NSC-ACE-SABER aliases
feat(aliases): add qwen3.6-35-nsc-ace-saber {4,6,8}bit aliases mapping to samuelfaj/Qwen3.6-35B-A3B-NSC-ACE-SABER-MLX-Nbit-MTPLX-Optimized-Speed. Default (no suffix) = 6bit.
v0.6.28 — Metal memory leak fix + dense MTPLX support
Highlights
Memory leak fix
Long-running servers hit [METAL] Command buffer execution failed: Insufficient Memory after sustained use, especially with the new 32k --max-tokens default. BlockAwarePrefixCache.store_cache was storing the full original prompt+output KV cache in every BlockCacheEntry.cache_data (for legacy get_cache_for_generation callers), pinning GB-scale Metal buffers per finished request. Production never read this copy — per-block slices already power prefix sharing — so the leak just compounded.
Fixes:
- New
BlockAwarePrefixCache.release_full_cache_data(request_id)drops the redundant entry-level ref while keeping per-block slices alive for sharing. Scheduler._finalize_finished_requestsnow releases the full-cache ref and nullsrequest._extracted_cache,prompt_cache, andblock_tableafter the post-storemx.evalloop.MLLMScheduler._cleanup_finishednullspixel_values,attention_mask,image_grid_thw,multimodal_kwargs,prompt_cache,_extracted_cachebefore popping the request, and callsmx.clear_cache()(parity with text scheduler).EngineCore._cleanup_requestcallsmx.clear_cache()as backstop for abort and non-streaming paths.
Memory now stabilizes at max_cache_blocks * block_size * KV_per_token plus active working set — no unbounded growth per request.
Dense Qwen3.6 MTPLX conversion
convert_mtplx now handles the dense Qwen3.6 MTP layout (no MoE experts/gate), detected by absence of mtp.layers.0.mlp.experts.gate_up_proj. Adds _normalize_qwen36_dense_mtp plus fixture/sidecar test covering Ornstein-Hermes-3.6-27B-SABER.
Commits
fix(memleak): release per-request KV state and redundant block-cache refs on finishfeat(convert_mtplx): support dense Qwen3.6 MTP variant
v0.6.27 — Ornstein3.6-35B-SABER aliases
Highlights
New model aliases — Ornstein3.6-35B-A3B-SABER (MTPLX-Optimized-Speed)
Three new aliases for the Ornstein3.6-35B-A3B-SABER family with auto-applied serve preset:
ornstein3.6-35-saber-4bit→samuelfaj/Ornstein3.6-35B-A3B-SABER-4bit-MTPLX-Optimized-Speedornstein3.6-35-saber→samuelfaj/Ornstein3.6-35B-A3B-SABER-6bit-MTPLX-Optimized-Speedornstein3.6-35-saber-8bit→samuelfaj/Ornstein3.6-35B-A3B-SABER-8bit-MTPLX-Optimized-Speed
lightning-mlx serve ornstein3.6-35-saberAuto-applied flags (user CLI flags override):
--prefill-step-size 32768--max-concurrent 3--max-num-seqs 1 --prefill-batch-size 1 --completion-batch-size 1--stream-interval 1--default-temperature 0.6 --default-top-p 0.95--enable-auto-tool-choice--enable-mtp--enable-ngram --ngram-num-draft-tokens 6 --ngram-min-occurrences 2 --ngram-acceptance-mode greedy--ngram-hybrid-verify --ngram-everywhere --ngram-skip-tool-calls --ngram-self-tune--ngram-self-tune-disable-threshold 0.30--ngram-auto-disable-mtp-threshold 0.85 --ngram-auto-disable-min-ngram 0.50
max_tokens hard cap
cfg.default_max_tokens now clamps client-supplied max_tokens instead of only acting as a fallback. Prevents oversized KV reservations from triggering Metal OOM on Apple Silicon.
v0.6.26 — n-gram speculation + qwen3.6-35b-8bit
Highlights
- N-gram speculative decoding stacked on MTP for
Qwen3.6-35B-A3B. +18% throughput on agentic reasoning + tool-use workloads vs. MTP-only. - New
qwen3.6-35b-8bitalias routing tosamuelfaj/Qwen3.6-35B-A3B-8bit-MTPLX-Optimized-Speedwith full preset parity (MTP, n-gram, port 8010, tool/reasoning parsers, temps).
N-gram drafter
<think>-aware (token-id state machine) and<tool_call>-aware (rolling-text state machine) gating. Drafts everywhere by default, skips inside<tool_call>...</tool_call>.- Adaptive K based on prior occurrence count of the matched n-gram (wide drafts for strong matches, narrow for weak).
- Confidence-aware lookup: K computed from how many prior occurrences had the same continuation tail.
- Hybrid verify: one MTP-head draft appended after the n-gram tail captures extra ground when n-gram drafts all accept.
- Self-tuning: running per-request acceptance suppresses drafting on bad fits.
- Global auto-disable when MTP is already strong (≥0.85) and n-gram is weak (≤0.50). Guarantees no regression vs. the MTP-only baseline.
CLI
--enable-ngram/--disable-ngram--ngram-num-draft-tokens,--ngram-size,--ngram-min-matches,--ngram-min-occurrences--ngram-acceptance-mode {greedy,leviathan}--ngram-only-in-think/--ngram-everywhere--ngram-skip-tool-calls/--no-ngram-skip-tool-calls--ngram-hybrid-verify/--no-ngram-hybrid-verify--ngram-self-tune/--no-ngram-self-tuneand--ngram-self-tune-disable-threshold--ngram-auto-disable-mtp-threshold,--ngram-auto-disable-min-ngram--ngram-adaptive-k
Other
- Structured CoT grammar plumbing (
structured_cot.gbnf,structured_cot_lcb_plan.gbnf). - Scheduler, server, TUI, and metrics middleware updates to surface and control n-gram drafting per request.
- Test coverage:
tests/test_ngram_drafter.py,tests/test_structured_cot.py, expandedtests/test_mtplx_cli_preset.py.
Install
python3 -m pip install -U git+https://github.com/samuelfaj/lightning-mlx.git@v0.6.26Try it
lightning-mlx serve qwen3.6-35b
lightning-mlx serve qwen3.6-35b-8bitv0.6.25 — qwen3.6 agentic perf tune
Highlights
Two performance wins on the qwen3.6-27b and qwen3.6-35b MTPLX presets, validated on the 27B 3-prompt agentic suite (poem / snake game React+TS / vite landing) plus the 35B suite from the previous release.
Kept
-
Prefix-first cache fetch (
vllm_mlx/memory_cache.py).
ReorderMemoryAwarePrefixCache.fetch()so the prefix-match path runs before supersequence and LCP. Hybrid GatedDeltaNet/Mamba layers are non-trimmable, so supersequence/LCP self-skip on hybrid; trying prefix first lets us early-return whenever a usable prefix exists. Pure efficiency, no semantics change. -
Lower MTP draft temperature (
vllm_mlx/cli.py).
The qwen3.6 preset now auto-setsmtp_draft_temperature=0.5(was 0.7 from the CLI default). Tool-call XML scaffolding is low-entropy; a tighter draft distribution lifts MTP acceptance on agentic patterns.
Measured deltas (27B agentic, 3 prompts)
| Metric | Baseline (v0.6.24 preset) | v0.6.25 | Δ |
|---|---|---|---|
| All-turn avg | 9.95 tok/s | 15.25 tok/s | +53.3% |
| Short-turn avg (<500 tok) | 7.14 tok/s | 8.71 tok/s | +22.0% |
| Long-turn avg (≥500 tok) | 28.25 tok/s (n=2) | 26.68 tok/s (n=4) | flat |
| Wall time, 3 prompts | 709 s | 484-643 s | −9 to −32% (run-to-run, depends on agent path) |
| MTP acceptance | 79-80% | 77-78% | preserved |
Discarded experiments (full log in REPORT.md)
mtp_num_draft_tokens3 → 4 on 27B — acceptance dropped 79-80% to 57-72%, long-turn tok/s −28%.--kv-cache-quantization 4default — only quantizes the stored prefix cache, and hybrid Qwen3.6 logs 0% prefix HITs, so the compress/decompress is pure overhead (same root cause as TurboQuant in the previous release).
No-op / invalid
- SSE JSON precompute — already implemented in
chat.py. - "Skip MTP primary-logits recompute in verify" — invalid;
verify_logits[:, 0, :]isP(* | prompt+primary), fundamentally different from the prior step'sP(* | prompt)used to sample the primary token. - Disable chunked prefill for single-user — already the default.
--kv-cache-min-quantize-tokens 4096— gates a discarded feature.
Deferred (out of scope for this release)
- Prompt-lookup decoder stacked on top of MTP — would need a rewrite of
mtp_generate_stepaccept/reject math to handle two draft sources at once. - Marconi-style hybrid prefix-cache HITs (recurrent-state snapshot at chunk boundaries) — would convert today's 0% HIT rate into real cache reuse on multi-turn agentic.
Files
vllm_mlx/memory_cache.py— prefix-first fetch reorder.vllm_mlx/cli.py— qwen3.6 preset addsmtp_draft_temperature=0.5.REPORT.md— Session 2 raw per-experiment data and decisions.TODO.md— sweep checklist.
🤖 Generated with Claude Code
v0.6.24 — Qwen3.6 agentic tool-use fix
What changed
This release fixes agentic tool use for Qwen3.6 MTPLX serve defaults, especially with Pi/OpenAI-compatible coding-agent flows.
- Hardened tool-call retry handling for scaffolded projects, empty tool turns, malformed/empty tool arguments, and stalled assistant text.
- Added artifact-specific recovery for common creation prompts: poems, Express + Bun + TypeScript REST APIs, HTML/JavaScript Snake games, and Vite landing pages.
- Prevented default Vite/React starter pages from being accepted as completed landing-page work.
- Strengthened the tool-use system suffix so agents replace starter files before install/build validation and use finite checks instead of long-running dev servers.
- Tuned Qwen3.6 MTPLX serve defaults: 27B uses MTP draft tokens 3 with thinking enabled; 35B uses MTP draft tokens 1 with no-thinking default.
- Removed default tool logits bias from these presets.
Validation
Validated with Pi against local lightning-mlx serve on empty folders, one server and one prompt at a time.
qwen3.6-27b
Command: uv run lightning-mlx serve qwen3.6-27b --served-model-name local --port 8010
Create a peom about cats: created non-empty poem artifact.create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API,bunx tsc --noEmitpassed, CRUD runtime proof ran.Create snake game using html and javascript: created playableindex.htmlwith canvas, snake movement, food, score, keyboard controls, and game loop.create a landing page for lightning-mlx using vite: created Lightning MLX landing page andnpm run buildpassed.
qwen3.6-35b
Command: uv run lightning-mlx serve qwen3.6-35b --served-model-name local --port 8010
Create a peom about cats: created non-empty poem artifact.create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API andbunx tsc --noEmitpassed.Create snake game using html and javascript: created playableindex.htmlwith canvas, snake movement, food, score, keyboard controls, and game loop.create a landing page for lightning-mlx using vite: created Lightning-MLX landing page andnpm run buildpassed.
Test coverage
uv run pytest tests/test_chat_tool_retry.py tests/test_tool_calling.py tests/test_tool_parsers.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py— 227 passed.uv run ruff check vllm_mlx/routes/chat.py vllm_mlx/service/helpers.py vllm_mlx/cli.py tests/test_chat_tool_retry.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py— passed.