Skip to content

Releases: samuelfaj/lightning-mlx

v0.7.0 — MTP d=5 default + agentic 2.3x-5.2x faster

24 May 01:16

Choose a tag to compare

MTP draft depth 5 is now the default across all models (bench + serve).

Agentic benchmarks (M5 Max, macOS 26.3):

  • 35B Short: 182.87 tok/s (3.5x vs 52.50)
  • 35B Long: 215.43 tok/s (1.8x vs 117.40)
  • 27B Short: 47.53 tok/s (2.3x vs 20.40)
  • 27B Long: 55.03 tok/s (1.4x vs 38.60)

Also: Metal4 detection, ngram per-agent profiles, MLX floor 0.30.0.

Full experiment trail: reports/exp/INDEX.md

v0.6.32 — Ornstein3.6-27B-MTP-NSC-ACE-SABER aliases

14 May 01:59

Choose a tag to compare

What's new

  • ornstein3.6-27b-nsc-ace-saber-4bitsamuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-4bit-MTPLX-Optimized-Speed
  • ornstein3.6-27b-nsc-ace-saber (default, 6-bit) → samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-6bit-MTPLX-Optimized-Speed
  • ornstein3.6-27b-nsc-ace-saber-8bitsamuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-8bit-MTPLX-Optimized-Speed

All three resolve through the existing Ornstein MTPLX preset: MTP on, n-gram off, agentic tool-use defaults (temp 0.6, top_p 0.95, enable_auto_tool_choice).

Usage

lightning-mlx serve ornstein3.6-27b-nsc-ace-saber
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-4bit
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-8bit

Source model: GestaltLabs/Ornstein3.6-27B-MTP-NSC-ACE-SABER

Collection: Ornstein MLX MTPLX

v0.6.31 — --daemon boot persistence (launchd / systemd)

13 May 17:20
98eeff1

Choose a tag to compare

Highlights

lightning-mlx serve --daemon is now boot-persistent by default. The supervisor restarts at user login / system boot via a per-user OS service:

  • macOS: LaunchAgent at ~/Library/LaunchAgents/com.lightning-mlx.<id>.plist (RunAtLoad, KeepAlive).
  • Linux: systemd user unit at ~/.config/systemd/user/lightning-mlx-<id>.service (Restart=always).

lightning-mlx kill removes the autostart before signaling so the OS service manager cannot resurrect the supervisor.

Opt-out

lightning-mlx serve <model> --daemon=non-persist

Keeps the original in-session detached behavior (dies on reboot).

Linux post-logout survival

Run once:

loginctl enable-linger $USER

Robustness

  • Install failure unlinks the on-disk record and exits with a clean DaemonError (no traceback).
  • lightning-mlx status shows persistent daemons whose supervisor is not yet live as pending rather than stale.
  • launchctl unload before load -w makes re-install idempotent.

Tests

  • tests/test_persistence.py (7): plist/unit content, install + uninstall on darwin/linux, idempotent uninstall, unsupported-platform error.
  • tests/test_daemon.py: persistent vs non-persist start, install-failure cleanup, stop-uninstall ordering, list-pending for persistent dead-pid, CLI dispatch for --daemon=non-persist.
  • 21/21 PASS on changed-file suite. 2262/2262 PASS on relevant full suite.

PR: #2

v0.6.30

13 May 03:35

Choose a tag to compare

Highlights

  • New aliases: qwopus3.6-35b (default 6-bit), qwopus3.6-35b-4bit, qwopus3.6-35b-8bit — point to the samuelfaj/Qwopus3.6-35B-A3B-v1-*-MTPLX-Optimized-Speed collection. Defaults mirror qwen3.6-35b-nsc-ace-saber (XML tool parser, qwen3 reasoning, MTP on, 35B-A3B n-gram preset).
  • Alias rename for consistency with qwen3.6-35b:
    • ornstein3.6-35-saber*ornstein3.6-35b-saber*
    • qwen3.6-35-nsc-ace-saber*qwen3.6-35b-nsc-ace-saber*

Usage

lightning-mlx serve qwopus3.6-35b
lightning-mlx serve qwopus3.6-35b-4bit
lightning-mlx serve qwopus3.6-35b-8bit

v0.6.29 — Qwen3.6-35B NSC-ACE-SABER aliases

12 May 02:35

Choose a tag to compare

feat(aliases): add qwen3.6-35-nsc-ace-saber {4,6,8}bit aliases mapping to samuelfaj/Qwen3.6-35B-A3B-NSC-ACE-SABER-MLX-Nbit-MTPLX-Optimized-Speed. Default (no suffix) = 6bit.

v0.6.28 — Metal memory leak fix + dense MTPLX support

11 May 17:11

Choose a tag to compare

Highlights

Memory leak fix

Long-running servers hit [METAL] Command buffer execution failed: Insufficient Memory after sustained use, especially with the new 32k --max-tokens default. BlockAwarePrefixCache.store_cache was storing the full original prompt+output KV cache in every BlockCacheEntry.cache_data (for legacy get_cache_for_generation callers), pinning GB-scale Metal buffers per finished request. Production never read this copy — per-block slices already power prefix sharing — so the leak just compounded.

Fixes:

  • New BlockAwarePrefixCache.release_full_cache_data(request_id) drops the redundant entry-level ref while keeping per-block slices alive for sharing.
  • Scheduler._finalize_finished_requests now releases the full-cache ref and nulls request._extracted_cache, prompt_cache, and block_table after the post-store mx.eval loop.
  • MLLMScheduler._cleanup_finished nulls pixel_values, attention_mask, image_grid_thw, multimodal_kwargs, prompt_cache, _extracted_cache before popping the request, and calls mx.clear_cache() (parity with text scheduler).
  • EngineCore._cleanup_request calls mx.clear_cache() as backstop for abort and non-streaming paths.

Memory now stabilizes at max_cache_blocks * block_size * KV_per_token plus active working set — no unbounded growth per request.

Dense Qwen3.6 MTPLX conversion

convert_mtplx now handles the dense Qwen3.6 MTP layout (no MoE experts/gate), detected by absence of mtp.layers.0.mlp.experts.gate_up_proj. Adds _normalize_qwen36_dense_mtp plus fixture/sidecar test covering Ornstein-Hermes-3.6-27B-SABER.

Commits

  • fix(memleak): release per-request KV state and redundant block-cache refs on finish
  • feat(convert_mtplx): support dense Qwen3.6 MTP variant

v0.6.27 — Ornstein3.6-35B-SABER aliases

11 May 13:56

Choose a tag to compare

Highlights

New model aliases — Ornstein3.6-35B-A3B-SABER (MTPLX-Optimized-Speed)

Three new aliases for the Ornstein3.6-35B-A3B-SABER family with auto-applied serve preset:

  • ornstein3.6-35-saber-4bitsamuelfaj/Ornstein3.6-35B-A3B-SABER-4bit-MTPLX-Optimized-Speed
  • ornstein3.6-35-sabersamuelfaj/Ornstein3.6-35B-A3B-SABER-6bit-MTPLX-Optimized-Speed
  • ornstein3.6-35-saber-8bitsamuelfaj/Ornstein3.6-35B-A3B-SABER-8bit-MTPLX-Optimized-Speed
lightning-mlx serve ornstein3.6-35-saber

Auto-applied flags (user CLI flags override):

  • --prefill-step-size 32768
  • --max-concurrent 3
  • --max-num-seqs 1 --prefill-batch-size 1 --completion-batch-size 1
  • --stream-interval 1
  • --default-temperature 0.6 --default-top-p 0.95
  • --enable-auto-tool-choice
  • --enable-mtp
  • --enable-ngram --ngram-num-draft-tokens 6 --ngram-min-occurrences 2 --ngram-acceptance-mode greedy
  • --ngram-hybrid-verify --ngram-everywhere --ngram-skip-tool-calls --ngram-self-tune
  • --ngram-self-tune-disable-threshold 0.30
  • --ngram-auto-disable-mtp-threshold 0.85 --ngram-auto-disable-min-ngram 0.50

max_tokens hard cap

cfg.default_max_tokens now clamps client-supplied max_tokens instead of only acting as a fallback. Prevents oversized KV reservations from triggering Metal OOM on Apple Silicon.

v0.6.26 — n-gram speculation + qwen3.6-35b-8bit

10 May 16:22

Choose a tag to compare

Highlights

  • N-gram speculative decoding stacked on MTP for Qwen3.6-35B-A3B. +18% throughput on agentic reasoning + tool-use workloads vs. MTP-only.
  • New qwen3.6-35b-8bit alias routing to samuelfaj/Qwen3.6-35B-A3B-8bit-MTPLX-Optimized-Speed with full preset parity (MTP, n-gram, port 8010, tool/reasoning parsers, temps).

N-gram drafter

  • <think>-aware (token-id state machine) and <tool_call>-aware (rolling-text state machine) gating. Drafts everywhere by default, skips inside <tool_call>...</tool_call>.
  • Adaptive K based on prior occurrence count of the matched n-gram (wide drafts for strong matches, narrow for weak).
  • Confidence-aware lookup: K computed from how many prior occurrences had the same continuation tail.
  • Hybrid verify: one MTP-head draft appended after the n-gram tail captures extra ground when n-gram drafts all accept.
  • Self-tuning: running per-request acceptance suppresses drafting on bad fits.
  • Global auto-disable when MTP is already strong (≥0.85) and n-gram is weak (≤0.50). Guarantees no regression vs. the MTP-only baseline.

CLI

  • --enable-ngram / --disable-ngram
  • --ngram-num-draft-tokens, --ngram-size, --ngram-min-matches, --ngram-min-occurrences
  • --ngram-acceptance-mode {greedy,leviathan}
  • --ngram-only-in-think / --ngram-everywhere
  • --ngram-skip-tool-calls / --no-ngram-skip-tool-calls
  • --ngram-hybrid-verify / --no-ngram-hybrid-verify
  • --ngram-self-tune / --no-ngram-self-tune and --ngram-self-tune-disable-threshold
  • --ngram-auto-disable-mtp-threshold, --ngram-auto-disable-min-ngram
  • --ngram-adaptive-k

Other

  • Structured CoT grammar plumbing (structured_cot.gbnf, structured_cot_lcb_plan.gbnf).
  • Scheduler, server, TUI, and metrics middleware updates to surface and control n-gram drafting per request.
  • Test coverage: tests/test_ngram_drafter.py, tests/test_structured_cot.py, expanded tests/test_mtplx_cli_preset.py.

Install

python3 -m pip install -U git+https://github.com/samuelfaj/lightning-mlx.git@v0.6.26

Try it

lightning-mlx serve qwen3.6-35b
lightning-mlx serve qwen3.6-35b-8bit

v0.6.25 — qwen3.6 agentic perf tune

09 May 21:42

Choose a tag to compare

Highlights

Two performance wins on the qwen3.6-27b and qwen3.6-35b MTPLX presets, validated on the 27B 3-prompt agentic suite (poem / snake game React+TS / vite landing) plus the 35B suite from the previous release.

Kept

  1. Prefix-first cache fetch (vllm_mlx/memory_cache.py).
    Reorder MemoryAwarePrefixCache.fetch() so the prefix-match path runs before supersequence and LCP. Hybrid GatedDeltaNet/Mamba layers are non-trimmable, so supersequence/LCP self-skip on hybrid; trying prefix first lets us early-return whenever a usable prefix exists. Pure efficiency, no semantics change.

  2. Lower MTP draft temperature (vllm_mlx/cli.py).
    The qwen3.6 preset now auto-sets mtp_draft_temperature=0.5 (was 0.7 from the CLI default). Tool-call XML scaffolding is low-entropy; a tighter draft distribution lifts MTP acceptance on agentic patterns.

Measured deltas (27B agentic, 3 prompts)

Metric Baseline (v0.6.24 preset) v0.6.25 Δ
All-turn avg 9.95 tok/s 15.25 tok/s +53.3%
Short-turn avg (<500 tok) 7.14 tok/s 8.71 tok/s +22.0%
Long-turn avg (≥500 tok) 28.25 tok/s (n=2) 26.68 tok/s (n=4) flat
Wall time, 3 prompts 709 s 484-643 s −9 to −32% (run-to-run, depends on agent path)
MTP acceptance 79-80% 77-78% preserved

Discarded experiments (full log in REPORT.md)

  • mtp_num_draft_tokens 3 → 4 on 27B — acceptance dropped 79-80% to 57-72%, long-turn tok/s −28%.
  • --kv-cache-quantization 4 default — only quantizes the stored prefix cache, and hybrid Qwen3.6 logs 0% prefix HITs, so the compress/decompress is pure overhead (same root cause as TurboQuant in the previous release).

No-op / invalid

  • SSE JSON precompute — already implemented in chat.py.
  • "Skip MTP primary-logits recompute in verify" — invalid; verify_logits[:, 0, :] is P(* | prompt+primary), fundamentally different from the prior step's P(* | prompt) used to sample the primary token.
  • Disable chunked prefill for single-user — already the default.
  • --kv-cache-min-quantize-tokens 4096 — gates a discarded feature.

Deferred (out of scope for this release)

  • Prompt-lookup decoder stacked on top of MTP — would need a rewrite of mtp_generate_step accept/reject math to handle two draft sources at once.
  • Marconi-style hybrid prefix-cache HITs (recurrent-state snapshot at chunk boundaries) — would convert today's 0% HIT rate into real cache reuse on multi-turn agentic.

Files

  • vllm_mlx/memory_cache.py — prefix-first fetch reorder.
  • vllm_mlx/cli.py — qwen3.6 preset adds mtp_draft_temperature=0.5.
  • REPORT.md — Session 2 raw per-experiment data and decisions.
  • TODO.md — sweep checklist.

🤖 Generated with Claude Code

v0.6.24 — Qwen3.6 agentic tool-use fix

08 May 13:18

Choose a tag to compare

What changed

This release fixes agentic tool use for Qwen3.6 MTPLX serve defaults, especially with Pi/OpenAI-compatible coding-agent flows.

  • Hardened tool-call retry handling for scaffolded projects, empty tool turns, malformed/empty tool arguments, and stalled assistant text.
  • Added artifact-specific recovery for common creation prompts: poems, Express + Bun + TypeScript REST APIs, HTML/JavaScript Snake games, and Vite landing pages.
  • Prevented default Vite/React starter pages from being accepted as completed landing-page work.
  • Strengthened the tool-use system suffix so agents replace starter files before install/build validation and use finite checks instead of long-running dev servers.
  • Tuned Qwen3.6 MTPLX serve defaults: 27B uses MTP draft tokens 3 with thinking enabled; 35B uses MTP draft tokens 1 with no-thinking default.
  • Removed default tool logits bias from these presets.

Validation

Validated with Pi against local lightning-mlx serve on empty folders, one server and one prompt at a time.

qwen3.6-27b

Command: uv run lightning-mlx serve qwen3.6-27b --served-model-name local --port 8010

  • Create a peom about cats: created non-empty poem artifact.
  • create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API, bunx tsc --noEmit passed, CRUD runtime proof ran.
  • Create snake game using html and javascript: created playable index.html with canvas, snake movement, food, score, keyboard controls, and game loop.
  • create a landing page for lightning-mlx using vite: created Lightning MLX landing page and npm run build passed.

qwen3.6-35b

Command: uv run lightning-mlx serve qwen3.6-35b --served-model-name local --port 8010

  • Create a peom about cats: created non-empty poem artifact.
  • create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API and bunx tsc --noEmit passed.
  • Create snake game using html and javascript: created playable index.html with canvas, snake movement, food, score, keyboard controls, and game loop.
  • create a landing page for lightning-mlx using vite: created Lightning-MLX landing page and npm run build passed.

Test coverage

  • uv run pytest tests/test_chat_tool_retry.py tests/test_tool_calling.py tests/test_tool_parsers.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py — 227 passed.
  • uv run ruff check vllm_mlx/routes/chat.py vllm_mlx/service/helpers.py vllm_mlx/cli.py tests/test_chat_tool_retry.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py — passed.