Releases · samuelfaj/lightning-mlx

ornstein3.6-27b-nsc-ace-saber-4bit → samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-4bit-MTPLX-Optimized-Speed
ornstein3.6-27b-nsc-ace-saber (default, 6-bit) → samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-6bit-MTPLX-Optimized-Speed
ornstein3.6-27b-nsc-ace-saber-8bit → samuelfaj/Ornstein3.6-27B-MTP-NSC-ACE-SABER-8bit-MTPLX-Optimized-Speed

All three resolve through the existing Ornstein MTPLX preset: MTP on, n-gram off, agentic tool-use defaults (temp 0.6, top_p 0.95, enable_auto_tool_choice).

Usage

lightning-mlx serve ornstein3.6-27b-nsc-ace-saber
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-4bit
lightning-mlx serve ornstein3.6-27b-nsc-ace-saber-8bit

Source model: GestaltLabs/Ornstein3.6-27B-MTP-NSC-ACE-SABER

Collection: Ornstein MLX MTPLX

Assets 2

13 May 17:20

samuelfaj

v0.6.31

98eeff1

v0.6.31 — --daemon boot persistence (launchd / systemd)

Highlights

lightning-mlx serve --daemon is now boot-persistent by default. The supervisor restarts at user login / system boot via a per-user OS service:

macOS: LaunchAgent at ~/Library/LaunchAgents/com.lightning-mlx.<id>.plist (RunAtLoad, KeepAlive).
Linux: systemd user unit at ~/.config/systemd/user/lightning-mlx-<id>.service (Restart=always).

lightning-mlx kill removes the autostart before signaling so the OS service manager cannot resurrect the supervisor.

Opt-out

lightning-mlx serve <model> --daemon=non-persist

Keeps the original in-session detached behavior (dies on reboot).

Linux post-logout survival

Run once:

loginctl enable-linger $USER

Robustness

Install failure unlinks the on-disk record and exits with a clean DaemonError (no traceback).
lightning-mlx status shows persistent daemons whose supervisor is not yet live as pending rather than stale.
launchctl unload before load -w makes re-install idempotent.

Tests

tests/test_persistence.py (7): plist/unit content, install + uninstall on darwin/linux, idempotent uninstall, unsupported-platform error.
tests/test_daemon.py: persistent vs non-persist start, install-failure cleanup, stop-uninstall ordering, list-pending for persistent dead-pid, CLI dispatch for --daemon=non-persist.
21/21 PASS on changed-file suite. 2262/2262 PASS on relevant full suite.

PR: #2

Assets 2

13 May 03:35

samuelfaj

v0.6.30

381d53b

v0.6.30

Highlights

New aliases: qwopus3.6-35b (default 6-bit), qwopus3.6-35b-4bit, qwopus3.6-35b-8bit — point to the samuelfaj/Qwopus3.6-35B-A3B-v1-*-MTPLX-Optimized-Speed collection. Defaults mirror qwen3.6-35b-nsc-ace-saber (XML tool parser, qwen3 reasoning, MTP on, 35B-A3B n-gram preset).
Alias rename for consistency with qwen3.6-35b:
- ornstein3.6-35-saber* → ornstein3.6-35b-saber*
- qwen3.6-35-nsc-ace-saber* → qwen3.6-35b-nsc-ace-saber*

Usage

lightning-mlx serve qwopus3.6-35b
lightning-mlx serve qwopus3.6-35b-4bit
lightning-mlx serve qwopus3.6-35b-8bit

Assets 2

12 May 02:35

samuelfaj

v0.6.29

5b7a11e

v0.6.29 — Qwen3.6-35B NSC-ACE-SABER aliases

feat(aliases): add qwen3.6-35-nsc-ace-saber {4,6,8}bit aliases mapping to samuelfaj/Qwen3.6-35B-A3B-NSC-ACE-SABER-MLX-Nbit-MTPLX-Optimized-Speed. Default (no suffix) = 6bit.

Assets 2

11 May 17:11

samuelfaj

v0.6.28

6c1e4db

v0.6.28 — Metal memory leak fix + dense MTPLX support

Highlights

Memory leak fix

Long-running servers hit [METAL] Command buffer execution failed: Insufficient Memory after sustained use, especially with the new 32k --max-tokens default. BlockAwarePrefixCache.store_cache was storing the full original prompt+output KV cache in every BlockCacheEntry.cache_data (for legacy get_cache_for_generation callers), pinning GB-scale Metal buffers per finished request. Production never read this copy — per-block slices already power prefix sharing — so the leak just compounded.

Fixes:

New BlockAwarePrefixCache.release_full_cache_data(request_id) drops the redundant entry-level ref while keeping per-block slices alive for sharing.
Scheduler._finalize_finished_requests now releases the full-cache ref and nulls request._extracted_cache, prompt_cache, and block_table after the post-store mx.eval loop.
MLLMScheduler._cleanup_finished nulls pixel_values, attention_mask, image_grid_thw, multimodal_kwargs, prompt_cache, _extracted_cache before popping the request, and calls mx.clear_cache() (parity with text scheduler).
EngineCore._cleanup_request calls mx.clear_cache() as backstop for abort and non-streaming paths.

Memory now stabilizes at max_cache_blocks * block_size * KV_per_token plus active working set — no unbounded growth per request.

Dense Qwen3.6 MTPLX conversion

convert_mtplx now handles the dense Qwen3.6 MTP layout (no MoE experts/gate), detected by absence of mtp.layers.0.mlp.experts.gate_up_proj. Adds _normalize_qwen36_dense_mtp plus fixture/sidecar test covering Ornstein-Hermes-3.6-27B-SABER.

Commits

fix(memleak): release per-request KV state and redundant block-cache refs on finish
feat(convert_mtplx): support dense Qwen3.6 MTP variant

Assets 2

11 May 13:56

samuelfaj

v0.6.27

f3c5301

v0.6.27 — Ornstein3.6-35B-SABER aliases

Highlights

New model aliases — Ornstein3.6-35B-A3B-SABER (MTPLX-Optimized-Speed)

Three new aliases for the Ornstein3.6-35B-A3B-SABER family with auto-applied serve preset:

ornstein3.6-35-saber-4bit → samuelfaj/Ornstein3.6-35B-A3B-SABER-4bit-MTPLX-Optimized-Speed
ornstein3.6-35-saber → samuelfaj/Ornstein3.6-35B-A3B-SABER-6bit-MTPLX-Optimized-Speed
ornstein3.6-35-saber-8bit → samuelfaj/Ornstein3.6-35B-A3B-SABER-8bit-MTPLX-Optimized-Speed

lightning-mlx serve ornstein3.6-35-saber

Auto-applied flags (user CLI flags override):

--prefill-step-size 32768
--max-concurrent 3
--max-num-seqs 1 --prefill-batch-size 1 --completion-batch-size 1
--stream-interval 1
--default-temperature 0.6 --default-top-p 0.95
--enable-auto-tool-choice
--enable-mtp
--enable-ngram --ngram-num-draft-tokens 6 --ngram-min-occurrences 2 --ngram-acceptance-mode greedy
--ngram-hybrid-verify --ngram-everywhere --ngram-skip-tool-calls --ngram-self-tune
--ngram-self-tune-disable-threshold 0.30
--ngram-auto-disable-mtp-threshold 0.85 --ngram-auto-disable-min-ngram 0.50

max_tokens hard cap

cfg.default_max_tokens now clamps client-supplied max_tokens instead of only acting as a fallback. Prevents oversized KV reservations from triggering Metal OOM on Apple Silicon.

Assets 2

10 May 16:22

samuelfaj

v0.6.26

3600f06

v0.6.26 — n-gram speculation + qwen3.6-35b-8bit

Highlights

N-gram speculative decoding stacked on MTP for Qwen3.6-35B-A3B. +18% throughput on agentic reasoning + tool-use workloads vs. MTP-only.
New qwen3.6-35b-8bit alias routing to samuelfaj/Qwen3.6-35B-A3B-8bit-MTPLX-Optimized-Speed with full preset parity (MTP, n-gram, port 8010, tool/reasoning parsers, temps).

N-gram drafter

<think>-aware (token-id state machine) and <tool_call>-aware (rolling-text state machine) gating. Drafts everywhere by default, skips inside <tool_call>...</tool_call>.
Adaptive K based on prior occurrence count of the matched n-gram (wide drafts for strong matches, narrow for weak).
Confidence-aware lookup: K computed from how many prior occurrences had the same continuation tail.
Hybrid verify: one MTP-head draft appended after the n-gram tail captures extra ground when n-gram drafts all accept.
Self-tuning: running per-request acceptance suppresses drafting on bad fits.
Global auto-disable when MTP is already strong (≥0.85) and n-gram is weak (≤0.50). Guarantees no regression vs. the MTP-only baseline.

CLI

--enable-ngram / --disable-ngram
--ngram-num-draft-tokens, --ngram-size, --ngram-min-matches, --ngram-min-occurrences
--ngram-acceptance-mode {greedy,leviathan}
--ngram-only-in-think / --ngram-everywhere
--ngram-skip-tool-calls / --no-ngram-skip-tool-calls
--ngram-hybrid-verify / --no-ngram-hybrid-verify
--ngram-self-tune / --no-ngram-self-tune and --ngram-self-tune-disable-threshold
--ngram-auto-disable-mtp-threshold, --ngram-auto-disable-min-ngram
--ngram-adaptive-k

Other

Structured CoT grammar plumbing (structured_cot.gbnf, structured_cot_lcb_plan.gbnf).
Scheduler, server, TUI, and metrics middleware updates to surface and control n-gram drafting per request.
Test coverage: tests/test_ngram_drafter.py, tests/test_structured_cot.py, expanded tests/test_mtplx_cli_preset.py.

Install

python3 -m pip install -U git+https://github.com/samuelfaj/lightning-mlx.git@v0.6.26

Try it

lightning-mlx serve qwen3.6-35b
lightning-mlx serve qwen3.6-35b-8bit

Assets 2

09 May 21:42

samuelfaj

v0.6.25

6ee13a9

v0.6.25 — qwen3.6 agentic perf tune

Highlights

Two performance wins on the qwen3.6-27b and qwen3.6-35b MTPLX presets, validated on the 27B 3-prompt agentic suite (poem / snake game React+TS / vite landing) plus the 35B suite from the previous release.

Kept

Prefix-first cache fetch (vllm_mlx/memory_cache.py).
Reorder MemoryAwarePrefixCache.fetch() so the prefix-match path runs before supersequence and LCP. Hybrid GatedDeltaNet/Mamba layers are non-trimmable, so supersequence/LCP self-skip on hybrid; trying prefix first lets us early-return whenever a usable prefix exists. Pure efficiency, no semantics change.
Lower MTP draft temperature (vllm_mlx/cli.py).
The qwen3.6 preset now auto-sets mtp_draft_temperature=0.5 (was 0.7 from the CLI default). Tool-call XML scaffolding is low-entropy; a tighter draft distribution lifts MTP acceptance on agentic patterns.

Measured deltas (27B agentic, 3 prompts)

Metric	Baseline (v0.6.24 preset)	v0.6.25	Δ
All-turn avg	9.95 tok/s	15.25 tok/s	+53.3%
Short-turn avg (<500 tok)	7.14 tok/s	8.71 tok/s	+22.0%
Long-turn avg (≥500 tok)	28.25 tok/s (n=2)	26.68 tok/s (n=4)	flat
Wall time, 3 prompts	709 s	484-643 s	−9 to −32% (run-to-run, depends on agent path)
MTP acceptance	79-80%	77-78%	preserved

Discarded experiments (full log in `REPORT.md`)

mtp_num_draft_tokens 3 → 4 on 27B — acceptance dropped 79-80% to 57-72%, long-turn tok/s −28%.
--kv-cache-quantization 4 default — only quantizes the stored prefix cache, and hybrid Qwen3.6 logs 0% prefix HITs, so the compress/decompress is pure overhead (same root cause as TurboQuant in the previous release).

No-op / invalid

SSE JSON precompute — already implemented in chat.py.
"Skip MTP primary-logits recompute in verify" — invalid; verify_logits[:, 0, :] is P(* | prompt+primary), fundamentally different from the prior step's P(* | prompt) used to sample the primary token.
Disable chunked prefill for single-user — already the default.
--kv-cache-min-quantize-tokens 4096 — gates a discarded feature.

Deferred (out of scope for this release)

Prompt-lookup decoder stacked on top of MTP — would need a rewrite of mtp_generate_step accept/reject math to handle two draft sources at once.
Marconi-style hybrid prefix-cache HITs (recurrent-state snapshot at chunk boundaries) — would convert today's 0% HIT rate into real cache reuse on multi-turn agentic.

Files

vllm_mlx/memory_cache.py — prefix-first fetch reorder.
vllm_mlx/cli.py — qwen3.6 preset adds mtp_draft_temperature=0.5.
REPORT.md — Session 2 raw per-experiment data and decisions.
TODO.md — sweep checklist.

🤖 Generated with Claude Code

Assets 2

08 May 13:18

samuelfaj

v0.6.24

3ea1697

v0.6.24 — Qwen3.6 agentic tool-use fix

What changed

This release fixes agentic tool use for Qwen3.6 MTPLX serve defaults, especially with Pi/OpenAI-compatible coding-agent flows.

Hardened tool-call retry handling for scaffolded projects, empty tool turns, malformed/empty tool arguments, and stalled assistant text.
Added artifact-specific recovery for common creation prompts: poems, Express + Bun + TypeScript REST APIs, HTML/JavaScript Snake games, and Vite landing pages.
Prevented default Vite/React starter pages from being accepted as completed landing-page work.
Strengthened the tool-use system suffix so agents replace starter files before install/build validation and use finite checks instead of long-running dev servers.
Tuned Qwen3.6 MTPLX serve defaults: 27B uses MTP draft tokens 3 with thinking enabled; 35B uses MTP draft tokens 1 with no-thinking default.
Removed default tool logits bias from these presets.

Validation

Validated with Pi against local lightning-mlx serve on empty folders, one server and one prompt at a time.

qwen3.6-27b

Command: uv run lightning-mlx serve qwen3.6-27b --served-model-name local --port 8010

Create a peom about cats: created non-empty poem artifact.
create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API, bunx tsc --noEmit passed, CRUD runtime proof ran.
Create snake game using html and javascript: created playable index.html with canvas, snake movement, food, score, keyboard controls, and game loop.
create a landing page for lightning-mlx using vite: created Lightning MLX landing page and npm run build passed.

qwen3.6-35b

Command: uv run lightning-mlx serve qwen3.6-35b --served-model-name local --port 8010

Create a peom about cats: created non-empty poem artifact.
create a REST api using express and bun and typescript.: created Express/Bun/TypeScript REST API and bunx tsc --noEmit passed.
Create snake game using html and javascript: created playable index.html with canvas, snake movement, food, score, keyboard controls, and game loop.
create a landing page for lightning-mlx using vite: created Lightning-MLX landing page and npm run build passed.

Test coverage

uv run pytest tests/test_chat_tool_retry.py tests/test_tool_calling.py tests/test_tool_parsers.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py — 227 passed.
uv run ruff check vllm_mlx/routes/chat.py vllm_mlx/service/helpers.py vllm_mlx/cli.py tests/test_chat_tool_retry.py tests/test_mtplx_cli_preset.py tests/test_cli_presets.py — passed.

Assets 2

Releases: samuelfaj/lightning-mlx

v0.7.0 — MTP d=5 default + agentic 2.3x-5.2x faster

Uh oh!

v0.6.32 — Ornstein3.6-27B-MTP-NSC-ACE-SABER aliases

What's new

Usage

Uh oh!

v0.6.31 — --daemon boot persistence (launchd / systemd)

Highlights

Opt-out

Linux post-logout survival

Robustness

Tests

Uh oh!

v0.6.30

Highlights

Usage

Uh oh!

v0.6.29 — Qwen3.6-35B NSC-ACE-SABER aliases

Uh oh!

v0.6.28 — Metal memory leak fix + dense MTPLX support

Highlights

Memory leak fix

Dense Qwen3.6 MTPLX conversion

Commits

Uh oh!

v0.6.27 — Ornstein3.6-35B-SABER aliases

Highlights

New model aliases — Ornstein3.6-35B-A3B-SABER (MTPLX-Optimized-Speed)

max_tokens hard cap

Uh oh!

v0.6.26 — n-gram speculation + qwen3.6-35b-8bit

Highlights

N-gram drafter

CLI

Other

Install

Try it

Uh oh!

v0.6.25 — qwen3.6 agentic perf tune

Highlights

Kept

Measured deltas (27B agentic, 3 prompts)

Discarded experiments (full log in REPORT.md)

No-op / invalid

Deferred (out of scope for this release)

Files

Uh oh!

v0.6.24 — Qwen3.6 agentic tool-use fix

What changed

Validation

qwen3.6-27b

qwen3.6-35b

Test coverage

Uh oh!

Discarded experiments (full log in `REPORT.md`)