Skip to content

Commit 3600f06

Browse files
samuelfajclaude
andcommitted
feat(speculative): n-gram drafter stacked on MTP for Qwen3.6-35B-A3B
- New per-request n-gram speculative drafter with <think> and <tool_call> state machines, adaptive K based on n-gram match confidence, hybrid verify (append MTP draft after n-gram tail), per-request self-tuning, and global auto-disable when MTP is strong and n-gram is weak. - Auto-enabled by the qwen3.6-35b preset and the new qwen3.6-35b-8bit preset. +18% throughput on agentic reasoning + tool-use workloads vs. MTP-only. - New qwen3.6-35b-8bit alias routing to samuelfaj/Qwen3.6-35B-A3B-8bit-MTPLX-Optimized-Speed with full preset parity (MTP, n-gram, port 8010, tool/reasoning parsers, temps). - Structured CoT grammar plumbing (structured_cot.gbnf, lcb_plan.gbnf). - Scheduler, server, TUI, and metrics middleware updates to surface and control n-gram drafting per request. - Test coverage for drafter, structured CoT, and CLI preset parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 70ac799 commit 3600f06

18 files changed

Lines changed: 2579 additions & 39 deletions

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,19 @@ create the snake game using react and typescript
4848

4949
You can check for more benchmarks (for non-optmized models) in [Rapid-MLX](https://github.com/raullenchai/Rapid-MLX).
5050

51+
## N-gram Speculation (Qwen3.6-35B-A3B)
52+
53+
On the 35B-A3B preset, n-gram (prompt-lookup) drafting is layered on top of MTP for **+18% throughput** on mixed reasoning + tool-use workloads (vs. MTP-only). Auto-enabled for `qwen3.6-35b` and `qwen3.6-35b-8bit`.
54+
55+
Highlights:
56+
57+
- **`<think>`-aware** and **`<tool_call>`-aware** state machines: drafts everywhere by default but skips inside `<tool_call>...</tool_call>` regions where structure repeats but content varies.
58+
- **Adaptive K** based on n-gram match confidence (prior occurrence count): wide drafts for strong matches, narrow drafts for weak ones.
59+
- **Hybrid verify**: append one MTP-head draft after the n-gram tail to capture extra ground when n-gram drafts all accept.
60+
- **Self-tuning**: per-request running acceptance suppresses drafting on bad fits; global auto-disable when MTP is already strong (≥0.85) and n-gram is weak (≤0.50). Guarantees no regression vs. the MTP-only baseline.
61+
62+
Tunable via `--enable-ngram` / `--disable-ngram` and `--ngram-*` flags on `lightning-mlx serve` and `bench`.
63+
5164
## Install
5265

5366
```bash
@@ -81,8 +94,11 @@ Best optimized models:
8194
```bash
8295
lightning-mlx serve qwen3.6-27b
8396
lightning-mlx serve qwen3.6-35b
97+
lightning-mlx serve qwen3.6-35b-8bit
8498
```
8599

100+
`qwen3.6-35b-8bit` mirrors the `qwen3.6-35b` preset (MTP, n-gram, port 8010, tool/reasoning parsers) but routes to the 8-bit MTPLX-optimized weights for higher quality on memory-rich Macs.
101+
86102
Local model path works too:
87103

88104
```bash
@@ -147,6 +163,7 @@ curl http://localhost:8010/v1/chat/completions \
147163

148164
- **2.75x faster short agentic turns** in the benchmark fixture.
149165
- **1.96x higher all-turn throughput** versus the MLX baseline.
166+
- **+18% throughput on Qwen3.6-35B-A3B** with n-gram + MTP stacked speculation.
150167
- **Successful artifact generation** where baseline timed out.
151168
- **OpenAI-compatible API** for local tools, agents, editors, and CLIs.
152169
- **Apple Silicon first**: built around MLX and local Mac inference.

tests/test_mtplx_cli_preset.py

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
import argparse
22

33
from vllm_mlx.cli import (
4+
_QWEN36_35B_8BIT_MTPLX_MODEL,
45
_QWEN36_35B_MTPLX_MODEL,
56
_QWEN36_MTPLX_MODEL,
7+
_apply_qwen36_35b_defaults,
68
_apply_qwen36_mtplx_preset,
79
)
810
from vllm_mlx.scheduler import SchedulerConfig
@@ -32,6 +34,21 @@ def _serve_args(**overrides):
3234
"no_thinking": False,
3335
"log_level": "INFO",
3436
"enable_tool_logits_bias": False,
37+
# N-gram defaults (preset overrides these for 35B-A3B).
38+
"enable_ngram": False,
39+
"ngram_num_draft_tokens": 4,
40+
"ngram_size": 3,
41+
"ngram_min_matches": 2,
42+
"ngram_only_in_think": True,
43+
"ngram_acceptance_mode": "greedy",
44+
"ngram_min_occurrences": 1,
45+
"ngram_adaptive_k": True,
46+
"ngram_auto_disable_mtp_threshold": 0.0,
47+
"ngram_auto_disable_min_ngram": 0.50,
48+
"ngram_hybrid_verify": False,
49+
"ngram_skip_tool_calls": True,
50+
"ngram_self_tune": True,
51+
"ngram_self_tune_disable_threshold": 0.30,
3552
}
3653
values.update(overrides)
3754
return argparse.Namespace(**values)
@@ -144,3 +161,115 @@ def test_qwen36_mtplx_preset_keeps_explicit_prefill_step_size():
144161

145162
def test_scheduler_default_prefill_step_size_is_sustained():
146163
assert SchedulerConfig().prefill_step_size == 8192
164+
165+
166+
def test_qwen36_35b_serve_preset_enables_ngram_with_tuned_defaults():
167+
args = _serve_args(model=_QWEN36_35B_MTPLX_MODEL)
168+
169+
_apply_qwen36_mtplx_preset(args, ["serve", _QWEN36_35B_MTPLX_MODEL])
170+
171+
# N-gram is auto-enabled for 35B-A3B with the validated agentic
172+
# configuration.
173+
assert args.enable_ngram is True
174+
assert args.ngram_num_draft_tokens == 6
175+
assert args.ngram_min_occurrences == 2
176+
assert args.ngram_acceptance_mode == "greedy"
177+
assert args.ngram_hybrid_verify is True
178+
assert args.ngram_only_in_think is False # everywhere
179+
assert args.ngram_skip_tool_calls is True
180+
assert args.ngram_self_tune is True
181+
assert args.ngram_self_tune_disable_threshold == 0.30
182+
assert args.ngram_auto_disable_mtp_threshold == 0.85
183+
assert args.ngram_auto_disable_min_ngram == 0.50
184+
185+
186+
def test_qwen36_35b_serve_preset_disable_ngram_flag_overrides():
187+
args = _serve_args(model=_QWEN36_35B_MTPLX_MODEL)
188+
189+
_apply_qwen36_mtplx_preset(
190+
args,
191+
["serve", _QWEN36_35B_MTPLX_MODEL, "--disable-ngram"],
192+
)
193+
194+
assert args.enable_ngram is False
195+
196+
197+
def test_qwen36_35b_serve_preset_keeps_explicit_ngram_overrides():
198+
args = _serve_args(
199+
model=_QWEN36_35B_MTPLX_MODEL,
200+
ngram_num_draft_tokens=8,
201+
ngram_min_occurrences=4,
202+
ngram_hybrid_verify=False,
203+
)
204+
205+
_apply_qwen36_mtplx_preset(
206+
args,
207+
[
208+
"serve",
209+
_QWEN36_35B_MTPLX_MODEL,
210+
"--ngram-num-draft-tokens",
211+
"8",
212+
"--ngram-min-occurrences",
213+
"4",
214+
],
215+
)
216+
217+
assert args.ngram_num_draft_tokens == 8
218+
assert args.ngram_min_occurrences == 4
219+
# User did NOT pass --ngram-hybrid-verify, so the preset still flips
220+
# it on (the existing hybrid_verify=False in args is the parser's
221+
# default, not an explicit override).
222+
assert args.ngram_hybrid_verify is True
223+
224+
225+
def test_qwen36_35b_serve_preset_no_hybrid_verify_overrides():
226+
args = _serve_args(
227+
model=_QWEN36_35B_MTPLX_MODEL,
228+
ngram_hybrid_verify=False,
229+
)
230+
231+
_apply_qwen36_mtplx_preset(
232+
args,
233+
[
234+
"serve",
235+
_QWEN36_35B_MTPLX_MODEL,
236+
"--no-ngram-hybrid-verify",
237+
],
238+
)
239+
240+
assert args.ngram_hybrid_verify is False
241+
242+
243+
def test_qwen36_35b_8bit_alias_matches_4bit_preset():
244+
"""8bit alias must apply identical defaults — only model differs."""
245+
a = _serve_args(
246+
model=_QWEN36_35B_MTPLX_MODEL, _original_alias="qwen3.6-35b"
247+
)
248+
b = _serve_args(
249+
model=_QWEN36_35B_8BIT_MTPLX_MODEL,
250+
_original_alias="qwen3.6-35b-8bit",
251+
)
252+
253+
_apply_qwen36_mtplx_preset(a, ["serve", "qwen3.6-35b"])
254+
_apply_qwen36_35b_defaults(a, ["serve", "qwen3.6-35b"])
255+
_apply_qwen36_mtplx_preset(b, ["serve", "qwen3.6-35b-8bit"])
256+
_apply_qwen36_35b_defaults(b, ["serve", "qwen3.6-35b-8bit"])
257+
258+
da, db = vars(a), vars(b)
259+
diffs = {k: (da[k], db[k]) for k in da if da[k] != db[k]}
260+
assert diffs == {
261+
"model": (_QWEN36_35B_MTPLX_MODEL, _QWEN36_35B_8BIT_MTPLX_MODEL),
262+
"_original_alias": ("qwen3.6-35b", "qwen3.6-35b-8bit"),
263+
}
264+
265+
266+
def test_qwen36_27b_serve_preset_does_not_enable_ngram():
267+
"""27B model should not get the 35B-only ngram preset."""
268+
args = _serve_args(model=_QWEN36_MTPLX_MODEL)
269+
270+
_apply_qwen36_mtplx_preset(args, ["serve", _QWEN36_MTPLX_MODEL])
271+
272+
assert args.enable_ngram is False
273+
assert args.ngram_num_draft_tokens == 4 # parser default unchanged
274+
assert args.ngram_min_occurrences == 1
275+
assert args.ngram_hybrid_verify is False

0 commit comments

Comments
 (0)