v0.0.27 — Speculative decoding & Responses API
This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.
New Features
- End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New
Draftertrait +DrafterKindenum +model_typeauto-detection (#624). Ported drafter components:MaskedEmbedderfor Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) andnormalize_batched_shared_kv_states(#628),Gemma4AssistantDraftModel(4-layer drafter + pre/post projections) (#626), andDFlashDraftModel(5-layer drafter +DFlashAttention+DFlashKVCache) (#635). Target-side hooks: Gemma 4return_hidden/return_shared_kv/rollback_speculative_cache(#625); Qwen 3.5return_hidden+capture_layer_ids+ GDN-awarerollback_speculative_cache(#634). Round loops: DFlash single-batch (#636),MtpGeneratorsingle-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632). - Server speculative dispatch. Speculative dispatch resolution and
MtpTargetadapters wired into the server (#666); the assistant model paths now plug into the realMaskedEmbedderandmake_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path viaMtpBatchedGenerator/DFlashBatchedGenerator(#684). Per-request properties propagated through the speculative-burst path: cancellation propagation throughMtpGenerator/DFlashGenerator(#681),token_historythreading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689). --draft-kind {dflash,mtp}and--draft-block-sizeCLI flags on bothmlxcelandmlxcel-server(#630).- OpenAI Responses API (Phase 1) at
/v1/responsesfor both binaries (#622, #623). Conversation store with shared-LRU semantics,response.created/response.in_progress/response.completedSSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide atdocs/responses-api.md. - APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components:
DetachedKVCache::trim_toandDetachedCacheSet::truncate_to(FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxedPromptCacheStorecontainment gate, andScheduler::try_adopt_cached_prefixtruncate-on-adopt. Bench procedure indocs/apc-partial-adoption-bench.md. - Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at
sound_context_token_idand interleaves them with vision tokens. Loader applies the upstreamsanitize_audio_weightstranspose pass. Bring-up guide atdocs/nemotron-h-nano-omni-audio-bringup.md. mlxcel download/mlxcel-server downloadprogress bars (#648, #649). Newsrc/downloader/progress.rsmodule provides terminal-aware suppression, aMultiProgressfactory, and 6 suppression unit tests. The downloader streams files viareqwestto aNamedTempFileand atomically renames into place.- Server
--max-kv-sizeflag matching llama-server, plus a tightened chat-completion response envelope (#618). - Tokenizer support for multi-token think and tool-call sequences so chat templates that emit
<think>/<tool_call>across multiple BPE tokens stream and parse correctly (#590, #613).
Improvements
StreamFilterextended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).- Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real
z-lab/Qwen3.5-4B-DFlashcheckpoint and themlx-community/gemma-4-*drafter variants. - README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
- Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
- Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).
Bug Fixes
- DFlash drafter lazy-bind for the upstream
z-lab/Qwen3.5-4B-DFlashcheckpoint —Drafter::bindwas previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683). - Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
- Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
- Prompt cache RadixTrie:
pop_prefixesnow uses correct immediate-prefix semantics (#617). - MiniMax M2 parallel tool calling parser correctly emits one
ChatToolCallper parallel call instead of merging them into a single call (#616). - Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when
tool_call_endis empty so streaming continues correctly for templates without an explicit close marker (#614). video_urlallowlist TOCTOU race closed by passing the resolvedOwnedFdto ffmpeg via/dev/fd/Ninstead of re-opening the path inside the subprocess. Symlink swaps betweenmetadataand the subsequent open now cannot mis-route the subprocess (#601, #611).- Gemma 4: skip
k_proj/v_proj/k_normweight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608). - Nemotron-H: default
time_step_limitto(0.0, +inf)regardless oftime_step_min/time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619). gated_deltamasked Metal kernel variants: zero-inity[dv_idx]when the mask is false (#610).- Tests: add
max_kv_sizefield to theServeArgstest fixture (#620). - Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.
CI/CD Improvements
- Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.
Technical Details
- MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in
src/lib/mlx-cpp/turbo/continue to use the validatedmlx::core::fast::metal_kernel,mlx::core::full,mlx::core::Shape,mlx::core::float32,mlx::core::int32, andmetal::fast::expsurface from the previous cycle. - Drafter dispatch architecture: the
Draftertrait +DrafterKindenum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live insrc/server/batch/speculative_burst.rs(speculative_dispatch::resolve). - Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (
donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision. - APC partial adoption invariant: the
truncate_tooperation is only invoked whenapc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead. - Downloader streaming:
stream_file(outer) handles destination resolution, cache-hit detection, and atomic rename;stream_to_tempfile(inner) performs the network read into aNamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.
Dependencies
- No new direct dependency additions over v0.0.26.
reqwest0.12 (existing) is used for the streaming downloader path.
Breaking Changes
- None for end users. The new
--draft-kind/--draft-block-sizeCLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.
Known Issues
- B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See
docs/speculative-decoding.mdfor tuning guidance. - DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning).
What's Changed
- feat(server): APC block-level partial cache adoption in scheduler (#580) by @inureyes in lablup/mlxcel-internal#607
- fix(gemma4): skip k_proj/v_proj/k_norm load for KV-shared layers by @inureyes in lablup/mlxcel-internal#608
- feat(nemotron-h-nano-omni): port Parakeet audio encoder (#582) by @inureyes in lablup/mlxcel-internal#609
- fix(server): close TOCTOU race in video allowlist by passing fd to ffmpeg (#601) by @inureyes in lablup/mlxcel-internal#611
- fix(gated_delta): zero-init y[dv_idx] when mask is false in masked Metal kernel variants by @inureyes in lablup/mlxcel-internal#610
- fix(speculative): validate trimmable cache and reserve last token in prefill by @inureyes in lablup/mlxcel-internal#612
- feat(tokenizer): support multi-token think/tool-call sequences (#590) by @inureyes in lablup/mlxcel-internal#613
- fix(server): skip tool→normal transition when tool_call_end is empty by @inureyes in lablup/mlxcel-internal#614
- fix(server): preserve token positions when buffering parallel tool calls by @inureyes in lablup/mlxcel-internal#615
- fix(tool-parsers): MiniMax M2 parallel tool calling parser by @inureyes in lablup/mlxcel-internal#616
- fix(prompt_cache): add pop_prefixes to RadixTrie with correct immediate-prefix semantics by @inureyes in lablup/mlxcel-internal#617
- feat(server): expose --max-kv-size flag and tighten chat-completion response by @inureyes in lablup/mlxcel-internal#618
- fix(nemotron_h): default time_step_limit to (0.0, inf) regardless of time_step_min/max by @inureyes in lablup/mlxcel-internal#619
- fix(tests): add max_kv_size field to ServeArgs test fixture by @inureyes in lablup/mlxcel-internal#620
- fix: address upstream sync review follow-ups by @inureyes in lablup/mlxcel-internal#621
- feat(server): implement OpenAI Responses API (Phase 1) for /v1/responses (#622) by @inureyes in lablup/mlxcel-internal#623
- feat(download): show per-file + aggregate progress bars during mlxcel download (#648) by @inureyes in lablup/mlxcel-internal#649
- chore(tokenizer): replace .map_or(false, ...) with .is_some_and(...) for clippy 1.93 by @inureyes in lablup/mlxcel-internal#651
- chore(download): PR #649 security hardening (M1/M2/L1/L2/L3/L5/L6) by @inureyes in lablup/mlxcel-internal#652
- feat(speculative): introduce Drafter trait + DrafterKind enum + model_type auto-detection (#624) by @inureyes in lablup/mlxcel-internal#653
- feat(qwen3_5): expose return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache for DFlash (#634) by @inureyes in lablup/mlxcel-internal#654
- feat(gemma4): expose return_hidden / return_shared_kv / rollback_speculative_cache for MTP (#625) by @inureyes in lablup/mlxcel-internal#655
- feat(speculative): port MaskedEmbedder for Gemma 4 E2B / E4B drafters (#627) by @inureyes in lablup/mlxcel-internal#656
- feat(speculative): port drafter masks + normalize_batched_shared_kv_states for Gemma 4 MTP (#628) by @inureyes in lablup/mlxcel-internal#657
- feat(speculative): port DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635) by @inureyes in lablup/mlxcel-internal#659
- feat(speculative): port Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626) by @inureyes in lablup/mlxcel-internal#658
- feat(speculative): implement DFlash round loop (single-batch path) (#636) by @inureyes in lablup/mlxcel-internal#661
- feat(speculative): implement MtpGenerator round loop (single-batch path) (#629) by @inureyes in lablup/mlxcel-internal#662
- feat(speculative): batched DFlash round loop with continuous batching and GDN-aware rollback (#637) by @inureyes in lablup/mlxcel-internal#663
- feat(cli): add --draft-kind {dflash,mtp} and --draft-block-size flags (#630) by @inureyes in lablup/mlxcel-internal#664
- feat(speculative): batched MTP round loop with continuous batching and left-padding normalization (#631) by @inureyes in lablup/mlxcel-internal#665
- test(speculative): greedy-parity + perf benchmarks for speculative drafter pairings (#632) by @inureyes in lablup/mlxcel-internal#667
- feat(gemma4_assistant): wire real MaskedEmbedder and make_drafter_masks (deferred from #626) by @inureyes in lablup/mlxcel-internal#668
- feat(server): wire speculative dispatch resolution and MtpTarget adapters (#666) by @inureyes in lablup/mlxcel-internal#669
- feat(server): wire speculative dispatch into scheduler via B=1 bursts (#670) by @inureyes in lablup/mlxcel-internal#671
- feat(server): cancellation propagation through MtpGenerator and DFlashGenerator by @inureyes in lablup/mlxcel-internal#681
- feat(server): thread sampling penalties through speculative-burst first sample by @inureyes in lablup/mlxcel-internal#682
- fix(core): DFlash drafter lazy-bind for upstream z-lab/Qwen3.5-4B-DFlash checkpoint by @inureyes in lablup/mlxcel-internal#683
- test(speculative): real-model byte-equality end-to-end test for speculative bursts by @inureyes in lablup/mlxcel-internal#685
- feat(server): logprobs support in speculative-burst path by @inureyes in lablup/mlxcel-internal#686
- feat(server): thinking-budget enforcement in speculative-burst path by @inureyes in lablup/mlxcel-internal#687
- feat(server): speculative-burst prompt-cache donate symmetric with classic path (#673) by @inureyes in lablup/mlxcel-internal#680
- feat(server): B>1 batched speculative dispatch via MtpBatchedGenerator/DFlashBatchedGenerator by @inureyes in lablup/mlxcel-internal#684
- feat(server): wire donate_finished_sequence_cache into B>1 batched speculative-burst arm by @inureyes in lablup/mlxcel-internal#689
- fix: harden speculative drafter epic follow-ups by @inureyes in lablup/mlxcel-internal#690
- fix: enable DFlash for Qwen3.5 VLM text requests by @inureyes in lablup/mlxcel-internal#694
- fix(perf): optimize Qwen3.5 DFlash greedy argmax by @inureyes in lablup/mlxcel-internal#695
- docs: refresh README to match current code and benchmarks by @inureyes in lablup/mlxcel-internal#700
- fix(perf): avoid slow Gemma 4 MTP singleton bursts by @inureyes in lablup/mlxcel-internal#698
- docs: update speculative decoding guidance by @inureyes in lablup/mlxcel-internal#702
Full Changelog: lablup/mlxcel-internal@v0.0.26...v0.0.27