Release v0.0.27 · lablup/mlxcel

v0.0.27 — Speculative decoding & Responses API

This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.

New Features

End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New Drafter trait + DrafterKind enum + model_type auto-detection (#624). Ported drafter components: MaskedEmbedder for Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) and normalize_batched_shared_kv_states (#628), Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626), and DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635). Target-side hooks: Gemma 4 return_hidden / return_shared_kv / rollback_speculative_cache (#625); Qwen 3.5 return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache (#634). Round loops: DFlash single-batch (#636), MtpGenerator single-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632).
Server speculative dispatch. Speculative dispatch resolution and MtpTarget adapters wired into the server (#666); the assistant model paths now plug into the real MaskedEmbedder and make_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path via MtpBatchedGenerator / DFlashBatchedGenerator (#684). Per-request properties propagated through the speculative-burst path: cancellation propagation through MtpGenerator / DFlashGenerator (#681), token_history threading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689).
--draft-kind {dflash,mtp} and --draft-block-size CLI flags on both mlxcel and mlxcel-server (#630).
OpenAI Responses API (Phase 1) at /v1/responses for both binaries (#622, #623). Conversation store with shared-LRU semantics, response.created / response.in_progress / response.completed SSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide at docs/responses-api.md.
APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components: DetachedKVCache::trim_to and DetachedCacheSet::truncate_to (FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxed PromptCacheStore containment gate, and Scheduler::try_adopt_cached_prefix truncate-on-adopt. Bench procedure in docs/apc-partial-adoption-bench.md.
Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at sound_context_token_id and interleaves them with vision tokens. Loader applies the upstream sanitize_audio_weights transpose pass. Bring-up guide at docs/nemotron-h-nano-omni-audio-bringup.md.
mlxcel download / mlxcel-server download progress bars (#648, #649). New src/downloader/progress.rs module provides terminal-aware suppression, a MultiProgress factory, and 6 suppression unit tests. The downloader streams files via reqwest to a NamedTempFile and atomically renames into place.
Server --max-kv-size flag matching llama-server, plus a tightened chat-completion response envelope (#618).
Tokenizer support for multi-token think and tool-call sequences so chat templates that emit <think> / <tool_call> across multiple BPE tokens stream and parse correctly (#590, #613).

Improvements

StreamFilter extended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).
Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real z-lab/Qwen3.5-4B-DFlash checkpoint and the mlx-community/gemma-4-* drafter variants.
README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).

Bug Fixes

DFlash drafter lazy-bind for the upstream z-lab/Qwen3.5-4B-DFlash checkpoint — Drafter::bind was previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683).
Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
Prompt cache RadixTrie: pop_prefixes now uses correct immediate-prefix semantics (#617).
MiniMax M2 parallel tool calling parser correctly emits one ChatToolCall per parallel call instead of merging them into a single call (#616).
Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when tool_call_end is empty so streaming continues correctly for templates without an explicit close marker (#614).
video_url allowlist TOCTOU race closed by passing the resolved OwnedFd to ffmpeg via /dev/fd/N instead of re-opening the path inside the subprocess. Symlink swaps between metadata and the subsequent open now cannot mis-route the subprocess (#601, #611).
Gemma 4: skip k_proj / v_proj / k_norm weight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608).
Nemotron-H: default time_step_limit to (0.0, +inf) regardless of time_step_min / time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619).
gated_delta masked Metal kernel variants: zero-init y[dv_idx] when the mask is false (#610).
Tests: add max_kv_size field to the ServeArgs test fixture (#620).
Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.

CI/CD Improvements

Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.

Technical Details

MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ continue to use the validated mlx::core::fast::metal_kernel, mlx::core::full, mlx::core::Shape, mlx::core::float32, mlx::core::int32, and metal::fast::exp surface from the previous cycle.
Drafter dispatch architecture: the Drafter trait + DrafterKind enum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live in src/server/batch/speculative_burst.rs (speculative_dispatch::resolve).
Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision.
APC partial adoption invariant: the truncate_to operation is only invoked when apc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead.
Downloader streaming: stream_file (outer) handles destination resolution, cache-hit detection, and atomic rename; stream_to_tempfile (inner) performs the network read into a NamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.

Dependencies

No new direct dependency additions over v0.0.26. reqwest 0.12 (existing) is used for the streaming downloader path.

Breaking Changes

None for end users. The new --draft-kind / --draft-block-size CLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.

Known Issues

B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See docs/speculative-decoding.md for tuning guidance.
DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning).

What's Changed

feat(server): APC block-level partial cache adoption in scheduler (#580) by @inureyes in lablup/mlxcel-internal#607
fix(gemma4): skip k_proj/v_proj/k_norm load for KV-shared layers by @inureyes in lablup/mlxcel-internal#608
feat(nemotron-h-nano-omni): port Parakeet audio encoder (#582) by @inureyes in lablup/mlxcel-internal#609
fix(server): close TOCTOU race in video allowlist by passing fd to ffmpeg (#601) by @inureyes in lablup/mlxcel-internal#611
fix(gated_delta): zero-init y[dv_idx] when mask is false in masked Metal kernel variants by @inureyes in lablup/mlxcel-internal#610
fix(speculative): validate trimmable cache and reserve last token in prefill by @inureyes in lablup/mlxcel-internal#612
feat(tokenizer): support multi-token think/tool-call sequences (#590) by @inureyes in lablup/mlxcel-internal#613
fix(server): skip tool→normal transition when tool_call_end is empty by @inureyes in lablup/mlxcel-internal#614
fix(server): preserve token positions when buffering parallel tool calls by @inureyes in lablup/mlxcel-internal#615
fix(tool-parsers): MiniMax M2 parallel tool calling parser by @inureyes in lablup/mlxcel-internal#616
fix(prompt_cache): add pop_prefixes to RadixTrie with correct immediate-prefix semantics by @inureyes in lablup/mlxcel-internal#617
feat(server): expose --max-kv-size flag and tighten chat-completion response by @inureyes in lablup/mlxcel-internal#618
fix(nemotron_h): default time_step_limit to (0.0, inf) regardless of time_step_min/max by @inureyes in lablup/mlxcel-internal#619
fix(tests): add max_kv_size field to ServeArgs test fixture by @inureyes in lablup/mlxcel-internal#620
fix: address upstream sync review follow-ups by @inureyes in lablup/mlxcel-internal#621
feat(server): implement OpenAI Responses API (Phase 1) for /v1/responses (#622) by @inureyes in lablup/mlxcel-internal#623
feat(download): show per-file + aggregate progress bars during mlxcel download (#648) by @inureyes in lablup/mlxcel-internal#649
chore(tokenizer): replace .map_or(false, ...) with .is_some_and(...) for clippy 1.93 by @inureyes in lablup/mlxcel-internal#651
chore(download): PR #649 security hardening (M1/M2/L1/L2/L3/L5/L6) by @inureyes in lablup/mlxcel-internal#652
feat(speculative): introduce Drafter trait + DrafterKind enum + model_type auto-detection (#624) by @inureyes in lablup/mlxcel-internal#653
feat(qwen3_5): expose return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache for DFlash (#634) by @inureyes in lablup/mlxcel-internal#654
feat(gemma4): expose return_hidden / return_shared_kv / rollback_speculative_cache for MTP (#625) by @inureyes in lablup/mlxcel-internal#655
feat(speculative): port MaskedEmbedder for Gemma 4 E2B / E4B drafters (#627) by @inureyes in lablup/mlxcel-internal#656
feat(speculative): port drafter masks + normalize_batched_shared_kv_states for Gemma 4 MTP (#628) by @inureyes in lablup/mlxcel-internal#657
feat(speculative): port DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635) by @inureyes in lablup/mlxcel-internal#659
feat(speculative): port Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626) by @inureyes in lablup/mlxcel-internal#658
feat(speculative): implement DFlash round loop (single-batch path) (#636) by @inureyes in lablup/mlxcel-internal#661
feat(speculative): implement MtpGenerator round loop (single-batch path) (#629) by @inureyes in lablup/mlxcel-internal#662
feat(speculative): batched DFlash round loop with continuous batching and GDN-aware rollback (#637) by @inureyes in lablup/mlxcel-internal#663
feat(cli): add --draft-kind {dflash,mtp} and --draft-block-size flags (#630) by @inureyes in lablup/mlxcel-internal#664
feat(speculative): batched MTP round loop with continuous batching and left-padding normalization (#631) by @inureyes in lablup/mlxcel-internal#665
test(speculative): greedy-parity + perf benchmarks for speculative drafter pairings (#632) by @inureyes in lablup/mlxcel-internal#667
feat(gemma4_assistant): wire real MaskedEmbedder and make_drafter_masks (deferred from #626) by @inureyes in lablup/mlxcel-internal#668
feat(server): wire speculative dispatch resolution and MtpTarget adapters (#666) by @inureyes in lablup/mlxcel-internal#669
feat(server): wire speculative dispatch into scheduler via B=1 bursts (#670) by @inureyes in lablup/mlxcel-internal#671
feat(server): cancellation propagation through MtpGenerator and DFlashGenerator by @inureyes in lablup/mlxcel-internal#681
feat(server): thread sampling penalties through speculative-burst first sample by @inureyes in lablup/mlxcel-internal#682
fix(core): DFlash drafter lazy-bind for upstream z-lab/Qwen3.5-4B-DFlash checkpoint by @inureyes in lablup/mlxcel-internal#683
test(speculative): real-model byte-equality end-to-end test for speculative bursts by @inureyes in lablup/mlxcel-internal#685
feat(server): logprobs support in speculative-burst path by @inureyes in lablup/mlxcel-internal#686
feat(server): thinking-budget enforcement in speculative-burst path by @inureyes in lablup/mlxcel-internal#687
feat(server): speculative-burst prompt-cache donate symmetric with classic path (#673) by @inureyes in lablup/mlxcel-internal#680
feat(server): B>1 batched speculative dispatch via MtpBatchedGenerator/DFlashBatchedGenerator by @inureyes in lablup/mlxcel-internal#684
feat(server): wire donate_finished_sequence_cache into B>1 batched speculative-burst arm by @inureyes in lablup/mlxcel-internal#689
fix: harden speculative drafter epic follow-ups by @inureyes in lablup/mlxcel-internal#690
fix: enable DFlash for Qwen3.5 VLM text requests by @inureyes in lablup/mlxcel-internal#694
fix(perf): optimize Qwen3.5 DFlash greedy argmax by @inureyes in lablup/mlxcel-internal#695
docs: refresh README to match current code and benchmarks by @inureyes in lablup/mlxcel-internal#700
fix(perf): avoid slow Gemma 4 MTP singleton bursts by @inureyes in lablup/mlxcel-internal#698
docs: update speculative decoding guidance by @inureyes in lablup/mlxcel-internal#702

Full Changelog: lablup/mlxcel-internal@v0.0.26...v0.0.27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.27

Choose a tag to compare

Sorry, something went wrong.