Skip to content

v0.0.27

Latest

Choose a tag to compare

@inureyes inureyes released this 18 May 06:51
· 14 commits to main since this release

v0.0.27 — Speculative decoding & Responses API

This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.

New Features

  • End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New Drafter trait + DrafterKind enum + model_type auto-detection (#624). Ported drafter components: MaskedEmbedder for Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) and normalize_batched_shared_kv_states (#628), Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626), and DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635). Target-side hooks: Gemma 4 return_hidden / return_shared_kv / rollback_speculative_cache (#625); Qwen 3.5 return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache (#634). Round loops: DFlash single-batch (#636), MtpGenerator single-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632).
  • Server speculative dispatch. Speculative dispatch resolution and MtpTarget adapters wired into the server (#666); the assistant model paths now plug into the real MaskedEmbedder and make_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path via MtpBatchedGenerator / DFlashBatchedGenerator (#684). Per-request properties propagated through the speculative-burst path: cancellation propagation through MtpGenerator / DFlashGenerator (#681), token_history threading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689).
  • --draft-kind {dflash,mtp} and --draft-block-size CLI flags on both mlxcel and mlxcel-server (#630).
  • OpenAI Responses API (Phase 1) at /v1/responses for both binaries (#622, #623). Conversation store with shared-LRU semantics, response.created / response.in_progress / response.completed SSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide at docs/responses-api.md.
  • APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components: DetachedKVCache::trim_to and DetachedCacheSet::truncate_to (FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxed PromptCacheStore containment gate, and Scheduler::try_adopt_cached_prefix truncate-on-adopt. Bench procedure in docs/apc-partial-adoption-bench.md.
  • Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at sound_context_token_id and interleaves them with vision tokens. Loader applies the upstream sanitize_audio_weights transpose pass. Bring-up guide at docs/nemotron-h-nano-omni-audio-bringup.md.
  • mlxcel download / mlxcel-server download progress bars (#648, #649). New src/downloader/progress.rs module provides terminal-aware suppression, a MultiProgress factory, and 6 suppression unit tests. The downloader streams files via reqwest to a NamedTempFile and atomically renames into place.
  • Server --max-kv-size flag matching llama-server, plus a tightened chat-completion response envelope (#618).
  • Tokenizer support for multi-token think and tool-call sequences so chat templates that emit <think> / <tool_call> across multiple BPE tokens stream and parse correctly (#590, #613).

Improvements

  • StreamFilter extended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).
  • Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real z-lab/Qwen3.5-4B-DFlash checkpoint and the mlx-community/gemma-4-* drafter variants.
  • README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
  • Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
  • Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).

Bug Fixes

  • DFlash drafter lazy-bind for the upstream z-lab/Qwen3.5-4B-DFlash checkpoint — Drafter::bind was previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683).
  • Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
  • Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
  • Prompt cache RadixTrie: pop_prefixes now uses correct immediate-prefix semantics (#617).
  • MiniMax M2 parallel tool calling parser correctly emits one ChatToolCall per parallel call instead of merging them into a single call (#616).
  • Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when tool_call_end is empty so streaming continues correctly for templates without an explicit close marker (#614).
  • video_url allowlist TOCTOU race closed by passing the resolved OwnedFd to ffmpeg via /dev/fd/N instead of re-opening the path inside the subprocess. Symlink swaps between metadata and the subsequent open now cannot mis-route the subprocess (#601, #611).
  • Gemma 4: skip k_proj / v_proj / k_norm weight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608).
  • Nemotron-H: default time_step_limit to (0.0, +inf) regardless of time_step_min / time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619).
  • gated_delta masked Metal kernel variants: zero-init y[dv_idx] when the mask is false (#610).
  • Tests: add max_kv_size field to the ServeArgs test fixture (#620).
  • Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.

CI/CD Improvements

  • Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.

Technical Details

  • MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ continue to use the validated mlx::core::fast::metal_kernel, mlx::core::full, mlx::core::Shape, mlx::core::float32, mlx::core::int32, and metal::fast::exp surface from the previous cycle.
  • Drafter dispatch architecture: the Drafter trait + DrafterKind enum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live in src/server/batch/speculative_burst.rs (speculative_dispatch::resolve).
  • Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision.
  • APC partial adoption invariant: the truncate_to operation is only invoked when apc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead.
  • Downloader streaming: stream_file (outer) handles destination resolution, cache-hit detection, and atomic rename; stream_to_tempfile (inner) performs the network read into a NamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.

Dependencies

  • No new direct dependency additions over v0.0.26. reqwest 0.12 (existing) is used for the streaming downloader path.

Breaking Changes

  • None for end users. The new --draft-kind / --draft-block-size CLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.

Known Issues

  • B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See docs/speculative-decoding.md for tuning guidance.
  • DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning).

What's Changed

Full Changelog: lablup/mlxcel-internal@v0.0.26...v0.0.27