Skip to content

Releases: lablup/mlxcel

v0.0.27

18 May 06:51

Choose a tag to compare

v0.0.27 — Speculative decoding & Responses API

This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.

New Features

  • End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New Drafter trait + DrafterKind enum + model_type auto-detection (#624). Ported drafter components: MaskedEmbedder for Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) and normalize_batched_shared_kv_states (#628), Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626), and DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635). Target-side hooks: Gemma 4 return_hidden / return_shared_kv / rollback_speculative_cache (#625); Qwen 3.5 return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache (#634). Round loops: DFlash single-batch (#636), MtpGenerator single-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632).
  • Server speculative dispatch. Speculative dispatch resolution and MtpTarget adapters wired into the server (#666); the assistant model paths now plug into the real MaskedEmbedder and make_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path via MtpBatchedGenerator / DFlashBatchedGenerator (#684). Per-request properties propagated through the speculative-burst path: cancellation propagation through MtpGenerator / DFlashGenerator (#681), token_history threading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689).
  • --draft-kind {dflash,mtp} and --draft-block-size CLI flags on both mlxcel and mlxcel-server (#630).
  • OpenAI Responses API (Phase 1) at /v1/responses for both binaries (#622, #623). Conversation store with shared-LRU semantics, response.created / response.in_progress / response.completed SSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide at docs/responses-api.md.
  • APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components: DetachedKVCache::trim_to and DetachedCacheSet::truncate_to (FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxed PromptCacheStore containment gate, and Scheduler::try_adopt_cached_prefix truncate-on-adopt. Bench procedure in docs/apc-partial-adoption-bench.md.
  • Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at sound_context_token_id and interleaves them with vision tokens. Loader applies the upstream sanitize_audio_weights transpose pass. Bring-up guide at docs/nemotron-h-nano-omni-audio-bringup.md.
  • mlxcel download / mlxcel-server download progress bars (#648, #649). New src/downloader/progress.rs module provides terminal-aware suppression, a MultiProgress factory, and 6 suppression unit tests. The downloader streams files via reqwest to a NamedTempFile and atomically renames into place.
  • Server --max-kv-size flag matching llama-server, plus a tightened chat-completion response envelope (#618).
  • Tokenizer support for multi-token think and tool-call sequences so chat templates that emit <think> / <tool_call> across multiple BPE tokens stream and parse correctly (#590, #613).

Improvements

  • StreamFilter extended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).
  • Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real z-lab/Qwen3.5-4B-DFlash checkpoint and the mlx-community/gemma-4-* drafter variants.
  • README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
  • Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
  • Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).

Bug Fixes

  • DFlash drafter lazy-bind for the upstream z-lab/Qwen3.5-4B-DFlash checkpoint — Drafter::bind was previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683).
  • Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
  • Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
  • Prompt cache RadixTrie: pop_prefixes now uses correct immediate-prefix semantics (#617).
  • MiniMax M2 parallel tool calling parser correctly emits one ChatToolCall per parallel call instead of merging them into a single call (#616).
  • Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when tool_call_end is empty so streaming continues correctly for templates without an explicit close marker (#614).
  • video_url allowlist TOCTOU race closed by passing the resolved OwnedFd to ffmpeg via /dev/fd/N instead of re-opening the path inside the subprocess. Symlink swaps between metadata and the subsequent open now cannot mis-route the subprocess (#601, #611).
  • Gemma 4: skip k_proj / v_proj / k_norm weight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608).
  • Nemotron-H: default time_step_limit to (0.0, +inf) regardless of time_step_min / time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619).
  • gated_delta masked Metal kernel variants: zero-init y[dv_idx] when the mask is false (#610).
  • Tests: add max_kv_size field to the ServeArgs test fixture (#620).
  • Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.

CI/CD Improvements

  • Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.

Technical Details

  • MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ continue to use the validated mlx::core::fast::metal_kernel, mlx::core::full, mlx::core::Shape, mlx::core::float32, mlx::core::int32, and metal::fast::exp surface from the previous cycle.
  • Drafter dispatch architecture: the Drafter trait + DrafterKind enum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live in src/server/batch/speculative_burst.rs (speculative_dispatch::resolve).
  • Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision.
  • APC partial adoption invariant: the truncate_to operation is only invoked when apc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead.
  • Downloader streaming: stream_file (outer) handles destination resolution, cache-hit detection, and atomic rename; stream_to_tempfile (inner) performs the network read into a NamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.

Dependencies

  • No new direct dependency additions over v0.0.26. reqwest 0.12 (existing) is used for the streaming downloader path.

Breaking Changes

  • None for end users. The new --draft-kind / --draft-block-size CLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.

Known Issues

  • B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See docs/speculative-decoding.md for tuning guidance.
  • DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning)...
Read more

v0.0.26

18 May 10:01

Choose a tag to compare

What's Changed

mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.

New Features

  • TurboQuant KV cache — full mode family wired through KVCacheMode:
    • Turbo4 symmetric with per-model allowlist (#476)
    • Turbo4Asym Fp16-K + Turbo4-V (#474)
    • Turbo3Asym 3-bit asymmetric Fp16-K + Turbo3-V (#477)
    • Turbo4Delegated FP16 hot tail + packed turbo cold body (#479)
    • Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
    • RotatingKVCache (sliding-window) integration (B9), Boundary-V layer protection (#478), packed-aware PagedKvLayout (#482), sparse-V dequant scaffolding (#480)
    • llama-server flag parity: --cache-type-k / --cache-type-v accept mlxcel_turbo* variants (#484)
    • KV cache quantization extended to continuous batching (#545)
    • Unified TurboQuant CLI flags across all binaries (#567)
    • User guide and validated config matrix (#485)
  • Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
  • OpenAI-compatible response_format: json_schema constrained decoding via llguidance (#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$ref schema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequence mask_buf / bias_buf allocations
  • mlxcel download / mlxcel-server download subcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense
  • Paged scheduler dispatch on PagedKvLayout::cache_mode (#508)
  • /health endpoint exposes context_size and tool_call_parser (#549, #572)
  • VLM video input infrastructure:
    • Gemma 4 video support and VLM video input pipeline (#553)
    • video_url content blocks wired through /v1/chat/completions (#596)
    • Single-pass ffmpeg frame extraction with Drop guard (#597)
    • Content-preservation tests for video frame extraction (#598)
  • New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)

Improvements

  • Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
  • Turbo4Delegated decode hot path:
    • Unified K storage to drop per-step K concat (#527)
    • Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
    • Steel-attention-envelope fused SDPA kernel (#531)
    • Parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534)
    • Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
    • Compressed fold moved before decode
  • Compressed dequant-SDPA paths for TurboQuant decode (#562)
  • Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
  • Quality and speed gates:
    • Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
    • VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
    • KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
  • Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in benchmarks-by-hardware.md (#578)

Bug Fixes

  • TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and ENV_LOCK race in concurrent process startup (#573)
  • Vision / VLM mixed batching:
    • Per-sequence MRoPE alignment for mixed VL+text batches (#558)
    • Per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561)
    • Mixed-length batching support for Gemma 4 (#560)
    • Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
    • Qwen3.5-MoE batch-size validation on cached position_ids reuse (#559)
  • Streaming and sampling:
    • Streamed detokenization for byte-fallback tokens (#570)
    • Top-p filter correctness for batched logits (#569)
    • Token queue timeout handling during long prefills (#571)
    • StreamFilter extended to cover Hermes-style <tool_call> and Mistral Nemo [TOOL_CALLS] markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4 <|tool_call> suppression unaffected
  • Models:
    • Gemma3-4B attention SIGABRT from sliding-window mask T_k mismatch on long-context prompts (#507)
    • Preserve Qwen2 fused QKV bias when present in checkpoint (#517)

CI/CD Improvements

None.

Technical Details

  • Refactor: unified TurboQuant KV-cache CLI flags across mlxcel, mlxcel-server, and mlxcel download so all binaries accept the same --kv-cache-mode / --cache-type-{k,v} syntax (#567)
  • Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
  • Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
  • mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606); bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python

Dependencies

  • MLX upstream pin bumped twice:
    • First to v0.32.0 / c9aa5605 (#565)
    • Then forward to 84961223 covering 3 PRs:
      • #3443 splits the CUDA qmm_naive / qmm_sm80 kernel bodies into new qmm_naive.cuh / qmm_sm80.cuh headers without changing the public ABI consumed by mlxcel's patches/mlx/backend/cuda/quantized/qmm/qmm.h
      • #3463 routes the CPU JIT preamble through JitCompiler::get_preamble() and renames the prebuilt symbol from get_kernel_preamble to get_prebuilt_preamble (mlxcel does not call either directly)
      • #3475 fixes contiguity-flag accuracy in AsStrided by computing data_size from the actually-occupied stride range
    • Three-location pin update applied to src/lib/mlx-cpp/CMakeLists.txt, src/lib/mlxcel-core/build.rs, and .github/workflows/release.yml per CLAUDE.md
    • Fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ re-validated against both bumps; symbols unchanged

Breaking Changes

None.

Known Issues

  • Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
  • mlx-lm baseline re-measurement pending for model_tests_m1ultra.md Performance Comparison vs mlx-lm percentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence).
  • M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.

What's Changed

Read more