Releases: lablup/mlxcel
v0.0.27
v0.0.27 — Speculative decoding & Responses API
This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.
New Features
- End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New
Draftertrait +DrafterKindenum +model_typeauto-detection (#624). Ported drafter components:MaskedEmbedderfor Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) andnormalize_batched_shared_kv_states(#628),Gemma4AssistantDraftModel(4-layer drafter + pre/post projections) (#626), andDFlashDraftModel(5-layer drafter +DFlashAttention+DFlashKVCache) (#635). Target-side hooks: Gemma 4return_hidden/return_shared_kv/rollback_speculative_cache(#625); Qwen 3.5return_hidden+capture_layer_ids+ GDN-awarerollback_speculative_cache(#634). Round loops: DFlash single-batch (#636),MtpGeneratorsingle-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632). - Server speculative dispatch. Speculative dispatch resolution and
MtpTargetadapters wired into the server (#666); the assistant model paths now plug into the realMaskedEmbedderandmake_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path viaMtpBatchedGenerator/DFlashBatchedGenerator(#684). Per-request properties propagated through the speculative-burst path: cancellation propagation throughMtpGenerator/DFlashGenerator(#681),token_historythreading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689). --draft-kind {dflash,mtp}and--draft-block-sizeCLI flags on bothmlxcelandmlxcel-server(#630).- OpenAI Responses API (Phase 1) at
/v1/responsesfor both binaries (#622, #623). Conversation store with shared-LRU semantics,response.created/response.in_progress/response.completedSSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide atdocs/responses-api.md. - APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components:
DetachedKVCache::trim_toandDetachedCacheSet::truncate_to(FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxedPromptCacheStorecontainment gate, andScheduler::try_adopt_cached_prefixtruncate-on-adopt. Bench procedure indocs/apc-partial-adoption-bench.md. - Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at
sound_context_token_idand interleaves them with vision tokens. Loader applies the upstreamsanitize_audio_weightstranspose pass. Bring-up guide atdocs/nemotron-h-nano-omni-audio-bringup.md. mlxcel download/mlxcel-server downloadprogress bars (#648, #649). Newsrc/downloader/progress.rsmodule provides terminal-aware suppression, aMultiProgressfactory, and 6 suppression unit tests. The downloader streams files viareqwestto aNamedTempFileand atomically renames into place.- Server
--max-kv-sizeflag matching llama-server, plus a tightened chat-completion response envelope (#618). - Tokenizer support for multi-token think and tool-call sequences so chat templates that emit
<think>/<tool_call>across multiple BPE tokens stream and parse correctly (#590, #613).
Improvements
StreamFilterextended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).- Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real
z-lab/Qwen3.5-4B-DFlashcheckpoint and themlx-community/gemma-4-*drafter variants. - README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
- Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
- Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).
Bug Fixes
- DFlash drafter lazy-bind for the upstream
z-lab/Qwen3.5-4B-DFlashcheckpoint —Drafter::bindwas previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683). - Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
- Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
- Prompt cache RadixTrie:
pop_prefixesnow uses correct immediate-prefix semantics (#617). - MiniMax M2 parallel tool calling parser correctly emits one
ChatToolCallper parallel call instead of merging them into a single call (#616). - Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when
tool_call_endis empty so streaming continues correctly for templates without an explicit close marker (#614). video_urlallowlist TOCTOU race closed by passing the resolvedOwnedFdto ffmpeg via/dev/fd/Ninstead of re-opening the path inside the subprocess. Symlink swaps betweenmetadataand the subsequent open now cannot mis-route the subprocess (#601, #611).- Gemma 4: skip
k_proj/v_proj/k_normweight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608). - Nemotron-H: default
time_step_limitto(0.0, +inf)regardless oftime_step_min/time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619). gated_deltamasked Metal kernel variants: zero-inity[dv_idx]when the mask is false (#610).- Tests: add
max_kv_sizefield to theServeArgstest fixture (#620). - Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.
CI/CD Improvements
- Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.
Technical Details
- MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in
src/lib/mlx-cpp/turbo/continue to use the validatedmlx::core::fast::metal_kernel,mlx::core::full,mlx::core::Shape,mlx::core::float32,mlx::core::int32, andmetal::fast::expsurface from the previous cycle. - Drafter dispatch architecture: the
Draftertrait +DrafterKindenum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live insrc/server/batch/speculative_burst.rs(speculative_dispatch::resolve). - Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (
donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision. - APC partial adoption invariant: the
truncate_tooperation is only invoked whenapc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead. - Downloader streaming:
stream_file(outer) handles destination resolution, cache-hit detection, and atomic rename;stream_to_tempfile(inner) performs the network read into aNamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.
Dependencies
- No new direct dependency additions over v0.0.26.
reqwest0.12 (existing) is used for the streaming downloader path.
Breaking Changes
- None for end users. The new
--draft-kind/--draft-block-sizeCLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.
Known Issues
- B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See
docs/speculative-decoding.mdfor tuning guidance. - DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning)...
v0.0.26
What's Changed
mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.
New Features
- TurboQuant KV cache — full mode family wired through
KVCacheMode:Turbo4symmetric with per-model allowlist (#476)Turbo4AsymFp16-K + Turbo4-V (#474)Turbo3Asym3-bit asymmetric Fp16-K + Turbo3-V (#477)Turbo4DelegatedFP16 hot tail + packed turbo cold body (#479)- Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
RotatingKVCache(sliding-window) integration (B9), Boundary-V layer protection (#478), packed-awarePagedKvLayout(#482), sparse-V dequant scaffolding (#480)- llama-server flag parity:
--cache-type-k/--cache-type-vacceptmlxcel_turbo*variants (#484) - KV cache quantization extended to continuous batching (#545)
- Unified TurboQuant CLI flags across all binaries (#567)
- User guide and validated config matrix (#485)
- Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
- OpenAI-compatible
response_format: json_schemaconstrained decoding viallguidance(#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$refschema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequencemask_buf/bias_bufallocations mlxcel download/mlxcel-server downloadsubcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense- Paged scheduler dispatch on
PagedKvLayout::cache_mode(#508) /healthendpoint exposescontext_sizeandtool_call_parser(#549, #572)- VLM video input infrastructure:
- Gemma 4 video support and VLM video input pipeline (#553)
video_urlcontent blocks wired through/v1/chat/completions(#596)- Single-pass ffmpeg frame extraction with
Dropguard (#597) - Content-preservation tests for video frame extraction (#598)
- New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)
Improvements
- Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
- Turbo4Delegated decode hot path:
- Unified K storage to drop per-step K concat (#527)
- Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
- Steel-attention-envelope fused SDPA kernel (#531)
- Parallelized Pass 1 softmax in
turbo4_delegated_steel_sdpa(#534) - Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
- Compressed fold moved before decode
- Compressed dequant-SDPA paths for TurboQuant decode (#562)
- Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
- Quality and speed gates:
- Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
- VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
- KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
- Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in
benchmarks-by-hardware.md(#578)
Bug Fixes
- TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and
ENV_LOCKrace in concurrent process startup (#573) - Vision / VLM mixed batching:
- Per-sequence MRoPE alignment for mixed VL+text batches (#558)
- Per-sequence
per_layer_inputsfor Gemma 4 E2B/E4B VLM (#561) - Mixed-length batching support for Gemma 4 (#560)
- Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
- Qwen3.5-MoE batch-size validation on cached
position_idsreuse (#559)
- Streaming and sampling:
- Streamed detokenization for byte-fallback tokens (#570)
- Top-p filter correctness for batched logits (#569)
- Token queue timeout handling during long prefills (#571)
StreamFilterextended to cover Hermes-style<tool_call>and Mistral Nemo[TOOL_CALLS]markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4<|tool_call>suppression unaffected
- Models:
- Gemma3-4B attention SIGABRT from sliding-window mask
T_kmismatch on long-context prompts (#507) - Preserve Qwen2 fused QKV bias when present in checkpoint (#517)
- Gemma3-4B attention SIGABRT from sliding-window mask
CI/CD Improvements
None.
Technical Details
- Refactor: unified TurboQuant KV-cache CLI flags across
mlxcel,mlxcel-server, andmlxcel downloadso all binaries accept the same--kv-cache-mode/--cache-type-{k,v}syntax (#567) - Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
- Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
- mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606);
bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python
Dependencies
- MLX upstream pin bumped twice:
- First to v0.32.0 /
c9aa5605(#565) - Then forward to
84961223covering 3 PRs:- #3443 splits the CUDA
qmm_naive/qmm_sm80kernel bodies into newqmm_naive.cuh/qmm_sm80.cuhheaders without changing the public ABI consumed by mlxcel'spatches/mlx/backend/cuda/quantized/qmm/qmm.h - #3463 routes the CPU JIT preamble through
JitCompiler::get_preamble()and renames the prebuilt symbol fromget_kernel_preambletoget_prebuilt_preamble(mlxcel does not call either directly) - #3475 fixes contiguity-flag accuracy in
AsStridedby computingdata_sizefrom the actually-occupied stride range
- #3443 splits the CUDA
- Three-location pin update applied to
src/lib/mlx-cpp/CMakeLists.txt,src/lib/mlxcel-core/build.rs, and.github/workflows/release.ymlperCLAUDE.md - Fused Metal kernel launchers in
src/lib/mlx-cpp/turbo/re-validated against both bumps; symbols unchanged
- First to v0.32.0 /
Breaking Changes
None.
Known Issues
- Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
- mlx-lm baseline re-measurement pending for
model_tests_m1ultra.mdPerformance Comparisonvs mlx-lmpercentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence). - M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.
What's Changed
- feat: add download subcommand for HuggingFace model repositories by @inureyes in lablup/mlxcel-internal#486
- feat(core): port PolarQuant Lloyd-Max codebook generator to Rust (#472) by @inureyes in lablup/mlxcel-internal#487
- feat(core): add Walsh-Hadamard transform op for TurboQuant KV cache (#470) by @inureyes in lablup/mlxcel-internal#488
- feat(core): KVCacheMode::Turbo4Asym — Fp16-K + Turbo4-V (#474) by @inureyes in lablup/mlxcel-internal#490
- feat(core): TurboQuant KV cache quality gate — wikitext-2 PPL + NIAH (#475) by @inureyes in lablup/mlxcel-internal#491
- feat(core): KVCacheMode::Turbo4 (symmetric) with per-model allowlist (#476) by @inureyes in lablup/mlxcel-internal#494
- feat(core): TurboQuant + RotatingKVCache (sliding window) — B9 by @inureyes in lablup/mlxcel-internal#496
- feat(core): KVCacheMode::Turbo4Delegated — FP16 hot tail + packed turbo cold body (#479) by @inureyes in lablup/mlxcel-internal#495
- test(fixtures): replace wikitext-2 placeholder with full 283K-token test split (#492) by @inureyes in lablup/mlxcel-internal#497
- feat(core): Boundary-V layer protection for Turbo4* KV cache modes (#478) by @inureyes in lablup/mlxcel-internal#499
- feat(core): sparse-V dequant scaffolding (#480) by @inureyes in lablup/mlxcel-internal#498
- test(bench): commit per-model PPL+NIAH gate results on Apple Silicon (#493) by @inureyes in lablup/mlxcel-internal#500
- feat(core): KVCacheMode::Turbo3Asym — 3-bit asymmetric KV (Fp16-K + Turbo3-V) (#477) by @inureyes in lablup/mlxcel-internal#503
- feat(core): packed-aware PagedKvLayout for Turbo4 KV (#482) by @inureyes in lablup/mlxcel-internal#502
- feat(server): --cache-type-k/--cache-type-v server flag parity with llama-server (#484) by @inureyes in lablup/mlxcel-internal#501
- docs: TurboQuant KV cache user guide and validated config matrix (B12 / #485) by @inureyes in lablup/mlxcel-internal#504
- feat(core): fused Sparse-V Metal kernel — fused per-thread SDPA skipping (#505) by @inureyes in lablup/mlxcel-internal#511
...