Releases · lablup/mlxcel

18 May 06:51

inureyes

v0.0.27

bbc5777

v0.0.27 Latest

Latest

v0.0.27 — Speculative decoding & Responses API

This stable release closes the Gemma 4 MTP and Qwen 3.5 DFlash speculative-decoding epic with full server-side integration, adds the OpenAI Responses API (Phase 1), brings progress bars + a security-hardened core to mlxcel download, and adds block-level partial cache adoption to the APC scheduler.

New Features

End-to-end speculative decoding for Gemma 4 MTP and Qwen 3.5 DFlash drafter families (epic #633, #632). New Drafter trait + DrafterKind enum + model_type auto-detection (#624). Ported drafter components: MaskedEmbedder for Gemma 4 E2B / E4B (#627), drafter masks (bidirectional full + sliding-window) and normalize_batched_shared_kv_states (#628), Gemma4AssistantDraftModel (4-layer drafter + pre/post projections) (#626), and DFlashDraftModel (5-layer drafter + DFlashAttention + DFlashKVCache) (#635). Target-side hooks: Gemma 4 return_hidden / return_shared_kv / rollback_speculative_cache (#625); Qwen 3.5 return_hidden + capture_layer_ids + GDN-aware rollback_speculative_cache (#634). Round loops: DFlash single-batch (#636), MtpGenerator single-batch (#629), batched DFlash with continuous batching + GDN-aware rollback (#637), and batched MTP with continuous batching + left-padding normalization (#631). Real-model byte-equality end-to-end tests (#685) and a greedy-parity + perf benchmark harness (#632).
Server speculative dispatch. Speculative dispatch resolution and MtpTarget adapters wired into the server (#666); the assistant model paths now plug into the real MaskedEmbedder and make_drafter_masks; speculative dispatch is wired into the scheduler via per-request B=1 bursts (#670) and a B>1 batched path via MtpBatchedGenerator / DFlashBatchedGenerator (#684). Per-request properties propagated through the speculative-burst path: cancellation propagation through MtpGenerator / DFlashGenerator (#681), token_history threading through the speculative-burst first sample (#682), logprobs support (#686), thinking-budget enforcement (#687), and prompt-cache donate symmetric with the classic path (#680) and into the B>1 batched arm (#689).
--draft-kind {dflash,mtp} and --draft-block-size CLI flags on both mlxcel and mlxcel-server (#630).
OpenAI Responses API (Phase 1) at /v1/responses for both binaries (#622, #623). Conversation store with shared-LRU semantics, response.created / response.in_progress / response.completed SSE event stream, reasoning-trace forwarding, response cancellation, and four new CLI flags. User guide at docs/responses-api.md.
APC block-level partial cache adoption in the scheduler (#580, #607). When APC is on, a prompt sharing the first N blocks with a cached entry but diverging at block N+1 reuses blocks 0..N and re-prefills only from the divergence boundary. APC-off retains bit-exact prior behaviour. Three components: DetachedKVCache::trim_to and DetachedCacheSet::truncate_to (FP16, INT8, Turbo4, Turbo4Delegated sidecars), relaxed PromptCacheStore containment gate, and Scheduler::try_adopt_cached_prefix truncate-on-adopt. Bench procedure in docs/apc-partial-adoption-bench.md.
Nemotron H Nano Omni audio modality (#582, #609). Parakeet/Conformer sound encoder, mel-spectrogram feature extractor, audio projector, and runtime path that merges audio token embeddings at sound_context_token_id and interleaves them with vision tokens. Loader applies the upstream sanitize_audio_weights transpose pass. Bring-up guide at docs/nemotron-h-nano-omni-audio-bringup.md.
mlxcel download / mlxcel-server download progress bars (#648, #649). New src/downloader/progress.rs module provides terminal-aware suppression, a MultiProgress factory, and 6 suppression unit tests. The downloader streams files via reqwest to a NamedTempFile and atomically renames into place.
Server --max-kv-size flag matching llama-server, plus a tightened chat-completion response envelope (#618).
Tokenizer support for multi-token think and tool-call sequences so chat templates that emit <think> / <tool_call> across multiple BPE tokens stream and parse correctly (#590, #613).

Improvements

StreamFilter extended to handle multi-token markers and to reset state when a partial marker is broken by a non-marker token (#613).
Speculative drafter epic follow-ups hardened post-merge — covers misc invariants surfaced by integration testing against the real z-lab/Qwen3.5-4B-DFlash checkpoint and the mlx-community/gemma-4-* drafter variants.
README and speculative decoding guidance refreshed to match current code paths and the latest M1 Ultra / M5 Max benchmarks (#700).
Qwen 3.5 DFlash greedy-argmax decode-path optimization that drops the per-decode-step copy and an unnecessary argmax temporary, restoring decode tok/s on Qwen 3.5 32B / 9B DFlash configurations.
Avoid slow Gemma 4 MTP singleton bursts — the speculative-burst path now correctly short-circuits to the classic path when the batch size collapses to 1 with no draft tokens accepted, eliminating a per-step over-evaluation regression (#698).

Bug Fixes

DFlash drafter lazy-bind for the upstream z-lab/Qwen3.5-4B-DFlash checkpoint — Drafter::bind was previously not called on the DFlash family, causing an internal cache mis-binding on the first speculative burst. The drafter now performs lazy-bind on first use, matching the MTP path (#683).
Enable DFlash for Qwen 3.5 VLM text requests — pure-text generations against a Qwen 3.5 VLM checkpoint can now resolve a DFlash drafter when one is available, instead of silently falling back to the classic path (#694).
Speculative-rollback safety: validate trimmable cache and reserve the last token in prefill so a rolled-back speculative burst always lands on a valid sampling boundary (#612).
Prompt cache RadixTrie: pop_prefixes now uses correct immediate-prefix semantics (#617).
MiniMax M2 parallel tool calling parser correctly emits one ChatToolCall per parallel call instead of merging them into a single call (#616).
Server tool-call buffering: preserve token positions when buffering parallel tool calls (#615); skip the tool→normal transition when tool_call_end is empty so streaming continues correctly for templates without an explicit close marker (#614).
video_url allowlist TOCTOU race closed by passing the resolved OwnedFd to ffmpeg via /dev/fd/N instead of re-opening the path inside the subprocess. Symlink swaps between metadata and the subsequent open now cannot mis-route the subprocess (#601, #611).
Gemma 4: skip k_proj / v_proj / k_norm weight load for KV-shared layers — the previous load step would error out on real Gemma 4 E2B / E4B checkpoints that omit these tensors per KV-shared design (#608).
Nemotron-H: default time_step_limit to (0.0, +inf) regardless of time_step_min / time_step_max, matching upstream mlx-lm even when only one bound is supplied (#619).
gated_delta masked Metal kernel variants: zero-init y[dv_idx] when the mask is false (#610).
Tests: add max_kv_size field to the ServeArgs test fixture (#620).
Address upstream-sync review follow-ups carried over from the v0.0.26 sync cycle.

CI/CD Improvements

Bump GitHub Actions to Node 24 runtime to clear the Node 20 deprecation warning on macOS runners.

Technical Details

MLX C++ upstream pin: unchanged from v0.0.26 — fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ continue to use the validated mlx::core::fast::metal_kernel, mlx::core::full, mlx::core::Shape, mlx::core::float32, mlx::core::int32, and metal::fast::exp surface from the previous cycle.
Drafter dispatch architecture: the Drafter trait + DrafterKind enum lets a single scheduler dispatch path drive both Gemma 4 MTP (re-using parent layer 0–3 hidden states with cross-attention into a 4-layer drafter) and Qwen 3.5 DFlash (5-layer drafter + GDN cache rollback) without per-family branching in the hot loop. Burst dispatch decisions live in src/server/batch/speculative_burst.rs (speculative_dispatch::resolve).
Speculative-burst integration with prompt cache: the donate path is symmetric with the classic path (donate_finished_sequence_cache) and applies to both the B=1 burst and the B>1 batched arm, so cache hit rates are unaffected by the dispatch decision.
APC partial adoption invariant: the truncate_to operation is only invoked when apc_consistent_prefix_len < entry_len, so a full-prefix hit takes the existing fast path and incurs no slicing overhead.
Downloader streaming: stream_file (outer) handles destination resolution, cache-hit detection, and atomic rename; stream_to_tempfile (inner) performs the network read into a NamedTempFile. Outer/inner split keeps the progress bar coverage clean and lets tests stub the network path.

Dependencies

No new direct dependency additions over v0.0.26. reqwest 0.12 (existing) is used for the streaming downloader path.

Breaking Changes

None for end users. The new --draft-kind / --draft-block-size CLI flags default to off, so the classic non-speculative path is unchanged unless the user opts in.

Known Issues

B=1 burst on the smallest dense Gemma 4 configurations (e.g. gemma-4-e2b-it-4bit) is currently bounded by per-iteration overhead; the singleton-burst short-circuit (#698) covers this on the hot path but the steady-state speedup for B=1 is checkpoint-dependent. See docs/speculative-decoding.md for tuning guidance.
DFlash speed-up on Qwen 3.5 VLM text-only requests requires a real DFlash drafter checkpoint to be present alongside the target; without one, the path silently falls back to classic decode (this is intentional but can surprise users expecting a server-side warning)...

Contributors

inureyes

Assets 8

18 May 10:01

inureyes

v0.0.26

9779bcf

v0.0.26

What's Changed

mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.

New Features

TurboQuant KV cache — full mode family wired through KVCacheMode:
- Turbo4 symmetric with per-model allowlist (#476)
- Turbo4Asym Fp16-K + Turbo4-V (#474)
- Turbo3Asym 3-bit asymmetric Fp16-K + Turbo3-V (#477)
- Turbo4Delegated FP16 hot tail + packed turbo cold body (#479)
- Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
- RotatingKVCache (sliding-window) integration (B9), Boundary-V layer protection (#478), packed-aware PagedKvLayout (#482), sparse-V dequant scaffolding (#480)
- llama-server flag parity: --cache-type-k / --cache-type-v accept mlxcel_turbo* variants (#484)
- KV cache quantization extended to continuous batching (#545)
- Unified TurboQuant CLI flags across all binaries (#567)
- User guide and validated config matrix (#485)
Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
OpenAI-compatible response_format: json_schema constrained decoding via llguidance (#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$ref schema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequence mask_buf / bias_buf allocations
mlxcel download / mlxcel-server download subcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense
Paged scheduler dispatch on PagedKvLayout::cache_mode (#508)
/health endpoint exposes context_size and tool_call_parser (#549, #572)
VLM video input infrastructure:
- Gemma 4 video support and VLM video input pipeline (#553)
- video_url content blocks wired through /v1/chat/completions (#596)
- Single-pass ffmpeg frame extraction with Drop guard (#597)
- Content-preservation tests for video frame extraction (#598)
New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)

Improvements

Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
Turbo4Delegated decode hot path:
- Unified K storage to drop per-step K concat (#527)
- Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
- Steel-attention-envelope fused SDPA kernel (#531)
- Parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534)
- Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
- Compressed fold moved before decode
Compressed dequant-SDPA paths for TurboQuant decode (#562)
Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
Quality and speed gates:
- Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
- VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
- KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in benchmarks-by-hardware.md (#578)

Bug Fixes

TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and ENV_LOCK race in concurrent process startup (#573)
Vision / VLM mixed batching:
- Per-sequence MRoPE alignment for mixed VL+text batches (#558)
- Per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561)
- Mixed-length batching support for Gemma 4 (#560)
- Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
- Qwen3.5-MoE batch-size validation on cached position_ids reuse (#559)
Streaming and sampling:
- Streamed detokenization for byte-fallback tokens (#570)
- Top-p filter correctness for batched logits (#569)
- Token queue timeout handling during long prefills (#571)
- StreamFilter extended to cover Hermes-style <tool_call> and Mistral Nemo [TOOL_CALLS] markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4 <|tool_call> suppression unaffected
Models:
- Gemma3-4B attention SIGABRT from sliding-window mask T_k mismatch on long-context prompts (#507)
- Preserve Qwen2 fused QKV bias when present in checkpoint (#517)

CI/CD Improvements

None.

Technical Details

Refactor: unified TurboQuant KV-cache CLI flags across mlxcel, mlxcel-server, and mlxcel download so all binaries accept the same --kv-cache-mode / --cache-type-{k,v} syntax (#567)
Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606); bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python

Dependencies

MLX upstream pin bumped twice:
- First to v0.32.0 / c9aa5605 (#565)
- Then forward to 84961223 covering 3 PRs:
  - #3443 splits the CUDA qmm_naive / qmm_sm80 kernel bodies into new qmm_naive.cuh / qmm_sm80.cuh headers without changing the public ABI consumed by mlxcel's patches/mlx/backend/cuda/quantized/qmm/qmm.h
  - #3463 routes the CPU JIT preamble through JitCompiler::get_preamble() and renames the prebuilt symbol from get_kernel_preamble to get_prebuilt_preamble (mlxcel does not call either directly)
  - #3475 fixes contiguity-flag accuracy in AsStrided by computing data_size from the actually-occupied stride range
- Three-location pin update applied to src/lib/mlx-cpp/CMakeLists.txt, src/lib/mlxcel-core/build.rs, and .github/workflows/release.yml per CLAUDE.md
- Fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ re-validated against both bumps; symbols unchanged

Breaking Changes

None.

Known Issues

Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
mlx-lm baseline re-measurement pending for model_tests_m1ultra.md Performance Comparison vs mlx-lm percentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence).
M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.

What's Changed

feat: add download subcommand for HuggingFace model repositories by @inureyes in lablup/mlxcel-internal#486
feat(core): port PolarQuant Lloyd-Max codebook generator to Rust (#472) by @inureyes in lablup/mlxcel-internal#487
feat(core): add Walsh-Hadamard transform op for TurboQuant KV cache (#470) by @inureyes in lablup/mlxcel-internal#488
feat(core): KVCacheMode::Turbo4Asym — Fp16-K + Turbo4-V (#474) by @inureyes in lablup/mlxcel-internal#490
feat(core): TurboQuant KV cache quality gate — wikitext-2 PPL + NIAH (#475) by @inureyes in lablup/mlxcel-internal#491
feat(core): KVCacheMode::Turbo4 (symmetric) with per-model allowlist (#476) by @inureyes in lablup/mlxcel-internal#494
feat(core): TurboQuant + RotatingKVCache (sliding window) — B9 by @inureyes in lablup/mlxcel-internal#496
feat(core): KVCacheMode::Turbo4Delegated — FP16 hot tail + packed turbo cold body (#479) by @inureyes in lablup/mlxcel-internal#495
test(fixtures): replace wikitext-2 placeholder with full 283K-token test split (#492) by @inureyes in lablup/mlxcel-internal#497
feat(core): Boundary-V layer protection for Turbo4* KV cache modes (#478) by @inureyes in lablup/mlxcel-internal#499
feat(core): sparse-V dequant scaffolding (#480) by @inureyes in lablup/mlxcel-internal#498
test(bench): commit per-model PPL+NIAH gate results on Apple Silicon (#493) by @inureyes in lablup/mlxcel-internal#500
feat(core): KVCacheMode::Turbo3Asym — 3-bit asymmetric KV (Fp16-K + Turbo3-V) (#477) by @inureyes in lablup/mlxcel-internal#503
feat(core): packed-aware PagedKvLayout for Turbo4 KV (#482) by @inureyes in lablup/mlxcel-internal#502
feat(server): --cache-type-k/--cache-type-v server flag parity with llama-server (#484) by @inureyes in lablup/mlxcel-internal#501
docs: TurboQuant KV cache user guide and validated config matrix (B12 / #485) by @inureyes in lablup/mlxcel-internal#504
feat(core): fused Sparse-V Metal kernel — fused per-thread SDPA skipping (#505) by @inureyes in lablup/mlxcel-internal#511
...

Contributors

inureyes

Assets 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.0.27 — Speculative decoding & Responses API

New Features

Improvements

Bug Fixes

CI/CD Improvements

Technical Details

Dependencies

Breaking Changes

Known Issues

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Features

Improvements

Bug Fixes

CI/CD Improvements

Technical Details

Dependencies

Breaking Changes

Known Issues

What's Changed

Contributors

Uh oh!

Releases: lablup/mlxcel

v0.0.27

v0.0.27 — Speculative decoding & Responses API

New Features

Improvements

Bug Fixes

CI/CD Improvements

Technical Details

Dependencies

Breaking Changes

Known Issues

Contributors

Uh oh!

v0.0.26

What's Changed

New Features

Improvements

Bug Fixes

CI/CD Improvements

Technical Details

Dependencies

Breaking Changes

Known Issues

What's Changed

Contributors

Uh oh!