Release v0.0.26 · lablup/mlxcel

What's Changed

mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.

New Features

TurboQuant KV cache — full mode family wired through KVCacheMode:
- Turbo4 symmetric with per-model allowlist (#476)
- Turbo4Asym Fp16-K + Turbo4-V (#474)
- Turbo3Asym 3-bit asymmetric Fp16-K + Turbo3-V (#477)
- Turbo4Delegated FP16 hot tail + packed turbo cold body (#479)
- Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
- RotatingKVCache (sliding-window) integration (B9), Boundary-V layer protection (#478), packed-aware PagedKvLayout (#482), sparse-V dequant scaffolding (#480)
- llama-server flag parity: --cache-type-k / --cache-type-v accept mlxcel_turbo* variants (#484)
- KV cache quantization extended to continuous batching (#545)
- Unified TurboQuant CLI flags across all binaries (#567)
- User guide and validated config matrix (#485)
Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
OpenAI-compatible response_format: json_schema constrained decoding via llguidance (#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$ref schema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequence mask_buf / bias_buf allocations
mlxcel download / mlxcel-server download subcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense
Paged scheduler dispatch on PagedKvLayout::cache_mode (#508)
/health endpoint exposes context_size and tool_call_parser (#549, #572)
VLM video input infrastructure:
- Gemma 4 video support and VLM video input pipeline (#553)
- video_url content blocks wired through /v1/chat/completions (#596)
- Single-pass ffmpeg frame extraction with Drop guard (#597)
- Content-preservation tests for video frame extraction (#598)
New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)

Improvements

Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
Turbo4Delegated decode hot path:
- Unified K storage to drop per-step K concat (#527)
- Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
- Steel-attention-envelope fused SDPA kernel (#531)
- Parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534)
- Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
- Compressed fold moved before decode
Compressed dequant-SDPA paths for TurboQuant decode (#562)
Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
Quality and speed gates:
- Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
- VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
- KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in benchmarks-by-hardware.md (#578)

Bug Fixes

TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and ENV_LOCK race in concurrent process startup (#573)
Vision / VLM mixed batching:
- Per-sequence MRoPE alignment for mixed VL+text batches (#558)
- Per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561)
- Mixed-length batching support for Gemma 4 (#560)
- Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
- Qwen3.5-MoE batch-size validation on cached position_ids reuse (#559)
Streaming and sampling:
- Streamed detokenization for byte-fallback tokens (#570)
- Top-p filter correctness for batched logits (#569)
- Token queue timeout handling during long prefills (#571)
- StreamFilter extended to cover Hermes-style <tool_call> and Mistral Nemo [TOOL_CALLS] markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4 <|tool_call> suppression unaffected
Models:
- Gemma3-4B attention SIGABRT from sliding-window mask T_k mismatch on long-context prompts (#507)
- Preserve Qwen2 fused QKV bias when present in checkpoint (#517)

CI/CD Improvements

None.

Technical Details

Refactor: unified TurboQuant KV-cache CLI flags across mlxcel, mlxcel-server, and mlxcel download so all binaries accept the same --kv-cache-mode / --cache-type-{k,v} syntax (#567)
Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606); bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python

Dependencies

MLX upstream pin bumped twice:
- First to v0.32.0 / c9aa5605 (#565)
- Then forward to 84961223 covering 3 PRs:
  - #3443 splits the CUDA qmm_naive / qmm_sm80 kernel bodies into new qmm_naive.cuh / qmm_sm80.cuh headers without changing the public ABI consumed by mlxcel's patches/mlx/backend/cuda/quantized/qmm/qmm.h
  - #3463 routes the CPU JIT preamble through JitCompiler::get_preamble() and renames the prebuilt symbol from get_kernel_preamble to get_prebuilt_preamble (mlxcel does not call either directly)
  - #3475 fixes contiguity-flag accuracy in AsStrided by computing data_size from the actually-occupied stride range
- Three-location pin update applied to src/lib/mlx-cpp/CMakeLists.txt, src/lib/mlxcel-core/build.rs, and .github/workflows/release.yml per CLAUDE.md
- Fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ re-validated against both bumps; symbols unchanged

Breaking Changes

None.

Known Issues

Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
mlx-lm baseline re-measurement pending for model_tests_m1ultra.md Performance Comparison vs mlx-lm percentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence).
M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.

What's Changed

feat: add download subcommand for HuggingFace model repositories by @inureyes in lablup/mlxcel-internal#486
feat(core): port PolarQuant Lloyd-Max codebook generator to Rust (#472) by @inureyes in lablup/mlxcel-internal#487
feat(core): add Walsh-Hadamard transform op for TurboQuant KV cache (#470) by @inureyes in lablup/mlxcel-internal#488
feat(core): KVCacheMode::Turbo4Asym — Fp16-K + Turbo4-V (#474) by @inureyes in lablup/mlxcel-internal#490
feat(core): TurboQuant KV cache quality gate — wikitext-2 PPL + NIAH (#475) by @inureyes in lablup/mlxcel-internal#491
feat(core): KVCacheMode::Turbo4 (symmetric) with per-model allowlist (#476) by @inureyes in lablup/mlxcel-internal#494
feat(core): TurboQuant + RotatingKVCache (sliding window) — B9 by @inureyes in lablup/mlxcel-internal#496
feat(core): KVCacheMode::Turbo4Delegated — FP16 hot tail + packed turbo cold body (#479) by @inureyes in lablup/mlxcel-internal#495
test(fixtures): replace wikitext-2 placeholder with full 283K-token test split (#492) by @inureyes in lablup/mlxcel-internal#497
feat(core): Boundary-V layer protection for Turbo4* KV cache modes (#478) by @inureyes in lablup/mlxcel-internal#499
feat(core): sparse-V dequant scaffolding (#480) by @inureyes in lablup/mlxcel-internal#498
test(bench): commit per-model PPL+NIAH gate results on Apple Silicon (#493) by @inureyes in lablup/mlxcel-internal#500
feat(core): KVCacheMode::Turbo3Asym — 3-bit asymmetric KV (Fp16-K + Turbo3-V) (#477) by @inureyes in lablup/mlxcel-internal#503
feat(core): packed-aware PagedKvLayout for Turbo4 KV (#482) by @inureyes in lablup/mlxcel-internal#502
feat(server): --cache-type-k/--cache-type-v server flag parity with llama-server (#484) by @inureyes in lablup/mlxcel-internal#501
docs: TurboQuant KV cache user guide and validated config matrix (B12 / #485) by @inureyes in lablup/mlxcel-internal#504
feat(core): fused Sparse-V Metal kernel — fused per-thread SDPA skipping (#505) by @inureyes in lablup/mlxcel-internal#511
fix(test): swap B3 Qwen2.5-1.5B fixture to base variant (#506) by @inureyes in lablup/mlxcel-internal#512
fix(utils): Gemma3-4B attention SIGABRT — sliding-window mask T_k mismatch (#507) by @inureyes in lablup/mlxcel-internal#513
feat(server): paged scheduler dispatch on PagedKvLayout::cache_mode (#508) by @inureyes in lablup/mlxcel-internal#514
test(bench): TurboQuant KV speed gate matrix runner + first M1 Ultra reading (#509) by @inureyes in lablup/mlxcel-internal#515
test(bench): TurboQuant Turbo4Asym VLM structural smoke (#510) by @inureyes in lablup/mlxcel-internal#516
fix: preserve Qwen2 fused QKV bias by @inureyes in lablup/mlxcel-internal#517
test(bench): TurboQuant KV speed gate matrix on M5 Max (#509) by @inureyes in lablup/mlxcel-internal#519
perf(core): precompute Sparse-V kernel rescale to drop per-token threadgroup barriers (#520) by @inureyes in lablup/mlxcel-internal#523
perf(core): cache cold-V dequant across decode steps in Turbo4Delegated by @inureyes in lablup/mlxcel-internal#525
perf(core): unify K storage in Turbo4Delegated to drop per-step K concat (#527) by @inureyes in lablup/mlxcel-internal#529
perf(turbo): retire PR-#525 cold-V FP16 memo and add cold-V dequant Metal kernel for Turbo4Delegated by @inureyes in lablup/mlxcel-internal#530
test(turbo): VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510 by @inureyes in lablup/mlxcel-internal#518
perf(turbo): steel-attention-envelope fused SDPA kernel for Turbo4Delegated (#531) by @inureyes in lablup/mlxcel-internal#532
bench(turbo): M5 Max gate-miss readings for Turbo4Delegated steel envelope (#531) by @inureyes in lablup/mlxcel-internal#533
perf(turbo): parallelize Pass 1 softmax in turbo4_delegated_steel_sdpa (#534) by @inureyes in lablup/mlxcel-internal#535
update: add delegated FP16 predecode compaction by @inureyes in lablup/mlxcel-internal#536
update: add lazy delegated FP16 sidecars by @inureyes in lablup/mlxcel-internal#537
Move Turbo4Delegated compressed fold before decode by @inureyes in lablup/mlxcel-internal#538
fix(vision): relax cached-position shape check in Qwen VL chunked prefill by @inureyes in lablup/mlxcel-internal#557
fix(vision): per-sequence MRoPE alignment for mixed VL+text batches by @inureyes in lablup/mlxcel-internal#558
fix(qwen3_5_moe): validate seq_length on cached position_ids reuse by @inureyes in lablup/mlxcel-internal#559
fix(vision): support mixed-length batching for Gemma 4 by @inureyes in lablup/mlxcel-internal#560
fix(vision): per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM by @inureyes in lablup/mlxcel-internal#561
perf(turbo): add compressed dequant-SDPA paths by @inureyes in lablup/mlxcel-internal#562
update: bump MLX pin to v0.32.0 (c9aa5605) by @inureyes in lablup/mlxcel-internal#565
fix(turbo): correct batch cache offset merging in TurboQuant continuous batching by @inureyes in lablup/mlxcel-internal#564
refactor: unify TurboQuant KV-cache CLI flags across mlxcel binaries by @inureyes in lablup/mlxcel-internal#567
feat(server): KV cache quantization for continuous batching (#545) by @inureyes in lablup/mlxcel-internal#568
fix(sampling): correct top_p filter for batched logits by @inureyes in lablup/mlxcel-internal#569
fix(server): correct streamed detokenization for byte-fallback tokens by @inureyes in lablup/mlxcel-internal#570
fix(server): handle token queue timeout during long prefills by @inureyes in lablup/mlxcel-internal#571
feat(server): include context_size and tool_call_parser in /health endpoint by @inureyes in lablup/mlxcel-internal#572
fix: Turbo3 split-flag, doc, and ENV_LOCK race (#573) by @inureyes in lablup/mlxcel-internal#574
feat(server): json_schema response_format support (#550) by @inureyes in lablup/mlxcel-internal#575
fix(server): strip tool-call markup from streamed delta.content by @inureyes in lablup/mlxcel-internal#576
update: refresh M1 Ultra benchmark to 2026-05-08 by @inureyes in lablup/mlxcel-internal#577
update: full M1 Ultra column resync in benchmarks-by-hardware.md by @inureyes in lablup/mlxcel-internal#578
feat(server): Automatic Prefix Caching (APC) with hash blocks (#552) by @inureyes in lablup/mlxcel-internal#579
feat(vision): Gemma 4 video support and VLM video input infrastructure (#553) by @inureyes in lablup/mlxcel-internal#581
feat(models): add Nemotron H Nano Omni vision (#554) by @inureyes in lablup/mlxcel-internal#583
feat(models): add Youtu-VL (#555) by @inureyes in lablup/mlxcel-internal#584
perf(server): thread-local generation stream and uniform-batch RoPE collapse (#556) by @inureyes in lablup/mlxcel-internal#585
fix: harden post-merge review findings for issues #550-#556 by @inureyes in lablup/mlxcel-internal#586
fix(nemotron-h-nano-omni): close PR #583 deferred validation gaps by @inureyes in lablup/mlxcel-internal#595
perf(multimodal): single-pass ffmpeg frame extraction and Drop guard (#597) by @inureyes in lablup/mlxcel-internal#599
feat(server): wire video_url content blocks through chat completion handler (#596) by @inureyes in lablup/mlxcel-internal#600
test(multimodal): content-preservation tests for video frame extraction (#598) by @inureyes in lablup/mlxcel-internal#602
update: bump MLX upstream pin to 84961223 (PRs #3443 #3463 #3475) by @inureyes in lablup/mlxcel-internal#605
docs: bump mlx-lm version reference to 0.31.3 by @inureyes in lablup/mlxcel-internal#606

Full Changelog: lablup/mlxcel-internal@v0.0.25...v0.0.26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.26

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Features

Improvements

Bug Fixes

CI/CD Improvements

Technical Details

Dependencies

Breaking Changes

Known Issues

What's Changed

Contributors

Uh oh!