v0.0.26
What's Changed
mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.
New Features
- TurboQuant KV cache — full mode family wired through
KVCacheMode:Turbo4symmetric with per-model allowlist (#476)Turbo4AsymFp16-K + Turbo4-V (#474)Turbo3Asym3-bit asymmetric Fp16-K + Turbo3-V (#477)Turbo4DelegatedFP16 hot tail + packed turbo cold body (#479)- Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
RotatingKVCache(sliding-window) integration (B9), Boundary-V layer protection (#478), packed-awarePagedKvLayout(#482), sparse-V dequant scaffolding (#480)- llama-server flag parity:
--cache-type-k/--cache-type-vacceptmlxcel_turbo*variants (#484) - KV cache quantization extended to continuous batching (#545)
- Unified TurboQuant CLI flags across all binaries (#567)
- User guide and validated config matrix (#485)
- Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
- OpenAI-compatible
response_format: json_schemaconstrained decoding viallguidance(#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$refschema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequencemask_buf/bias_bufallocations mlxcel download/mlxcel-server downloadsubcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense- Paged scheduler dispatch on
PagedKvLayout::cache_mode(#508) /healthendpoint exposescontext_sizeandtool_call_parser(#549, #572)- VLM video input infrastructure:
- Gemma 4 video support and VLM video input pipeline (#553)
video_urlcontent blocks wired through/v1/chat/completions(#596)- Single-pass ffmpeg frame extraction with
Dropguard (#597) - Content-preservation tests for video frame extraction (#598)
- New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)
Improvements
- Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
- Turbo4Delegated decode hot path:
- Unified K storage to drop per-step K concat (#527)
- Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
- Steel-attention-envelope fused SDPA kernel (#531)
- Parallelized Pass 1 softmax in
turbo4_delegated_steel_sdpa(#534) - Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
- Compressed fold moved before decode
- Compressed dequant-SDPA paths for TurboQuant decode (#562)
- Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
- Quality and speed gates:
- Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
- VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
- KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
- Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in
benchmarks-by-hardware.md(#578)
Bug Fixes
- TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and
ENV_LOCKrace in concurrent process startup (#573) - Vision / VLM mixed batching:
- Per-sequence MRoPE alignment for mixed VL+text batches (#558)
- Per-sequence
per_layer_inputsfor Gemma 4 E2B/E4B VLM (#561) - Mixed-length batching support for Gemma 4 (#560)
- Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
- Qwen3.5-MoE batch-size validation on cached
position_idsreuse (#559)
- Streaming and sampling:
- Streamed detokenization for byte-fallback tokens (#570)
- Top-p filter correctness for batched logits (#569)
- Token queue timeout handling during long prefills (#571)
StreamFilterextended to cover Hermes-style<tool_call>and Mistral Nemo[TOOL_CALLS]markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4<|tool_call>suppression unaffected
- Models:
- Gemma3-4B attention SIGABRT from sliding-window mask
T_kmismatch on long-context prompts (#507) - Preserve Qwen2 fused QKV bias when present in checkpoint (#517)
- Gemma3-4B attention SIGABRT from sliding-window mask
CI/CD Improvements
None.
Technical Details
- Refactor: unified TurboQuant KV-cache CLI flags across
mlxcel,mlxcel-server, andmlxcel downloadso all binaries accept the same--kv-cache-mode/--cache-type-{k,v}syntax (#567) - Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
- Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
- mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606);
bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python
Dependencies
- MLX upstream pin bumped twice:
- First to v0.32.0 /
c9aa5605(#565) - Then forward to
84961223covering 3 PRs:- #3443 splits the CUDA
qmm_naive/qmm_sm80kernel bodies into newqmm_naive.cuh/qmm_sm80.cuhheaders without changing the public ABI consumed by mlxcel'spatches/mlx/backend/cuda/quantized/qmm/qmm.h - #3463 routes the CPU JIT preamble through
JitCompiler::get_preamble()and renames the prebuilt symbol fromget_kernel_preambletoget_prebuilt_preamble(mlxcel does not call either directly) - #3475 fixes contiguity-flag accuracy in
AsStridedby computingdata_sizefrom the actually-occupied stride range
- #3443 splits the CUDA
- Three-location pin update applied to
src/lib/mlx-cpp/CMakeLists.txt,src/lib/mlxcel-core/build.rs, and.github/workflows/release.ymlperCLAUDE.md - Fused Metal kernel launchers in
src/lib/mlx-cpp/turbo/re-validated against both bumps; symbols unchanged
- First to v0.32.0 /
Breaking Changes
None.
Known Issues
- Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
- mlx-lm baseline re-measurement pending for
model_tests_m1ultra.mdPerformance Comparisonvs mlx-lmpercentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence). - M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.
What's Changed
- feat: add download subcommand for HuggingFace model repositories by @inureyes in lablup/mlxcel-internal#486
- feat(core): port PolarQuant Lloyd-Max codebook generator to Rust (#472) by @inureyes in lablup/mlxcel-internal#487
- feat(core): add Walsh-Hadamard transform op for TurboQuant KV cache (#470) by @inureyes in lablup/mlxcel-internal#488
- feat(core): KVCacheMode::Turbo4Asym — Fp16-K + Turbo4-V (#474) by @inureyes in lablup/mlxcel-internal#490
- feat(core): TurboQuant KV cache quality gate — wikitext-2 PPL + NIAH (#475) by @inureyes in lablup/mlxcel-internal#491
- feat(core): KVCacheMode::Turbo4 (symmetric) with per-model allowlist (#476) by @inureyes in lablup/mlxcel-internal#494
- feat(core): TurboQuant + RotatingKVCache (sliding window) — B9 by @inureyes in lablup/mlxcel-internal#496
- feat(core): KVCacheMode::Turbo4Delegated — FP16 hot tail + packed turbo cold body (#479) by @inureyes in lablup/mlxcel-internal#495
- test(fixtures): replace wikitext-2 placeholder with full 283K-token test split (#492) by @inureyes in lablup/mlxcel-internal#497
- feat(core): Boundary-V layer protection for Turbo4* KV cache modes (#478) by @inureyes in lablup/mlxcel-internal#499
- feat(core): sparse-V dequant scaffolding (#480) by @inureyes in lablup/mlxcel-internal#498
- test(bench): commit per-model PPL+NIAH gate results on Apple Silicon (#493) by @inureyes in lablup/mlxcel-internal#500
- feat(core): KVCacheMode::Turbo3Asym — 3-bit asymmetric KV (Fp16-K + Turbo3-V) (#477) by @inureyes in lablup/mlxcel-internal#503
- feat(core): packed-aware PagedKvLayout for Turbo4 KV (#482) by @inureyes in lablup/mlxcel-internal#502
- feat(server): --cache-type-k/--cache-type-v server flag parity with llama-server (#484) by @inureyes in lablup/mlxcel-internal#501
- docs: TurboQuant KV cache user guide and validated config matrix (B12 / #485) by @inureyes in lablup/mlxcel-internal#504
- feat(core): fused Sparse-V Metal kernel — fused per-thread SDPA skipping (#505) by @inureyes in lablup/mlxcel-internal#511
- fix(test): swap B3 Qwen2.5-1.5B fixture to base variant (#506) by @inureyes in lablup/mlxcel-internal#512
- fix(utils): Gemma3-4B attention SIGABRT — sliding-window mask T_k mismatch (#507) by @inureyes in lablup/mlxcel-internal#513
- feat(server): paged scheduler dispatch on PagedKvLayout::cache_mode (#508) by @inureyes in lablup/mlxcel-internal#514
- test(bench): TurboQuant KV speed gate matrix runner + first M1 Ultra reading (#509) by @inureyes in lablup/mlxcel-internal#515
- test(bench): TurboQuant Turbo4Asym VLM structural smoke (#510) by @inureyes in lablup/mlxcel-internal#516
- fix: preserve Qwen2 fused QKV bias by @inureyes in lablup/mlxcel-internal#517
- test(bench): TurboQuant KV speed gate matrix on M5 Max (#509) by @inureyes in lablup/mlxcel-internal#519
- perf(core): precompute Sparse-V kernel rescale to drop per-token threadgroup barriers (#520) by @inureyes in lablup/mlxcel-internal#523
- perf(core): cache cold-V dequant across decode steps in Turbo4Delegated by @inureyes in lablup/mlxcel-internal#525
- perf(core): unify K storage in Turbo4Delegated to drop per-step K concat (#527) by @inureyes in lablup/mlxcel-internal#529
- perf(turbo): retire PR-#525 cold-V FP16 memo and add cold-V dequant Metal kernel for Turbo4Delegated by @inureyes in lablup/mlxcel-internal#530
- test(turbo): VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510 by @inureyes in lablup/mlxcel-internal#518
- perf(turbo): steel-attention-envelope fused SDPA kernel for Turbo4Delegated (#531) by @inureyes in lablup/mlxcel-internal#532
- bench(turbo): M5 Max gate-miss readings for Turbo4Delegated steel envelope (#531) by @inureyes in lablup/mlxcel-internal#533
- perf(turbo): parallelize Pass 1 softmax in turbo4_delegated_steel_sdpa (#534) by @inureyes in lablup/mlxcel-internal#535
- update: add delegated FP16 predecode compaction by @inureyes in lablup/mlxcel-internal#536
- update: add lazy delegated FP16 sidecars by @inureyes in lablup/mlxcel-internal#537
- Move Turbo4Delegated compressed fold before decode by @inureyes in lablup/mlxcel-internal#538
- fix(vision): relax cached-position shape check in Qwen VL chunked prefill by @inureyes in lablup/mlxcel-internal#557
- fix(vision): per-sequence MRoPE alignment for mixed VL+text batches by @inureyes in lablup/mlxcel-internal#558
- fix(qwen3_5_moe): validate seq_length on cached position_ids reuse by @inureyes in lablup/mlxcel-internal#559
- fix(vision): support mixed-length batching for Gemma 4 by @inureyes in lablup/mlxcel-internal#560
- fix(vision): per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM by @inureyes in lablup/mlxcel-internal#561
- perf(turbo): add compressed dequant-SDPA paths by @inureyes in lablup/mlxcel-internal#562
- update: bump MLX pin to v0.32.0 (c9aa5605) by @inureyes in lablup/mlxcel-internal#565
- fix(turbo): correct batch cache offset merging in TurboQuant continuous batching by @inureyes in lablup/mlxcel-internal#564
- refactor: unify TurboQuant KV-cache CLI flags across mlxcel binaries by @inureyes in lablup/mlxcel-internal#567
- feat(server): KV cache quantization for continuous batching (#545) by @inureyes in lablup/mlxcel-internal#568
- fix(sampling): correct top_p filter for batched logits by @inureyes in lablup/mlxcel-internal#569
- fix(server): correct streamed detokenization for byte-fallback tokens by @inureyes in lablup/mlxcel-internal#570
- fix(server): handle token queue timeout during long prefills by @inureyes in lablup/mlxcel-internal#571
- feat(server): include context_size and tool_call_parser in /health endpoint by @inureyes in lablup/mlxcel-internal#572
- fix: Turbo3 split-flag, doc, and ENV_LOCK race (#573) by @inureyes in lablup/mlxcel-internal#574
- feat(server): json_schema response_format support (#550) by @inureyes in lablup/mlxcel-internal#575
- fix(server): strip tool-call markup from streamed delta.content by @inureyes in lablup/mlxcel-internal#576
- update: refresh M1 Ultra benchmark to 2026-05-08 by @inureyes in lablup/mlxcel-internal#577
- update: full M1 Ultra column resync in benchmarks-by-hardware.md by @inureyes in lablup/mlxcel-internal#578
- feat(server): Automatic Prefix Caching (APC) with hash blocks (#552) by @inureyes in lablup/mlxcel-internal#579
- feat(vision): Gemma 4 video support and VLM video input infrastructure (#553) by @inureyes in lablup/mlxcel-internal#581
- feat(models): add Nemotron H Nano Omni vision (#554) by @inureyes in lablup/mlxcel-internal#583
- feat(models): add Youtu-VL (#555) by @inureyes in lablup/mlxcel-internal#584
- perf(server): thread-local generation stream and uniform-batch RoPE collapse (#556) by @inureyes in lablup/mlxcel-internal#585
- fix: harden post-merge review findings for issues #550-#556 by @inureyes in lablup/mlxcel-internal#586
- fix(nemotron-h-nano-omni): close PR #583 deferred validation gaps by @inureyes in lablup/mlxcel-internal#595
- perf(multimodal): single-pass ffmpeg frame extraction and Drop guard (#597) by @inureyes in lablup/mlxcel-internal#599
- feat(server): wire video_url content blocks through chat completion handler (#596) by @inureyes in lablup/mlxcel-internal#600
- test(multimodal): content-preservation tests for video frame extraction (#598) by @inureyes in lablup/mlxcel-internal#602
- update: bump MLX upstream pin to 84961223 (PRs #3443 #3463 #3475) by @inureyes in lablup/mlxcel-internal#605
- docs: bump mlx-lm version reference to 0.31.3 by @inureyes in lablup/mlxcel-internal#606
Full Changelog: lablup/mlxcel-internal@v0.0.25...v0.0.26