Skip to content

v0.0.26

Choose a tag to compare

@inureyes inureyes released this 18 May 10:01
· 77 commits to main since this release

What's Changed

mlxcel v0.0.26 ships the TurboQuant KV cache family (3-4 bit/value compression via Walsh-Hadamard rotation + PolarQuant), Automatic Prefix Caching with hash blocks, OpenAI-compatible json_schema constrained decoding, an in-tree HuggingFace download subcommand, VLM video input infrastructure (Gemma 4 + new Youtu-VL and Nemotron H Nano Omni vision models), and a substantial decode hot-path performance pass on top of the v0.0.25 prompt-prefix KV cache foundation.

New Features

  • TurboQuant KV cache — full mode family wired through KVCacheMode:
    • Turbo4 symmetric with per-model allowlist (#476)
    • Turbo4Asym Fp16-K + Turbo4-V (#474)
    • Turbo3Asym 3-bit asymmetric Fp16-K + Turbo3-V (#477)
    • Turbo4Delegated FP16 hot tail + packed turbo cold body (#479)
    • Walsh-Hadamard transform op (#470) and PolarQuant Lloyd-Max codebook generator (#472)
    • RotatingKVCache (sliding-window) integration (B9), Boundary-V layer protection (#478), packed-aware PagedKvLayout (#482), sparse-V dequant scaffolding (#480)
    • llama-server flag parity: --cache-type-k / --cache-type-v accept mlxcel_turbo* variants (#484)
    • KV cache quantization extended to continuous batching (#545)
    • Unified TurboQuant CLI flags across all binaries (#567)
    • User guide and validated config matrix (#485)
  • Automatic Prefix Caching (APC) with hash blocks (#552) — hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache
  • OpenAI-compatible response_format: json_schema constrained decoding via llguidance (#550) — same backend as upstream mlx-vlm PR #1047, with 64 KiB / 32-depth / 64-$ref schema limits, SHA-256 fingerprinted tokenizer-environment cache, and reusable per-sequence mask_buf / bias_buf allocations
  • mlxcel download / mlxcel-server download subcommand (#457, #486) — fetch HuggingFace model snapshots without Python tooling, with allow-list file filter, atomic writes, and path-traversal defense
  • Paged scheduler dispatch on PagedKvLayout::cache_mode (#508)
  • /health endpoint exposes context_size and tool_call_parser (#549, #572)
  • VLM video input infrastructure:
    • Gemma 4 video support and VLM video input pipeline (#553)
    • video_url content blocks wired through /v1/chat/completions (#596)
    • Single-pass ffmpeg frame extraction with Drop guard (#597)
    • Content-preservation tests for video frame extraction (#598)
  • New models: Youtu-VL (#555); Nemotron H Nano Omni vision (#554, #595)

Improvements

  • Sparse-V Metal kernel: fused per-thread SDPA skipping (#505); precomputed kernel rescale dropping per-token threadgroup barriers (#520)
  • Turbo4Delegated decode hot path:
    • Unified K storage to drop per-step K concat (#527)
    • Cold-V dequant cache across decode steps (#525) → cold-V dequant Metal kernel retiring the FP16 memo (#530)
    • Steel-attention-envelope fused SDPA kernel (#531)
    • Parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534)
    • Delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537)
    • Compressed fold moved before decode
  • Compressed dequant-SDPA paths for TurboQuant decode (#562)
  • Server hot-path: thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556)
  • Quality and speed gates:
    • Wikitext-2 PPL + NIAH harness (#475) with the full 283K-token test split (#492) and per-model results committed (#493)
    • VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for #510
    • KV speed gate matrix runner with M1 Ultra (#509) and M5 Max readings
  • Benchmarks: M1 Ultra refresh to 2026-05-08 (#577); full M1 Ultra column resync in benchmarks-by-hardware.md (#578)

Bug Fixes

  • TurboQuant continuous batching: correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and ENV_LOCK race in concurrent process startup (#573)
  • Vision / VLM mixed batching:
    • Per-sequence MRoPE alignment for mixed VL+text batches (#558)
    • Per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561)
    • Mixed-length batching support for Gemma 4 (#560)
    • Relaxed cached-position shape check in Qwen VL chunked prefill (#557)
    • Qwen3.5-MoE batch-size validation on cached position_ids reuse (#559)
  • Streaming and sampling:
    • Streamed detokenization for byte-fallback tokens (#570)
    • Top-p filter correctness for batched logits (#569)
    • Token queue timeout handling during long prefills (#571)
    • StreamFilter extended to cover Hermes-style <tool_call> and Mistral Nemo [TOOL_CALLS] markers (#551, #576) — partial-marker buffering at token boundaries; Gemma 4 <|tool_call> suppression unaffected
  • Models:
    • Gemma3-4B attention SIGABRT from sliding-window mask T_k mismatch on long-context prompts (#507)
    • Preserve Qwen2 fused QKV bias when present in checkpoint (#517)

CI/CD Improvements

None.

Technical Details

  • Refactor: unified TurboQuant KV-cache CLI flags across mlxcel, mlxcel-server, and mlxcel download so all binaries accept the same --kv-cache-mode / --cache-type-{k,v} syntax (#567)
  • Test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506)
  • Post-merge review hardening for the Nemotron-H Nano Omni vision PR (#595)
  • mlx-lm version reference in docs bumped 0.31.2 → 0.31.3 (#606); bridge-overhead-microbench's v0.31.2 reference is preserved because it pins the MLX C++ runtime, not mlx-lm Python

Dependencies

  • MLX upstream pin bumped twice:
    • First to v0.32.0 / c9aa5605 (#565)
    • Then forward to 84961223 covering 3 PRs:
      • #3443 splits the CUDA qmm_naive / qmm_sm80 kernel bodies into new qmm_naive.cuh / qmm_sm80.cuh headers without changing the public ABI consumed by mlxcel's patches/mlx/backend/cuda/quantized/qmm/qmm.h
      • #3463 routes the CPU JIT preamble through JitCompiler::get_preamble() and renames the prebuilt symbol from get_kernel_preamble to get_prebuilt_preamble (mlxcel does not call either directly)
      • #3475 fixes contiguity-flag accuracy in AsStrided by computing data_size from the actually-occupied stride range
    • Three-location pin update applied to src/lib/mlx-cpp/CMakeLists.txt, src/lib/mlxcel-core/build.rs, and .github/workflows/release.yml per CLAUDE.md
    • Fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ re-validated against both bumps; symbols unchanged

Breaking Changes

None.

Known Issues

  • Small dense Qwen3/3.5 decode regression on M1 Ultra under investigation: qwen3-0.6b (-25.2%), qwen3-1.7b (-10.8%), qwen3.5-0.8b (-11.1%), molmo2-4b (-10.7%) between 4-04 and 5-08 baselines; pattern suggests fixed-cost overhead in the decode hot path (likely K-storage unification #527 or sparse-V around that period). Bisect not yet completed.
  • mlx-lm baseline re-measurement pending for model_tests_m1ultra.md Performance Comparison vs mlx-lm percentages (still anchored to 4-04 baseline; needs one mlx-lm 0.31.2 cycle on M1 Ultra to restore coherence).
  • M5 Max re-bench cycle pending — would naturally close the Notable Regressions table entries that are currently artifacts of M1 Ultra (5-08) vs M5 Max (4-13) measurement vintage gap.

What's Changed

Full Changelog: lablup/mlxcel-internal@v0.0.25...v0.0.26