release: v0.0.26

inureyes · inureyes · commit 9779bcf4beb5 · 2026-05-10T21:51:36.000+09:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,16 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ## [Unreleased]
 
-### Added
-- `mlxcel download` / `mlxcel-server download` subcommand to fetch HuggingFace model repository snapshots without Python tooling. Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md (#457).
-- `/health` endpoint now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path (#549).
-- OpenAI-compatible `response_format: {"type": "json_schema", ...}` structured-output support for `/v1/chat/completions` and `/v1/completions`. Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only (#550).
+## [v0.0.26] - 2026-05-10
 
-### Fixed
-- `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming. Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form (#551).
+### Added
+- **TurboQuant KV cache.** New 3–4 bit KV cache compression family built on a Walsh–Hadamard transform op (#470) and a Lloyd-Max PolarQuant codebook generator ported to Rust (#472). Four KV cache modes wired through `KVCacheMode`: `Turbo4` symmetric with per-model allowlist (#476), `Turbo4Asym` Fp16-K + Turbo4-V (#474), `Turbo3Asym` 3-bit Fp16-K + Turbo3-V (#477), and `Turbo4Delegated` with a FP16 hot tail + packed turbo cold body (#479). TurboQuant + `RotatingKVCache` integration covers sliding-window attention (B9). Sparse-V dequant scaffolding (#480), Boundary-V layer protection that keeps the first/last layer at FP16 (#478), and a packed-aware `PagedKvLayout` (#482) round out the runtime. Quality gates: wikitext-2 PPL + NIAH harness (#475), full 283K-token test split fixture (#492), per-model PPL/NIAH results committed (#493), and VLM B3 quality gates with image-token kurtosis (#510). Speed gate matrix runner with M1 Ultra (#509) and M5 Max readings. User guide and validated config matrix published (#485).
+- **Server flag parity for KV quantization.** `--cache-type-k` and `--cache-type-v` flags accept `f16`, `q8_0`, `q4_0`, etc. matching llama-server semantics, with TurboQuant modes exposed as `mlxcel_turbo*` variants (#484). KV cache quantization extended to continuous batching (#545) and a unified `--kv-cache-mode` flag layout shared across `mlxcel`, `mlxcel-server`, and `mlxcel download` (#567).
+- **Automatic Prefix Caching (APC) with hash blocks** (#552). Hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache, enabling shared physical blocks across requests with the same hashed prefix without per-request token-prefix matching cost.
+- **OpenAI-compatible `response_format: {"type": "json_schema", ...}`** structured-output support for `/v1/chat/completions` and `/v1/completions` (#550). Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only.
+- **`mlxcel download` / `mlxcel-server download` subcommand** to fetch HuggingFace model repository snapshots without Python tooling (#457, #486). Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md.
+- **`/health` endpoint** now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise) (#549, #572). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path.
+- **Paged scheduler dispatch on `PagedKvLayout::cache_mode`** so the scheduler routes batches into the matching paged decode kernel for each KV cache mode (#508).
+- **Video input infrastructure for VLMs.** Gemma 4 video support with the new VLM video input pipeline (#553); ffmpeg-backed frame extraction with single-pass extraction and a `Drop` guard for cleanup of temporary frame files (#597); `video_url` content blocks wired through `/v1/chat/completions` (#596); content-preservation tests covering the frame extraction path (#598).
+- **New models.** Youtu-VL vision-language model (#555); Nemotron H Nano Omni vision (#554) plus follow-up correctness/validation hardening (#595).
+- Multi-task M1 Ultra benchmark refresh to 2026-05-08 (#577) and full M1 Ultra column resync in `benchmarks-by-hardware.md` (#578).
 
 ### Changed
-- `MLX` upstream pin bumped from `c9aa5605` to `84961223` (3 commits, PRs #3443 / #3463 / #3475). PR #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; PR #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); PR #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` revalidated against the new pin: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols are unchanged across the bump.
+- **MLX upstream pin bumped twice.** First from the v0.0.25 baseline (`5d7e96cd`) to v0.32.0 / `c9aa5605` (#565), then forward to `84961223` covering 3 PRs: #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` re-validated against both bumps: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols unchanged.
+- **Refactor:** unified TurboQuant KV-cache CLI flags across `mlxcel`, `mlxcel-server`, and `mlxcel download` so all binaries accept the same `--kv-cache-mode` / `--cache-type-{k,v}` syntax (#567).
+- mlx-lm version reference in docs bumped from 0.31.2 to 0.31.3 (#606). The `bridge-overhead-microbench` reference at v0.31.2 is preserved because it pins the MLX C++ runtime, not the mlx-lm Python package.
+
+### Performance
+- **Sparse-V kernel:** fused per-thread Metal kernel that skips the full SDPA pass when sparse-V dequant predicts zero contribution (#505); precomputed kernel rescale to drop per-token threadgroup barriers (#520).
+- **Turbo4Delegated decode hot path:** unified K storage to drop the per-step K concat (#527); cold-V dequant cache across decode steps (#525) followed by a cold-V dequant Metal kernel that retires the FP16 memo (#530); steel-attention-envelope fused SDPA kernel (#531) with parallelized Pass 1 softmax (#534); delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537); compressed fold moved before decode.
+- **Compressed dequant-SDPA paths** for TurboQuant decode (#562).
+- **Server hot-path:** thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556).
+
+### Fixed
+- **TurboQuant continuous batching:** correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and an `ENV_LOCK` race in concurrent process startup (#573).
+- **Vision / VLM mixed batching:** per-sequence MRoPE alignment for mixed VL+text batches (#558); per-sequence `per_layer_inputs` for Gemma 4 E2B/E4B VLM (#561); mixed-length batching support for Gemma 4 (#560); relaxed cached-position shape check in Qwen VL chunked prefill (#557); Qwen3.5-MoE batch-size validation on cached `position_ids` reuse (#559).
+- **Streaming and sampling:** correct streamed detokenization for byte-fallback tokens that previously leaked raw byte fragments to the client (#570); top-p filter correctness for batched logits (#569); token queue timeout handling during long prefills so clients no longer see spurious 408s on slow first-token paths (#571); `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming (#551, #576). Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form.
+- **Models:** Gemma3-4B attention SIGABRT from a sliding-window mask `T_k` mismatch on long-context prompts (#507); preserve Qwen2 fused QKV bias when it is present in the checkpoint (#517); test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506); harden post-merge review findings on the Nemotron-H Nano Omni vision PR.
 
 ### Security
 - Path-traversal defense in the downloader: `is_safe_relative_path` pre-filters each sibling filename returned by the HuggingFace API (rejects absolute paths, `..` components, backslash separators, and empty components). A secondary canonicalized `starts_with` guard on the resolved destination path is applied before writing each file. Download target files are written to a temporary path and atomically renamed into place, preventing partial writes from leaving corrupt files in the output directory (fixes C1 and H1 from security review of #457).
@@ -563,6 +583,7 @@ Initial public release of mlxcel.
 - GitHub Actions release workflow for macOS ARM64
 - Profile mode for prefill/decode timing analysis
 
+[v0.0.26]: https://github.com/lablup/mlxcel/compare/v0.0.25...v0.0.26
 [v0.0.25]: https://github.com/lablup/mlxcel/compare/v0.0.24...v0.0.25
 [v0.0.24]: https://github.com/lablup/mlxcel/compare/v0.0.23...v0.0.24
 [v0.0.23]: https://github.com/lablup/mlxcel/compare/v0.0.22...v0.0.23
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "mlxcel"
-version = "0.0.25"
+version = "0.0.26"
 edition = "2024"
 description = "High-performance LLM/VLM/VLA inference on Apple Silicon and CUDA GPUs"
 repository = "https://github.com/lablup/mlxcel"
diff --git a/debian/changelog b/debian/changelog
@@ -1,3 +1,67 @@
+mlxcel (0.0.26-1~jammy1) jammy; urgency=medium
+
+  * Feat: TurboQuant KV cache — Walsh-Hadamard transform op + PolarQuant
+    Lloyd-Max codebook generator porting (#470, #472), KVCacheMode::Turbo4
+    symmetric (#476), Turbo4Asym Fp16-K + Turbo4-V (#474), Turbo3Asym
+    Fp16-K + Turbo3-V (#477), Turbo4Delegated FP16 hot tail + packed
+    turbo cold body (#479), TurboQuant + RotatingKVCache sliding window
+    B9, sparse-V dequant scaffolding (#480), Boundary-V layer protection
+    (#478), packed-aware PagedKvLayout (#482)
+  * Feat: --cache-type-k / --cache-type-v server flag parity with
+    llama-server for TurboQuant modes (#484)
+  * Feat: KV cache quantization for continuous batching (#545); unified
+    TurboQuant KV-cache CLI flags across mlxcel binaries (#567);
+    compressed dequant-SDPA paths (#562)
+  * Perf: Fused Sparse-V Metal kernel skipping per-thread SDPA (#505);
+    precomputed Sparse-V kernel rescale dropping per-token threadgroup
+    barriers (#520); cold-V dequant cache + Metal kernel for
+    Turbo4Delegated (#525, #530); unified K storage dropping per-step K
+    concat (#527); steel-attention-envelope fused SDPA kernel (#531);
+    parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534);
+    delegated FP16 predecode compaction + lazy sidecars (#536, #537)
+  * Test: TurboQuant wikitext-2 PPL + NIAH quality gate (#475, #492,
+    #493); VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for
+    #510; KV speed gate matrix on M1 Ultra and M5 Max (#509)
+  * Feat: Automatic Prefix Caching (APC) with hash blocks (#552)
+  * Feat: OpenAI-compatible response_format json_schema constrained
+    decoding via llguidance with schema size/depth/$ref limits (#550)
+  * Feat: mlxcel download / mlxcel-server download HuggingFace
+    subcommand with allow-list filter, atomic writes, and path-traversal
+    defense (#457, #486)
+  * Feat: /health endpoint exposes context_size and tool_call_parser
+    (#549)
+  * Feat: Paged scheduler dispatch on PagedKvLayout::cache_mode (#508);
+    thread-local generation stream and uniform-batch RoPE collapse (#556)
+  * Feat: Gemma 4 video support and VLM video input infrastructure
+    (#553); Youtu-VL model (#555); Nemotron H Nano Omni vision (#554);
+    video_url content blocks wired through chat completion handler
+    (#596); single-pass ffmpeg frame extraction with Drop guard (#597)
+  * Fix: Per-sequence MRoPE alignment for mixed VL+text batches (#558);
+    per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561);
+    mixed-length batching for Gemma 4 (#560); relaxed cached-position
+    shape check in Qwen VL chunked prefill (#557); Qwen3.5-MoE batch
+    size validation on cached position_ids reuse (#559)
+  * Fix: Streamed detokenization for byte-fallback tokens (#570); top_p
+    filter for batched logits (#569); token queue timeout during long
+    prefills (#571); tool-call markup stripped from streamed
+    delta.content covering Hermes <tool_call> and Mistral Nemo
+    [TOOL_CALLS] (#551, #576)
+  * Fix: TurboQuant batch cache offset merging (#564); Turbo3 split-flag
+    + ENV_LOCK race (#573)
+  * Fix: Gemma3-4B attention SIGABRT — sliding-window mask T_k mismatch
+    (#507); Qwen2 fused QKV bias preservation (#517);
+    nemotron-h-nano-omni post-merge validation gaps (#583)
+  * Update: MLX upstream pin bumped twice — first to v0.32.0
+    (c9aa5605, #565), then to 84961223 covering CUDA QMM kernel split,
+    JIT preamble rename, and AsStrided contiguity-flag fix (PRs #3443
+    #3463 #3475)
+  * Docs: TurboQuant KV cache user guide and validated config matrix
+    (#485); mlx-lm version reference 0.31.2 → 0.31.3 (#606); M1 Ultra
+    benchmark refresh to 2026-05-08 (#577) and benchmarks-by-hardware.md
+    full M1 Ultra column resync (#578)
+
+ -- Jeongkyu Shin <inureyes@gmail.com>  Sun, 10 May 2026 18:00:00 +0900
+
 mlxcel (0.0.25-1~jammy1) jammy; urgency=medium
 
   * Feat: Cross-sequence prompt-prefix KV cache — KVCache trim/detach/adopt