Skip to content

Commit 9779bcf

Browse files
committed
release: v0.0.26
1 parent 4497104 commit 9779bcf

4 files changed

Lines changed: 94 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9-
### Added
10-
- `mlxcel download` / `mlxcel-server download` subcommand to fetch HuggingFace model repository snapshots without Python tooling. Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md (#457).
11-
- `/health` endpoint now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path (#549).
12-
- OpenAI-compatible `response_format: {"type": "json_schema", ...}` structured-output support for `/v1/chat/completions` and `/v1/completions`. Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only (#550).
9+
## [v0.0.26] - 2026-05-10
1310

14-
### Fixed
15-
- `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming. Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form (#551).
11+
### Added
12+
- **TurboQuant KV cache.** New 3–4 bit KV cache compression family built on a Walsh–Hadamard transform op (#470) and a Lloyd-Max PolarQuant codebook generator ported to Rust (#472). Four KV cache modes wired through `KVCacheMode`: `Turbo4` symmetric with per-model allowlist (#476), `Turbo4Asym` Fp16-K + Turbo4-V (#474), `Turbo3Asym` 3-bit Fp16-K + Turbo3-V (#477), and `Turbo4Delegated` with a FP16 hot tail + packed turbo cold body (#479). TurboQuant + `RotatingKVCache` integration covers sliding-window attention (B9). Sparse-V dequant scaffolding (#480), Boundary-V layer protection that keeps the first/last layer at FP16 (#478), and a packed-aware `PagedKvLayout` (#482) round out the runtime. Quality gates: wikitext-2 PPL + NIAH harness (#475), full 283K-token test split fixture (#492), per-model PPL/NIAH results committed (#493), and VLM B3 quality gates with image-token kurtosis (#510). Speed gate matrix runner with M1 Ultra (#509) and M5 Max readings. User guide and validated config matrix published (#485).
13+
- **Server flag parity for KV quantization.** `--cache-type-k` and `--cache-type-v` flags accept `f16`, `q8_0`, `q4_0`, etc. matching llama-server semantics, with TurboQuant modes exposed as `mlxcel_turbo*` variants (#484). KV cache quantization extended to continuous batching (#545) and a unified `--kv-cache-mode` flag layout shared across `mlxcel`, `mlxcel-server`, and `mlxcel download` (#567).
14+
- **Automatic Prefix Caching (APC) with hash blocks** (#552). Hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache, enabling shared physical blocks across requests with the same hashed prefix without per-request token-prefix matching cost.
15+
- **OpenAI-compatible `response_format: {"type": "json_schema", ...}`** structured-output support for `/v1/chat/completions` and `/v1/completions` (#550). Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only.
16+
- **`mlxcel download` / `mlxcel-server download` subcommand** to fetch HuggingFace model repository snapshots without Python tooling (#457, #486). Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md.
17+
- **`/health` endpoint** now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise) (#549, #572). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path.
18+
- **Paged scheduler dispatch on `PagedKvLayout::cache_mode`** so the scheduler routes batches into the matching paged decode kernel for each KV cache mode (#508).
19+
- **Video input infrastructure for VLMs.** Gemma 4 video support with the new VLM video input pipeline (#553); ffmpeg-backed frame extraction with single-pass extraction and a `Drop` guard for cleanup of temporary frame files (#597); `video_url` content blocks wired through `/v1/chat/completions` (#596); content-preservation tests covering the frame extraction path (#598).
20+
- **New models.** Youtu-VL vision-language model (#555); Nemotron H Nano Omni vision (#554) plus follow-up correctness/validation hardening (#595).
21+
- Multi-task M1 Ultra benchmark refresh to 2026-05-08 (#577) and full M1 Ultra column resync in `benchmarks-by-hardware.md` (#578).
1622

1723
### Changed
18-
- `MLX` upstream pin bumped from `c9aa5605` to `84961223` (3 commits, PRs #3443 / #3463 / #3475). PR #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; PR #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); PR #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` revalidated against the new pin: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols are unchanged across the bump.
24+
- **MLX upstream pin bumped twice.** First from the v0.0.25 baseline (`5d7e96cd`) to v0.32.0 / `c9aa5605` (#565), then forward to `84961223` covering 3 PRs: #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` re-validated against both bumps: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols unchanged.
25+
- **Refactor:** unified TurboQuant KV-cache CLI flags across `mlxcel`, `mlxcel-server`, and `mlxcel download` so all binaries accept the same `--kv-cache-mode` / `--cache-type-{k,v}` syntax (#567).
26+
- mlx-lm version reference in docs bumped from 0.31.2 to 0.31.3 (#606). The `bridge-overhead-microbench` reference at v0.31.2 is preserved because it pins the MLX C++ runtime, not the mlx-lm Python package.
27+
28+
### Performance
29+
- **Sparse-V kernel:** fused per-thread Metal kernel that skips the full SDPA pass when sparse-V dequant predicts zero contribution (#505); precomputed kernel rescale to drop per-token threadgroup barriers (#520).
30+
- **Turbo4Delegated decode hot path:** unified K storage to drop the per-step K concat (#527); cold-V dequant cache across decode steps (#525) followed by a cold-V dequant Metal kernel that retires the FP16 memo (#530); steel-attention-envelope fused SDPA kernel (#531) with parallelized Pass 1 softmax (#534); delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537); compressed fold moved before decode.
31+
- **Compressed dequant-SDPA paths** for TurboQuant decode (#562).
32+
- **Server hot-path:** thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556).
33+
34+
### Fixed
35+
- **TurboQuant continuous batching:** correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and an `ENV_LOCK` race in concurrent process startup (#573).
36+
- **Vision / VLM mixed batching:** per-sequence MRoPE alignment for mixed VL+text batches (#558); per-sequence `per_layer_inputs` for Gemma 4 E2B/E4B VLM (#561); mixed-length batching support for Gemma 4 (#560); relaxed cached-position shape check in Qwen VL chunked prefill (#557); Qwen3.5-MoE batch-size validation on cached `position_ids` reuse (#559).
37+
- **Streaming and sampling:** correct streamed detokenization for byte-fallback tokens that previously leaked raw byte fragments to the client (#570); top-p filter correctness for batched logits (#569); token queue timeout handling during long prefills so clients no longer see spurious 408s on slow first-token paths (#571); `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming (#551, #576). Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form.
38+
- **Models:** Gemma3-4B attention SIGABRT from a sliding-window mask `T_k` mismatch on long-context prompts (#507); preserve Qwen2 fused QKV bias when it is present in the checkpoint (#517); test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506); harden post-merge review findings on the Nemotron-H Nano Omni vision PR.
1939

2040
### Security
2141
- Path-traversal defense in the downloader: `is_safe_relative_path` pre-filters each sibling filename returned by the HuggingFace API (rejects absolute paths, `..` components, backslash separators, and empty components). A secondary canonicalized `starts_with` guard on the resolved destination path is applied before writing each file. Download target files are written to a temporary path and atomically renamed into place, preventing partial writes from leaving corrupt files in the output directory (fixes C1 and H1 from security review of #457).
@@ -563,6 +583,7 @@ Initial public release of mlxcel.
563583
- GitHub Actions release workflow for macOS ARM64
564584
- Profile mode for prefill/decode timing analysis
565585

586+
[v0.0.26]: https://github.com/lablup/mlxcel/compare/v0.0.25...v0.0.26
566587
[v0.0.25]: https://github.com/lablup/mlxcel/compare/v0.0.24...v0.0.25
567588
[v0.0.24]: https://github.com/lablup/mlxcel/compare/v0.0.23...v0.0.24
568589
[v0.0.23]: https://github.com/lablup/mlxcel/compare/v0.0.22...v0.0.23

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "mlxcel"
3-
version = "0.0.25"
3+
version = "0.0.26"
44
edition = "2024"
55
description = "High-performance LLM/VLM/VLA inference on Apple Silicon and CUDA GPUs"
66
repository = "https://github.com/lablup/mlxcel"

debian/changelog

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,67 @@
1+
mlxcel (0.0.26-1~jammy1) jammy; urgency=medium
2+
3+
* Feat: TurboQuant KV cache — Walsh-Hadamard transform op + PolarQuant
4+
Lloyd-Max codebook generator porting (#470, #472), KVCacheMode::Turbo4
5+
symmetric (#476), Turbo4Asym Fp16-K + Turbo4-V (#474), Turbo3Asym
6+
Fp16-K + Turbo3-V (#477), Turbo4Delegated FP16 hot tail + packed
7+
turbo cold body (#479), TurboQuant + RotatingKVCache sliding window
8+
B9, sparse-V dequant scaffolding (#480), Boundary-V layer protection
9+
(#478), packed-aware PagedKvLayout (#482)
10+
* Feat: --cache-type-k / --cache-type-v server flag parity with
11+
llama-server for TurboQuant modes (#484)
12+
* Feat: KV cache quantization for continuous batching (#545); unified
13+
TurboQuant KV-cache CLI flags across mlxcel binaries (#567);
14+
compressed dequant-SDPA paths (#562)
15+
* Perf: Fused Sparse-V Metal kernel skipping per-thread SDPA (#505);
16+
precomputed Sparse-V kernel rescale dropping per-token threadgroup
17+
barriers (#520); cold-V dequant cache + Metal kernel for
18+
Turbo4Delegated (#525, #530); unified K storage dropping per-step K
19+
concat (#527); steel-attention-envelope fused SDPA kernel (#531);
20+
parallelized Pass 1 softmax in turbo4_delegated_steel_sdpa (#534);
21+
delegated FP16 predecode compaction + lazy sidecars (#536, #537)
22+
* Test: TurboQuant wikitext-2 PPL + NIAH quality gate (#475, #492,
23+
#493); VLM B3 quality gates (PPL + NIAH + image-token kurtosis) for
24+
#510; KV speed gate matrix on M1 Ultra and M5 Max (#509)
25+
* Feat: Automatic Prefix Caching (APC) with hash blocks (#552)
26+
* Feat: OpenAI-compatible response_format json_schema constrained
27+
decoding via llguidance with schema size/depth/$ref limits (#550)
28+
* Feat: mlxcel download / mlxcel-server download HuggingFace
29+
subcommand with allow-list filter, atomic writes, and path-traversal
30+
defense (#457, #486)
31+
* Feat: /health endpoint exposes context_size and tool_call_parser
32+
(#549)
33+
* Feat: Paged scheduler dispatch on PagedKvLayout::cache_mode (#508);
34+
thread-local generation stream and uniform-batch RoPE collapse (#556)
35+
* Feat: Gemma 4 video support and VLM video input infrastructure
36+
(#553); Youtu-VL model (#555); Nemotron H Nano Omni vision (#554);
37+
video_url content blocks wired through chat completion handler
38+
(#596); single-pass ffmpeg frame extraction with Drop guard (#597)
39+
* Fix: Per-sequence MRoPE alignment for mixed VL+text batches (#558);
40+
per-sequence per_layer_inputs for Gemma 4 E2B/E4B VLM (#561);
41+
mixed-length batching for Gemma 4 (#560); relaxed cached-position
42+
shape check in Qwen VL chunked prefill (#557); Qwen3.5-MoE batch
43+
size validation on cached position_ids reuse (#559)
44+
* Fix: Streamed detokenization for byte-fallback tokens (#570); top_p
45+
filter for batched logits (#569); token queue timeout during long
46+
prefills (#571); tool-call markup stripped from streamed
47+
delta.content covering Hermes <tool_call> and Mistral Nemo
48+
[TOOL_CALLS] (#551, #576)
49+
* Fix: TurboQuant batch cache offset merging (#564); Turbo3 split-flag
50+
+ ENV_LOCK race (#573)
51+
* Fix: Gemma3-4B attention SIGABRT — sliding-window mask T_k mismatch
52+
(#507); Qwen2 fused QKV bias preservation (#517);
53+
nemotron-h-nano-omni post-merge validation gaps (#583)
54+
* Update: MLX upstream pin bumped twice — first to v0.32.0
55+
(c9aa5605, #565), then to 84961223 covering CUDA QMM kernel split,
56+
JIT preamble rename, and AsStrided contiguity-flag fix (PRs #3443
57+
#3463 #3475)
58+
* Docs: TurboQuant KV cache user guide and validated config matrix
59+
(#485); mlx-lm version reference 0.31.2 → 0.31.3 (#606); M1 Ultra
60+
benchmark refresh to 2026-05-08 (#577) and benchmarks-by-hardware.md
61+
full M1 Ultra column resync (#578)
62+
63+
-- Jeongkyu Shin <inureyes@gmail.com> Sun, 10 May 2026 18:00:00 +0900
64+
165
mlxcel (0.0.25-1~jammy1) jammy; urgency=medium
266

367
* Feat: Cross-sequence prompt-prefix KV cache — KVCache trim/detach/adopt

0 commit comments

Comments
 (0)