You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+28-7Lines changed: 28 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,16 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
6
6
7
7
## [Unreleased]
8
8
9
-
### Added
10
-
-`mlxcel download` / `mlxcel-server download` subcommand to fetch HuggingFace model repository snapshots without Python tooling. Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md (#457).
11
-
-`/health` endpoint now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path (#549).
12
-
- OpenAI-compatible `response_format: {"type": "json_schema", ...}` structured-output support for `/v1/chat/completions` and `/v1/completions`. Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only (#550).
9
+
## [v0.0.26] - 2026-05-10
13
10
14
-
### Fixed
15
-
-`StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming. Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form (#551).
11
+
### Added
12
+
- **TurboQuant KV cache.** New 3–4 bit KV cache compression family built on a Walsh–Hadamard transform op (#470) and a Lloyd-Max PolarQuant codebook generator ported to Rust (#472). Four KV cache modes wired through `KVCacheMode`: `Turbo4` symmetric with per-model allowlist (#476), `Turbo4Asym` Fp16-K + Turbo4-V (#474), `Turbo3Asym` 3-bit Fp16-K + Turbo3-V (#477), and `Turbo4Delegated` with a FP16 hot tail + packed turbo cold body (#479). TurboQuant + `RotatingKVCache` integration covers sliding-window attention (B9). Sparse-V dequant scaffolding (#480), Boundary-V layer protection that keeps the first/last layer at FP16 (#478), and a packed-aware `PagedKvLayout` (#482) round out the runtime. Quality gates: wikitext-2 PPL + NIAH harness (#475), full 283K-token test split fixture (#492), per-model PPL/NIAH results committed (#493), and VLM B3 quality gates with image-token kurtosis (#510). Speed gate matrix runner with M1 Ultra (#509) and M5 Max readings. User guide and validated config matrix published (#485).
13
+
-**Server flag parity for KV quantization.**`--cache-type-k` and `--cache-type-v` flags accept `f16`, `q8_0`, `q4_0`, etc. matching llama-server semantics, with TurboQuant modes exposed as `mlxcel_turbo*` variants (#484). KV cache quantization extended to continuous batching (#545) and a unified `--kv-cache-mode` flag layout shared across `mlxcel`, `mlxcel-server`, and `mlxcel download` (#567).
14
+
-**Automatic Prefix Caching (APC) with hash blocks** (#552). Hash-keyed block-table prefix reuse on top of v0.0.25's cross-sequence prompt-prefix KV cache, enabling shared physical blocks across requests with the same hashed prefix without per-request token-prefix matching cost.
15
+
- **OpenAI-compatible `response_format: {"type": "json_schema", ...}`** structured-output support for `/v1/chat/completions` and `/v1/completions` (#550). Constrained decoding via `llguidance` (the same backend used by upstream mlx-vlm PR #1047) ensures every emitted token keeps the partial output conforming to the supplied schema. Per-request schema validation enforces a 64 KiB size cap, a 32-level nesting depth limit, and a 64-entry `$ref` count limit so an adversarial schema cannot exhaust CPU or memory during grammar compilation. The tokenizer environment is cached by SHA-256 fingerprint so consecutive requests to the same model share the build cost (~1–2 s for a 150k-vocab tokenizer). Reusable per-sequence `mask_buf` and `bias_buf` allocations eliminate per-token `Vec` allocation on the hot decode path. The legacy `json_object` mode is rejected with a clean 400 in this MVP; `json_schema` with a well-formed schema is the supported path. Supported on HuggingFace BPE tokenizers; SentencePiece and Tiktoken backends return a clean `UnsupportedTokenizer` error. Verbose llguidance internals are never surfaced in public error messages — they are routed to server-side tracing only.
16
+
-**`mlxcel download` / `mlxcel-server download` subcommand** to fetch HuggingFace model repository snapshots without Python tooling (#457, #486). Uses `hf-hub` with an allow-list file filter (SafeTensors, tokenizer, and config files only), cache-hit detection, and formatted per-file progress output. Supports `--local-dir`, `--revision`, `--token`, and `--force`. Default destination mirrors the `models/<repo-basename>` convention from AGENTS.md.
17
+
-**`/health` endpoint** now includes `context_size` (the configured `--ctx-size` value; `0` means model default) and `tool_call_parser` (`"mlxcel"` when the chat template exposes the `tools` variable, `null` otherwise) (#549, #572). Both fields are present once a model is loaded; `context_size` is absent while loading, `tool_call_parser` serializes as `null` during startup so monitoring clients can distinguish "template has no tool support" from "model not yet loaded". The tool-support heuristic is extracted into a shared `template_mentions_tools()` helper used by both the health route and the existing `compute_supports_tools` fallback path.
18
+
-**Paged scheduler dispatch on `PagedKvLayout::cache_mode`** so the scheduler routes batches into the matching paged decode kernel for each KV cache mode (#508).
19
+
-**Video input infrastructure for VLMs.** Gemma 4 video support with the new VLM video input pipeline (#553); ffmpeg-backed frame extraction with single-pass extraction and a `Drop` guard for cleanup of temporary frame files (#597); `video_url` content blocks wired through `/v1/chat/completions` (#596); content-preservation tests covering the frame extraction path (#598).
20
+
-**New models.** Youtu-VL vision-language model (#555); Nemotron H Nano Omni vision (#554) plus follow-up correctness/validation hardening (#595).
21
+
- Multi-task M1 Ultra benchmark refresh to 2026-05-08 (#577) and full M1 Ultra column resync in `benchmarks-by-hardware.md` (#578).
16
22
17
23
### Changed
18
-
- `MLX` upstream pin bumped from `c9aa5605` to `84961223` (3 commits, PRs #3443 / #3463 / #3475). PR #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; PR #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); PR #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` revalidated against the new pin: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols are unchanged across the bump.
24
+
- **MLX upstream pin bumped twice.** First from the v0.0.25 baseline (`5d7e96cd`) to v0.32.0 / `c9aa5605` (#565), then forward to `84961223` covering 3 PRs: #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` re-validated against both bumps: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols unchanged.
25
+
-**Refactor:** unified TurboQuant KV-cache CLI flags across `mlxcel`, `mlxcel-server`, and `mlxcel download` so all binaries accept the same `--kv-cache-mode` / `--cache-type-{k,v}` syntax (#567).
26
+
- mlx-lm version reference in docs bumped from 0.31.2 to 0.31.3 (#606). The `bridge-overhead-microbench` reference at v0.31.2 is preserved because it pins the MLX C++ runtime, not the mlx-lm Python package.
27
+
28
+
### Performance
29
+
-**Sparse-V kernel:** fused per-thread Metal kernel that skips the full SDPA pass when sparse-V dequant predicts zero contribution (#505); precomputed kernel rescale to drop per-token threadgroup barriers (#520).
30
+
-**Turbo4Delegated decode hot path:** unified K storage to drop the per-step K concat (#527); cold-V dequant cache across decode steps (#525) followed by a cold-V dequant Metal kernel that retires the FP16 memo (#530); steel-attention-envelope fused SDPA kernel (#531) with parallelized Pass 1 softmax (#534); delegated FP16 predecode compaction (#536) and lazy delegated FP16 sidecars (#537); compressed fold moved before decode.
31
+
-**Compressed dequant-SDPA paths** for TurboQuant decode (#562).
32
+
-**Server hot-path:** thread-local generation stream and uniform-batch RoPE collapse to remove per-request allocation in the steady-state batching loop (#556).
33
+
34
+
### Fixed
35
+
-**TurboQuant continuous batching:** correct batch cache offset merging when batches with different cache offsets are joined or split (#564); Turbo3 split-flag, documentation alignment, and an `ENV_LOCK` race in concurrent process startup (#573).
36
+
-**Vision / VLM mixed batching:** per-sequence MRoPE alignment for mixed VL+text batches (#558); per-sequence `per_layer_inputs` for Gemma 4 E2B/E4B VLM (#561); mixed-length batching support for Gemma 4 (#560); relaxed cached-position shape check in Qwen VL chunked prefill (#557); Qwen3.5-MoE batch-size validation on cached `position_ids` reuse (#559).
37
+
-**Streaming and sampling:** correct streamed detokenization for byte-fallback tokens that previously leaked raw byte fragments to the client (#570); top-p filter correctness for batched logits (#569); token queue timeout handling during long prefills so clients no longer see spurious 408s on slow first-token paths (#571); `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming (#551, #576). Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form.
38
+
-**Models:** Gemma3-4B attention SIGABRT from a sliding-window mask `T_k` mismatch on long-context prompts (#507); preserve Qwen2 fused QKV bias when it is present in the checkpoint (#517); test fixture swap to Qwen2.5-1.5B base variant for the B3 quality gate (#506); harden post-merge review findings on the Nemotron-H Nano Omni vision PR.
19
39
20
40
### Security
21
41
- Path-traversal defense in the downloader: `is_safe_relative_path` pre-filters each sibling filename returned by the HuggingFace API (rejects absolute paths, `..` components, backslash separators, and empty components). A secondary canonicalized `starts_with` guard on the resolved destination path is applied before writing each file. Download target files are written to a temporary path and atomically renamed into place, preventing partial writes from leaving corrupt files in the output directory (fixes C1 and H1 from security review of #457).
@@ -563,6 +583,7 @@ Initial public release of mlxcel.
0 commit comments