update: bump MLX upstream pin to 84961223 (PRs )

inureyes · web-flow · commit 4497104e6a0b · 2026-05-10T21:39:03.000+09:00
Picks up upstream (CUDA qmm_naive / qmm_sm80 kernel bodies extracted into new qmm_naive.cuh / qmm_sm80.cuh headers — public ABI of the symbols declared in mlxcel's patches/.../qmm.h is unchanged),
(CPU JIT preamble routed through JitCompiler::get_preamble
and the prebuilt symbol renamed from get_kernel_preamble to get_prebuilt_preamble — mlxcel does not call either directly), and
(AsStrided contiguity-flag accuracy fix in mlx/backend/common,
computing data_size from the actually-occupied stride range).

Three-location pin update applied per CLAUDE.md:
 - src/lib/mlx-cpp/CMakeLists.txt (GIT_TAG)
 - src/lib/mlxcel-core/build.rs (MLX_EXPECTED_COMMIT) -.github/workflows/release.yml (MLX_EXPECTED_COMMIT env)

Patch headers retargeted to the new commit:
 - patches/mlx/backend/cuda/quantized/qmm/qmm.h
 - patches/mlx/backend/cuda/quantized/quantized.cpp

Fused Metal kernel launchers in src/lib/mlx-cpp/turbo/ revalidated on Apple Silicon. The relevant symbols (mlx::core::fast::metal_kernel, mlx::core::full, mlx::core::Shape, mlx::core::float32, mlx::core::int32, metal::fast::exp) are unchanged across the bump; the three required correctness tests pass with significant headroom on the RMS&lt;5e-3 gate:

sparse_v_kernel_threshold_zero_matches_graph OK delegated_fused_kernel_matches_reference_over_200_steps RMS = 1.7263e-4 delegated_steel_envelope_matches_cold_only_fused_over_200_steps RMS = 1.5259e-4
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -236,7 +236,7 @@ jobs:
         env:
           # Must match GIT_TAG in src/lib/mlx-cpp/CMakeLists.txt and
           # MLX_EXPECTED_COMMIT in src/lib/mlxcel-core/build.rs
-          MLX_EXPECTED_COMMIT: "c9aa560577d4f41677bc5830a8b7e806a07d4c6f"
+          MLX_EXPECTED_COMMIT: "84961223c02925bef6bef95d3a0a046779bde935"
         run: |
           # Check every _deps directory for a valid .mlx-build-commit marker.
           # If the marker is missing or doesn't match, purge that _deps/ entirely.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 ### Fixed
 - `StreamFilter` extended to cover Hermes-style `<tool_call>` / `</tool_call>` and Mistral Nemo `[TOOL_CALLS]` markers, which previously leaked raw markup into `delta.content` during streaming. Partial-marker buffering at token boundaries correctly holds back prefixes (e.g. `<tool_`) until the full tag can be confirmed, then releases them to `delta.content` if they turn out not to be a boundary. Gemma 4 `<|tool_call>` suppression is unaffected; the delimiter table ordering ensures the Gemma 4 pipe-delimited form wins the tiebreak over the Hermes plain form (#551).
 
+### Changed
+- `MLX` upstream pin bumped from `c9aa5605` to `84961223` (3 commits, PRs #3443 / #3463 / #3475). PR #3443 splits the CUDA `qmm_naive` / `qmm_sm80` kernel bodies into new `qmm_naive.cuh` / `qmm_sm80.cuh` headers without changing the public ABI consumed by mlxcel's `patches/mlx/backend/cuda/quantized/qmm/qmm.h`; PR #3463 routes the CPU JIT preamble through `JitCompiler::get_preamble()` and renames the prebuilt symbol from `get_kernel_preamble` to `get_prebuilt_preamble` (mlxcel does not call either directly); PR #3475 fixes contiguity-flag accuracy in `AsStrided` by computing `data_size` from the actually-occupied stride range. Three-location pin update applied to `src/lib/mlx-cpp/CMakeLists.txt`, `src/lib/mlxcel-core/build.rs`, and `.github/workflows/release.yml` per `CLAUDE.md`. Fused Metal kernel launchers in `src/lib/mlx-cpp/turbo/` revalidated against the new pin: `mlx::core::fast::metal_kernel`, `mlx::core::full`, `mlx::core::Shape`, `mlx::core::float32`, `mlx::core::int32`, and `metal::fast::exp` symbols are unchanged across the bump.
+
 ### Security
 - Path-traversal defense in the downloader: `is_safe_relative_path` pre-filters each sibling filename returned by the HuggingFace API (rejects absolute paths, `..` components, backslash separators, and empty components). A secondary canonicalized `starts_with` guard on the resolved destination path is applied before writing each file. Download target files are written to a temporary path and atomically renamed into place, preventing partial writes from leaving corrupt files in the output directory (fixes C1 and H1 from security review of #457).
 - Structured-output schema limits (64 KiB serialized size, 32 nesting depth, 64 `$ref` count) and tightened `llguidance` parser caps (`max_grammar_size: 100 000`, `max_lexer_states: 50 000`) applied before grammar compilation so an adversarial client cannot use the schema endpoint as a CPU/memory exhaustion vector. Schema content is never echoed in public error messages (#550).
diff --git a/src/lib/mlx-cpp/CMakeLists.txt b/src/lib/mlx-cpp/CMakeLists.txt
@@ -89,7 +89,7 @@ else()
   FetchContent_Declare(
     mlx
     GIT_REPOSITORY "https://github.com/ml-explore/mlx.git"
-    GIT_TAG c9aa560577d4f41677bc5830a8b7e806a07d4c6f)
+    GIT_TAG 84961223c02925bef6bef95d3a0a046779bde935)
 
   # Use FetchContent_Populate + add_subdirectory so we can apply source
   # overlays before the MLX build system processes the files.
diff --git a/src/lib/mlx-cpp/patches/mlx/backend/cuda/quantized/qmm/qmm.h b/src/lib/mlx-cpp/patches/mlx/backend/cuda/quantized/qmm/qmm.h
@@ -1,7 +1,9 @@
 // Copyright © 2026 Apple Inc.
-// Patched by mlxcel: matches upstream c9aa5605. Declarations carry the
+// Patched by mlxcel: matches upstream 84961223. Declarations carry the
 // optional<array> lhs_indices / rhs_indices parameters on qmm_sm80 and
-// qmm_naive; no functional change between 68cf2fdd and c9aa5605.
+// qmm_naive; upstream #3443 (c9aa5605..84961223) split the kernel body
+// into qmm_naive.cuh / qmm_sm80.cuh while preserving the public ABI of
+// the symbols declared here.
 
 #pragma once
 
diff --git a/src/lib/mlx-cpp/patches/mlx/backend/cuda/quantized/quantized.cpp b/src/lib/mlx-cpp/patches/mlx/backend/cuda/quantized/quantized.cpp
@@ -1,10 +1,12 @@
 // Copyright © 2025 Apple Inc.
 // Patched by mlxcel: ensure input contiguity in QuantizedMatmul for
 // non-contiguous 3D batched weights (e.g. GLM-4 MLA embed_q with
-// transpose=false). Synced to upstream c9aa5605, which folds in the
-// #3469 cutlass-half-type fix (ensure_row_contiguous on x and indices)
-// and continues to accept the optional<array> lhs_indices / rhs_indices
-// parameters on qmm_sm80 / qmm_naive.
+// transpose=false). Synced to upstream 84961223 (post-c9aa5605: PR #3443
+// extracted qmm_naive / qmm_sm80 kernel bodies into .cuh headers but
+// preserved the public function signatures consumed here, including the
+// #3469 cutlass-half-type fix that landed at c9aa5605: ensure_row_contiguous
+// on x and indices, plus the optional<array> lhs_indices / rhs_indices
+// parameters on qmm_sm80 / qmm_naive).
 
 #include "mlx/backend/cuda/quantized/quantized.h"
 #include "mlx/backend/cuda/device.h"
diff --git a/src/lib/mlxcel-core/build.rs b/src/lib/mlxcel-core/build.rs
@@ -134,7 +134,7 @@ fn main() {
 }
 
 /// Expected MLX git commit — must match GIT_TAG in mlx-cpp/CMakeLists.txt.
-const MLX_EXPECTED_COMMIT: &str = "c9aa560577d4f41677bc5830a8b7e806a07d4c6f";
+const MLX_EXPECTED_COMMIT: &str = "84961223c02925bef6bef95d3a0a046779bde935";
 
 /// Purge stale cached MLX build artifacts before CMake runs.
 ///

Original file line number	Diff line number	Diff line change
`@@ -134,7 +134,7 @@ fn main() {`
`134`	`134`	`}`
`135`	`135`
`136`	`136`	`/// Expected MLX git commit — must match GIT_TAG in mlx-cpp/CMakeLists.txt.`
`137`		`-const MLX_EXPECTED_COMMIT: &str = "c9aa560577d4f41677bc5830a8b7e806a07d4c6f";`
	`137`	`+const MLX_EXPECTED_COMMIT: &str = "84961223c02925bef6bef95d3a0a046779bde935";`
`138`	`138`
`139`	`139`	`/// Purge stale cached MLX build artifacts before CMake runs.`
`140`	`140`	`///`