v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++ by ashvardanian · Pull Request #350 · ashvardanian/NumKong

ashvardanian · 2026-04-19T19:48:32Z

GPU backends
Block-scaled tensors Feature: MXFP4 support, NVFP4, and block-scaling #298
Top-K many-to-many kernels
Geo-temporal cross-products

Range: 455d535..247d8c5

Adds a struct-of-arrays block-scaled tensor across C, C++, Rust, and Python: packed micro-float elements, a per-block scale tensor, and an optional per-tensor scale (NVFP4). The C kernel `nk_cast_block_scaled` encodes, decodes, and transcodes along the last axis across serial, Haswell, Skylake, Icelake, and NEON; the shared UE8M0 scale follows the OCP MX v1.0 floor formula so a block's own maximum is never clipped, with RNE element rounding and an E5M2 saturate clamp. Bit-to-byte sizes now use the round-up helpers and a sub-byte SIMD over-read was fixed. C++ adds `scaled_tensor`/`scaled_tensor_view`/`scaled_tensor_span` composing the existing tensor family; `cast()` encodes, decodes, and transcodes in place through a single block-scaled marshalling helper and is rank-general over the last axis. The cast backends were tidied too: the Haswell FP4 path now mirrors Icelake, and the SIMD loop counter and block-amax helper names were unified. Rust adds `ScaledTensor`/`View`/`Span` composing `Tensor`, with explicit per-format impls (no macros) and rank-general `try_cast_to_scaled` / `try_cast_dense` verbs funneled through one FFI helper. `TensorView` is now `Copy` (an immutable view is a shared borrow), removing the hand-rolled reborrow. Python adds a `ScaledTensor` produced and consumed via `astype`, with read-only `elements`/`block_scales`/`tensor_scale`/`block_size` attributes; the components are DLPack- and CUDA-array-interface-exportable for zero-copy GPU interop. Tests cover byte-identity against the C reference, rank-1/2/3 round-trips, slicing, and transcode in every binding.

The Python `Tensor(...)` constructor mishandled non-C-contiguous buffers — it honored only the outermost stride and assumed packed inner axes, silently corrupting Fortran-order, transposed, and strided NumPy inputs. It now routes through `linearize_cast_into`, which walks every axis' stride. `ScaledTensor.astype("<block-scaled>")` now transcodes between block-scaled formats (e.g. NVFP4 -> MXFP8) instead of raising, matching the C++ and Rust bindings. Adds coverage in both Python and Rust for non-contiguous / transposed construction and for transcoding (verified equal to decode-then-encode).

Several SIMD kernels and Python-binding paths mishandled degenerate shapes and strides, ranging from a hang to out-of-bounds reads and writes. The reduce kernels (skylake/icelake/haswell/sierra/alder/neonbfdot) hit an infinite loop, SIGFPE, or stack overflow when handed stride_bytes == 0; their `!aligned` serial fallbacks now also catch `stride_elements == 0`. The sub-byte cast oracle in cast/serial.h hard-coded its pack/unpack loops to four bytes regardless of count, reading and writing past the buffer on any odd i4/u4/e2m1 length -- every site now bounds the loop by nk_size_divide_round_up_(count, 2) and guards the odd tail. Rank-0 and empty reductions no longer touch a non-existent element: C++ moments()/minmax() indexed stride_bytes(SIZE_MAX) on a 0-D view, the Python rank-0 path fabricated a zero stride, and empty minmax primed its accumulators from element [0]. The Python bindings gained the missing input validation. Dense-metric `out=` buffers are checked for rank and capacity before writing (an undersized out overflowed the heap); parse_tensor now requires exact inner-axis contiguity, rejecting the negative/zero strides (e.g. x[::-1]) that the old signed `> itemsize` check let walk off the buffer; and DLPack import rejects negative extents and a NULL data pointer. The Rust symmetric matrix verbs reject non-contiguous-row (transposed) views, matching the packed and parallel paths. The test suites were deduplicated and extended over the same edges. The Python *_float/*_integer reduction and arithmetic pairs collapse into parametrized tests (removing dead precise_*/baseline_sum helpers and a stale DLPack helper), and new cases cover empty/0-D tensors, NaN/Inf preservation on dense casts, and block-scaled round-trip / idempotence / byte-identity across all seven formats, plus the degenerate inputs above. Net test lines shrink while coverage grows. Comments were stripped of decorative banner separators and ephemeral numbered "Phase"/"Step"/"Option" labels, whose ordering is implied by code order. Also sets the rustfmt line width to 120 and bumps the Rust MSRV to 1.73 for usize::div_ceil.

The block-scaling cast paths reference NULL (`from_tensor_scale != NULL`) in cast/{serial,neon,icelake,skylake,haswell}.h, and capabilities.h passes NULL to sysctlbyname. NumKong headers intentionally avoid <stddef.h>, so NULL resolves only transitively through the C translation units' <stdlib.h>. The Swift module and WASM/WASI clang builds compile the headers standalone and fail with "use of undeclared identifier 'NULL'", which blocked the Swift, WASM, and wheel CI jobs. Add an NK_NULL macro to types.h (mirroring StringZilla's SZ_NULL: __null on GCC/Clang, ((void *)0) otherwise) and use it at those call sites -- no header dependency is introduced. Also drop the stale float8_e8m0fnu case from test_ml_dtypes_incompatible_rejected: e8m0 is the OCP UE8M0 scale format, now a first-class dtype via block scaling, so it is accepted and round-trips exactly. A positive test covers it.

Add nk_size_mul_checked_ to types.h and route the shape-product and cdist size computations in python/{each,distance}.c through it, so a buffer reporting an overflowing shape raises OverflowError instead of wrapping into an undersized allocation walked at the true extent. The elementwise scalar-array paths returned NULL without setting an exception when the buffer dtype is unsupported (a CPython protocol violation); they now raise TypeError. The packed and symmetric matrix verbs leaked the owned output tensor on an invalid row range; they now Py_DECREF it. The DLPack importer propagates a PyObject_IsTrue error and nulls owner->managed if PyCapsule_SetName fails, avoiding a double-free of the producer tensor.

cibuildwheel 4.0 removed the `cpython-freethreading` enable group (free-threaded CPython is built by default now), so the pinned `CIBW_ENABLE` aborts every wheel job with "Unknown enable group". Drop it from all seven wheel matrices; `cp31Xt-*` selectors still build the no-GIL wheels. Stop pinning the `Visual Studio 17 2022` generator in the Windows C test and release jobs; let CMake auto-detect whichever Visual Studio the runner ships (windows-2025 now carries VS18). `-A` still selects the target architecture. Intel's downloadmirror.intel.com now sits behind an AWS WAF challenge (HTTP 202, x-amzn-waf-action: challenge) that hands scripted clients an HTML page instead of the tarball, so `tar -xf` fails. Fetch SDE from the petarpetrovt/setup-sde GitHub-hosted mirror instead.

cibuildwheel 4.x removed support for Python 3.13 free-threading entirely (`cp313t` is no longer a build identifier and the tool errors out on the selector). Python 3.14 free-threading is the first supported tier and is built by default — no enable group needed. Drop the now-invalid `313t` matrix rows across every wheel platform; `314t` continues to ship.

`nk_size_mul_checked_` takes `nk_size_t *`, but the cdist and elementwise bindings passed `&size_t`. On x86_64 `nk_size_t` is `unsigned long long` while `size_t` is `unsigned long` (same width, distinct types), so GCC's `-Wincompatible-pointer-types` fires — and gcc-toolset-14 in the manylinux images treats it as a hard error, breaking every Linux wheel. clang and older gcc only warned, so it slipped through local builds. Do the count arithmetic in `nk_size_t` (the helper's and `nk_cast`'s domain) and convert to `size_t` only at the libc boundary, which also drops the now-redundant `(nk_size_t)total_elements` casts at the cast calls.

The ARM64 Windows wheels are cross-compiled on an x64 host. `/openmp:llvm` makes the extension import `libomp140.aarch64.dll`, which ships with Visual Studio's ARM64 cross-tools but is not on PATH, so delvewheel could not find it to bundle and the repair step failed. Add a step (ARM64 jobs only) that locates `libomp140.aarch64.dll` in the Visual Studio install and puts its directory on PATH. delvewheel searches PATH (matching the DLL by architecture), so it now vendors the runtime into the wheel. OpenMP stays enabled on every platform.

`tensor_slice_suffix_(all_t, rest...)` computed `inner.byte_data() - first_row.byte_data()` to locate the sliced sub-shape. When `rest...` over-slices the row to an empty tensor, `inner.byte_data()` is null, and a pointer subtraction with a null operand is undefined behavior. Detect the null inner and return an empty view first; a valid rank-0 scalar slice keeps a non-null pointer and is unaffected.

When the source point cloud has zero variance (all points identical), the similarity scale is mathematically undefined. The backends disagreed: the serial oracle and skylake's f32 path divided unguarded and yielded Inf/NaN, while Haswell, genoa, every NEON/RVV/v128 path, and skylake's f16/f64/bf16 paths guarded the division to return 0. Returning 0 silently produces a plausible-looking but wrong transform. Drop the guards so every backend propagates the Inf/NaN of the undefined division, matching the serial reference and making the degenerate case an explicit signal rather than a quiet 0. Non-degenerate inputs are unchanged (mesh suite passes at machine-epsilon error).

Construction sizes storage as `shape.product() / dimensions_per_value()`; for a rank-0 tensor the empty product is 1, so it allocates one element. `Drop` special-cased `ndim == 0` to a storage count of 0 and skipped the deallocation, leaking that element. Compute the count the same way construction does so the element is freed (rank-0 sub-byte stays at 0 via the same integer division, matching its dangling allocation).

The in-place ops (scale/add, scalar add/sub/mul, sin/cos/atan, and add/sub/mul-tensor) routed through `try_reborrow_tensor_inplace`, which fabricated a read-only TensorView aliasing the mutable TensorSpan over the same storage. The kernel walk then formed an overlapping `&[T]` + `&mut [T]` over those bytes — undefined behavior under Stacked/Tree Borrows (the C kernels tolerate the aliasing, but Rust's model does not). In-place mutation now lives on TensorSpan and operates on its own storage: each per-type `each_*_inplace` entry derives both raw pointers from the single `&mut` (source == dest, which the C kernels accept), and the binary ops keep `other` as a disjoint `&[T]`. No read-view of the target is ever constructed. `try_reborrow_tensor_inplace` is deleted; Tensor's in-place methods delegate to `self.span()`. Error handling is made consistent: unary in-place is infallible, binary validates shapes and returns Result. Adds in-place == out-of-place regression tests.

Run `cargo +nightly fmt` so the crate matches rustfmt.toml (fn_single_line, max_width=120, unstable_features). Pure formatting — no behavior change. Committing the reflow on its own keeps it out of logic diffs and makes future `fmt` runs no-ops.

The `all_t` slice overload guarded `inner.byte_data() - first_row.byte_data()` against a null inner, but the parallel `range` overload did the same pointer subtraction unguarded — UB when a nested slice over-slices to an empty tensor. Extract one `slice_inner_byte_offset_` helper (also rejecting a null first_row) and use it in both overloads.

The WASM-via-Wasmer job built `wasmer-cli` from `--git main` to get relaxed-SIMD support. Upstream's main now pulls a private SSH submodule (git@github.com:wasmerio/quickjs.git) that fails to authenticate in CI, so the install aborts. Relaxed-SIMD now ships in released Wasmer, so install the prebuilt latest release via get.wasmer.io — no source build, no submodule fetch.

Two RVV cast bugs, both verified against the serial oracle under qemu-riscv64 (CI only compiles RVV, so these were never caught): - f32→f16 (`nk_f32m2_to_f16m1_rvv_`) was a simplified conversion that clamped the exponent and rounded half-up, so overflow did not saturate to infinity (65520 → a finite value, 70000 → garbage), denormals and underflow were wrong, and rounding was not round-to-nearest-even. Rewritten to compute every IEEE bucket (zero, inf/nan, underflow, denormal, normal, overflow) with RNE and merge by exponent mask, matching nk_f32_to_f16_serial bit-for-bit (incl. 200 random values). - The packed i4/u4 ↔ i8/u8 casts sized their vector loop as `count / 2`, dropping the trailing element on odd counts (and skipping the whole cast at count == 1). Add a scalar tail for the odd high nibble, clamping packs to [-8,7]/[0,15] to match the vector helpers.

ashvardanian added 7 commits April 19, 2026 18:57

Add: Draft block-scaling casts

8916f4d

Add: NVFP4 & MX-blocked formats for C++

cc82fbc

Make: Wire WASI to CTest

247d8c5

Merge: New ISAs & Modular Design with Block Scaling

e8b5e51

Range: 455d535..247d8c5

ashvardanian force-pushed the main-v8-block-scaling branch from a080325 to aa7dddb Compare June 12, 2026 11:57

ashvardanian linked an issue Jun 12, 2026 that may be closed by this pull request

Feature: MXFP4 support, NVFP4, and block-scaling #298

Open

3 tasks

ashvardanian changed the title ~~NVFP4 for Rust & C++~~ v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++ Jun 12, 2026

ashvardanian added 14 commits June 12, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++#350

v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++#350
ashvardanian wants to merge 21 commits into
main-devfrom
main-v8-block-scaling

ashvardanian commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashvardanian commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ashvardanian commented Apr 19, 2026 •

edited

Loading