Skip to content

v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++#350

Open
ashvardanian wants to merge 21 commits into
main-devfrom
main-v8-block-scaling
Open

v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++#350
ashvardanian wants to merge 21 commits into
main-devfrom
main-v8-block-scaling

Conversation

@ashvardanian

@ashvardanian ashvardanian commented Apr 19, 2026

Copy link
Copy Markdown
Owner

Adds a struct-of-arrays block-scaled tensor across C, C++, Rust, and Python:
packed micro-float elements, a per-block scale tensor, and an optional
per-tensor scale (NVFP4). The C kernel `nk_cast_block_scaled` encodes, decodes,
and transcodes along the last axis across serial, Haswell, Skylake, Icelake, and
NEON; the shared UE8M0 scale follows the OCP MX v1.0 floor formula so a block's
own maximum is never clipped, with RNE element rounding and an E5M2 saturate
clamp. Bit-to-byte sizes now use the round-up helpers and a sub-byte SIMD
over-read was fixed.

C++ adds `scaled_tensor`/`scaled_tensor_view`/`scaled_tensor_span` composing the
existing tensor family; `cast()` encodes, decodes, and transcodes in place
through a single block-scaled marshalling helper and is rank-general over the
last axis. The cast backends were tidied too: the Haswell FP4 path now mirrors
Icelake, and the SIMD loop counter and block-amax helper names were unified.

Rust adds `ScaledTensor`/`View`/`Span` composing `Tensor`, with explicit
per-format impls (no macros) and rank-general `try_cast_to_scaled` /
`try_cast_dense` verbs funneled through one FFI helper. `TensorView` is now
`Copy` (an immutable view is a shared borrow), removing the hand-rolled reborrow.

Python adds a `ScaledTensor` produced and consumed via `astype`, with read-only
`elements`/`block_scales`/`tensor_scale`/`block_size` attributes; the components
are DLPack- and CUDA-array-interface-exportable for zero-copy GPU interop. Tests
cover byte-identity against the C reference, rank-1/2/3 round-trips, slicing, and
transcode in every binding.
The Python `Tensor(...)` constructor mishandled non-C-contiguous buffers — it
honored only the outermost stride and assumed packed inner axes, silently
corrupting Fortran-order, transposed, and strided NumPy inputs. It now routes
through `linearize_cast_into`, which walks every axis' stride.

`ScaledTensor.astype("<block-scaled>")` now transcodes between block-scaled
formats (e.g. NVFP4 -> MXFP8) instead of raising, matching the C++ and Rust
bindings. Adds coverage in both Python and Rust for non-contiguous / transposed
construction and for transcoding (verified equal to decode-then-encode).
Several SIMD kernels and Python-binding paths mishandled degenerate shapes and
strides, ranging from a hang to out-of-bounds reads and writes. The reduce
kernels (skylake/icelake/haswell/sierra/alder/neonbfdot) hit an infinite loop,
SIGFPE, or stack overflow when handed stride_bytes == 0; their `!aligned`
serial fallbacks now also catch `stride_elements == 0`. The sub-byte cast
oracle in cast/serial.h hard-coded its pack/unpack loops to four bytes
regardless of count, reading and writing past the buffer on any odd i4/u4/e2m1
length -- every site now bounds the loop by nk_size_divide_round_up_(count, 2)
and guards the odd tail. Rank-0 and empty reductions no longer touch a
non-existent element: C++ moments()/minmax() indexed stride_bytes(SIZE_MAX) on
a 0-D view, the Python rank-0 path fabricated a zero stride, and empty minmax
primed its accumulators from element [0].

The Python bindings gained the missing input validation. Dense-metric `out=`
buffers are checked for rank and capacity before writing (an undersized out
overflowed the heap); parse_tensor now requires exact inner-axis contiguity,
rejecting the negative/zero strides (e.g. x[::-1]) that the old signed
`> itemsize` check let walk off the buffer; and DLPack import rejects negative
extents and a NULL data pointer. The Rust symmetric matrix verbs reject
non-contiguous-row (transposed) views, matching the packed and parallel paths.

The test suites were deduplicated and extended over the same edges. The Python
*_float/*_integer reduction and arithmetic pairs collapse into parametrized
tests (removing dead precise_*/baseline_sum helpers and a stale DLPack helper),
and new cases cover empty/0-D tensors, NaN/Inf preservation on dense casts, and
block-scaled round-trip / idempotence / byte-identity across all seven formats,
plus the degenerate inputs above. Net test lines shrink while coverage grows.

Comments were stripped of decorative banner separators and ephemeral numbered
"Phase"/"Step"/"Option" labels, whose ordering is implied by code order. Also
sets the rustfmt line width to 120 and bumps the Rust MSRV to 1.73 for
usize::div_ceil.
@ashvardanian ashvardanian force-pushed the main-v8-block-scaling branch from a080325 to aa7dddb Compare June 12, 2026 11:57
@ashvardanian ashvardanian linked an issue Jun 12, 2026 that may be closed by this pull request
3 tasks
@ashvardanian ashvardanian changed the title NVFP4 for Rust & C++ v8: GPUs, MXFP & NVFP4 for Rust, Python, and C++ Jun 12, 2026
The block-scaling cast paths reference NULL (`from_tensor_scale != NULL`) in
cast/{serial,neon,icelake,skylake,haswell}.h, and capabilities.h passes NULL to
sysctlbyname. NumKong headers intentionally avoid <stddef.h>, so NULL resolves
only transitively through the C translation units' <stdlib.h>. The Swift module
and WASM/WASI clang builds compile the headers standalone and fail with
"use of undeclared identifier 'NULL'", which blocked the Swift, WASM, and wheel
CI jobs.

Add an NK_NULL macro to types.h (mirroring StringZilla's SZ_NULL: __null on
GCC/Clang, ((void *)0) otherwise) and use it at those call sites -- no header
dependency is introduced.

Also drop the stale float8_e8m0fnu case from test_ml_dtypes_incompatible_rejected:
e8m0 is the OCP UE8M0 scale format, now a first-class dtype via block scaling, so
it is accepted and round-trips exactly. A positive test covers it.
Add nk_size_mul_checked_ to types.h and route the shape-product and cdist size
computations in python/{each,distance}.c through it, so a buffer reporting an
overflowing shape raises OverflowError instead of wrapping into an undersized
allocation walked at the true extent.

The elementwise scalar-array paths returned NULL without setting an exception
when the buffer dtype is unsupported (a CPython protocol violation); they now
raise TypeError. The packed and symmetric matrix verbs leaked the owned output
tensor on an invalid row range; they now Py_DECREF it. The DLPack importer
propagates a PyObject_IsTrue error and nulls owner->managed if PyCapsule_SetName
fails, avoiding a double-free of the producer tensor.
cibuildwheel 4.0 removed the `cpython-freethreading` enable group
(free-threaded CPython is built by default now), so the pinned
`CIBW_ENABLE` aborts every wheel job with "Unknown enable group". Drop
it from all seven wheel matrices; `cp31Xt-*` selectors still build the
no-GIL wheels.

Stop pinning the `Visual Studio 17 2022` generator in the Windows C test
and release jobs; let CMake auto-detect whichever Visual Studio the
runner ships (windows-2025 now carries VS18). `-A` still selects the
target architecture.

Intel's downloadmirror.intel.com now sits behind an AWS WAF challenge
(HTTP 202, x-amzn-waf-action: challenge) that hands scripted clients an
HTML page instead of the tarball, so `tar -xf` fails. Fetch SDE from the
petarpetrovt/setup-sde GitHub-hosted mirror instead.
cibuildwheel 4.x removed support for Python 3.13 free-threading entirely
(`cp313t` is no longer a build identifier and the tool errors out on the
selector). Python 3.14 free-threading is the first supported tier and is
built by default — no enable group needed. Drop the now-invalid `313t`
matrix rows across every wheel platform; `314t` continues to ship.
`nk_size_mul_checked_` takes `nk_size_t *`, but the cdist and elementwise
bindings passed `&size_t`. On x86_64 `nk_size_t` is `unsigned long long`
while `size_t` is `unsigned long` (same width, distinct types), so GCC's
`-Wincompatible-pointer-types` fires — and gcc-toolset-14 in the manylinux
images treats it as a hard error, breaking every Linux wheel. clang and
older gcc only warned, so it slipped through local builds.

Do the count arithmetic in `nk_size_t` (the helper's and `nk_cast`'s
domain) and convert to `size_t` only at the libc boundary, which also
drops the now-redundant `(nk_size_t)total_elements` casts at the cast
calls.
The ARM64 Windows wheels are cross-compiled on an x64 host. `/openmp:llvm`
makes the extension import `libomp140.aarch64.dll`, which ships with
Visual Studio's ARM64 cross-tools but is not on PATH, so delvewheel could
not find it to bundle and the repair step failed.

Add a step (ARM64 jobs only) that locates `libomp140.aarch64.dll` in the
Visual Studio install and puts its directory on PATH. delvewheel searches
PATH (matching the DLL by architecture), so it now vendors the runtime
into the wheel. OpenMP stays enabled on every platform.
`tensor_slice_suffix_(all_t, rest...)` computed
`inner.byte_data() - first_row.byte_data()` to locate the sliced sub-shape.
When `rest...` over-slices the row to an empty tensor, `inner.byte_data()`
is null, and a pointer subtraction with a null operand is undefined
behavior. Detect the null inner and return an empty view first; a valid
rank-0 scalar slice keeps a non-null pointer and is unaffected.
When the source point cloud has zero variance (all points identical), the
similarity scale is mathematically undefined. The backends disagreed: the
serial oracle and skylake's f32 path divided unguarded and yielded
Inf/NaN, while Haswell, genoa, every NEON/RVV/v128 path, and skylake's
f16/f64/bf16 paths guarded the division to return 0.

Returning 0 silently produces a plausible-looking but wrong transform.
Drop the guards so every backend propagates the Inf/NaN of the undefined
division, matching the serial reference and making the degenerate case an
explicit signal rather than a quiet 0. Non-degenerate inputs are
unchanged (mesh suite passes at machine-epsilon error).
Construction sizes storage as `shape.product() / dimensions_per_value()`;
for a rank-0 tensor the empty product is 1, so it allocates one element.
`Drop` special-cased `ndim == 0` to a storage count of 0 and skipped the
deallocation, leaking that element. Compute the count the same way
construction does so the element is freed (rank-0 sub-byte stays at 0 via
the same integer division, matching its dangling allocation).
The in-place ops (scale/add, scalar add/sub/mul, sin/cos/atan, and
add/sub/mul-tensor) routed through `try_reborrow_tensor_inplace`, which
fabricated a read-only TensorView aliasing the mutable TensorSpan over the
same storage. The kernel walk then formed an overlapping `&[T]` + `&mut [T]`
over those bytes — undefined behavior under Stacked/Tree Borrows (the C
kernels tolerate the aliasing, but Rust's model does not).

In-place mutation now lives on TensorSpan and operates on its own storage:
each per-type `each_*_inplace` entry derives both raw pointers from the
single `&mut` (source == dest, which the C kernels accept), and the
binary ops keep `other` as a disjoint `&[T]`. No read-view of the target
is ever constructed. `try_reborrow_tensor_inplace` is deleted; Tensor's
in-place methods delegate to `self.span()`. Error handling is made
consistent: unary in-place is infallible, binary validates shapes and
returns Result. Adds in-place == out-of-place regression tests.
Run `cargo +nightly fmt` so the crate matches rustfmt.toml (fn_single_line,
max_width=120, unstable_features). Pure formatting — no behavior change.
Committing the reflow on its own keeps it out of logic diffs and makes
future `fmt` runs no-ops.
The `all_t` slice overload guarded `inner.byte_data() - first_row.byte_data()`
against a null inner, but the parallel `range` overload did the same pointer
subtraction unguarded — UB when a nested slice over-slices to an empty
tensor. Extract one `slice_inner_byte_offset_` helper (also rejecting a null
first_row) and use it in both overloads.
The WASM-via-Wasmer job built `wasmer-cli` from `--git main` to get
relaxed-SIMD support. Upstream's main now pulls a private SSH submodule
(git@github.com:wasmerio/quickjs.git) that fails to authenticate in CI, so
the install aborts. Relaxed-SIMD now ships in released Wasmer, so install
the prebuilt latest release via get.wasmer.io — no source build, no
submodule fetch.
Two RVV cast bugs, both verified against the serial oracle under
qemu-riscv64 (CI only compiles RVV, so these were never caught):

- f32→f16 (`nk_f32m2_to_f16m1_rvv_`) was a simplified conversion that
  clamped the exponent and rounded half-up, so overflow did not saturate
  to infinity (65520 → a finite value, 70000 → garbage), denormals and
  underflow were wrong, and rounding was not round-to-nearest-even.
  Rewritten to compute every IEEE bucket (zero, inf/nan, underflow,
  denormal, normal, overflow) with RNE and merge by exponent mask,
  matching nk_f32_to_f16_serial bit-for-bit (incl. 200 random values).

- The packed i4/u4 ↔ i8/u8 casts sized their vector loop as `count / 2`,
  dropping the trailing element on odd counts (and skipping the whole
  cast at count == 1). Add a scalar tail for the odd high nibble,
  clamping packs to [-8,7]/[0,15] to match the vector helpers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: MXFP4 support, NVFP4, and block-scaling

1 participant