Skip to content

CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups#349

Merged
ashvardanian merged 8 commits into
mainfrom
main-dev
Apr 20, 2026
Merged

CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups#349
ashvardanian merged 8 commits into
mainfrom
main-dev

Conversation

@ashvardanian

@ashvardanian ashvardanian commented Apr 16, 2026

Copy link
Copy Markdown
Owner

CUDA & C++ 20 Compatibility

NVCC 13 caps its language-standard flag at C++20, and our multi-argument subscript overloads from C++23 P2128 made tensor.hpp unparseable by cudafe++. We added call-operator primaries that mirror every multi-argument subscript overload in the tensor view, span, and owning container types, and kept the bracket sugar behind an __cpp_multidimensional_subscript feature test so older toolchains pick the portable spelling automatically. Downstream CUDA callers now parse the tensor header without touching their language-standard flag.

DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow

Tensors now exchange zero-copy in both directions with every Python framework that implements the DLPack protocol — PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla — using DLPack 1.3's versioned capsules and the max_version handshake. This finally carries semantic dtype identity across the bridge: bf16 and the four narrow float variants E4M3FN, E5M2, E2M3, and E3M2 round-trip without losing their type, where PEP 3118 and the legacy array interface previously degraded them to raw unsigned bytes. The importer accepts every device whose pointer is host-dereferenceable — plain CPU, pinned host memory on CUDA and ROCm, CUDA managed unified memory, Intel oneAPI host and shared USM, and Metal on Apple Silicon — while pure device memory is rejected with the offending device code named. The exporter stays strict and only emits the CPU device. Sub-byte types u1, u4, and i4 ride as byte containers, and the ABI is declared inline as six structs and twelve device codes rather than vendoring an external header, mirroring NumPy's own approach. Validated against torch 2.11, numpy 2.4, jax 0.10, tensorflow 2.21, pyarrow 23, cupy 13.6, and onnxruntime 1.24 on H100 with 127 tests passing.

import numpy as np, torch, numkong as nk

# NumKong → PyTorch: zero-copy FP8 round-trip preserves dtype identity.
src = torch.zeros(4, 6, dtype=torch.float8_e4m3fn)
nk_view = nk.from_dlpack(src); pt_back = torch.from_dlpack(nk_view)
assert nk_view.shape == (4, 6) and nk_view.dtype == "e4m3" and pt_back.dtype == torch.float8_e4m3fn

# Mutation through one view is visible through the other — proves zero-copy.
tensor = nk.Tensor(np.arange(24, dtype=np.float32).reshape(4, 6)); pt = torch.from_dlpack(tensor)
pt[0, 0] = 99; assert np.asarray(tensor)[0, 0] == 99

Upstream DLPack PRs that this bridge interoperates with, already referenced from our DLPack interop source:

Faster Single-Pass 3D Mesh Alignment Algorithms

Kabsch/Umeyama mesh alignment now folds into a single pass via the trace identity

$$\mathrm{SSD} = \lVert a - \bar a \rVert^2 + \lVert b - \bar b \rVert^2 - 2,\mathrm{tr}(R \cdot H)$$

replacing the earlier two-pass approach — covariance first, transformed-SSD second — across all nine backends. An identity-dominant short-circuit skips SVD entirely when $H$ approximates a positive diagonal, saving around 500 cycles on already-aligned inputs. Two new backends land alongside: a Genoa kernel that uses VDPBF16PS for channel-grouped bf16 reductions, and a NEON+FP16FML kernel that uses vfmlalq for fp16 widening FMA, while the existing NEON+BFDOT path picks up vbfdotq_f32 for its bf16 stats pass. A centered-RMSD bug in the NEON and NEON+FP16FML paths is fixed in passing.

Faster Float8 Linear Algebra on x86

Pairwise FP8 distance kernels — sqeuclidean, euclidean, and angular — on Skylake and Haswell now compute the squared difference directly in F32 after a free-shift widen. E5M2 abuses its shared exponent bias with F16: a byte-to-word unpack against zero places the byte as a valid F16 encoding. E4M3 uses a Giesen-style fake-F16 cast that shifts the mantissa up by seven, reinjects the sign at bit fifteen, widens with vcvtph2ps, and multiplies by 256 to correct the bias delta. Per-pair speedups range from 1.4× for E4M3 angular on Skylake to 4.9× for E5M2 sqeuclidean on Haswell on a pinned Xeon 6776P. The redundant Genoa E5M2 pairwise kernels are deleted because the rewritten Skylake path runs on Genoa silicon and beats the old vdpbf16ps-chain form by 2.4×.

Stateful FP8 GEMMs follow the same trajectory. E5M2 byte-packs into a new dtype-specific update helper that runs two FMA chains into a single state accumulator, landing at 1.4–2.5× on Skylake and up to 3.2× on Haswell for the packed dot, angular, and euclidean variants. E4M3 GEMMs on Skylake switch to an asymmetric F16-pack scheme where A streams as F32 while B is pre-cast at pack time and stored as F16, halving packed-B memory with compute neutral against baseline. Granite Rapids gets a brand-new E5M2 GEMM that packs E5M2 into F16 with a single byte shift and runs TDPFP16PS over F16 tiles, beating the Sapphire AMX BF16 path on E5M2 inputs with better intermediate precision at the same throughput. Dispatch wires it ahead of Sapphire AMX so Granite hardware automatically picks it up.

NVCC 13 caps `--std` at `c++20`. Tensor's multi-arg `operator[]` (C++23
P2128) made `tensor.hpp` unparseable by cudafe++. Back-port:

Add `operator()` primaries mirroring every multi-arg `operator[]`
overload in `tensor_view`, `tensor_span`, and `tensor`. Keep the `[]`
sugar behind `NK_HAS_MULTIDIMENSIONAL_SUBSCRIPT_`, a tensor.hpp-local
feature-test macro that folds `__cpp_multidimensional_subscript >= 202110L`.
`test_tensor.cpp` exercises both syntaxes side-by-side on P2128-capable
compilers.
…NFHM kernels

Replace the two-pass kabsch/umeyama pattern (centroid+covariance pass, then a
separate transformed_ssd pass) with a single-pass trace-identity fold:
SSD = ‖a-ā‖² + ‖b-b̄‖² − 2·trace(R·H). Norm-squared accumulators piggyback on
the existing stats loop, eliminating ~150 lines of transformed_ssd helpers per
backend. Applied to all 9 backends (serial, haswell, skylake, genoa, neon,
neonbfdot, neonfhm, v128relaxed, rvv). Added identity-dominant short-circuit
that skips SVD when H ≈ diag(positive), saving ~500 cy on aligned inputs.

New backends: mesh/genoa.h (VDPBF16PS channel-grouped bf16), mesh/neonfhm.h
(vfmlalq fp16 widening FMA). Integrated vbfdotq_f32 into neonbfdot bf16 stats
pass. Fixed rmsd_f16 centered-RMSD bug in neon/neonfhm. Refactored QR helper
to drop unused R output. Consistent snake_case naming throughout (cross_covariance,
optimal_rotation, svd_left/diagonal/right, trace_rotation_covariance). Added
math-notation glossary to each backend header.
@ashvardanian ashvardanian changed the title CUDA & C++ 20 Compatibility, NVFP4, Faster 3D Meshes, & Black-Scaled Matrices CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices Apr 18, 2026
@ashvardanian ashvardanian changed the title CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices, Pre-fill & Decode Interop Apr 18, 2026
Tensors now exchange zero-copy in both directions with PyTorch, NumPy,
JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla --
every library implementing the Python Array API DLPack protocol.
PEP 3118 and __array_interface__ cannot describe bf16 / fp8 / fp6, so
those dtypes round-tripped through NumPy as raw uint bytes and lost
their semantic identity. DLPack 1.3 carries kDLBfloat / kDLFloat8_e4m3fn
/ kDLFloat8_e5m2 / kDLFloat6_* codes -- the dtype now survives the bridge.

The importer accepts every device_type whose pointer is host-dereferenceable:
plain CPU, pinned host (CUDA / ROCm), cudaMallocManaged unified memory,
Intel oneAPI host / shared USM, and Metal on Apple Silicon. Pure device
memory (kDLCUDA, kDLROCM, kDLOpenCL, kDLVulkan, kDLWebGPU, kDLHexagon,
kDLMAIA, kDLTrn, kDLVPI, kDLExtDev) is rejected with the device code
named. The exporter stays strict -- only (kDLCPU, 0) is emitted.

Legacy v0 ("dltensor") and versioned v1.3 ("dltensor_versioned") capsules
are both supported via the max_version handshake. FP6 silently upgrades
to v1 because the IS_SUBBYTE_TYPE_PADDED flag -- required for NumKong's
byte-padded layout -- only exists on the versioned struct. Sub-byte
types (u1, u4, i4) exchange as byte containers. Consumed capsules are
renamed to "used_dltensor[_versioned]" per spec; the producer's deleter
runs exactly once via an internal owner that holds the GIL across
Py_XDECREF and PyMem_Free, since PyTorch's c10 finalizer invokes
deleters without the GIL on CPython 3.12.

The exchanged ABI is declared inline (six structs, twelve device codes,
version.major checked at runtime) rather than vendoring dmlc/dlpack.h,
mirroring NumPy's approach in numpy/_core/src/multiarray/dlpack.c.

Per-framework export + import matrices cover every native dtype,
including bf16 + fp8_e4m3fn + fp8_e5m2 on PyTorch and bf16 on JAX /
TensorFlow / MLX. Three CuPy tests enforce the device-acceptance
contract on real hardware (H100): cudaMalloc rejects, cudaMallocManaged
accepts, cudaMallocHost accepts. With torch 2.11 / numpy 2.4 / jax 0.10
/ tensorflow-cpu 2.21 / pyarrow 23 / cupy-cuda12x 13.6 / onnxruntime
1.24 on Python 3.12: 127 passed, 1 skipped (ONNX RT inference build
lacks to_dlpack per microsoft/onnxruntime#23110).
Apply `#pragma GCC optimize("no-tree-vectorize","no-tree-slp-vectorize",
"no-ipa-cp-clone","no-inline")` (Clang: `noinline` attribute push) around
the serial kernel instantiation regions in dots, mesh, spatials, cast, each,
spatial, geospatial, reduce, maxsim, dot, and sparse `*/serial.h` files.
Without this, -O3 + LTO cross-TU codegen cloned serial kernels into
AVX-512 bodies under dispatch callers, wasting binary and violating the
nk_*_serial-as-scalar-oracle contract that tests and numerical-stability
docs rely on.

Also collapse the `_aligned_` / `_1x8_aligned_` fast-path helpers inside
`nk_define_cross_packed_` / `nk_define_cross_compensated_packed_` and the
`_galloping_search_` / `_linear_scan_` helpers inside
`nk_define_sparse_intersect_` from NK_PUBLIC to NK_INTERNAL so they inline
into their dispatchers — removes structural path divergence, not just
compiler policy.

Portable libnumkong.so: 4.51 MB → 3.56 MB (−930 KB, −21%). Side effect:
graniteamx kernels that LTO was silently DCE-ing now survive to link
(+38 KB), which was a latent perf regression on Granite Rapids.
Replace the Skylake & Haswell E5M2/E4M3 sqeuclidean/euclidean/angular kernels
with direct (a-b)² in F32 after a free-shift widen, and delete the now-
redundant Genoa E5M2 pairwise path (the algebraic a²+b²−2ab form via three
vdpbf16ps chains saturated the reservation station for ~77% scheduler stalls).

E5M2 shares F16's exponent bias (15), so `vpunpck*bw` against zero places the
E5M2 byte in the high half of each 16-bit lane — free widen + <<8 shift at
once, no explicit slli. E4M3 has no such alignment, but a Giesen-style cast
fits: `((byte & 0x7F) << 7) | ((byte & 0x80) << 8)` yields a fake F16 whose
F16 value differs from the true E4M3 magnitude by exactly 2⁸ (bias delta);
`vcvtph2ps` widens, multiply by 256 corrects. Subnormals handle themselves via
F16 subnormal semantics; NaN (|byte|==0x7F) is blended explicitly. The new
E4M3 cast replaces the 10-op i32-lane bit-math helper in cast/{skylake,haswell}.h
so every downstream caller (dots_packed, angulars_symmetric, each_scale, etc.)
transparently picks up the faster path.

Inner-loop widths: Skylake E5M2 64 lanes/iter (half extracts free via
castsi512_si256), Skylake E4M3 32 lanes/iter, Haswell both at 16 lanes/iter.
Two F32 accumulators per kernel break the FMA dependency chain. Tail uses
a zero-padded wide load (sum-shape reductions tolerate zero contributions)
so the slow branch runs the same body as the fast branch — single-dichotomy
`if (n < FAST_WIDTH) else` inside the `if+goto _cycle` idiom.

Measured cycles/pair on a pinned Xeon 6776P core (D=128 → D=768):
  sqeuclidean_e5m2_skylake   47.1 → 13.1 ns  (3.6×) ; 263 → 73.2 ns  (3.6×)
  euclidean_e5m2_skylake     48.5 → 14.2 ns  (3.4×) ; 265 → 77.0 ns  (3.4×)
  angular_e5m2_skylake       53.5 → 25.5 ns  (2.1×) ; 280 → 94.1 ns  (3.0×)
  sqeuclidean_e4m3_skylake   54.2 → 34.3 ns  (1.6×) ; 317 → 193 ns   (1.6×)
  euclidean_e4m3_skylake     55.3 → 34.3 ns  (1.6×) ; 315 → 193 ns   (1.6×)
  angular_e4m3_skylake       62.0 → 44.3 ns  (1.4×) ; 335 → 217 ns   (1.5×)
  sqeuclidean_e5m2_haswell   82.8 → 16.9 ns  (4.9×) ; 436 → 93.9 ns  (4.6×)
  euclidean_e5m2_haswell     79.6 → 18.5 ns  (4.3×) ; 429 → 96.6 ns  (4.4×)
  angular_e5m2_haswell       89.9 → 28.7 ns  (3.1×) ; 453 → 147 ns   (3.1×)
  sqeuclidean_e4m3_haswell    105 → 59.5 ns  (1.8×) ; 583 → 347 ns   (1.7×)
  euclidean_e4m3_haswell      103 → 43.4 ns  (2.4×) ; 560 → 244 ns   (2.3×)
  angular_e4m3_haswell        112 → 72.4 ns  (1.5×) ; 572 → 382 ns   (1.5×)

Deleted nk_sqeuclidean_e5m2_genoa / nk_euclidean_e5m2_genoa /
nk_angular_e5m2_genoa (32.0 / 32.0 / 45.0 ns at D=128); the rewritten Skylake
kernel is 2.4× faster and runs on Genoa silicon (AVX-512F+BW, no FP16 needed).
Dispatch entries dropped in c/dispatch_e5m2.c; genoa bf16 kernels,
dot_e5m2_genoa, and all e5m2 packed/symmetric/reduce_moments kernels stay.

Max ULP within NK_ULP_THRESHOLD_F16 (32) for all 12 rewritten kernels.
Stateful / GEMM helpers (dots_*_{skylake,haswell}, angulars_*_*, euclideans_*_*)
remain within ±5% of baseline since they route through the same cast helpers —
transparent beneficiaries of the simpler E4M3 cast.
@ashvardanian ashvardanian changed the title CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices, Pre-fill & Decode Interop CUDA & C++ 20 Compatibility, DLPack, Faster 3D Meshes, & Float8 Speedups Apr 19, 2026
@ashvardanian ashvardanian changed the title CUDA & C++ 20 Compatibility, DLPack, Faster 3D Meshes, & Float8 Speedups CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups Apr 19, 2026
… kernel

E5M2 cast (`cast/{skylake,haswell}.h`):
- Rewrite `nk_e5m2x{16,8}_to_f32x{16,8}_*` as 3-op (cvtepu8 + slli 8 + cvtph_ps).
  E5M2 shares F16 bias, so `byte << 8` is the matching F16 encoding.

Pairwise dot (`dot/{skylake,haswell}.h`):
- E5M2 widened to 64/16-lane multi-chain inline unpack:
  Skylake 3.17-3.94×, Haswell 2.07-5.05× across D=100..768.
- E4M3 unchanged (16-lane single-chain was already at the cast cost limit).

Stateful GEMM (`dot/`, `dots/{skylake,haswell}.h`):
- E5M2: byte-pack + new dtype-specific `nk_dot_e5m2x{64,32}_update_*` with
  two independent FMA chains folding into the single state accumulator.
  Skylake dots/angulars/euclideans 1.42-2.52×, Haswell up to 3.17×.
- E4M3 Skylake: asymmetric F16-pack (A streamed as F32, B pre-cast and
  stored as F16, widened to F32 on load). 50% memory savings for the
  packed B; compute ~0.89-1.16× of baseline (accepted tradeoff).
- E4M3 Haswell: byte-pack at depth=32, neutral 0.95-1.03×.

New Granite Rapids E5M2 GEMM (`dots/graniteamx.h`, `spatials/graniteamx.h`):
- Pack E5M2 → F16 via `byte << 8`, run TDPFP16PS over F16 tiles to F32.
  Beats Sapphire AMX BF16 path on E5M2 inputs (better intermediate
  precision, same throughput). Wired through `dispatch_e5m2.c` ahead
  of Sapphire AMX.

Sapphire AMX (`dots/sapphireamx.h`): keeps icelake LUT cast helpers
(empirical eval showed Genoa-Giesen path regressed; revert preserved).

Spatials (`spatials/skylake.h`): E4M3 normalize_packed packed_value_type
follows dots packed_value_type (f16) for matching norm offset.

Bonus: collapse split string literals across JS/Python error messages
(`javascript/numkong.c`, `python/{each,matrix,tensor}.c`).

ULP: dots/spatials max_ulp ≤ 1 vs scalar reference (threshold 32).
@ashvardanian ashvardanian merged commit d3dfc17 into main Apr 20, 2026
101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant