CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups by ashvardanian · Pull Request #349 · ashvardanian/NumKong

ashvardanian · 2026-04-16T13:27:18Z

CUDA & C++ 20 Compatibility

NVCC 13 caps its language-standard flag at C++20, and our multi-argument subscript overloads from C++23 P2128 made tensor.hpp unparseable by cudafe++. We added call-operator primaries that mirror every multi-argument subscript overload in the tensor view, span, and owning container types, and kept the bracket sugar behind an __cpp_multidimensional_subscript feature test so older toolchains pick the portable spelling automatically. Downstream CUDA callers now parse the tensor header without touching their language-standard flag.

DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow

Tensors now exchange zero-copy in both directions with every Python framework that implements the DLPack protocol — PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla — using DLPack 1.3's versioned capsules and the max_version handshake. This finally carries semantic dtype identity across the bridge: bf16 and the four narrow float variants E4M3FN, E5M2, E2M3, and E3M2 round-trip without losing their type, where PEP 3118 and the legacy array interface previously degraded them to raw unsigned bytes. The importer accepts every device whose pointer is host-dereferenceable — plain CPU, pinned host memory on CUDA and ROCm, CUDA managed unified memory, Intel oneAPI host and shared USM, and Metal on Apple Silicon — while pure device memory is rejected with the offending device code named. The exporter stays strict and only emits the CPU device. Sub-byte types u1, u4, and i4 ride as byte containers, and the ABI is declared inline as six structs and twelve device codes rather than vendoring an external header, mirroring NumPy's own approach. Validated against torch 2.11, numpy 2.4, jax 0.10, tensorflow 2.21, pyarrow 23, cupy 13.6, and onnxruntime 1.24 on H100 with 127 tests passing.

import numpy as np, torch, numkong as nk

# NumKong → PyTorch: zero-copy FP8 round-trip preserves dtype identity.
src = torch.zeros(4, 6, dtype=torch.float8_e4m3fn)
nk_view = nk.from_dlpack(src); pt_back = torch.from_dlpack(nk_view)
assert nk_view.shape == (4, 6) and nk_view.dtype == "e4m3" and pt_back.dtype == torch.float8_e4m3fn

# Mutation through one view is visible through the other — proves zero-copy.
tensor = nk.Tensor(np.arange(24, dtype=np.float32).reshape(4, 6)); pt = torch.from_dlpack(tensor)
pt[0, 0] = 99; assert np.asarray(tensor)[0, 0] == 99

Upstream DLPack PRs that this bridge interoperates with, already referenced from our DLPack interop source:

pytorch/pytorch#57110 — PyTorch DLPack protocol, later upgraded to DLPack 1.0 with the max_version handshake
tensorflow/community#180 — TensorFlow DLPack RFC, implemented as tf.experimental.dlpack
microsoft/onnxruntime#23110 — ONNX Runtime OrtValue DLPack enabled by default for inference

Faster Single-Pass 3D Mesh Alignment Algorithms

Kabsch/Umeyama mesh alignment now folds into a single pass via the trace identity

$$\mathrm{SSD} = \lVert a - \bar a \rVert^2 + \lVert b - \bar b \rVert^2 - 2,\mathrm{tr}(R \cdot H)$$

replacing the earlier two-pass approach — covariance first, transformed-SSD second — across all nine backends. An identity-dominant short-circuit skips SVD entirely when $H$ approximates a positive diagonal, saving around 500 cycles on already-aligned inputs. Two new backends land alongside: a Genoa kernel that uses VDPBF16PS for channel-grouped bf16 reductions, and a NEON+FP16FML kernel that uses vfmlalq for fp16 widening FMA, while the existing NEON+BFDOT path picks up vbfdotq_f32 for its bf16 stats pass. A centered-RMSD bug in the NEON and NEON+FP16FML paths is fixed in passing.

Faster Float8 Linear Algebra on x86

Pairwise FP8 distance kernels — sqeuclidean, euclidean, and angular — on Skylake and Haswell now compute the squared difference directly in F32 after a free-shift widen. E5M2 abuses its shared exponent bias with F16: a byte-to-word unpack against zero places the byte as a valid F16 encoding. E4M3 uses a Giesen-style fake-F16 cast that shifts the mantissa up by seven, reinjects the sign at bit fifteen, widens with vcvtph2ps, and multiplies by 256 to correct the bias delta. Per-pair speedups range from 1.4× for E4M3 angular on Skylake to 4.9× for E5M2 sqeuclidean on Haswell on a pinned Xeon 6776P. The redundant Genoa E5M2 pairwise kernels are deleted because the rewritten Skylake path runs on Genoa silicon and beats the old vdpbf16ps-chain form by 2.4×.

Stateful FP8 GEMMs follow the same trajectory. E5M2 byte-packs into a new dtype-specific update helper that runs two FMA chains into a single state accumulator, landing at 1.4–2.5× on Skylake and up to 3.2× on Haswell for the packed dot, angular, and euclidean variants. E4M3 GEMMs on Skylake switch to an asymmetric F16-pack scheme where A streams as F32 while B is pre-cast at pack time and stored as F16, halving packed-B memory with compute neutral against baseline. Granite Rapids gets a brand-new E5M2 GEMM that packs E5M2 into F16 with a single byte shift and runs TDPFP16PS over F16 tiles, beating the Sapphire AMX BF16 path on E5M2 inputs with better intermediate precision at the same throughput. Dispatch wires it ahead of Sapphire AMX so Granite hardware automatically picks it up.

NVCC 13 caps `--std` at `c++20`. Tensor's multi-arg `operator[]` (C++23 P2128) made `tensor.hpp` unparseable by cudafe++. Back-port: Add `operator()` primaries mirroring every multi-arg `operator[]` overload in `tensor_view`, `tensor_span`, and `tensor`. Keep the `[]` sugar behind `NK_HAS_MULTIDIMENSIONAL_SUBSCRIPT_`, a tensor.hpp-local feature-test macro that folds `__cpp_multidimensional_subscript >= 202110L`. `test_tensor.cpp` exercises both syntaxes side-by-side on P2128-capable compilers.

…NFHM kernels Replace the two-pass kabsch/umeyama pattern (centroid+covariance pass, then a separate transformed_ssd pass) with a single-pass trace-identity fold: SSD = ‖a-ā‖² + ‖b-b̄‖² − 2·trace(R·H). Norm-squared accumulators piggyback on the existing stats loop, eliminating ~150 lines of transformed_ssd helpers per backend. Applied to all 9 backends (serial, haswell, skylake, genoa, neon, neonbfdot, neonfhm, v128relaxed, rvv). Added identity-dominant short-circuit that skips SVD when H ≈ diag(positive), saving ~500 cy on aligned inputs. New backends: mesh/genoa.h (VDPBF16PS channel-grouped bf16), mesh/neonfhm.h (vfmlalq fp16 widening FMA). Integrated vbfdotq_f32 into neonbfdot bf16 stats pass. Fixed rmsd_f16 centered-RMSD bug in neon/neonfhm. Refactored QR helper to drop unused R output. Consistent snake_case naming throughout (cross_covariance, optimal_rotation, svd_left/diagonal/right, trace_rotation_covariance). Added math-notation glossary to each backend header.

Tensors now exchange zero-copy in both directions with PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla -- every library implementing the Python Array API DLPack protocol. PEP 3118 and __array_interface__ cannot describe bf16 / fp8 / fp6, so those dtypes round-tripped through NumPy as raw uint bytes and lost their semantic identity. DLPack 1.3 carries kDLBfloat / kDLFloat8_e4m3fn / kDLFloat8_e5m2 / kDLFloat6_* codes -- the dtype now survives the bridge. The importer accepts every device_type whose pointer is host-dereferenceable: plain CPU, pinned host (CUDA / ROCm), cudaMallocManaged unified memory, Intel oneAPI host / shared USM, and Metal on Apple Silicon. Pure device memory (kDLCUDA, kDLROCM, kDLOpenCL, kDLVulkan, kDLWebGPU, kDLHexagon, kDLMAIA, kDLTrn, kDLVPI, kDLExtDev) is rejected with the device code named. The exporter stays strict -- only (kDLCPU, 0) is emitted. Legacy v0 ("dltensor") and versioned v1.3 ("dltensor_versioned") capsules are both supported via the max_version handshake. FP6 silently upgrades to v1 because the IS_SUBBYTE_TYPE_PADDED flag -- required for NumKong's byte-padded layout -- only exists on the versioned struct. Sub-byte types (u1, u4, i4) exchange as byte containers. Consumed capsules are renamed to "used_dltensor[_versioned]" per spec; the producer's deleter runs exactly once via an internal owner that holds the GIL across Py_XDECREF and PyMem_Free, since PyTorch's c10 finalizer invokes deleters without the GIL on CPython 3.12. The exchanged ABI is declared inline (six structs, twelve device codes, version.major checked at runtime) rather than vendoring dmlc/dlpack.h, mirroring NumPy's approach in numpy/_core/src/multiarray/dlpack.c. Per-framework export + import matrices cover every native dtype, including bf16 + fp8_e4m3fn + fp8_e5m2 on PyTorch and bf16 on JAX / TensorFlow / MLX. Three CuPy tests enforce the device-acceptance contract on real hardware (H100): cudaMalloc rejects, cudaMallocManaged accepts, cudaMallocHost accepts. With torch 2.11 / numpy 2.4 / jax 0.10 / tensorflow-cpu 2.21 / pyarrow 23 / cupy-cuda12x 13.6 / onnxruntime 1.24 on Python 3.12: 127 passed, 1 skipped (ONNX RT inference build lacks to_dlpack per microsoft/onnxruntime#23110).

Apply `#pragma GCC optimize("no-tree-vectorize","no-tree-slp-vectorize", "no-ipa-cp-clone","no-inline")` (Clang: `noinline` attribute push) around the serial kernel instantiation regions in dots, mesh, spatials, cast, each, spatial, geospatial, reduce, maxsim, dot, and sparse `*/serial.h` files. Without this, -O3 + LTO cross-TU codegen cloned serial kernels into AVX-512 bodies under dispatch callers, wasting binary and violating the nk_*_serial-as-scalar-oracle contract that tests and numerical-stability docs rely on. Also collapse the `_aligned_` / `_1x8_aligned_` fast-path helpers inside `nk_define_cross_packed_` / `nk_define_cross_compensated_packed_` and the `_galloping_search_` / `_linear_scan_` helpers inside `nk_define_sparse_intersect_` from NK_PUBLIC to NK_INTERNAL so they inline into their dispatchers — removes structural path divergence, not just compiler policy. Portable libnumkong.so: 4.51 MB → 3.56 MB (−930 KB, −21%). Side effect: graniteamx kernels that LTO was silently DCE-ing now survive to link (+38 KB), which was a latent perf regression on Granite Rapids.

Replace the Skylake & Haswell E5M2/E4M3 sqeuclidean/euclidean/angular kernels with direct (a-b)² in F32 after a free-shift widen, and delete the now- redundant Genoa E5M2 pairwise path (the algebraic a²+b²−2ab form via three vdpbf16ps chains saturated the reservation station for ~77% scheduler stalls). E5M2 shares F16's exponent bias (15), so `vpunpck*bw` against zero places the E5M2 byte in the high half of each 16-bit lane — free widen + <<8 shift at once, no explicit slli. E4M3 has no such alignment, but a Giesen-style cast fits: `((byte & 0x7F) << 7) | ((byte & 0x80) << 8)` yields a fake F16 whose F16 value differs from the true E4M3 magnitude by exactly 2⁸ (bias delta); `vcvtph2ps` widens, multiply by 256 corrects. Subnormals handle themselves via F16 subnormal semantics; NaN (|byte|==0x7F) is blended explicitly. The new E4M3 cast replaces the 10-op i32-lane bit-math helper in cast/{skylake,haswell}.h so every downstream caller (dots_packed, angulars_symmetric, each_scale, etc.) transparently picks up the faster path. Inner-loop widths: Skylake E5M2 64 lanes/iter (half extracts free via castsi512_si256), Skylake E4M3 32 lanes/iter, Haswell both at 16 lanes/iter. Two F32 accumulators per kernel break the FMA dependency chain. Tail uses a zero-padded wide load (sum-shape reductions tolerate zero contributions) so the slow branch runs the same body as the fast branch — single-dichotomy `if (n < FAST_WIDTH) else` inside the `if+goto _cycle` idiom. Measured cycles/pair on a pinned Xeon 6776P core (D=128 → D=768): sqeuclidean_e5m2_skylake 47.1 → 13.1 ns (3.6×) ; 263 → 73.2 ns (3.6×) euclidean_e5m2_skylake 48.5 → 14.2 ns (3.4×) ; 265 → 77.0 ns (3.4×) angular_e5m2_skylake 53.5 → 25.5 ns (2.1×) ; 280 → 94.1 ns (3.0×) sqeuclidean_e4m3_skylake 54.2 → 34.3 ns (1.6×) ; 317 → 193 ns (1.6×) euclidean_e4m3_skylake 55.3 → 34.3 ns (1.6×) ; 315 → 193 ns (1.6×) angular_e4m3_skylake 62.0 → 44.3 ns (1.4×) ; 335 → 217 ns (1.5×) sqeuclidean_e5m2_haswell 82.8 → 16.9 ns (4.9×) ; 436 → 93.9 ns (4.6×) euclidean_e5m2_haswell 79.6 → 18.5 ns (4.3×) ; 429 → 96.6 ns (4.4×) angular_e5m2_haswell 89.9 → 28.7 ns (3.1×) ; 453 → 147 ns (3.1×) sqeuclidean_e4m3_haswell 105 → 59.5 ns (1.8×) ; 583 → 347 ns (1.7×) euclidean_e4m3_haswell 103 → 43.4 ns (2.4×) ; 560 → 244 ns (2.3×) angular_e4m3_haswell 112 → 72.4 ns (1.5×) ; 572 → 382 ns (1.5×) Deleted nk_sqeuclidean_e5m2_genoa / nk_euclidean_e5m2_genoa / nk_angular_e5m2_genoa (32.0 / 32.0 / 45.0 ns at D=128); the rewritten Skylake kernel is 2.4× faster and runs on Genoa silicon (AVX-512F+BW, no FP16 needed). Dispatch entries dropped in c/dispatch_e5m2.c; genoa bf16 kernels, dot_e5m2_genoa, and all e5m2 packed/symmetric/reduce_moments kernels stay. Max ULP within NK_ULP_THRESHOLD_F16 (32) for all 12 rewritten kernels. Stateful / GEMM helpers (dots_*_{skylake,haswell}, angulars_*_*, euclideans_*_*) remain within ±5% of baseline since they route through the same cast helpers — transparent beneficiaries of the simpler E4M3 cast.

… kernel E5M2 cast (`cast/{skylake,haswell}.h`): - Rewrite `nk_e5m2x{16,8}_to_f32x{16,8}_*` as 3-op (cvtepu8 + slli 8 + cvtph_ps). E5M2 shares F16 bias, so `byte << 8` is the matching F16 encoding. Pairwise dot (`dot/{skylake,haswell}.h`): - E5M2 widened to 64/16-lane multi-chain inline unpack: Skylake 3.17-3.94×, Haswell 2.07-5.05× across D=100..768. - E4M3 unchanged (16-lane single-chain was already at the cast cost limit). Stateful GEMM (`dot/`, `dots/{skylake,haswell}.h`): - E5M2: byte-pack + new dtype-specific `nk_dot_e5m2x{64,32}_update_*` with two independent FMA chains folding into the single state accumulator. Skylake dots/angulars/euclideans 1.42-2.52×, Haswell up to 3.17×. - E4M3 Skylake: asymmetric F16-pack (A streamed as F32, B pre-cast and stored as F16, widened to F32 on load). 50% memory savings for the packed B; compute ~0.89-1.16× of baseline (accepted tradeoff). - E4M3 Haswell: byte-pack at depth=32, neutral 0.95-1.03×. New Granite Rapids E5M2 GEMM (`dots/graniteamx.h`, `spatials/graniteamx.h`): - Pack E5M2 → F16 via `byte << 8`, run TDPFP16PS over F16 tiles to F32. Beats Sapphire AMX BF16 path on E5M2 inputs (better intermediate precision, same throughput). Wired through `dispatch_e5m2.c` ahead of Sapphire AMX. Sapphire AMX (`dots/sapphireamx.h`): keeps icelake LUT cast helpers (empirical eval showed Genoa-Giesen path regressed; revert preserved). Spatials (`spatials/skylake.h`): E4M3 normalize_packed packed_value_type follows dots packed_value_type (f16) for matching norm offset. Bonus: collapse split string literals across JS/Python error messages (`javascript/numkong.c`, `python/{each,matrix,tensor}.c`). ULP: dots/spatials max_ulp ≤ 1 vs scalar reference (threshold 32).

ashvardanian added 4 commits April 15, 2026 11:03

Make: Normalize base PowerPC & LoongArch cap for JS

ab81191

Make: Enable symbol exports for nk_shared Emscripten builds

482e4fd

ashvardanian changed the title ~~CUDA & C++ 20 Compatibility, NVFP4, Faster 3D Meshes, & Black-Scaled Matrices~~ CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices Apr 18, 2026

ashvardanian changed the title ~~CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices~~ CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices, Pre-fill & Decode Interop Apr 18, 2026

ashvardanian added 3 commits April 18, 2026 19:08

ashvardanian changed the title ~~CUDA & C++ 20 Compatibility, DLPack, NVFP4, Faster 3D Meshes, Complex Math, Black-Scaled Matrices, Pre-fill & Decode Interop~~ CUDA & C++ 20 Compatibility, DLPack, Faster 3D Meshes, & Float8 Speedups Apr 19, 2026

ashvardanian changed the title ~~CUDA & C++ 20 Compatibility, DLPack, Faster 3D Meshes, & Float8 Speedups~~ CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups Apr 19, 2026

ashvardanian merged commit d3dfc17 into main Apr 20, 2026
101 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups#349

CUDA & C++ 20 Compatibility, DLPack, Float8 & 3D Mesh Speedups#349
ashvardanian merged 8 commits into
mainfrom
main-dev

ashvardanian commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashvardanian commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!