Releases: ashvardanian/NumKong
v7.6: CUDA & C++ 20 Compatibility, DLPack 1.3 Views, Float8 & 3D Mesh Speedups
CUDA & C++ 20 Compatibility
NVCC 13 caps its language-standard flag at C++20, and our multi-argument subscript overloads from C++23 P2128 made tensor.hpp unparseable by cudafe++. We added call-operator primaries that mirror every multi-argument subscript overload in the tensor view, span, and owning container types, and kept the bracket sugar behind an __cpp_multidimensional_subscript feature test so older toolchains pick the portable spelling automatically. Downstream CUDA callers now parse the tensor header without touching their language-standard flag.
DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow
Tensors now exchange zero-copy in both directions with every Python framework that implements the DLPack protocol β PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla β using DLPack 1.3's versioned capsules and the max_version handshake. This finally carries semantic dtype identity across the bridge: bf16 and the four narrow float variants E4M3FN, E5M2, E2M3, and E3M2 round-trip without losing their type, where PEP 3118 and the legacy array interface previously degraded them to raw unsigned bytes. The importer accepts every device whose pointer is host-dereferenceable β plain CPU, pinned host memory on CUDA and ROCm, CUDA managed unified memory, Intel oneAPI host and shared USM, and Metal on Apple Silicon β while pure device memory is rejected with the offending device code named. The exporter stays strict and only emits the CPU device. Sub-byte types u1, u4, and i4 ride as byte containers, and the ABI is declared inline as six structs and twelve device codes rather than vendoring an external header, mirroring NumPy's own approach. Validated against torch 2.11, numpy 2.4, jax 0.10, tensorflow 2.21, pyarrow 23, cupy 13.6, and onnxruntime 1.24 on H100 with 127 tests passing.
import numpy as np, torch, numkong as nk
# NumKong β PyTorch: zero-copy FP8 round-trip preserves dtype identity.
src = torch.zeros(4, 6, dtype=torch.float8_e4m3fn)
nk_view = nk.from_dlpack(src); pt_back = torch.from_dlpack(nk_view)
assert nk_view.shape == (4, 6) and nk_view.dtype == "e4m3" and pt_back.dtype == torch.float8_e4m3fn
# Mutation through one view is visible through the other β proves zero-copy.
tensor = nk.Tensor(np.arange(24, dtype=np.float32).reshape(4, 6)); pt = torch.from_dlpack(tensor)
pt[0, 0] = 99; assert np.asarray(tensor)[0, 0] == 99Upstream DLPack PRs that this bridge interoperates with, already referenced from our DLPack interop source:
- pytorch/pytorch#57110 β PyTorch DLPack protocol, later upgraded to DLPack 1.0 with the
max_versionhandshake - tensorflow/community#180 β TensorFlow DLPack RFC, implemented as
tf.experimental.dlpack - microsoft/onnxruntime#23110 β ONNX Runtime
OrtValueDLPack enabled by default for inference
Faster Single-Pass 3D Mesh Alignment Algorithms
Kabsch/Umeyama mesh alignment now folds into a single pass via the trace identity
replacing the earlier two-pass approach β covariance first, transformed-SSD second β across all nine backends. An identity-dominant short-circuit skips SVD entirely when
Faster Float8 Linear Algebra on x86
Pairwise FP8 distance kernels β sqeuclidean, euclidean, and angular β on Skylake and Haswell now compute the squared difference directly in F32 after a free-shift widen. E5M2 abuses its shared exponent bias with F16: a byte-to-word unpack against zero places the byte as a valid F16 encoding. E4M3 uses a Giesen-style fake-F16 cast that shifts the mantissa up by seven, reinjects the sign at bit fifteen, widens with vcvtph2ps, and multiplies by 256 to correct the bias delta. Per-pair speedups range from 1.4Γ for E4M3 angular on Skylake to 4.9Γ for E5M2 sqeuclidean on Haswell on a pinned Xeon 6776P. The redundant Genoa E5M2 pairwise kernels are deleted because the rewritten Skylake path runs on Genoa silicon and beats the old vdpbf16ps-chain form by 2.4Γ.
Stateful FP8 GEMMs follow the same trajectory. E5M2 byte-packs into a new dtype-specific update helper that runs two FMA chains into a single state accumulator, landing at 1.4β2.5Γ on Skylake and up to 3.2Γ on Haswell for the packed dot, angular, and euclidean variants. E4M3 GEMMs on Skylake switch to an asymmetric F16-pack scheme where A streams as F32 while B is pre-cast at pack time and stored as F16, halving packed-B memory with compute neutral against baseline. Granite Rapids gets a brand-new E5M2 GEMM that packs E5M2 into F16 with a single byte shift and runs TDPFP16PS over F16 tiles, beating the Sapphire AMX BF16 path on E5M2 inputs with better intermediate precision at the same throughput. Dispatch wires it ahead of Sapphire AMX so Granite hardware automatically picks it up.
Minor
- Add: DLPack 1.3 interop bridge for numkong.Tensor (ea74fe1)
- Add: Back-port tensor API to C++20 for CUDA (ad93068)
Patch
- Improve: FP8 GEMM throughput on Skylake/Haswell + Granite Rapids E5M2 kernel (c19bec9)
- Improve: FP8 pairwise distance kernels via Giesen trick + F16 widen path (679f55f)
- Fix: Keep
*_serialkernels scalar across LTO (455d535) - Make: Enable symbol exports for
nk_sharedEmscripten builds (482e4fd) - Improve: SSD trace-identity fold across all mesh backends + Genoa/NEONFHM kernels (e9d40e5)
- Make: Normalize base PowerPC & LoongArch cap for JS (ab81191)
v7.5: Parallelism & Portability
- Built-in OpenMP bundling for JS & Python π
- Intel Granite Rapids πͺ¨ F16 β F32 GEMMs π
- Faster bit-vector population counts for Arm NEON π¦Ύ
- SME compatibility with non-Apple Clang on Apple machines π
- Hardening against MSan SVE false-positives, thanks to @alexey-milovidov π¦Ί
- Hardening against GCC 13 Arm NEON code-gen bugs, thanks to @swasik π
_into&_parallelGEMM Rust APIs: reusing memory & ForkUnion pools π- De-vectorize serial kernels with compiler flags π
- Compress source & binary distributions for Windows ποΈ
- Pre-build & share FreeBSD, PowerPC, RISC-V, & LoongArch libs π€
Minor
- Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c)
- Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279)
- Add: Span-based matrix
_intoAPIs, parallel Hammings/Jaccards, full-crate docs (99289df) - Add: OpenMP for Python & JavaScript (499ecc9)
- Add: Granite Rapids AMX for F16 & F32 (28036ea)
Patch
- Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02)
- Make: Detect illegal instructions in macOS CI (289cdaf)
- Fix: Drop
-march=on macOS setup.py builds (28aac74) - Fix: Exclude
std::signalfrom WASM builds (14814c5) - Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0)
- Make: Drop
+nosimdfrom AArch64 baseline (23f5195) - Make: Forbid auto-vectorization in portable baseline builds (43e8324)
- Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f)
- Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec)
- Improve: Log faulting capability detection (a401f8a)
- Improve: Log faulting kernel on fatal signals in
nk_test(22c7c79) - Make: Normalize Python test dependencies across CI and docs (8a0f3d4)
- Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685)
- Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb)
- Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2)
- Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021)
- Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118)
- Fix: GCC requires +sme prefix in target attribute for _arm_sc* stubs (291dc0a)
- Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75)
- Fix: OpenMP wheel builds on macOS and Windows (f569121)
- Fix: Add target("sme") to _arm_sc* stubs for GCC compatibility (ad2add0)
- Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7)
- Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934)
- Improve: Manual SME streaming control, single enter/exit per API call (6432837)
- Fix: Update
cdistedge-case test for re-addedthreads=kwarg (50681af) - Make: Allow force-enabling ISA targets via environment variables (0e58702)
- Improve: Abandon F32βF64 via Ozaki on Granite Rapids (94a5f19)
- Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83)
- Make: Standardize CI compilers and add Windows test job (9a22ea4)
- Make: Shrink serial fallbacks with scoped size optimization (83154a8)
- Make: Compress Windows builds (e30ad3d)
- Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)
v7.4.5: Faster RMSD
v7.4.4: CI & MSVC Hardening
- Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
- Make:
check_source_runs-probing likemarch=nativeon MSVC (7a152f3) - Fix: Drop
_MM_FROUND_NO_EXCfrom_mm256_cvtps_phcalls (8649b0c) - Fix: Guard against old MSVC preprocessor (25d3304)
- Make: Enforce newer preprocessor in MSVC (be966af)
- Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
- Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
- Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)
Release v7.4.3
Release: v7.4.3 [skip ci]
Patch
- Fix: Require AArch64 for NEON kernels (2ba1b34)
- Docs: Table order & formatting (8673a56)
- Make: Avoid
--all-featuresin Rust cross-compilation CI (8be8bff) - Improve: Arm32 compatibility (6404172)
- Make:
cancel-in-progressCI to shift compute resources (dfc8fa0) - Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
- Make: Strip
.unsafeFlags& list platforms for SPM consumption (b061b78) - Make: Expose
CNumKongDispatchtarget to Swift users (6aa00a8)
Release v7.4.2
Release v7.4.1
v7.4: Fast Tensor Contractions
- Faster tensor contractions
- Faster GEMM "packers" with SIMD
- New SVE+SDOT kernels for
i8 - MSVC build stability on Arm
Minor
- Add: WASM elementwise ops & spatial mini-float kernels (81b8c44)
- Add: WASM type-casting kernels (e09df31)
- Add: SVE+SDOT ops for 8-bit integers (913fc6b)
Patch
- Fix: Misplaced NEON loads/stores in Sierra (05e3045)
- Fix: Avoid unconsitional
npsymbols (9dffb68) - Make: Resolve probe locations for NPM consumers (c602f45)
- Docs: Refined "What's Inside" (28f35cd)
- Docs: Mini-float kernel selection strategy (04e6598)
- Improve: Accelerate PyTests, reduce
Decimaluse (2417248) - Make: Move
.pyifor PyLance (688ec2d) - Fix: Inconsistent SME function qualifiers (5b4148a)
- Improve: Smaller test inputs under QEMU (ee36bf2)
- Improve: Vectorize GEMM "packers" (86127a4)
- Make: Longer timeouts for QEMU in CI (a9cc732)
- Fix:
vec_tstore helper args order (eecbcac) - Fix: Negative stride tensor reductions (3ea81be)
- Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
- Improve: Faster reductions in strided tensors (61651ed)
- Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
- Fix: Harden mini-float type-casting (1911b89)
- Make: Ship
win32-arm64NPM builds (578b7ad) - Make: Auto-bump JS platform-specific versions (5617f75)
- Fix:
vcombineinstead of initializer lists for NEON arrays in MSVC (906c178) - Fix: Avoid flaky
vld1_f16for MSVC (7a987d2)
v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs
This release hardens Arm kernels across NEON, SVE, and SME. The most widespread fix replaces _x (don't-care) predicated intrinsics with _m (merge-with-zero) variants β inactive lanes left undefined by _x could carry stale data into reductions, producing wrong results for non-power-of-two dimensions on real SVE hardware. Partial-tail padding in BMOPA is fixed for sub-32-bit types, and strided reductions in NEON are hardened against off-by-one in non-contiguous layouts.
Thanks to the @ClickHouse team for help hardening tail loads and @albumentations-team for strided reductions!
On the performance side, NEON gets faster in-vector finalizers, vcvt_high for cheaper F16/BF16 widening, and new SDOT fallbacks for i4 and e3m2 that previously required SME β bringing sub-byte arithmetic to the much larger NEON install base. Streaming SVE picks up Giesen's trick for E4M3 β F16 and faster mini-float norms. SME GEMMs use fewer branches in the inner loop.
Also, NumKong now ships a CITATION.cff β hit "Cite this repository" on GitHub to grab it in case you are writing a paper on a related topic π€
Minor
- Add: NEON & SDOT fallbacks for
i4&e3m2(0c6afa5)
Patch
- Docs: M5 perf stats for Wasmtime v43 (43c2881)
- Fix: Alternative MSVC-friendly cast (4744b9b)
- Make: Disable LTCG due to MSVC issues (3d37684)
- Make: Try
PREBUILDS_ONLY=0in CI (64c5f95) - Improve: Lower NEONHALF β NEON requirements (37f99ec)
- Fix: Wire
nk_cast_neonbenchmarks (3793af2) - Docs: Apple M5 native stats for secondary workloads (d7c81c4)
- Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
- Improve: Drop
nk_f16x4_to_f32x4_neon(84bb20a) - Improve:
vcvt_highfor faster unpacking (a5f4a19) - Docs: Refresh GEMM/SYRK measurements Apple M4 β M5 (3e010de)
- Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
- Fix: Double-counted tail in Skylake
f64RMSD, Kabsch, and Umeyama (5391344) - Improve: Share
decimal.Context.trapsrules (3c28ae9) - Fix: Padding partial tail 32-bit words for
BMOPA(2598487) - Fix: Missing scale type definitions of mini-floats (91862da)
- Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
- Fix: Top-bottom variable names (a014134)
- Improve: Giesen's E4M3 β F16 in Streaming SVE (25322b5)
- Improve: Fewer branches in SME GEMMs (858263c)
- Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
- Make: Focus on M4 CPUs for SME probing (5ff63eb)
- Improve: PyTesting across more shapes (4bc3e44)
- Improve: Cleaner type-casting & promotion rules (23c2474)
- Make: Hide formatting commits for v7-7.2 (f6ce2da)
- Make: Native addon resolution for Deno & Bun (0d502d5)
- Docs: Citations (6220137)
- Improve: Faster mini-float norms in Streaming SVE (088de57)
- Make: Integrate PyRight (0fe56c0)
- Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
- Fix: Harden SVE MaxSim upcasting logic (803eb33)
- Fix: Disable
FPCR.AHbit (7b2b850) - Make: Node 24 for trusted publishing (9f1a4ef)
- Fix:
_mto zero-out predicated SVE/SME ops (16c157b) - Fix:
_mto zero-out predicated SVE lanes inspatial/(ac27cde) - Make: Replace stale
prebuildify(74c5454)