Skip to content

Releases: ashvardanian/NumKong

v7.6: CUDA & C++ 20 Compatibility, DLPack 1.3 Views, Float8 & 3D Mesh Speedups

20 Apr 00:38

Choose a tag to compare

CUDA & C++ 20 Compatibility

NVCC 13 caps its language-standard flag at C++20, and our multi-argument subscript overloads from C++23 P2128 made tensor.hpp unparseable by cudafe++. We added call-operator primaries that mirror every multi-argument subscript overload in the tensor view, span, and owning container types, and kept the bracket sugar behind an __cpp_multidimensional_subscript feature test so older toolchains pick the portable spelling automatically. Downstream CUDA callers now parse the tensor header without touching their language-standard flag.

DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow

Tensors now exchange zero-copy in both directions with every Python framework that implements the DLPack protocol β€” PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla β€” using DLPack 1.3's versioned capsules and the max_version handshake. This finally carries semantic dtype identity across the bridge: bf16 and the four narrow float variants E4M3FN, E5M2, E2M3, and E3M2 round-trip without losing their type, where PEP 3118 and the legacy array interface previously degraded them to raw unsigned bytes. The importer accepts every device whose pointer is host-dereferenceable β€” plain CPU, pinned host memory on CUDA and ROCm, CUDA managed unified memory, Intel oneAPI host and shared USM, and Metal on Apple Silicon β€” while pure device memory is rejected with the offending device code named. The exporter stays strict and only emits the CPU device. Sub-byte types u1, u4, and i4 ride as byte containers, and the ABI is declared inline as six structs and twelve device codes rather than vendoring an external header, mirroring NumPy's own approach. Validated against torch 2.11, numpy 2.4, jax 0.10, tensorflow 2.21, pyarrow 23, cupy 13.6, and onnxruntime 1.24 on H100 with 127 tests passing.

import numpy as np, torch, numkong as nk

# NumKong β†’ PyTorch: zero-copy FP8 round-trip preserves dtype identity.
src = torch.zeros(4, 6, dtype=torch.float8_e4m3fn)
nk_view = nk.from_dlpack(src); pt_back = torch.from_dlpack(nk_view)
assert nk_view.shape == (4, 6) and nk_view.dtype == "e4m3" and pt_back.dtype == torch.float8_e4m3fn

# Mutation through one view is visible through the other β€” proves zero-copy.
tensor = nk.Tensor(np.arange(24, dtype=np.float32).reshape(4, 6)); pt = torch.from_dlpack(tensor)
pt[0, 0] = 99; assert np.asarray(tensor)[0, 0] == 99

Upstream DLPack PRs that this bridge interoperates with, already referenced from our DLPack interop source:

Faster Single-Pass 3D Mesh Alignment Algorithms

Kabsch/Umeyama mesh alignment now folds into a single pass via the trace identity

$$\mathrm{SSD} = \lVert a - \bar a \rVert^2 + \lVert b - \bar b \rVert^2 - 2,\mathrm{tr}(R \cdot H)$$

replacing the earlier two-pass approach β€” covariance first, transformed-SSD second β€” across all nine backends. An identity-dominant short-circuit skips SVD entirely when $H$ approximates a positive diagonal, saving around 500 cycles on already-aligned inputs. Two new backends land alongside: a Genoa kernel that uses VDPBF16PS for channel-grouped bf16 reductions, and a NEON+FP16FML kernel that uses vfmlalq for fp16 widening FMA, while the existing NEON+BFDOT path picks up vbfdotq_f32 for its bf16 stats pass. A centered-RMSD bug in the NEON and NEON+FP16FML paths is fixed in passing.

Faster Float8 Linear Algebra on x86

Pairwise FP8 distance kernels β€” sqeuclidean, euclidean, and angular β€” on Skylake and Haswell now compute the squared difference directly in F32 after a free-shift widen. E5M2 abuses its shared exponent bias with F16: a byte-to-word unpack against zero places the byte as a valid F16 encoding. E4M3 uses a Giesen-style fake-F16 cast that shifts the mantissa up by seven, reinjects the sign at bit fifteen, widens with vcvtph2ps, and multiplies by 256 to correct the bias delta. Per-pair speedups range from 1.4Γ— for E4M3 angular on Skylake to 4.9Γ— for E5M2 sqeuclidean on Haswell on a pinned Xeon 6776P. The redundant Genoa E5M2 pairwise kernels are deleted because the rewritten Skylake path runs on Genoa silicon and beats the old vdpbf16ps-chain form by 2.4Γ—.

Stateful FP8 GEMMs follow the same trajectory. E5M2 byte-packs into a new dtype-specific update helper that runs two FMA chains into a single state accumulator, landing at 1.4–2.5Γ— on Skylake and up to 3.2Γ— on Haswell for the packed dot, angular, and euclidean variants. E4M3 GEMMs on Skylake switch to an asymmetric F16-pack scheme where A streams as F32 while B is pre-cast at pack time and stored as F16, halving packed-B memory with compute neutral against baseline. Granite Rapids gets a brand-new E5M2 GEMM that packs E5M2 into F16 with a single byte shift and runs TDPFP16PS over F16 tiles, beating the Sapphire AMX BF16 path on E5M2 inputs with better intermediate precision at the same throughput. Dispatch wires it ahead of Sapphire AMX so Granite hardware automatically picks it up.

Minor

  • Add: DLPack 1.3 interop bridge for numkong.Tensor (ea74fe1)
  • Add: Back-port tensor API to C++20 for CUDA (ad93068)

Patch

  • Improve: FP8 GEMM throughput on Skylake/Haswell + Granite Rapids E5M2 kernel (c19bec9)
  • Improve: FP8 pairwise distance kernels via Giesen trick + F16 widen path (679f55f)
  • Fix: Keep *_serial kernels scalar across LTO (455d535)
  • Make: Enable symbol exports for nk_shared Emscripten builds (482e4fd)
  • Improve: SSD trace-identity fold across all mesh backends + Genoa/NEONFHM kernels (e9d40e5)
  • Make: Normalize base PowerPC & LoongArch cap for JS (ab81191)

v7.5: Parallelism & Portability

14 Apr 08:55

Choose a tag to compare

  • Built-in OpenMP bundling for JS & Python 🐍
  • Intel Granite Rapids πŸͺ¨ F16 β†’ F32 GEMMs πŸ’Ž
  • Faster bit-vector population counts for Arm NEON 🦾
  • SME compatibility with non-Apple Clang on Apple machines 🍏
  • Hardening against MSan SVE false-positives, thanks to @alexey-milovidov 🦺
  • Hardening against GCC 13 Arm NEON code-gen bugs, thanks to @swasik πŸ‚
  • _into & _parallel GEMM Rust APIs: reusing memory & ForkUnion pools πŸ†•
  • De-vectorize serial kernels with compiler flags 🎏
  • Compress source & binary distributions for Windows πŸ—œοΈ
  • Pre-build & share FreeBSD, PowerPC, RISC-V, & LoongArch libs πŸ€—

Minor

  • Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c)
  • Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279)
  • Add: Span-based matrix _into APIs, parallel Hammings/Jaccards, full-crate docs (99289df)
  • Add: OpenMP for Python & JavaScript (499ecc9)
  • Add: Granite Rapids AMX for F16 & F32 (28036ea)

Patch

  • Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02)
  • Make: Detect illegal instructions in macOS CI (289cdaf)
  • Fix: Drop -march= on macOS setup.py builds (28aac74)
  • Fix: Exclude std::signal from WASM builds (14814c5)
  • Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0)
  • Make: Drop +nosimd from AArch64 baseline (23f5195)
  • Make: Forbid auto-vectorization in portable baseline builds (43e8324)
  • Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f)
  • Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec)
  • Improve: Log faulting capability detection (a401f8a)
  • Improve: Log faulting kernel on fatal signals in nk_test (22c7c79)
  • Make: Normalize Python test dependencies across CI and docs (8a0f3d4)
  • Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685)
  • Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb)
  • Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2)
  • Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021)
  • Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118)
  • Fix: GCC requires +sme prefix in target attribute for _arm_sc* stubs (291dc0a)
  • Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75)
  • Fix: OpenMP wheel builds on macOS and Windows (f569121)
  • Fix: Add target("sme") to _arm_sc* stubs for GCC compatibility (ad2add0)
  • Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7)
  • Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934)
  • Improve: Manual SME streaming control, single enter/exit per API call (6432837)
  • Fix: Update cdist edge-case test for re-added threads= kwarg (50681af)
  • Make: Allow force-enabling ISA targets via environment variables (0e58702)
  • Improve: Abandon F32β†’F64 via Ozaki on Granite Rapids (94a5f19)
  • Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83)
  • Make: Standardize CI compilers and add Windows test job (9a22ea4)
  • Make: Shrink serial fallbacks with scoped size optimization (83154a8)
  • Make: Compress Windows builds (e30ad3d)
  • Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)

v7.4.5: Faster RMSD

06 Apr 21:04

Choose a tag to compare

  • Improve: Vectorize F32 SME MaxSim finalizer (0daacf3)
  • Improve: Remove centering from RMSD kernels (1a83ab4)
  • Fix: Emulated vs native test durations (4266451)

v7.4.4: CI & MSVC Hardening

06 Apr 12:17

Choose a tag to compare

  • Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
  • Make: check_source_runs-probing like march=native on MSVC (7a152f3)
  • Fix: Drop _MM_FROUND_NO_EXC from _mm256_cvtps_ph calls (8649b0c)
  • Fix: Guard against old MSVC preprocessor (25d3304)
  • Make: Enforce newer preprocessor in MSVC (be966af)
  • Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
  • Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
  • Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)

Release v7.4.3

05 Apr 16:34

Choose a tag to compare

Release: v7.4.3 [skip ci]

Patch

  • Fix: Require AArch64 for NEON kernels (2ba1b34)
  • Docs: Table order & formatting (8673a56)
  • Make: Avoid --all-features in Rust cross-compilation CI (8be8bff)
  • Improve: Arm32 compatibility (6404172)
  • Make: cancel-in-progress CI to shift compute resources (dfc8fa0)
  • Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
  • Make: Strip .unsafeFlags & list platforms for SPM consumption (b061b78)
  • Make: Expose CNumKongDispatch target to Swift users (6aa00a8)

Release v7.4.2

05 Apr 09:07

Choose a tag to compare

Release: v7.4.2 [skip ci]

Patch

  • Docs: Shrink tables in the main README (6d2ea34)
  • Make: Inline Power Shell cross-compilation logic in CI (974c30c)
  • Make: Define _ARM64_ for Arm JS builds in MSVC (f303042)
  • Make: Skip same-named artifacts on CI reruns (7c098e5)

Release v7.4.1

05 Apr 00:12

Choose a tag to compare

Release: v7.4.1 [skip ci]

Patch

  • Make: Set repository.url for NPM (385480d)
  • Make: Pull MSVC ARM64 Cross-Compiler (e20c93e)
  • Fix: Swap f16x8 for u16x8 in cast_neon (154ec5d)

v7.4: Fast Tensor Contractions

04 Apr 23:26

Choose a tag to compare

  • Faster tensor contractions
  • Faster GEMM "packers" with SIMD
  • New SVE+SDOT kernels for i8
  • MSVC build stability on Arm

Minor

  • Add: WASM elementwise ops & spatial mini-float kernels (81b8c44)
  • Add: WASM type-casting kernels (e09df31)
  • Add: SVE+SDOT ops for 8-bit integers (913fc6b)

Patch

  • Fix: Misplaced NEON loads/stores in Sierra (05e3045)
  • Fix: Avoid unconsitional np symbols (9dffb68)
  • Make: Resolve probe locations for NPM consumers (c602f45)
  • Docs: Refined "What's Inside" (28f35cd)
  • Docs: Mini-float kernel selection strategy (04e6598)
  • Improve: Accelerate PyTests, reduce Decimal use (2417248)
  • Make: Move .pyi for PyLance (688ec2d)
  • Fix: Inconsistent SME function qualifiers (5b4148a)
  • Improve: Smaller test inputs under QEMU (ee36bf2)
  • Improve: Vectorize GEMM "packers" (86127a4)
  • Make: Longer timeouts for QEMU in CI (a9cc732)
  • Fix: vec_t store helper args order (eecbcac)
  • Fix: Negative stride tensor reductions (3ea81be)
  • Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
  • Improve: Faster reductions in strided tensors (61651ed)
  • Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
  • Fix: Harden mini-float type-casting (1911b89)
  • Make: Ship win32-arm64 NPM builds (578b7ad)
  • Make: Auto-bump JS platform-specific versions (5617f75)
  • Fix: vcombine instead of initializer lists for NEON arrays in MSVC (906c178)
  • Fix: Avoid flaky vld1_f16 for MSVC (7a987d2)

v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs

02 Apr 22:48

Choose a tag to compare

This release hardens Arm kernels across NEON, SVE, and SME. The most widespread fix replaces _x (don't-care) predicated intrinsics with _m (merge-with-zero) variants β€” inactive lanes left undefined by _x could carry stale data into reductions, producing wrong results for non-power-of-two dimensions on real SVE hardware. Partial-tail padding in BMOPA is fixed for sub-32-bit types, and strided reductions in NEON are hardened against off-by-one in non-contiguous layouts.

Thanks to the @ClickHouse team for help hardening tail loads and @albumentations-team for strided reductions!

On the performance side, NEON gets faster in-vector finalizers, vcvt_high for cheaper F16/BF16 widening, and new SDOT fallbacks for i4 and e3m2 that previously required SME β€” bringing sub-byte arithmetic to the much larger NEON install base. Streaming SVE picks up Giesen's trick for E4M3 β†’ F16 and faster mini-float norms. SME GEMMs use fewer branches in the inner loop.

Also, NumKong now ships a CITATION.cff β€” hit "Cite this repository" on GitHub to grab it in case you are writing a paper on a related topic πŸ€—

Minor

  • Add: NEON & SDOT fallbacks for i4 & e3m2 (0c6afa5)

Patch

  • Docs: M5 perf stats for Wasmtime v43 (43c2881)
  • Fix: Alternative MSVC-friendly cast (4744b9b)
  • Make: Disable LTCG due to MSVC issues (3d37684)
  • Make: Try PREBUILDS_ONLY=0 in CI (64c5f95)
  • Improve: Lower NEONHALF β†’ NEON requirements (37f99ec)
  • Fix: Wire nk_cast_neon benchmarks (3793af2)
  • Docs: Apple M5 native stats for secondary workloads (d7c81c4)
  • Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
  • Improve: Drop nk_f16x4_to_f32x4_neon (84bb20a)
  • Improve: vcvt_high for faster unpacking (a5f4a19)
  • Docs: Refresh GEMM/SYRK measurements Apple M4 β†’ M5 (3e010de)
  • Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
  • Fix: Double-counted tail in Skylake f64 RMSD, Kabsch, and Umeyama (5391344)
  • Improve: Share decimal.Context.traps rules (3c28ae9)
  • Fix: Padding partial tail 32-bit words for BMOPA (2598487)
  • Fix: Missing scale type definitions of mini-floats (91862da)
  • Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
  • Fix: Top-bottom variable names (a014134)
  • Improve: Giesen's E4M3 β†’ F16 in Streaming SVE (25322b5)
  • Improve: Fewer branches in SME GEMMs (858263c)
  • Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
  • Make: Focus on M4 CPUs for SME probing (5ff63eb)
  • Improve: PyTesting across more shapes (4bc3e44)
  • Improve: Cleaner type-casting & promotion rules (23c2474)
  • Make: Hide formatting commits for v7-7.2 (f6ce2da)
  • Make: Native addon resolution for Deno & Bun (0d502d5)
  • Docs: Citations (6220137)
  • Improve: Faster mini-float norms in Streaming SVE (088de57)
  • Make: Integrate PyRight (0fe56c0)
  • Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
  • Fix: Harden SVE MaxSim upcasting logic (803eb33)
  • Fix: Disable FPCR.AH bit (7b2b850)
  • Make: Node 24 for trusted publishing (9f1a4ef)
  • Fix: _m to zero-out predicated SVE/SME ops (16c157b)
  • Fix: _m to zero-out predicated SVE lanes in spatial/ (ac27cde)
  • Make: Replace stale prebuildify (74c5454)

Release v7.2.4

28 Mar 23:38

Choose a tag to compare

Release: v7.2.4 [skip ci]

Patch

  • Make: 2h timeout budget for JS & Py builds (2e8f081)