Releases · ashvardanian/NumKong

20 Apr 00:38

v7.6.0

48cbd21

v7.6: CUDA & C++ 20 Compatibility, DLPack 1.3 Views, Float8 & 3D Mesh Speedups Latest

Latest

CUDA & C++ 20 Compatibility

NVCC 13 caps its language-standard flag at C++20, and our multi-argument subscript overloads from C++23 P2128 made tensor.hpp unparseable by cudafe++. We added call-operator primaries that mirror every multi-argument subscript overload in the tensor view, span, and owning container types, and kept the bracket sugar behind an __cpp_multidimensional_subscript feature test so older toolchains pick the portable spelling automatically. Downstream CUDA callers now parse the tensor header without touching their language-standard flag.

DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow

Tensors now exchange zero-copy in both directions with every Python framework that implements the DLPack protocol — PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla — using DLPack 1.3's versioned capsules and the max_version handshake. This finally carries semantic dtype identity across the bridge: bf16 and the four narrow float variants E4M3FN, E5M2, E2M3, and E3M2 round-trip without losing their type, where PEP 3118 and the legacy array interface previously degraded them to raw unsigned bytes. The importer accepts every device whose pointer is host-dereferenceable — plain CPU, pinned host memory on CUDA and ROCm, CUDA managed unified memory, Intel oneAPI host and shared USM, and Metal on Apple Silicon — while pure device memory is rejected with the offending device code named. The exporter stays strict and only emits the CPU device. Sub-byte types u1, u4, and i4 ride as byte containers, and the ABI is declared inline as six structs and twelve device codes rather than vendoring an external header, mirroring NumPy's own approach. Validated against torch 2.11, numpy 2.4, jax 0.10, tensorflow 2.21, pyarrow 23, cupy 13.6, and onnxruntime 1.24 on H100 with 127 tests passing.

import numpy as np, torch, numkong as nk

# NumKong → PyTorch: zero-copy FP8 round-trip preserves dtype identity.
src = torch.zeros(4, 6, dtype=torch.float8_e4m3fn)
nk_view = nk.from_dlpack(src); pt_back = torch.from_dlpack(nk_view)
assert nk_view.shape == (4, 6) and nk_view.dtype == "e4m3" and pt_back.dtype == torch.float8_e4m3fn

# Mutation through one view is visible through the other — proves zero-copy.
tensor = nk.Tensor(np.arange(24, dtype=np.float32).reshape(4, 6)); pt = torch.from_dlpack(tensor)
pt[0, 0] = 99; assert np.asarray(tensor)[0, 0] == 99

Upstream DLPack PRs that this bridge interoperates with, already referenced from our DLPack interop source:

pytorch/pytorch#57110 — PyTorch DLPack protocol, later upgraded to DLPack 1.0 with the max_version handshake
tensorflow/community#180 — TensorFlow DLPack RFC, implemented as tf.experimental.dlpack
microsoft/onnxruntime#23110 — ONNX Runtime OrtValue DLPack enabled by default for inference

Faster Single-Pass 3D Mesh Alignment Algorithms

Kabsch/Umeyama mesh alignment now folds into a single pass via the trace identity

$$\mathrm{SSD} = \lVert a - \bar a \rVert^2 + \lVert b - \bar b \rVert^2 - 2,\mathrm{tr}(R \cdot H)$$

replacing the earlier two-pass approach — covariance first, transformed-SSD second — across all nine backends. An identity-dominant short-circuit skips SVD entirely when $H$ approximates a positive diagonal, saving around 500 cycles on already-aligned inputs. Two new backends land alongside: a Genoa kernel that uses VDPBF16PS for channel-grouped bf16 reductions, and a NEON+FP16FML kernel that uses vfmlalq for fp16 widening FMA, while the existing NEON+BFDOT path picks up vbfdotq_f32 for its bf16 stats pass. A centered-RMSD bug in the NEON and NEON+FP16FML paths is fixed in passing.

Faster Float8 Linear Algebra on x86

Pairwise FP8 distance kernels — sqeuclidean, euclidean, and angular — on Skylake and Haswell now compute the squared difference directly in F32 after a free-shift widen. E5M2 abuses its shared exponent bias with F16: a byte-to-word unpack against zero places the byte as a valid F16 encoding. E4M3 uses a Giesen-style fake-F16 cast that shifts the mantissa up by seven, reinjects the sign at bit fifteen, widens with vcvtph2ps, and multiplies by 256 to correct the bias delta. Per-pair speedups range from 1.4× for E4M3 angular on Skylake to 4.9× for E5M2 sqeuclidean on Haswell on a pinned Xeon 6776P. The redundant Genoa E5M2 pairwise kernels are deleted because the rewritten Skylake path runs on Genoa silicon and beats the old vdpbf16ps-chain form by 2.4×.

Stateful FP8 GEMMs follow the same trajectory. E5M2 byte-packs into a new dtype-specific update helper that runs two FMA chains into a single state accumulator, landing at 1.4–2.5× on Skylake and up to 3.2× on Haswell for the packed dot, angular, and euclidean variants. E4M3 GEMMs on Skylake switch to an asymmetric F16-pack scheme where A streams as F32 while B is pre-cast at pack time and stored as F16, halving packed-B memory with compute neutral against baseline. Granite Rapids gets a brand-new E5M2 GEMM that packs E5M2 into F16 with a single byte shift and runs TDPFP16PS over F16 tiles, beating the Sapphire AMX BF16 path on E5M2 inputs with better intermediate precision at the same throughput. Dispatch wires it ahead of Sapphire AMX so Granite hardware automatically picks it up.

Minor

Add: DLPack 1.3 interop bridge for numkong.Tensor (ea74fe1)
Add: Back-port tensor API to C++20 for CUDA (ad93068)

Patch

Improve: FP8 GEMM throughput on Skylake/Haswell + Granite Rapids E5M2 kernel (c19bec9)
Improve: FP8 pairwise distance kernels via Giesen trick + F16 widen path (679f55f)
Fix: Keep *_serial kernels scalar across LTO (455d535)
Make: Enable symbol exports for nk_shared Emscripten builds (482e4fd)
Improve: SSD trace-identity fold across all mesh backends + Genoa/NEONFHM kernels (e9d40e5)
Make: Normalize base PowerPC & LoongArch cap for JS (ab81191)

Assets 17

numkong_android_arm32_7.6.0.zip

sha256:bfef2862fdca2a8e7acae809f20b3caf55e37d3cebbe28574d52e2859d3fd188

1.75 MB 2026-04-20T00:52:04Z
numkong_android_arm64_7.6.0.zip

sha256:a1118b1b8e19ebdb3abed6831f7b929798004a746e1e9272210db10c04f4753a

2.95 MB 2026-04-20T00:51:52Z
numkong_linux_amd64_7.6.0.deb

sha256:9f6e713d7f57f16ad3648264d6fe212340843cb3329a013aa1c1b46917bcfe8c

1.22 MB 2026-04-20T00:56:54Z
numkong_linux_amd64_7.6.0.so

sha256:4e5550167ea064198f243c96b9e7184b27f4d8296cd9422c330a262bbbd172ac

2.12 MB 2026-04-20T00:56:54Z
numkong_linux_arm64_7.6.0.deb

sha256:640ee021efadcfa8e6744932635158021e2db89f660d324e2debe22386e68640

930 KB 2026-04-20T00:54:04Z
numkong_linux_arm64_7.6.0.so

sha256:e12b484779dad436e6095c3e801a36ec5ef02f07a7fc0bea281cfcec5ba7dd9b

914 KB 2026-04-20T00:54:04Z
numkong_linux_loongarch64_7.6.0.so

sha256:917925c085f729a7a0b2756bdc3f82f82d4db50dbc483cf4870ac29d6155bf38

1.02 MB 2026-04-20T00:55:18Z
numkong_linux_ppc64le_7.6.0.so

sha256:c339f4d23dbdd7b788f59d4cd3a29d3560b12725fe0020600bf48c1345e69cf4

1.06 MB 2026-04-20T00:53:28Z
numkong_linux_riscv64_7.6.0.so

sha256:e5a3ae3a871b01223524f4a4658a55b6141ed2d826bbb2055b69c4f8ce924d48

476 KB 2026-04-20T00:48:18Z
numkong_macos_arm64_7.6.0.dylib

sha256:4915ecdad7d19fb928607f9a5b568a91ead8e6eaf97483362ba634867d9fa43a

957 KB 2026-04-20T00:49:25Z
Source code (zip)

2026-04-20T00:38:53Z
Source code (tar.gz)

2026-04-20T00:38:53Z

14 Apr 08:55

ashvardanian

v7.5.0

14daf40

v7.5: Parallelism & Portability

Built-in OpenMP bundling for JS & Python 🐍
Intel Granite Rapids 🪨 F16 → F32 GEMMs 💎
Faster bit-vector population counts for Arm NEON 🦾
SME compatibility with non-Apple Clang on Apple machines 🍏
Hardening against MSan SVE false-positives, thanks to @alexey-milovidov 🦺
Hardening against GCC 13 Arm NEON code-gen bugs, thanks to @swasik 🐂
_into & _parallel GEMM Rust APIs: reusing memory & ForkUnion pools 🆕
De-vectorize serial kernels with compiler flags 🎏
Compress source & binary distributions for Windows 🗜️
Pre-build & share FreeBSD, PowerPC, RISC-V, & LoongArch libs 🤗

Minor

Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c)
Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279)
Add: Span-based matrix _into APIs, parallel Hammings/Jaccards, full-crate docs (99289df)
Add: OpenMP for Python & JavaScript (499ecc9)
Add: Granite Rapids AMX for F16 & F32 (28036ea)

Patch

Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02)
Make: Detect illegal instructions in macOS CI (289cdaf)
Fix: Drop -march= on macOS setup.py builds (28aac74)
Fix: Exclude std::signal from WASM builds (14814c5)
Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0)
Make: Drop +nosimd from AArch64 baseline (23f5195)
Make: Forbid auto-vectorization in portable baseline builds (43e8324)
Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f)
Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec)
Improve: Log faulting capability detection (a401f8a)
Improve: Log faulting kernel on fatal signals in nk_test (22c7c79)
Make: Normalize Python test dependencies across CI and docs (8a0f3d4)
Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685)
Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb)
Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2)
Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021)
Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118)
Fix: GCC requires +sme prefix in target attribute for _arm_sc* stubs (291dc0a)
Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75)
Fix: OpenMP wheel builds on macOS and Windows (f569121)
Fix: Add target("sme") to _arm_sc* stubs for GCC compatibility (ad2add0)
Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7)
Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934)
Improve: Manual SME streaming control, single enter/exit per API call (6432837)
Fix: Update cdist edge-case test for re-added threads= kwarg (50681af)
Make: Allow force-enabling ISA targets via environment variables (0e58702)
Improve: Abandon F32→F64 via Ozaki on Granite Rapids (94a5f19)
Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83)
Make: Standardize CI compilers and add Windows test job (9a22ea4)
Make: Shrink serial fallbacks with scoped size optimization (83154a8)
Make: Compress Windows builds (e30ad3d)
Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)

Contributors

swasik and alexey-milovidov

Assets 17

06 Apr 21:04

ashvardanian

v7.4.5

a750052

v7.4.5: Faster RMSD

Improve: Vectorize F32 SME MaxSim finalizer (0daacf3)
Improve: Remove centering from RMSD kernels (1a83ab4)
Fix: Emulated vs native test durations (4266451)

Assets 16

06 Apr 12:17

ashvardanian

v7.4.4

b4ed3ae

v7.4.4: CI & MSVC Hardening

Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
Make: check_source_runs-probing like march=native on MSVC (7a152f3)
Fix: Drop _MM_FROUND_NO_EXC from _mm256_cvtps_ph calls (8649b0c)
Fix: Guard against old MSVC preprocessor (25d3304)
Make: Enforce newer preprocessor in MSVC (be966af)
Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)

Assets 16

05 Apr 16:34

ashvardanian

v7.4.3

55fc1d8

Release v7.4.3

Release: v7.4.3 [skip ci]

Patch

Fix: Require AArch64 for NEON kernels (2ba1b34)
Docs: Table order & formatting (8673a56)
Make: Avoid --all-features in Rust cross-compilation CI (8be8bff)
Improve: Arm32 compatibility (6404172)
Make: cancel-in-progress CI to shift compute resources (dfc8fa0)
Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
Make: Strip .unsafeFlags & list platforms for SPM consumption (b061b78)
Make: Expose CNumKongDispatch target to Swift users (6aa00a8)

Assets 16

05 Apr 09:07

ashvardanian

v7.4.2

0f2783c

Release v7.4.2

Release: v7.4.2 [skip ci]

Patch

Docs: Shrink tables in the main README (6d2ea34)
Make: Inline Power Shell cross-compilation logic in CI (974c30c)
Make: Define _ARM64_ for Arm JS builds in MSVC (f303042)
Make: Skip same-named artifacts on CI reruns (7c098e5)

Assets 16

05 Apr 00:12

ashvardanian

v7.4.1

c360304

Release v7.4.1

Release: v7.4.1 [skip ci]

Patch

Make: Set repository.url for NPM (385480d)
Make: Pull MSVC ARM64 Cross-Compiler (e20c93e)
Fix: Swap f16x8 for u16x8 in cast_neon (154ec5d)

Assets 16

04 Apr 23:26

ashvardanian

v7.4.0

ffc6e74

v7.4: Fast Tensor Contractions

Faster tensor contractions
Faster GEMM "packers" with SIMD
New SVE+SDOT kernels for i8
MSVC build stability on Arm

Minor

Add: WASM elementwise ops & spatial mini-float kernels (81b8c44)
Add: WASM type-casting kernels (e09df31)
Add: SVE+SDOT ops for 8-bit integers (913fc6b)

Patch

Fix: Misplaced NEON loads/stores in Sierra (05e3045)
Fix: Avoid unconsitional np symbols (9dffb68)
Make: Resolve probe locations for NPM consumers (c602f45)
Docs: Refined "What's Inside" (28f35cd)
Docs: Mini-float kernel selection strategy (04e6598)
Improve: Accelerate PyTests, reduce Decimal use (2417248)
Make: Move .pyi for PyLance (688ec2d)
Fix: Inconsistent SME function qualifiers (5b4148a)
Improve: Smaller test inputs under QEMU (ee36bf2)
Improve: Vectorize GEMM "packers" (86127a4)
Make: Longer timeouts for QEMU in CI (a9cc732)
Fix: vec_t store helper args order (eecbcac)
Fix: Negative stride tensor reductions (3ea81be)
Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
Improve: Faster reductions in strided tensors (61651ed)
Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
Fix: Harden mini-float type-casting (1911b89)
Make: Ship win32-arm64 NPM builds (578b7ad)
Make: Auto-bump JS platform-specific versions (5617f75)
Fix: vcombine instead of initializer lists for NEON arrays in MSVC (906c178)
Fix: Avoid flaky vld1_f16 for MSVC (7a987d2)

Assets 15

02 Apr 22:48

ashvardanian

v7.3.0

9d58663

v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs

This release hardens Arm kernels across NEON, SVE, and SME. The most widespread fix replaces _x (don't-care) predicated intrinsics with _m (merge-with-zero) variants — inactive lanes left undefined by _x could carry stale data into reductions, producing wrong results for non-power-of-two dimensions on real SVE hardware. Partial-tail padding in BMOPA is fixed for sub-32-bit types, and strided reductions in NEON are hardened against off-by-one in non-contiguous layouts.

Thanks to the @ClickHouse team for help hardening tail loads and @albumentations-team for strided reductions!

On the performance side, NEON gets faster in-vector finalizers, vcvt_high for cheaper F16/BF16 widening, and new SDOT fallbacks for i4 and e3m2 that previously required SME — bringing sub-byte arithmetic to the much larger NEON install base. Streaming SVE picks up Giesen's trick for E4M3 → F16 and faster mini-float norms. SME GEMMs use fewer branches in the inner loop.

Also, NumKong now ships a CITATION.cff — hit "Cite this repository" on GitHub to grab it in case you are writing a paper on a related topic 🤗

Minor

Add: NEON & SDOT fallbacks for i4 & e3m2 (0c6afa5)

Patch

Docs: M5 perf stats for Wasmtime v43 (43c2881)
Fix: Alternative MSVC-friendly cast (4744b9b)
Make: Disable LTCG due to MSVC issues (3d37684)
Make: Try PREBUILDS_ONLY=0 in CI (64c5f95)
Improve: Lower NEONHALF → NEON requirements (37f99ec)
Fix: Wire nk_cast_neon benchmarks (3793af2)
Docs: Apple M5 native stats for secondary workloads (d7c81c4)
Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
Improve: Drop nk_f16x4_to_f32x4_neon (84bb20a)
Improve: vcvt_high for faster unpacking (a5f4a19)
Docs: Refresh GEMM/SYRK measurements Apple M4 → M5 (3e010de)
Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
Fix: Double-counted tail in Skylake f64 RMSD, Kabsch, and Umeyama (5391344)
Improve: Share decimal.Context.traps rules (3c28ae9)
Fix: Padding partial tail 32-bit words for BMOPA (2598487)
Fix: Missing scale type definitions of mini-floats (91862da)
Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
Fix: Top-bottom variable names (a014134)
Improve: Giesen's E4M3 → F16 in Streaming SVE (25322b5)
Improve: Fewer branches in SME GEMMs (858263c)
Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
Make: Focus on M4 CPUs for SME probing (5ff63eb)
Improve: PyTesting across more shapes (4bc3e44)
Improve: Cleaner type-casting & promotion rules (23c2474)
Make: Hide formatting commits for v7-7.2 (f6ce2da)
Make: Native addon resolution for Deno & Bun (0d502d5)
Docs: Citations (6220137)
Improve: Faster mini-float norms in Streaming SVE (088de57)
Make: Integrate PyRight (0fe56c0)
Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
Fix: Harden SVE MaxSim upcasting logic (803eb33)
Fix: Disable FPCR.AH bit (7b2b850)
Make: Node 24 for trusted publishing (9f1a4ef)
Fix: _m to zero-out predicated SVE/SME ops (16c157b)
Fix: _m to zero-out predicated SVE lanes in spatial/ (ac27cde)
Make: Replace stale prebuildify (74c5454)

Contributors

ClickHouse and albumentations-team

Assets 15

28 Mar 23:38

ashvardanian

v7.2.4

facd43f

Release v7.2.4

Release: v7.2.4 [skip ci]

Patch

Make: 2h timeout budget for JS & Py builds (2e8f081)

Assets 16

Releases: ashvardanian/NumKong

v7.6: CUDA & C++ 20 Compatibility, DLPack 1.3 Views, Float8 & 3D Mesh Speedups

CUDA & C++ 20 Compatibility

DLPack & Zero-copy Exchange with PyTorch, JAX, & Arrow

Faster Single-Pass 3D Mesh Alignment Algorithms

Faster Float8 Linear Algebra on x86

Minor

Patch

Uh oh!

v7.5: Parallelism & Portability

Minor

Patch

Contributors

Uh oh!

v7.4.5: Faster RMSD

Uh oh!

v7.4.4: CI & MSVC Hardening

Uh oh!

Release v7.4.3

Patch

Uh oh!

Release v7.4.2

Patch

Uh oh!

Release v7.4.1

Patch

Uh oh!

v7.4: Fast Tensor Contractions

Minor

Patch

Uh oh!

v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs

Minor

Patch

Contributors

Uh oh!

Release v7.2.4

Patch

Uh oh!