Releases: ashvardanian/StringZilla
v4.2: Faster Hashing and SHA-256
User-facing updates:
- 🆕 SHA-256 checksums
- 🆕 Detect compilation settings
Implementation details:
- 🆕 Intel Goldmont capabilities level
- 🆕 Arm NEON+SHA capabilities level
- Hardened Rust builds & capability masking
- Faster buffer filling in
sz_hashin NEON backend - Fixed tail handling in
sz_copyin SVE backend
Minor
- Add: Check comp-time capabilities (3347be4)
- Add:
sz_cap_goldmont_kcapability! (f70e927) - Add:
neon+shanew capability! (fcb68a4) - Add: Sha256 to
bench_token(bb077da) - Add:
hmac_sha256APIs (bf1971e) - Add:
Sha256class for Python (6ae7b75) - Add: Initial Sha256 variant for NEON (bd35030)
- Add: SHA256 for Arm (20672dd)
Patch
- Fix: Avoid unaligned SHA loads on ArmV7 (ebf0503)
- Fix: Sign conversion warning (3c3e5fc)
- Make:
before-allfordnfon Fedora &apton Debian (fc74452) - Make: Consume env-vars for Rust backend builds (222fc39)
- Improve: Amortize
bench_unarycosts (d8d19ce) - Fix: Init
uint32x4_ton MSVC (3dce631) - Improve: Bring back SVE2 hash for short inputs (b1c750b)
- Improve: More sorting tests (61e08ce)
- Improve: Simplify SVE memory-ops (35e2236)
- Fix:
sz_copy_svetail issue (2fa818d) - Fix: Avoid
<arm_neon_sve_bridge.h>(e5b4496) - Improve: Different SHA pipeline for AArch64 (c8aafd3)
- Improve: Try better SHA pipelining (313f71f)
- Improve: Faster 2-block SHA256 on NEON (9425341)
- Improve: Deprecate SVE2 hashing (9cb1588)
- Improve: Try using non-temporal SVE loads (4572a63)
- Fix:
svlasta_u64(svpfalse_b())UB (4833e83) - Improve: Westmere-like hash updates in NEON (064355f)
- Improve: Hardening Rust builds (d6a9ba6)
- Fix: Type-casting on Arm (03a0340)
v4.1: Intel Westmere Kernels
Thanks to @Algunenano and the broader ClickHouse team for help, back-porting StringZilla kernels to older CPUs 🤗
With this release:
- Substring search and hashing on CPUs from Westmere to Haswell will become at least 2x faster.
- Inferring Skylake capabilities in dynamic dispatch won't require
VAESextensions only needed for Ice Lake and newer. - MSVC will correctly detect Haswell, Ice Lake, and NEON capabilities for compile-time dispatch, lacking options to differentiate other platforms from macros.
Minor
Patch
Release v4.0.15
Release v4.0.14
Release v4.0.13
v4.0.12: Zero-Copy for Rust and Python
This release fixes a critical bug where non-owning Strs slices incorrectly copied entire parent data during GPU memory allocation, instead of just the slice portion. The fix ensures proper Apache Arrow-compatible StringTape format handling with correct offset normalization for zero-copy operations. GPU memory management is now significantly more efficient, eliminating unnecessary re-allocations when data already resides in GPU memory through intelligent parent chain traversal.
A new stringzillas.to_device() function enables explicit GPU memory pre-allocation, useful for testing and performance optimization:
import stringzilla as sz
import stringzillas as szs
# Create strings and slices
strs = sz.Strs(["hello", "world", "test", "data"])
slice_view = strs[1:3] # Non-owning view of ["world", "test"]
# Pre-allocate on GPU (if available)
gpu_strs = szs.to_device(strs)
gpu_slice = szs.to_device(slice_view) # Correctly handles slice offsetsCross-platform builds are now more stable with fixes for Windows ARM64 cross-compilation, ensuring mutually exclusive architecture flags prevent header conflicts. The CI/CD pipeline correctly generates stringzillas-cuda packages by properly propagating environment variables through cibuildwheel. Enhanced test coverage includes complex Unicode scenarios with RTL text, emoji sequences, and different normalization forms. Documentation has been extended with Rust examples showcasing zero-copy compute_into APIs using StringTape format.
Patch
- Make: Mutually exclusive platform flags (773d959)
- Fix: Skip
to_devicetests w/out GPUs found (d701592) - Make: Propagate
SZ_TARGETintoCIBWenv (b222e82) - Improve: Avoid
reallocfor on-GPU views (77e67cf) - Docs: Zero-copy Rust
compute_intoAPI with StringTape (f4ad81e) - Improve: Validate
to_device(Strs)for unicode (f3c5357) - Improve: Pre-send to GPU with
to_device(c78cd21) - Fix: Same
Strsslicing as in StringTape (b4f8d12)
Release v4.0.11
Release v4.0.10
Release v4.0.9
Release v4.0.8
Release: v4.0.8 [skip ci]
Patch
- Docs: Outdated algorithm details (68dc092)
- Make: Guess platform for PyPI and
sdist(e1966de) - Make: Embed
SZ_TARGET.envfor PyPIsdists (57822a4) - Make: Require serial package for parallel PyPI packages (d3ef8c8)
- Make: Move Py benchmarks to StringWa.rs (52c90da)
- Improve: Compare Levenshtein to CuDF (fe1e32b)
- Make: Bump StringTape (da974f5)
- Improve: Take slices in
compute_into(b507a8f) - Improve:
compute_intoAPIs for Rust (103019c) - Docs: Ship different
__description__s (1467001) - Make: Supress warnings in Windows CIBW (a879dd0)
- Make: Drop Windows
before-testoverride (d19e2fa) - Improve: Skip
szPyTests inszsruns (8abcdab) - Fix: Windows compilation issues (fee2ffc)
- Fix: Safer conversion to
RawParts(a9cfa22)