Skip to content

Releases: ashvardanian/StringZilla

Release v4.0.7

10 Sep 11:07

Choose a tag to compare

Release: v4.0.7 [skip ci]

Patch

  • Fix: Use unsigned offsets & smaller tapes (71f6069)
  • Improve: Clippy's static Rust suggestions (7282fc2)
  • Fix: Revert zero-copy Rust interfaces (c12a9c8)
  • Fix: Visibility warnings (6dc2412)
  • Fix: Input traits & docs consistency (3bf13da)
  • Improve: Support zero-copy tape inputs in Rust (ad40e6a)
  • Make: Allow partial wheel uploads to PyPI (5dacaa9)

Release v4.0.6

09 Sep 14:03

Choose a tag to compare

Release: v4.0.6 [skip ci]

Patch

  • Fix: Move macros for MSVC visibility (e5c19ee)
  • Make: Require macOS 10.13 for x86 (f136fd8)
  • Make: Only pack include/ from Fork Union (948ae74)
  • Fix: Missing void specialization for enable_if (225d55a)
  • Fix: Can't use __attribute__ with MSVC (f2170d8)
  • Make: Hide Fork Union crate info for packaging (fd3485c)

Release v4.0.5

09 Sep 13:10

Choose a tag to compare

Release: v4.0.5 [skip ci]

Patch

  • Make: Don't mix C & C++ flags in setup.py (7fc35fe)
  • Make: Clang qualifiers flag differs from GCC (3f66d18)
  • Make: Dynamically pull deps & run tests in CIBW (0075a42)
  • Fix: Expose unused NW getters (de6a7b2)
  • Make: Enable uv editable installations (7359e95)
  • Make: Pull submodules for Crates build (5537935)

Release v4.0.4

09 Sep 11:35

Choose a tag to compare

Release: v4.0.4 [skip ci]

Patch

  • Make: Validate NPM distro contents (04a3425)
  • Make: Use relative path for Crates (b9f1bf9)
  • Make: Explicit build_backend for CIBW (f582504)

Release v4.0.3

09 Sep 11:00

Choose a tag to compare

Release: v4.0.3 [skip ci]

Patch

  • Make: Separate release CI for separate badges (c89c741)
  • Fix: Move subs cost matrix creation to szs:: (c7d4005)
  • Make: Put fork_union sources into crate (3fc0e17)
  • Make: Don't stop building wheels on 1st fail (941da38)
  • Make: Conditionally pull NumPy (3fec3eb)

Release v4.0.2

09 Sep 10:11

Choose a tag to compare

Release: v4.0.2 [skip ci]

Patch

  • Make: Override get_requires_for_build_wheel (c83f1d3)

Release v4.0.1

09 Sep 09:39

Choose a tag to compare

Release: v4.0.1 [skip ci]

Patch

  • Make: Delay NumPy include path resolution (ca8186b)

StringZilla 4 CUDA 🥳

08 Sep 20:38

Choose a tag to compare

This PR entirely refactors the codebase, separating the single-header implementation into separate headers. Moreover, it brings faster kernels for:

  • Sorting of string sequences and pointer-sized integers 🔀
  • Levenshtein edit distances for fuzzy matching of UTF-8 or DNA 🧬
  • Needleman-Wunsch and Smith-Waterman scoring, see affine-gaps ☣️
  • Multi-pattern sketching and fingerprinting on GPUs 🔍
  • AES-based portable general-purpose hashing functions #️⃣

And more community contributions:

Huge thanks to our partners at Nebius for their continued support and endless stream of GPU installations for the most demanding computational workloads in both AI and beyond!

Why Split the Files? Matching SimSIMD Design

Sadly, most modern software development tooling is subpar. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.

Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.

Faster Sequence Alignment & Scoring on GPUs

Biology is set to be one of the driving forces of the 21st century, and biological DNA/RNA/protein data is already one of the fastest-growing data modalities, outpacing Moore's Law. Still, most of the BioInformatics software today is flawed and pretty slow. Last year, I helped several BioTech and Pharma companies scale up their data-processing capacity with specialized optimizations for various niche use cases. Still, I also wanted to include baseline kernels for the most crucial algorithms to StringZilla, covering:

  • Levenshtein distance computation for binary & UTF-8 data for general-purpose fuzzy string matching on CPUs and GPUs,
  • Needleman-Wunsch and Smith-Waterman global and local sequence alignment with both linear and affine Gotoh gap extensions, so common in BioInformatics, on both CPUs and GPUs.

These kernels are hardly state-of-the-art at this point, but should provide a good baseline, ensuring correctness and equivalent outputs across different CPU & GPU brands.

Faster Sorting

Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on both 32-bit and 64-bit systems.

The new serial algorithm is often 5 times faster than the std::sort function in the C++ Standard Template Library for a vector of strings. It's also typically 10 times faster than the qsort_r function in the GNU C library. There are even quicker versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.

Faster Hashing Algorithms

Our old algorithm was a variation of the Karp-Rabin hash and was designed more for rolling hashing workloads. Sadly, such hashing schemes don't pass SMHasher and similar hash-testing suites, and a better solution was needed. For years, I have been contemplating designing a general-purpose hash function based on AES instructions, which have been implemented in hardware for several CPU generations now. As discussed with @jandrewrogers, and can be seen in his AquaHash project, those instructions provide an almost unique amount of mixing logic per CPU cycle of latency.

Many popular hash libraries, such as AHash in the Rust ecosystem, cleverly combine AES instructions with 8-bit shuffles and 64-bit additions. However, they rarely harness the full power of the CPU due to the constraints of Rust tooling and the complexity of using masked x86 AVX-512 and predicated Arm SVE2 instructions. StringZilla does that and ticks a few more boxes:

  • Outputs 64-bit hashes and passes the SMHasher --extra tests.
  • Is fast for both short (velocity) and long strings (throughput).
  • Supports incremental (streaming) hashing, when the data arrives in chunks.
  • Supports custom seeds for hashes and has it affecting every bit of the output.
  • Provides dynamic-dispatch for different architectures to simplify deployment.
  • Documents its logic and guarantees the same output across different platforms.

Implementing this logic, which provides both fast and high-quality hashes, often capable of computing four hashes simultaneously, made these kernels handy not only for hashing itself, but also for higher-level operations like database-style hash joins and set intersections, as well as advanced sequence alignment algorithms for bioinformatics.

Fingerprinting, Sketching, and Min-Hashing... using 52-bit Floats?!

Despite deprecating Rabin-Karp rolling hashes for general-purpose workloads, it's hard to argue with their usability in "fingerprinting" or "sketching" tasks, where a fixed-size feature vector is needed to compare contents of variable-length strings. The features should be as different as possible, covering various substring lengths, so polynomial rolling hashes fit nicely!

That said, implementing modulo-arithmetic over 64-bit integers is extremely expensive on Intel CPUs:

  • VPMULLQ (ZMM, ZMM, ZMM) for _mm512_mullo_epi64:
    • on Intel Ice Lake: 15 cycles on port 0.
    • on AMD Zen4: 3 cycles on ports 0 or 1.
  • VPMULLD (ZMM, ZMM, ZMM) for _mm512_mullo_epi32:
    • on Intel Ice Lake: 10 cycles on port 0.
    • on AMD Zen4: 3 cycles on ports 0 or 1.
  • VPMULLW (ZMM, ZMM, ZMM) for _mm512_mullo_epi16:
    • on Intel Ice Lake: 5 cycles on port 0.
    • on AMD Zen4: 3 cycles on ports 0 or 1.
  • VPMADD52LUQ (ZMM, ZMM, ZMM) for _mm512_madd52lo_epu64 for 52-bit multiplication:
    • on Intel Ice Lake: 4 cycles on port 0.
    • on AMD Zen4: 4 cycles on ports 0 or 1.

It's even pricier on Nvidia GPUs, but as you can see above, we may have a way out! x86 has a cheap instruction for 52-bit integer multiplication & addition. Moreover, 52 is precisely the number of bits we can safely use to exactly store an integer inside a double-precision floating-point number, opening doors to a really weird, but equally as implementation of rolling hash functions across CPUs and GPUs, implemented using 64-bit floats, and Barrett's reductions for modulo-arithmetics, to avoid division!

Nest Steps

  • Ditch using constant memory in CUDA. The original design was to load the substitution tables for NW and SW into shared memory, but a simpler design with constant memory was eventually accepted. That proved to be a mistake. The performance numbers are way lower than expected.
  • Use Distributed Shared Memory for 10x larger NW and SW inputs. That would require additional variants of linear_score_on_each_cuda_warp_ and affine_score_on_each_cuda_warp_, which scale to 16 blocks of (228 - 1) KB shared memory on each, totaling 3.5 MB of shared memory that can be used for alignments. With u32 scores and 3 DP diagonals (in case of non-affine gaps) that should significantly accelerate the alignment of ~300K-long strings. But keep in mind that cluster.sync() is still very expensive - 1300 cycles - only 40% less than 2200 cycles for grid.sync().
  • Add asynchronous host callbacks for updating the addressable memory region in NW and SW.
  • Potentially bring back stringzilla_bare builds on MSVC, if needed.

Major

  • Break: Output error messages via C API (fc94890)
  • Break: New wording for incremental hashers (66898ad)
  • Break: Rust namespaces layout (b3338bb)
  • Break: Refactor Strs ops (009b975)
  • Break: Rename again (fb56b60)
  • Break: Drop fingerprinting bench (3e04157)
  • Break: sz::edit_distance -> Levenshtein (d44beb4)
  • Break: C++ lookup and fill_random (1ce830b)
  • Break: charset/generate -> byteset/fill_random (2ce2b49)
  • Break: New calling convention in similarity.h (095bc2d)
  • Break: Return error-codes in sort functions (944804e)
  • Break: look_up_transform to lookup API (e0055d5)
  • Break: checkum to bytesum, new hash, and PRNG (71f1f4b)
  • Break: Pointer-sized N-gram Sorting (0c38bff)
  • Break: sz_sort now takes allocators (ec81663)
  • Break: Deprecate old fingerprinting (38014ee)
  • Break: Replace char_set constructor with literals (2c49eae)

Minor

  • Add: Str.count_byteset for Python (802d699)
  • Add: GoLang official support (234b758)
  • Add: HashMap traits for Rust (e555cc3)
  • Add: try_resize_and_overwrite (e00a98b)
  • Add: Hashers for Swift (be363c9)
  • Add: Big-endian SWAR backends (607dd14)
  • Add: Zero-copy JS wrapper for buffers (4bf2dd6)
  • Add: sz.fill_random for Python (3565f9b)
  • Add: Allow resetting the dispatch table (68172a0)
  • Add: Ephemeral GPU executors if no device is passed (6431900)
  • Add: Capabilities getters for SW (86c89b7)
  • Add: StringZillas for Rust draft (48a1120)
  • Add: Fingerprinting benchmarks (471649e)
  • Add: On GPU fingerprints in ...
Read more

Release v3.12.6

07 Aug 23:06

Choose a tag to compare

Release: v3.12.6 [skip ci]

Patch

  • Fix: Skip second half of AVX2 copy (a198630)
  • Fix: LUT API name (c78adca)

Release v3.12.5

18 Apr 20:43

Choose a tag to compare

Release: v3.12.5 [skip ci]

Patch

  • Make: Upgrade to newer CMake (25224bb)