Skip to content

Releases: ashvardanian/StringZilla

v3.4.0

02 Mar 06:22

Choose a tag to compare

3.4.0 (2024-03-02)

Add

  • Strs.sample() functionality (0e5c2f5)
  • Lazy iterators for Python (3b6cddd)
  • Python slices with steps for Strs (fd48df9)
  • Similarity measures for Rust (30398bc)

feat

  • port randomize and sz_generate to Rust (c35a832)

Fix

  • split_iter(..., keepseparator=True) (3f9f197)
  • Str() in Str() checks in Python (f8d59d9)
  • Handle NULL PRNGs (6998bcf)
  • Missing pytest.mark.skipif for NumPy and Arrow (40eb12d)
  • no return in void funcs in C 99 (6798c4e)
  • Syntax issues (34997a3)

Improve

  • Dynamic-dispatch for sz_generate (1c813e4)
  • Faster rich comparisons (65564b9)

v3.3.1

27 Feb 23:10

Choose a tag to compare

3.3.1 (2024-02-27)

Docs

Fix

  • sz_generate out-of-bounds (949fe42)

v3.3.0

24 Feb 05:37

Choose a tag to compare

3.3.0 (2024-02-24)

Add

  • CLI, offset_within, write_to (4c738ea)

Docs

Fix

  • Missing STL-compat imports (ec8abc2)
  • popcount & unaligned loads on Win32 (a975c16)

Improve

  • sz_find_neon for different lengths (9ce74c7)
  • Anomaly selection strategy for UTF8 (5be069d)

Make

  • Enable SIMD for Rust Crates (81fe9f9)
  • Extend CPython labels for visibility (92e4bc6)
  • iOS, tvOS, watchOS builds (65edb67), closes #86
  • Package CLI for PyPi (657416e)
  • Test on Alpine & Windows (84d78c7)
  • Use NEON in Aaarch64 crates (fe0cfbd)

v3.2.0

19 Feb 00:59

Choose a tag to compare

3.2.0 (2024-02-19)

Add

Fix

  • SZ_NULL type cast on Windows (e3dedc1)
  • Alpine builds without ASAN (85f69ab)
  • Missing __SIZE_TYPE__ in MSVC (3e4ee7a)
  • missing <numeric> include for iota (9c3dae7)
  • missing uniform char distribution in STL (e21cbf1)
  • MSVC internal compiler error (ccb7dac)
  • popcount & unaligned loads on Win32 (3e124e8)

Improve

  • Avoid implicit builtins (de5d38e)
  • Default to misaligned loads on x86 (edcf9aa)
  • Pass bounds to Levenshtein API (57a5c12)
  • SWAR for sz_equal_serial (b63622c)

Make

v3.1.2

18 Feb 05:44

Choose a tag to compare

3.1.2 (2024-02-18)

Make

v3.1.1

15 Feb 00:57

Choose a tag to compare

3.1.1 (2024-02-15)

Make

  • rename wheels to avoid conflicts (0690e2e)

v3.1: Extreme Compatibility with big-endian and 32-bit platforms, and UTF8 support in Levenshtein distances

15 Feb 00:37

Choose a tag to compare

StringZilla v3: with bindings for C++, Rust, and Swift, AVX-512 acceleration, Levenshtein distances & Needleman-Wunsch scores, faster sorting and rolling fingerprints

06 Feb 23:46

Choose a tag to compare

This is the largest StringZilla release to date 🥳
It includes, among other things:

  • 🤝 Mostly STL-compatible sz::string and sz::string_view with a superset of C++20 features back-ported to C++11.
  • 🦥 Lazily-evaluated ranges, avoiding memory allocations for bulk-search and split operations.
  • ⏪ Character-set search and symmetric reverse-order APIs for all search interfaces.
  • 🧬 String-similarity measures, like the Levenshtein distance and Needleman-Wunsch scores for bioinformatics.
  • 🏎️ AVX-512 backend, faster sorting, and a bunch of other performance improvements.
  • 🔀 Runtime-dispatch system to select the fastest SIMD implementation from a precompiled library.
  • 🛠️ Improved stability and test coverage, thanks to @kmapb!
  • 🍏 First bindings for Swift, thanks to @vmanot!
  • 🦀 First bindings for Rust, thanks to @michaelgrigoryan25!

...and of course, a rant about STL, and a new meme. Check out the README.md for a much longer list of features, benchmarks, algorithmic design decisions, and open questions 🤗

Throughput Benchmarks

StringZilla Cover

LibC C++ Standard Python StringZilla
find the first occurrence of a random word from text, ≅ 5 bytes long
strstr 1
x86: 7.4 · arm: 2.0 GB/s
.find
x86: 2.9 · arm: 1.6 GB/s
.find
x86: 1.1 · arm: 0.6 GB/s
sz_find
x86: 10.6 · arm: 7.1 GB/s
find the last occurrence of a random word from text, ≅ 5 bytes long
.rfind
x86: 0.5 · arm: 0.4 GB/s
.rfind
x86: 0.9 · arm: 0.5 GB/s
sz_rfind
x86: 10.8 · arm: 6.7 GB/s
find the first occurrence of any of 6 whitespaces 2
strcspn 1
x86: 0.74 · arm: 0.29 GB/s
.find_first_of
x86: 0.25 · arm: 0.23 GB/s
re.finditer
x86: 0.06 · arm: 0.02 GB/s
sz_find_charset
x86: 0.43 · arm: 0.23 GB/s
find the last occurrence of any of 6 whitespaces 2
.find_last_of
x86: 0.25 · arm: 0.25 GB/s
sz_rfind_charset
x86: 0.43 · arm: 0.23 GB/s
generate a random string from the given alphabet, 20 bytes long 5
rand() % n
x86: 18.0 · arm: 9.4 MB/s
uniform_int_distribution
x86: 47.2 · arm: 20.4 MB/s
join(random.choices(...))
x86: 13.3 · arm: 5.9 MB/s
sz_generate
x86: 56.2 · arm: 25.8 MB/s
compute the sorting permutation, ≅ 8 million English words 6
qsort_r
x86: 3.55 · arm: 5.77 s
std::sort
x86: 2.79 · arm: 4.02 s
numpy.argsort
x86: 7.58 · arm: 13.00 s
sz_sort
x86: 1.91 · arm: 2.37 s
Levenshtein edit distance, ≅ 5 bytes long
via jellyfish 3
x86: 1,550 · arm: 2,220 ns
sz_edit_distance
x86: 99 · arm: 180 ns
Needleman-Wunsch alignment scores, ≅ 10 K aminoacids long
via biopython 4
x86: 257 · arm: 367 ms
sz_alignment_score
x86: 73 · arm: 177 ms

Supported Functionality

Functionality Maturity C 99 C++ 11 Python Swift Rust
Substring Search 🌳
Character Set Search 🌳
Edit Distance 🧐
Small String Class 🧐
Sorting & Sequence Operations 🚧
Lazy Ranges, Compressed Arrays 🧐
Hashes & Fingerprints 🚧

Note

Current StringZilla design assumes little-endian architecture, ASCII or UTF-8 encoding, and 64-bit address space.
This covers most modern CPUs, including x86, Arm, RISC-V.
Feel free to open an issue if you need support for other architectures.

🌳 parts are used in production.
🧐 parts are in beta.
🚧 parts are under active development, and are likely to break in subsequent releases.

Quick...

Read more

v2.0.4

04 Jan 21:54

Choose a tag to compare

2.0.4 (2024-01-04)

Fix

v2.0.3

19 Nov 01:39

Choose a tag to compare

2.0.3 (2023-11-19)

Fix

  • Returning NULL without setting the error (922d7c5)