Releases: ashvardanian/StringZilla
v3.4.0
3.4.0 (2024-03-02)
Add
Strs.sample()functionality (0e5c2f5)- Lazy iterators for Python (3b6cddd)
- Python slices with steps for
Strs(fd48df9) - Similarity measures for Rust (30398bc)
feat
- port randomize and sz_generate to Rust (c35a832)
Fix
split_iter(..., keepseparator=True)(3f9f197)Str() in Str()checks in Python (f8d59d9)- Handle
NULLPRNGs (6998bcf) - Missing
pytest.mark.skipiffor NumPy and Arrow (40eb12d) - no
returninvoidfuncs in C 99 (6798c4e) - Syntax issues (34997a3)
Improve
v3.3.1
v3.3.0
v3.2.0
3.2.0 (2024-02-19)
Add
- Hamming distances (4033d7b)
Fix
SZ_NULLtype cast on Windows (e3dedc1)- Alpine builds without ASAN (85f69ab)
- Missing
__SIZE_TYPE__in MSVC (3e4ee7a) - missing
<numeric>include foriota(9c3dae7) - missing uniform
chardistribution in STL (e21cbf1) - MSVC internal compiler error (ccb7dac)
- popcount & unaligned loads on Win32 (3e124e8)
Improve
- Avoid implicit builtins (de5d38e)
- Default to misaligned loads on x86 (edcf9aa)
- Pass bounds to Levenshtein API (57a5c12)
- SWAR for
sz_equal_serial(b63622c)
Make
- Add PyPy builds (e4e1f20)
v3.1.2
v3.1.1
v3.1: Extreme Compatibility with big-endian and 32-bit platforms, and UTF8 support in Levenshtein distances
Seems like StringZilla now supports more platforms than NumPy 🤯
Special thanks to @WillisMedwell for MSVC compatibility patches and to @michaelgrigoryan25 for new Rust interfaces 🤗
StringZilla v3: with bindings for C++, Rust, and Swift, AVX-512 acceleration, Levenshtein distances & Needleman-Wunsch scores, faster sorting and rolling fingerprints
This is the largest StringZilla release to date 🥳
It includes, among other things:
- 🤝 Mostly STL-compatible
sz::stringandsz::string_viewwith a superset of C++20 features back-ported to C++11. - 🦥 Lazily-evaluated ranges, avoiding memory allocations for bulk-
searchandsplitoperations. - ⏪ Character-set search and symmetric reverse-order APIs for all search interfaces.
- 🧬 String-similarity measures, like the Levenshtein distance and Needleman-Wunsch scores for bioinformatics.
- 🏎️ AVX-512 backend, faster sorting, and a bunch of other performance improvements.
- 🔀 Runtime-dispatch system to select the fastest SIMD implementation from a precompiled library.
- 🛠️ Improved stability and test coverage, thanks to @kmapb!
- 🍏 First bindings for Swift, thanks to @vmanot!
- 🦀 First bindings for Rust, thanks to @michaelgrigoryan25!
...and of course, a rant about STL, and a new meme. Check out the README.md for a much longer list of features, benchmarks, algorithmic design decisions, and open questions 🤗
Throughput Benchmarks
| LibC | C++ Standard | Python | StringZilla |
|---|---|---|---|
| find the first occurrence of a random word from text, ≅ 5 bytes long | |||
strstr 1x86: 7.4 · arm: 2.0 GB/s |
.findx86: 2.9 · arm: 1.6 GB/s |
.findx86: 1.1 · arm: 0.6 GB/s |
sz_findx86: 10.6 · arm: 7.1 GB/s |
| find the last occurrence of a random word from text, ≅ 5 bytes long | |||
| ❌ |
.rfindx86: 0.5 · arm: 0.4 GB/s |
.rfindx86: 0.9 · arm: 0.5 GB/s |
sz_rfindx86: 10.8 · arm: 6.7 GB/s |
| find the first occurrence of any of 6 whitespaces 2 | |||
strcspn 1x86: 0.74 · arm: 0.29 GB/s |
.find_first_ofx86: 0.25 · arm: 0.23 GB/s |
re.finditerx86: 0.06 · arm: 0.02 GB/s |
sz_find_charsetx86: 0.43 · arm: 0.23 GB/s |
| find the last occurrence of any of 6 whitespaces 2 | |||
| ❌ |
.find_last_ofx86: 0.25 · arm: 0.25 GB/s |
❌ |
sz_rfind_charsetx86: 0.43 · arm: 0.23 GB/s |
| generate a random string from the given alphabet, 20 bytes long 5 | |||
rand() % nx86: 18.0 · arm: 9.4 MB/s |
uniform_int_distributionx86: 47.2 · arm: 20.4 MB/s |
join(random.choices(...))x86: 13.3 · arm: 5.9 MB/s |
sz_generatex86: 56.2 · arm: 25.8 MB/s |
| compute the sorting permutation, ≅ 8 million English words 6 | |||
qsort_rx86: 3.55 · arm: 5.77 s |
std::sortx86: 2.79 · arm: 4.02 s |
numpy.argsortx86: 7.58 · arm: 13.00 s |
sz_sortx86: 1.91 · arm: 2.37 s |
| Levenshtein edit distance, ≅ 5 bytes long | |||
| ❌ | ❌ |
via jellyfish 3x86: 1,550 · arm: 2,220 ns |
sz_edit_distancex86: 99 · arm: 180 ns |
| Needleman-Wunsch alignment scores, ≅ 10 K aminoacids long | |||
| ❌ | ❌ |
via biopython 4x86: 257 · arm: 367 ms |
sz_alignment_scorex86: 73 · arm: 177 ms |
Supported Functionality
| Functionality | Maturity | C 99 | C++ 11 | Python | Swift | Rust |
|---|---|---|---|---|---|---|
| Substring Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Character Set Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Edit Distance | 🧐 | ✅ | ✅ | ✅ | ✅ | ❌ |
| Small String Class | 🧐 | ✅ | ✅ | ❌ | ❌ | ❌ |
| Sorting & Sequence Operations | 🚧 | ✅ | ✅ | ✅ | ❌ | ❌ |
| Lazy Ranges, Compressed Arrays | 🧐 | ❌ | ✅ | ✅ | ❌ | ❌ |
| Hashes & Fingerprints | 🚧 | ✅ | ✅ | ❌ | ❌ | ❌ |
Note
Current StringZilla design assumes little-endian architecture, ASCII or UTF-8 encoding, and 64-bit address space.
This covers most modern CPUs, including x86, Arm, RISC-V.
Feel free to open an issue if you need support for other architectures.
🌳 parts are used in production.
🧐 parts are in beta.
🚧 parts are under active development, and are likely to break in subsequent releases.
