Releases: ashvardanian/StringZilla
v2.0.2
v2.0.1
v2: 5x swifter CPython bindings and first NodeJS bindings
Python
So why would anyone replace the easy-to-use PyBind11 with almost 2,000 lines of pure CPython bindings?! Of course, to lower the latency! PyBind11 wraps every C++ object with a smart pointer, puts a hash table next to it, and addresses function pointers with std::string key lookups 🤯
Let's see where it gets us if benchmarking with the "Leipzig1M" dataset. The bandwidth-oriented functions are just as fast as in the past:
- Hashing the dataset: 77 ms for 🐍 vs 16 ms for 🦖 ~ 4.5x faster
- Counting the number of "the": 151 ms for 🐍 vs 45 ms for 🦖 ~ 3.3x faster
- Split all whitespace-delimited words: 782 ms for 🐍 vs 338 ms for 🦖~ 2.3x faster
- Split around every "the": 240 ms for 🐍 vs 48 ms for 🦖 ~ 5x faster
What about the latency-oriented ones?
- Find the first whitespace: 1 µs for 🐍 vs 3 µs for 🦖 ~ 3x slower, where previously it was 15µ and 15x slower
- Partition around the first whitespace: 73 ms for 🐍 vs 33 µs for 🦖 ~ 2212x faster 🥳
JavaScript
In an effort to bring faster string operations, together with @nairihar, we have started the NodeJS binding. It's just a skeleton, and has poor performance for now, but you can use it as a starting point to help us implement faster Str class for JavaScript 🤗
v1.2.2
v1.2.1
v1.2.0
v1.1.3
v1.1.2
v1.1.1
v1.1.0
1.1.0 (2023-08-06)
New Functionality
Do you want to work with large arrays of separate strings? There is a way! The following code is now valid:
from stringzilla import Str, File, Strs
text: Str = Str('... very large string or file ...')
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
lines.append(shuffled_copy.pop(0))
lines.append('Pythonic string')
lines.extend(shuffled_copy)Performance
You can expect even those trivial operations to be 8x faster than native Python 🤯