So why would anyone replace the easy-to-use PyBind11 with almost 2,000 lines of pure CPython bindings?! Of course, to lower the latency! PyBind11 wraps every C++ object with a smart pointer, puts a hash table next to it, and addresses function pointers with std::string key lookups 🤯

Let's see where it gets us if benchmarking with the "Leipzig1M" dataset. The bandwidth-oriented functions are just as fast as in the past:

Hashing the dataset: 77 ms for 🐍 vs 16 ms for 🦖 ~ 4.5x faster
Counting the number of "the": 151 ms for 🐍 vs 45 ms for 🦖 ~ 3.3x faster
Split all whitespace-delimited words: 782 ms for 🐍 vs 338 ms for 🦖~ 2.3x faster
Split around every "the": 240 ms for 🐍 vs 48 ms for 🦖 ~ 5x faster

What about the latency-oriented ones?

Find the first whitespace: 1 µs for 🐍 vs 3 µs for 🦖 ~ 3x slower, where previously it was 15µ and 15x slower
Partition around the first whitespace: 73 ms for 🐍 vs 33 µs for 🦖 ~ 2212x faster 🥳

JavaScript

In an effort to bring faster string operations, together with @nairihar, we have started the NodeJS binding. It's just a skeleton, and has poor performance for now, but you can use it as a starting point to help us implement faster Str class for JavaScript 🤗

Contributors

nairihar

Assets 2

18 Sep 19:18

ashvardanian

v1.2.2

eaa53b6

v1.2.2

1.2.2 (2023-09-18)

Fix

Use different functions depending on arch (4f1414e)

Assets 2

18 Sep 19:11

ashvardanian

v1.2.1

e130dc5

v1.2.1

1.2.1 (2023-09-18)

Fix

strzl_sort_config_t symbol (78a2e80)

Assets 2

18 Sep 19:08

ashvardanian

v1.2.0

8818b73

v1.2.0

1.2.0 (2023-09-18)

Add

Baseline NodeJS binding (cbdd2c9)
Initial Levenstein distance (9710194)
Levenstein distance (24821de)

Make

Publish StringZilla to NPM (54eb891)
Update release tasks (42f51cc)

Assets 2

31 Aug 11:23

ashvardanian

v1.1.3

616be6c

v1.1.3

1.1.3 (2023-08-31)

Make

Explicitly UTF-8 encoding on Windows (50d78ca)

Assets 2

31 Aug 11:14

ashvardanian

v1.1.2

be6d80d

v1.1.2

1.1.2 (2023-08-31)

Docs

Add development plans (f521fea)
Less sections (b4eefe8)
Make front page easier on the eye (b29ce03)
Refresh intro (54eda40)
Restructure groups (9994117)

Make

Fetch before rebase (4764048)

Assets 2

29 Aug 19:02

ashvardanian

v1.1.1

e011ce4

v1.1.1

1.1.1 (2023-08-29)

Docs

Add Apache 2.0 LICENSE (89cbce2)
Update [skip release] (33cb115)
Update LICENSE and table (6ce22bd)
Update table [skip release] (32acf5e)

Improve

Loading large files into memory (0cf1388)
Test reproducibility of the shuffle (50ac20a)

Make

Move Python bindings (ff93b16)

Assets 2

06 Aug 16:53

ashvardanian

v1.1.0

fab854d

v1.1.0

1.1.0 (2023-08-06)

New Functionality

Do you want to work with large arrays of separate strings? There is a way! The following code is now valid:

from stringzilla import Str, File, Strs

text: Str = Str('... very large string or file ...')
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

lines.append(shuffled_copy.pop(0))
lines.append('Pythonic string')
lines.extend(shuffled_copy)

Performance

You can expect even those trivial operations to be 8x faster than native Python 🤯

Add

Collection-level append, extend (9a2b357)
random shuffle for strings collections (36c1a58)

Fix

static_cast for Clang builds (bd0a671)
Counting substrings with allowoverlap (5234e8a)

Assets 2

Releases: ashvardanian/StringZilla

v2.0.2

2.0.2 (2023-11-04)

Docs

Make

Uh oh!

v2.0.1

2.0.1 (2023-10-10)

Docs

Fix

Make

Refactor

Uh oh!

v2: 5x swifter CPython bindings and first NodeJS bindings

Python

JavaScript

Contributors

Uh oh!

v1.2.2

1.2.2 (2023-09-18)

Fix

Uh oh!

v1.2.1

1.2.1 (2023-09-18)

Fix

Uh oh!

v1.2.0

1.2.0 (2023-09-18)

Add

Make

Uh oh!

v1.1.3

1.1.3 (2023-08-31)

Make

Uh oh!

v1.1.2

1.1.2 (2023-08-31)

Docs

Make

Uh oh!

v1.1.1

1.1.1 (2023-08-29)

Docs

Improve

Make

Uh oh!

v1.1.0

1.1.0 (2023-08-06)

New Functionality

Performance

Add

Fix

Uh oh!