Release v4.4: Case-Folding UTF-8 in AVX-512 · ashvardanian/StringZilla

To my knowledge, this is the first ever properly vectorized case-folding (aka .to_lower()) implementation compliant with Unicode (v17) and using SIMD (AVX-512 for Intel Ice Lake and newer). The results are remarkable across most languages, but it wasn't trivial to achieve. Unlike dense linear algebra workloads, such as in SimSIMD, no shared logic holds across all languages and code points here. After all, Unicode began in 1989 and covers languages and writing systems that took thousands of years to develop and decades to be organized into a standardized set of rules.

This implementation focuses on locale-independent conversion. It covers every one of 1000+ character folding rules in CaseFolding.txt of the Unicode spec, including:

simple cases, like ASCII English letters: 'A' → 'a'.
complex Latin extensions, where one codepoint expands into multiple characters: 'ẞ' → "ss".
ligatures and mathematical symbols, like 'ﬃ' → "ffi".
less common bicameral alphabets, including Armenian, Georgian, Vietnamese, and others.
fast memcpy-like paths for unicameral scripts, like Chinese, Japanese, and Korean.

To benchmark all of those, I've extended the StringWars benchmarks with a new bench_unicode.rs and bench_unicode.py scripts and the bench_unicode.md report produced for two dozen datasets pulled from the Leipzig Wikipedia corpora. On most languages the performance is great, except for Georgian and Vietnamese for now:

Language	Standard 🦀	StringZilla 🦀		Standard 🐍	StringZilla 🐍
English 🇬🇧	482 MB/s	7.53 GB/s	16x	257 MB/s	3.14 GB/s	12x
German 🇩🇪	432 MB/s	2.59 GB/s	6x	260 MB/s	1.81 GB/s	7x
Russian 🇷🇺	217 MB/s	2.20 GB/s	10x	470 MB/s	1.56 GB/s	3x
French 🇫🇷	346 MB/s	1.84 GB/s	5x	274 MB/s	1.37 GB/s	5x
Greek 🇬🇷	220 MB/s	1.00 GB/s	5x	431 MB/s	779 MB/s	2x
Armenian 🇦🇲	223 MB/s	908 MB/s	4x	470 MB/s	746 MB/s	2x
Vietnamese 🇻🇳	265 MB/s	352 MB/s	1x	340 MB/s	291 MB/s	1x
Arabic 🇸🇦	232 MB/s	1004 MB/s	4x	467 MB/s	1.80 GB/s	4x
Bengali 🇧🇩	314 MB/s	6.17 GB/s	20x	694 MB/s	2.91 GB/s	4x
Chinese 🇨🇳	325 MB/s	1.21 GB/s	4x	697 MB/s	886 MB/s	1x
Czech 🇨🇿	322 MB/s	827 MB/s	3x	292 MB/s	688 MB/s	2x
Dutch 🇳🇱	471 MB/s	4.73 GB/s	10x	262 MB/s	2.97 GB/s	11x
Farsi 🇮🇷	235 MB/s	858 MB/s	4x	475 MB/s	1.42 GB/s	3x
Georgian 🇬🇪	294 MB/s	192 MB/s	1x	689 MB/s	488 MB/s	1x
Hebrew 🇮🇱	233 MB/s	1.01 GB/s	4x	473 MB/s	1.86 GB/s	4x
Hindi 🇮🇳	293 MB/s	6.32 GB/s	22x	682 MB/s	3.14 GB/s	5x
Italian 🇮🇹	439 MB/s	2.29 GB/s	5x	268 MB/s	1.93 GB/s	7x
Japanese 🇯🇵	330 MB/s	3.51 GB/s	11x	726 MB/s	2.00 GB/s	3x
Korean 🇰🇷	314 MB/s	861 MB/s	3x	623 MB/s	2.80 GB/s	4x
Lithuanian 🇱🇹	352 MB/s	864 MB/s	2x	274 MB/s	728 MB/s	3x
Polish 🇵🇱	364 MB/s	939 MB/s	3x	277 MB/s	786 MB/s	3x
Portuguese 🇧🇷	395 MB/s	2.38 GB/s	6x	270 MB/s	1.79 GB/s	7x
Spanish 🇪🇸	414 MB/s	2.38 GB/s	6x	272 MB/s	1.80 GB/s	7x
Tamil 🇮🇳	306 MB/s	6.05 GB/s	20x	712 MB/s	3.03 GB/s	4x
Turkish 🇹🇷	326 MB/s	852 MB/s	3x	284 MB/s	706 MB/s	2x
Ukrainian 🇺🇦	217 MB/s	2.09 GB/s	10x	476 MB/s	1.58 GB/s	3x

For a complete comparison, go to StringWars 😉

Minor

Add: Fast path for Georgian case-folding (fa7422c)
Add: Case-insensitive ops for Python (d88e30a)
Add: Dispatch case-insensitive search (4ae91c0)
Add: Serial case-insensitive find & compare (4b18f05)

Patch

Fix: Eszett hex parsing warnings in Clang (8b27080)
Fix: Avoid __builtin missing on MSVC (fdc95f3)
Fix: Uninitialized values warning (b84c83e)
Improve: Safer & faster case-folding on Ice Lake (bcd5d16)
Improve: Case-folding on Ice Lake (bb23b60)
Fix: Move Ice Lake kernels out of Haswell scope (b7cc2c4)
Improve: Rename functions towards utf8_case* (44fbb92)
Improve: Faster serial Unicode folding (aa1b21b)
Improve: Re-group folding by char-length (c3586e2)
Docs: Avoid locale-specific Unicode rules (333a778)
Docs: Emoji-free doc section titles (#284) (dc11b40)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v4.4: Case-Folding UTF-8 in AVX-512

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Minor

Patch

Uh oh!