v4.4: Case-Folding UTF-8 in AVX-512
To my knowledge, this is the first ever properly vectorized case-folding (aka .to_lower()) implementation compliant with Unicode (v17) and using SIMD (AVX-512 for Intel Ice Lake and newer). The results are remarkable across most languages, but it wasn't trivial to achieve. Unlike dense linear algebra workloads, such as in SimSIMD, no shared logic holds across all languages and code points here. After all, Unicode began in 1989 and covers languages and writing systems that took thousands of years to develop and decades to be organized into a standardized set of rules.
This implementation focuses on locale-independent conversion. It covers every one of 1000+ character folding rules in CaseFolding.txt of the Unicode spec, including:
- simple cases, like ASCII English letters: 'A' → 'a'.
- complex Latin extensions, where one codepoint expands into multiple characters: 'ẞ' → "ss".
- ligatures and mathematical symbols, like 'ffi' → "ffi".
- less common bicameral alphabets, including Armenian, Georgian, Vietnamese, and others.
- fast
memcpy-like paths for unicameral scripts, like Chinese, Japanese, and Korean.
To benchmark all of those, I've extended the StringWars benchmarks with a new bench_unicode.rs and bench_unicode.py scripts and the bench_unicode.md report produced for two dozen datasets pulled from the Leipzig Wikipedia corpora. On most languages the performance is great, except for Georgian and Vietnamese for now:
| Language | Standard 🦀 | StringZilla 🦀 | Standard 🐍 | StringZilla 🐍 | ||
|---|---|---|---|---|---|---|
| English 🇬🇧 | 482 MB/s | 7.53 GB/s | 16x | 257 MB/s | 3.14 GB/s | 12x |
| German 🇩🇪 | 432 MB/s | 2.59 GB/s | 6x | 260 MB/s | 1.81 GB/s | 7x |
| Russian 🇷🇺 | 217 MB/s | 2.20 GB/s | 10x | 470 MB/s | 1.56 GB/s | 3x |
| French 🇫🇷 | 346 MB/s | 1.84 GB/s | 5x | 274 MB/s | 1.37 GB/s | 5x |
| Greek 🇬🇷 | 220 MB/s | 1.00 GB/s | 5x | 431 MB/s | 779 MB/s | 2x |
| Armenian 🇦🇲 | 223 MB/s | 908 MB/s | 4x | 470 MB/s | 746 MB/s | 2x |
| Vietnamese 🇻🇳 | 265 MB/s | 352 MB/s | 1x | 340 MB/s | 291 MB/s | 1x |
| Arabic 🇸🇦 | 232 MB/s | 1004 MB/s | 4x | 467 MB/s | 1.80 GB/s | 4x |
| Bengali 🇧🇩 | 314 MB/s | 6.17 GB/s | 20x | 694 MB/s | 2.91 GB/s | 4x |
| Chinese 🇨🇳 | 325 MB/s | 1.21 GB/s | 4x | 697 MB/s | 886 MB/s | 1x |
| Czech 🇨🇿 | 322 MB/s | 827 MB/s | 3x | 292 MB/s | 688 MB/s | 2x |
| Dutch 🇳🇱 | 471 MB/s | 4.73 GB/s | 10x | 262 MB/s | 2.97 GB/s | 11x |
| Farsi 🇮🇷 | 235 MB/s | 858 MB/s | 4x | 475 MB/s | 1.42 GB/s | 3x |
| Georgian 🇬🇪 | 294 MB/s | 192 MB/s | 1x | 689 MB/s | 488 MB/s | 1x |
| Hebrew 🇮🇱 | 233 MB/s | 1.01 GB/s | 4x | 473 MB/s | 1.86 GB/s | 4x |
| Hindi 🇮🇳 | 293 MB/s | 6.32 GB/s | 22x | 682 MB/s | 3.14 GB/s | 5x |
| Italian 🇮🇹 | 439 MB/s | 2.29 GB/s | 5x | 268 MB/s | 1.93 GB/s | 7x |
| Japanese 🇯🇵 | 330 MB/s | 3.51 GB/s | 11x | 726 MB/s | 2.00 GB/s | 3x |
| Korean 🇰🇷 | 314 MB/s | 861 MB/s | 3x | 623 MB/s | 2.80 GB/s | 4x |
| Lithuanian 🇱🇹 | 352 MB/s | 864 MB/s | 2x | 274 MB/s | 728 MB/s | 3x |
| Polish 🇵🇱 | 364 MB/s | 939 MB/s | 3x | 277 MB/s | 786 MB/s | 3x |
| Portuguese 🇧🇷 | 395 MB/s | 2.38 GB/s | 6x | 270 MB/s | 1.79 GB/s | 7x |
| Spanish 🇪🇸 | 414 MB/s | 2.38 GB/s | 6x | 272 MB/s | 1.80 GB/s | 7x |
| Tamil 🇮🇳 | 306 MB/s | 6.05 GB/s | 20x | 712 MB/s | 3.03 GB/s | 4x |
| Turkish 🇹🇷 | 326 MB/s | 852 MB/s | 3x | 284 MB/s | 706 MB/s | 2x |
| Ukrainian 🇺🇦 | 217 MB/s | 2.09 GB/s | 10x | 476 MB/s | 1.58 GB/s | 3x |
For a complete comparison, go to StringWars 😉
Minor
- Add: Fast path for Georgian case-folding (fa7422c)
- Add: Case-insensitive ops for Python (d88e30a)
- Add: Dispatch case-insensitive search (4ae91c0)
- Add: Serial case-insensitive find & compare (4b18f05)
Patch
- Fix: Eszett hex parsing warnings in Clang (8b27080)
- Fix: Avoid
__builtinmissing on MSVC (fdc95f3) - Fix: Uninitialized values warning (b84c83e)
- Improve: Safer & faster case-folding on Ice Lake (bcd5d16)
- Improve: Case-folding on Ice Lake (bb23b60)
- Fix: Move Ice Lake kernels out of Haswell scope (b7cc2c4)
- Improve: Rename functions towards
utf8_case*(44fbb92) - Improve: Faster serial Unicode folding (aa1b21b)
- Improve: Re-group folding by char-length (c3586e2)
- Docs: Avoid locale-specific Unicode rules (333a778)
- Docs: Emoji-free doc section titles (#284) (dc11b40)