Trie-based normalization passthrough is slower than in ICU4C

[`norm_bench`](https://github.com/hsivonen/norm_bench) compares the normalization performance of ICU4X against other implementations, including ICU4C. The test strings are relative long (multiple memory pages). To avoid testing `rust_icu` buffer sizing logic, you need the [`rawdec` branch](https://github.com/hsivonen/rust_icu/tree/rawdec) of my fork of `rust_icu`.

With #2378 applied, English UTF-16 [NFC to NFC](https://hsivonen.fi/norm_bench/en_nfc_to_nfc_utf16/index.html) and [NFD to NFD](https://hsivonen.fi/norm_bench/en_nfd_to_nfd_utf16/index.html) show ICU4X performing a bit better than ICU4C, so the Latin passthrough comparison works as intended. (Both ICU4X and ICU4C pass through based on comparing with a boundary value.)

However, since Hanzi is normalization-invariant, ICU4X and ICU4C should both stay on a fast-path loop with one trie lookup per character such that the trie lookup yields the default value for the trie. Yet, ICU4X is slower than ICU4C for Chinese ([UTF-16 NFC to NFC](https://hsivonen.fi/norm_bench/zh_nfc_to_nfc_utf16/index.html), [UTF-16 NFD to NFD](https://hsivonen.fi/norm_bench/zh_nfd_to_nfd_utf16/index.html)). The difference is not about trie type fast vs. small. With #2410 applied, the difference shouldn't be about ICU4C using macros for the trie fast-path.

Possible differences:
* Perhaps on this level of loop tightness, branching on the normalization supplement presence discriminant (for seeing if there's a UTS46 or K normalization supplement) is significant.
* Perhaps on this level of loop tightness, branching on the `ZeroVec` borrow vs. owned discriminant is significant.
* Perhaps on this level of loop tightness, branching on Hangul range check is significant.
* Perhaps all the above combined?
* Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case? (I tried putting an `inline(never)` hurdle on the surrogate path to discourage speculation that way, but that didn't help.)
* The trie uses two `ZeroVec`s such that the index for the reading from the "data" `ZeroVec` depends on what was read from the "index" `ZeroVec`. Perhaps unaligned-capable access from the "index" `ZeroVec` interferes with the CPU speculatively performing the "data" `ZeroVec` access.
* ICU4X uses 32-bit trie values while ICU4C uses 16-bit trie values, so ICU4X might have a larger working set for the trie. However, since Hanzi should always resolve to the default value of the trie without reading all over the "data" `ZeroVec`, the actual hot working set difference seems implausible as a cause.
* Upon inspecting the trie value received, ICU4C seems to have less branchy code than ICU4X for making the decision to keep looping in the fast-path loop. However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.
* Optimizations happening in an unlucky way. (Getting the Latin passthrough in ICU4X to perform as well as ICU4C was surprisingly sensitive to compiler optimizations and memory copy working set sizes going the right way: Details of code seemingly outside the tightest loop had a notable impact on the performance of the tightest loop.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trie-based normalization passthrough is slower than in ICU4C #2431

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trie-based normalization passthrough is slower than in ICU4C #2431

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions