Description
norm_bench
compares the normalization performance of ICU4X against other implementations, including ICU4C. The test strings are relative long (multiple memory pages). To avoid testing rust_icu
buffer sizing logic, you need the rawdec
branch of my fork of rust_icu
.
With #2378 applied, English UTF-16 NFC to NFC and NFD to NFD show ICU4X performing a bit better than ICU4C, so the Latin passthrough comparison works as intended. (Both ICU4X and ICU4C pass through based on comparing with a boundary value.)
However, since Hanzi is normalization-invariant, ICU4X and ICU4C should both stay on a fast-path loop with one trie lookup per character such that the trie lookup yields the default value for the trie. Yet, ICU4X is slower than ICU4C for Chinese (UTF-16 NFC to NFC, UTF-16 NFD to NFD). The difference is not about trie type fast vs. small. With #2410 applied, the difference shouldn't be about ICU4C using macros for the trie fast-path.
Possible differences:
- Perhaps on this level of loop tightness, branching on the normalization supplement presence discriminant (for seeing if there's a UTS46 or K normalization supplement) is significant.
- Perhaps on this level of loop tightness, branching on the
ZeroVec
borrow vs. owned discriminant is significant. - Perhaps on this level of loop tightness, branching on Hangul range check is significant.
- Perhaps all the above combined?
- Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case? (I tried putting an
inline(never)
hurdle on the surrogate path to discourage speculation that way, but that didn't help.) - The trie uses two
ZeroVec
s such that the index for the reading from the "data"ZeroVec
depends on what was read from the "index"ZeroVec
. Perhaps unaligned-capable access from the "index"ZeroVec
interferes with the CPU speculatively performing the "data"ZeroVec
access. - ICU4X uses 32-bit trie values while ICU4C uses 16-bit trie values, so ICU4X might have a larger working set for the trie. However, since Hanzi should always resolve to the default value of the trie without reading all over the "data"
ZeroVec
, the actual hot working set difference seems implausible as a cause. - Upon inspecting the trie value received, ICU4C seems to have less branchy code than ICU4X for making the decision to keep looping in the fast-path loop. However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.
- Optimizations happening in an unlucky way. (Getting the Latin passthrough in ICU4X to perform as well as ICU4C was surprisingly sensitive to compiler optimizations and memory copy working set sizes going the right way: Details of code seemingly outside the tightest loop had a notable impact on the performance of the tightest loop.)