Note: This issue is not ready to be worked on in main. The implementation should be on top of https://github.com/hsivonen/icu4x/tree/nfdsinglemark , which hasn't landed, yet. I'm filing this now so that we have an issue on file in case I don't have the time to do this myself right away in the next few days.
Even with the optimizations present in https://github.com/hsivonen/icu4x/tree/nfdsinglemark , Hangul NFD to NFC UTF-16 is still slower than ICU4C.
In the composing normalizer in the slice mode (not in iterator mode), we should have a fast path that knows how to consume conjoining jamo and ASCII without normal trie lookups and exits back to the normal path when encountering something else.
We should enter the Hangul NFD to NFC fast path when the flag that we just formed an LV syllable from L and V without following ccc != 0 marks is true:
https://github.com/hsivonen/icu4x/blob/2e2611c6442737aa546ec16407136d31fb2b3e62/components/normalizer/src/lib.rs#L1913
The fast path should most likely be specialized for UTF-16 so that we don't do surrogate checks but surrogates exit the fast path due to being neither conjoining jamo nor ASCII. (In principle, we could even specialize what UTF-8 lead bytes to handle, but that's less likely to be worthwhile.)
Note: This issue is not ready to be worked on in
main. The implementation should be on top of https://github.com/hsivonen/icu4x/tree/nfdsinglemark , which hasn't landed, yet. I'm filing this now so that we have an issue on file in case I don't have the time to do this myself right away in the next few days.Even with the optimizations present in https://github.com/hsivonen/icu4x/tree/nfdsinglemark , Hangul NFD to NFC UTF-16 is still slower than ICU4C.
In the composing normalizer in the slice mode (not in iterator mode), we should have a fast path that knows how to consume conjoining jamo and ASCII without normal trie lookups and exits back to the normal path when encountering something else.
We should enter the Hangul NFD to NFC fast path when the flag that we just formed an LV syllable from L and V without following ccc != 0 marks is true:
https://github.com/hsivonen/icu4x/blob/2e2611c6442737aa546ec16407136d31fb2b3e62/components/normalizer/src/lib.rs#L1913
The fast path should most likely be specialized for UTF-16 so that we don't do surrogate checks but surrogates exit the fast path due to being neither conjoining jamo nor ASCII. (In principle, we could even specialize what UTF-8 lead bytes to handle, but that's less likely to be worthwhile.)