Implement Hangul NFD to NFC fast path

Note: This issue is not ready to be worked on in `main`. The implementation should be on top of https://github.com/hsivonen/icu4x/tree/nfdsinglemark , which hasn't landed, yet. I'm filing this now so that we have an issue on file in case I don't have the time to do this myself right away in the next few days.

Even with the optimizations present in https://github.com/hsivonen/icu4x/tree/nfdsinglemark , Hangul NFD to NFC UTF-16 is still slower than ICU4C.

In the composing normalizer in the slice mode (not in iterator mode), we should have a fast path that knows how to consume conjoining jamo and ASCII without normal trie lookups and exits back to the normal path when encountering something else.

We should enter the Hangul NFD to NFC fast path when the flag that we just formed an LV syllable from L and V without following ccc != 0 marks is true:
https://github.com/hsivonen/icu4x/blob/2e2611c6442737aa546ec16407136d31fb2b3e62/components/normalizer/src/lib.rs#L1913

The fast path should most likely be specialized for UTF-16 so that we don't do surrogate checks but surrogates exit the fast path due to being neither conjoining jamo nor ASCII. (In principle, we could even specialize what UTF-8 lead bytes to handle, but that's less likely to be worthwhile.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Hangul NFD to NFC fast path #7516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Hangul NFD to NFC fast path #7516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions