Skip to content

Implement Hangul NFD to NFC fast path #7516

@hsivonen

Description

@hsivonen

Note: This issue is not ready to be worked on in main. The implementation should be on top of https://github.com/hsivonen/icu4x/tree/nfdsinglemark , which hasn't landed, yet. I'm filing this now so that we have an issue on file in case I don't have the time to do this myself right away in the next few days.

Even with the optimizations present in https://github.com/hsivonen/icu4x/tree/nfdsinglemark , Hangul NFD to NFC UTF-16 is still slower than ICU4C.

In the composing normalizer in the slice mode (not in iterator mode), we should have a fast path that knows how to consume conjoining jamo and ASCII without normal trie lookups and exits back to the normal path when encountering something else.

We should enter the Hangul NFD to NFC fast path when the flag that we just formed an LV syllable from L and V without following ccc != 0 marks is true:
https://github.com/hsivonen/icu4x/blob/2e2611c6442737aa546ec16407136d31fb2b3e62/components/normalizer/src/lib.rs#L1913

The fast path should most likely be specialized for UTF-16 so that we don't do surrogate checks but surrogates exit the fast path due to being neither conjoining jamo nor ASCII. (In principle, we could even specialize what UTF-8 lead bytes to handle, but that's less likely to be worthwhile.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-performanceArea: Performance (CPU, Memory)C-collatorComponent: Collation, normalizationblockedA dependency must be resolved before this is actionable

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions