Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs

Upon comparing the feature set of the `unicode-normalization` crate with the feature set of `icu_normalizer`, I discovered that [`unicode-normalization` supports mapping CJK Compatibility Ideographs to Standardized Variants](https://github.com/unicode-rs/unicode-normalization/pull/70).

[Unicode 15.0](https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf) says (page 932; PDF page 958):

> **CJK Compatibility Ideographs.** There are 1,002 standardized variation sequences for CJK
compatibility ideographs. One sequence is defined for each CJK compatibility ideograph
in the Unicode Standard. These sequences are defined to address a normalization issue for
these ideographs.
>
> Implementations or users sometimes need a CJK compatibility ideograph to be distinct
from its corresponding CJK unified ideograph. For example, a distinct glyphic form may be
expected for a particular text. However, CJK compatibility ideographs have canonical
equivalence mappings to their corresponding CJK unified ideograph, which means that
such distinctions are lost whenever Unicode normalization is applied. Using the variation
sequence preserves the distinction found in the original, non-normalized text, even when
normalization is later applied.
>
> Because variation sequences are not affected by Unicode normalization, an implementa-
tion which uses the corresponding standardized variation sequence can safely maintain the
intended distinction for that CJK compatibility ideograph, even in plain text.
>
> It is important to distinguish standardized variation sequences for CJK compatibility ideo-
graphs from the variation sequences that are registered in the Ideographic Variation Data-
base (IVD). The former are normalization-stable representations of the CJK compatibility
ideographs; they are defined in StandardizedVariants.txt, and there is precisely one varia-
tion sequence for each CJK compatibility ideograph. The latter are also stable under nor-
malization, but correspond to implementation-specific glyphs in a registry entry.

Technically, `icu_normalizer` could support this mapping followed by NFD or this mapping followed by NFC by representing this mapping as a `DecompositionSupplementV1` with an associated `DecompositionTablesV1`. Somewhat unfortunately, the mappings to two BMP characters would still be stored in `DecompositionTablesV1`, since the in-trie pairs are reserved for the case where the canonical combining class of the second character is non-zero. It might be worthwhile to consider if it would make sense to relax that invariant for supplements, which are never used with the collator. (The invariant is collator-motivated in the first place.)

As for use cases, I spot-checked the IRG source of a handful of the compatibility characters. I saw one KP-source character. Other than that, the BMP ones that I happened to check were K-source and the Plane 2 ones were T-source from the higher planes of CNS 11643. Given the usage ratio of Hangul vs. Hanja for the Korean language and the higher planes of  CNS 11643 being rare for Traditional Chinese, without proper domain expertise, this feature seems to me more like a historical-text-relevant feature than modern-text-relevant feature, but I'd appreciate a characterization by someone with domain expertise.

Across GitHub, I found 3 users of this feature in `unicode-normalization`:

* https://github.com/sunfishcode/basic-text (by the implementor of the `unicode-normalization` feature)
* https://github.com/logannc/fuzzywuzzy-rs (unclear to me why you'd want this for a fuzzy match; I'd expect a fuzzy match not to want to distinguish the variations)
* https://github.com/crlf0710/runestr-rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs #2886

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs #2886

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions