Skip to content

Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs #2886

Open
@hsivonen

Description

@hsivonen

Upon comparing the feature set of the unicode-normalization crate with the feature set of icu_normalizer, I discovered that unicode-normalization supports mapping CJK Compatibility Ideographs to Standardized Variants.

Unicode 15.0 says (page 932; PDF page 958):

CJK Compatibility Ideographs. There are 1,002 standardized variation sequences for CJK
compatibility ideographs. One sequence is defined for each CJK compatibility ideograph
in the Unicode Standard. These sequences are defined to address a normalization issue for
these ideographs.

Implementations or users sometimes need a CJK compatibility ideograph to be distinct
from its corresponding CJK unified ideograph. For example, a distinct glyphic form may be
expected for a particular text. However, CJK compatibility ideographs have canonical
equivalence mappings to their corresponding CJK unified ideograph, which means that
such distinctions are lost whenever Unicode normalization is applied. Using the variation
sequence preserves the distinction found in the original, non-normalized text, even when
normalization is later applied.

Because variation sequences are not affected by Unicode normalization, an implementa-
tion which uses the corresponding standardized variation sequence can safely maintain the
intended distinction for that CJK compatibility ideograph, even in plain text.

It is important to distinguish standardized variation sequences for CJK compatibility ideo-
graphs from the variation sequences that are registered in the Ideographic Variation Data-
base (IVD). The former are normalization-stable representations of the CJK compatibility
ideographs; they are defined in StandardizedVariants.txt, and there is precisely one varia-
tion sequence for each CJK compatibility ideograph. The latter are also stable under nor-
malization, but correspond to implementation-specific glyphs in a registry entry.

Technically, icu_normalizer could support this mapping followed by NFD or this mapping followed by NFC by representing this mapping as a DecompositionSupplementV1 with an associated DecompositionTablesV1. Somewhat unfortunately, the mappings to two BMP characters would still be stored in DecompositionTablesV1, since the in-trie pairs are reserved for the case where the canonical combining class of the second character is non-zero. It might be worthwhile to consider if it would make sense to relax that invariant for supplements, which are never used with the collator. (The invariant is collator-motivated in the first place.)

As for use cases, I spot-checked the IRG source of a handful of the compatibility characters. I saw one KP-source character. Other than that, the BMP ones that I happened to check were K-source and the Plane 2 ones were T-source from the higher planes of CNS 11643. Given the usage ratio of Hangul vs. Hanja for the Korean language and the higher planes of CNS 11643 being rare for Traditional Chinese, without proper domain expertise, this feature seems to me more like a historical-text-relevant feature than modern-text-relevant feature, but I'd appreciate a characterization by someone with domain expertise.

Across GitHub, I found 3 users of this feature in unicode-normalization:

Metadata

Metadata

Assignees

Labels

C-collatorComponent: Collation, normalizationS-mediumSize: Less than a week (larger bug fix or enhancement)T-enhancementType: Nice-to-have but not required

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions