Skip to content

Chained Transliterator ID parsing #3991

Open
@skius

Description

@skius

Depends on the runtime parsing discussed in #3849.

Transliterators can not only be loaded by a single ID in ICU4C/J, but also through chaining a bunch of other transliterators (including filters) together. Example: [a-z] ; [a] Remove ; Latin-Greek/BGN. These "chains" are actually equivalent to the transform rule source obtained by applying chain.split(";").map(|elt| format!(":: {elt} ;")).collect::<String>(), e.g. :: [a-z] ; :: [a] Remove ; :: Latin-Greek/BGN ;, i.e., the same data struct can be reused (with only an overhead cost of a few empty VZVs).

This is primarily a convenience feature for runtime construction, allowing users to not have to write a dummy source file containing the mapping explained above. Because these chains use the legacy IDs, and ICU4X data uses BCP-47 IDs, the whole issue surrounding mapping legacy IDs to BCP-47 IDs applies (#3891). I suggest instead of supporting these chains of legacy IDs, instead supporting chains of BCP-47 IDs. Support for this is also on the roadmap for ICU: https://unicode-org.atlassian.net/browse/ICU-22474

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-transliteratorComponent: transliteratorC-unicodeComponent: Props, sets, tries

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions