Skip to content

Transliterator datagen should allow for slicing individual baked data transliterators #6249

Open
@Manishearth

Description

@Manishearth

Related: #3966

Currently with Transliterator, all transliterators are under the same data key, as different und-t-blah locales. This is hard to slice; it basically requires users to manually run datagen to get any slicing.

For blob data I'm not too worried about that: it would be nice to still have ways to slice that (#3966), but I'm okay with people performing some manual slicing here, because automatic slicing would potentially have to parse the transliterators themselves1.

But for baked data, this is not great.

I think we can structure transliterator baked data somewhat differently: datagen can produce the following:

const DATA_TRANSLITERATOR_LATIN_HAN = ...;
const DATA_TRANSLITERATOR_LATIN_GREEK = ...;

const DATA_TRANSLITERATOR_RULES_V1: icu_provider_baked::zerotrie::Data<icu::experimental::transliterate::provider::TransliteratorRulesV1> = {
   const TRIE: _ = ...;
   const VALUES: _ = [DATA_TRANSLITERATOR_LATIN_GREEK, DATA_TRANSLITERATOR_LATIN_HAN, ...];
   ... 

}

pub mod ctors {
    pub fn new_transliterator_latin_han() -> Transliterator {
       Transliterator::new_internal(DATA_TRANSLITERATOR_LATIN_HAN, ...);
    }
}

Ideally, ::new_internal() has a solution to #3966, where you can pass in something like Transliterator::new_internal(TRANSLITERATOR_LATIN_HAN, TransliteratorDeps { casemapper: Some(CaseMapper::new(), normalizer: ..., ... })

And then the calling crate can call pub use ctors::* somewhere.

Footnotes

  1. Maybe we can have a transliterator!() macro that embeds the transliterator string into the binary so that keyextract can pick it up and read it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions