Description
Related: #3966
Currently with Transliterator, all transliterators are under the same data key, as different und-t-blah
locales. This is hard to slice; it basically requires users to manually run datagen to get any slicing.
For blob data I'm not too worried about that: it would be nice to still have ways to slice that (#3966), but I'm okay with people performing some manual slicing here, because automatic slicing would potentially have to parse the transliterators themselves1.
But for baked data, this is not great.
I think we can structure transliterator baked data somewhat differently: datagen can produce the following:
const DATA_TRANSLITERATOR_LATIN_HAN = ...;
const DATA_TRANSLITERATOR_LATIN_GREEK = ...;
const DATA_TRANSLITERATOR_RULES_V1: icu_provider_baked::zerotrie::Data<icu::experimental::transliterate::provider::TransliteratorRulesV1> = {
const TRIE: _ = ...;
const VALUES: _ = [DATA_TRANSLITERATOR_LATIN_GREEK, DATA_TRANSLITERATOR_LATIN_HAN, ...];
...
}
pub mod ctors {
pub fn new_transliterator_latin_han() -> Transliterator {
Transliterator::new_internal(DATA_TRANSLITERATOR_LATIN_HAN, ...);
}
}
Ideally, ::new_internal()
has a solution to #3966, where you can pass in something like Transliterator::new_internal(TRANSLITERATOR_LATIN_HAN, TransliteratorDeps { casemapper: Some(CaseMapper::new(), normalizer: ..., ... })
And then the calling crate can call pub use ctors::*
somewhere.
Footnotes
-
Maybe we can have a
transliterator!()
macro that embeds the transliterator string into the binary so that keyextract can pick it up and read it. ↩