Open
Description
It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The icuexportdata
we currently use contains:
- casemapping, properties, normalization
- These should only require data from the Unicode Character Database. We should add the UCD as a ground truth data source (maybe using the existing
ucd_parse
crate) and generate the data from it.
- These should only require data from the Unicode Character Database. We should add the UCD as a ground truth data source (maybe using the existing
- segmentation dictionaries
- This is probably not something that can be upstreamed into Unicode, but we should consider whether a non-ICU4C location could be more appropriate (we already use a dedicated data source for LSTM segmentation models)
- collation
- I think the source of truth here is CLDR, so we should ideally generate this from CLDR data.
I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.