Reduce ICU4X's dependence on ICU4C data

It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The `icuexportdata` we currently use contains:

* casemapping, properties, normalization
  * These should only require data from the Unicode Character Database. We should add the UCD as a ground truth data source (maybe using the existing [`ucd_parse`](https://docs.rs/ucd-parse/latest/ucd_parse/) crate) and generate the data from it.
* segmentation dictionaries
  * This is probably not something that can be upstreamed into Unicode, but we should consider whether a non-ICU4C location could be more appropriate (we already use a dedicated data source for LSTM segmentation models)
* collation
  * I think the source of truth here is CLDR, so we should ideally generate this from CLDR data.

I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce ICU4X's dependence on ICU4C data #4602

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce ICU4X's dependence on ICU4C data #4602

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions