Skip to content

Reduce ICU4X's dependence on ICU4C data #4602

Open
@robertbastian

Description

@robertbastian

It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The icuexportdata we currently use contains:

  • casemapping, properties, normalization
    • These should only require data from the Unicode Character Database. We should add the UCD as a ground truth data source (maybe using the existing ucd_parse crate) and generate the data from it.
  • segmentation dictionaries
    • This is probably not something that can be upstreamed into Unicode, but we should consider whether a non-ICU4C location could be more appropriate (we already use a dedicated data source for LSTM segmentation models)
  • collation
    • I think the source of truth here is CLDR, so we should ideally generate this from CLDR data.

I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-dataArea: Data coverage or qualityC-data-infraComponent: provider, datagen, fallback, adaptersS-epicSize: Major project (create smaller child issues)help wantedIssue needs an assignee

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions