Add more collator data and filtering to testdata; add transliterator attributes domain by robertbastian · Pull Request #7679 · unicode-org/icu4x

robertbastian · 2026-02-23T15:01:05Z

Currently some of the testdata is only not generated because some files are not downloaded to repo source data. However, this is not what clients can/should do, all of this should be controllable through options on SourceDataProvider, in particular marker attribute filters.

Running make_testdata.rs with SourceDataProvider::new() instead of SourceDataProvider::new_testing() should yield the same results, new_testing is convenient for avoiding the network, but should not behave differently.

Manishearth · 2026-02-23T17:49:30Z

provider/source/src/tests/make_testdata.rs

        )
    })
+    .with_marker_attributes_filter("numbering_system", |attrs| {
+        matches!(attrs.as_str(), "arab" | "beng" | "cakm" | "latn" | "thai")


TIL we filter this for testdata.

observation: we do not filter this for bakeddata, which is the correct behavior

We have filters because it reduces the number of json files.

we currently generate the numbering systems for all locales in the data. because we only have select locales in the testdata, this behaves differently between new_testing and new. should we actually filter the numbering systems by locales somehow? the code is:

icu4x/provider/source/src/decimal/mod.rs

Lines 102 to 120 in 8b31bf1

/// Produce `DataIdentifier`'s for all *used* numbering systems in the form und/<numsys>

fn iter_ids_for_used_numbers(&self) -> Result<HashSet<DataIdentifierCow<'static>>, DataError> {

Ok(self

.cldr()?

.numbers()

.list_locales()?

.flat_map(|locale| {

self.get_supported_numsys_for_langid(&locale, false)

.expect("All languages from list_locales should be present")

.into_iter()

.map(move |nsname| {

DataIdentifierBorrowed::for_marker_attributes_and_locale(

DataMarkerAttributes::try_from_str(&nsname).unwrap(),

&locale!("und").into(),

)

.into_owned()

})

})

.collect())

I don't understand the question?

sffc · 2026-02-23T17:50:17Z

components/decimal/src/provider.rs

    [char; 10],
    #[cfg(feature = "datagen")]
-    attributes_domain = "numbering_system"
+    attributes_domain = "numbering-system"


Question: why did you change the casing? And in make_testdata.rs, you use _. I wasn't sure whether we should use _ or - so I started using _ on my newly added attribute domains since it seems that what we are using elsewhere.

make_testdata was an oversight.

we use - in most ICU4X identifiers, I don't think we use _ anywhere. - is easier to type and more pleasing to read

It's not just here, it is other places, too:

https://github.com/search?q=repo%3Aunicode-org%2Ficu4x%20attributes_domain&type=code

I don't want us to become inconsistent. I would rather make a separate PR to change all the instances at the same time, rather than switching just this one. (Please don't change all the others in this PR)

sffc · 2026-02-23T21:31:03Z

components/decimal/src/provider.rs

    [char; 10],
    #[cfg(feature = "datagen")]
-    attributes_domain = "numbering_system"
+    attributes_domain = "numbering-system"


It's not just here, it is other places, too:

https://github.com/search?q=repo%3Aunicode-org%2Ficu4x%20attributes_domain&type=code

I don't want us to become inconsistent. I would rather make a separate PR to change all the instances at the same time, rather than switching just this one. (Please don't change all the others in this PR)

robertbastian added 2 commits February 23, 2026 15:46

add missing collation testdata

3993bc1

add more attributes filters to testdata

c7c6c30

robertbastian requested review from a team, Manishearth and sffc as code owners February 23, 2026 15:01

Manishearth approved these changes Feb 23, 2026

View reviewed changes

sffc reviewed Feb 23, 2026

View reviewed changes

-

8ecf73d

robertbastian requested review from Manishearth and sffc February 23, 2026 18:02

sffc requested changes Feb 23, 2026

View reviewed changes

_

f1c812f

robertbastian requested a review from sffc February 23, 2026 21:57

sffc changed the title ~~Make testdata more realistic~~ Add more collator data to testdata; add transliterator attributes domain Feb 24, 2026

sffc changed the title ~~Add more collator data to testdata; add transliterator attributes domain~~ Add more collator data and filtering to testdata; add transliterator attributes domain Feb 24, 2026

sffc approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add more collator data and filtering to testdata; add transliterator attributes domain#7679

Add more collator data and filtering to testdata; add transliterator attributes domain#7679
robertbastian wants to merge 4 commits intounicode-org:mainfrom
robertbastian:testdata

robertbastian commented Feb 23, 2026

Uh oh!

Manishearth Feb 23, 2026

Uh oh!

sffc Feb 23, 2026

Uh oh!

robertbastian Feb 23, 2026

Uh oh!

sffc Feb 23, 2026

Uh oh!

sffc Feb 23, 2026

Uh oh!

robertbastian Feb 23, 2026

Uh oh!

sffc Feb 23, 2026

Uh oh!

sffc Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	/// Produce `DataIdentifier`'s for all used numbering systems in the form und/<numsys>
	fn iter_ids_for_used_numbers(&self) -> Result<HashSet<DataIdentifierCow<'static>>, DataError> {
	Ok(self
	.cldr()?
	.numbers()
	.list_locales()?
	.flat_map(\|locale\| {
	self.get_supported_numsys_for_langid(&locale, false)
	.expect("All languages from list_locales should be present")
	.into_iter()
	.map(move \|nsname\| {
	DataIdentifierBorrowed::for_marker_attributes_and_locale(
	DataMarkerAttributes::try_from_str(&nsname).unwrap(),
	&locale!("und").into(),
	)
	.into_owned()
	})
	})
	.collect())

Comments

Conversation

robertbastian commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants