Skip to content

Comments

Add more collator data and filtering to testdata; add transliterator attributes domain#7679

Open
robertbastian wants to merge 4 commits intounicode-org:mainfrom
robertbastian:testdata
Open

Add more collator data and filtering to testdata; add transliterator attributes domain#7679
robertbastian wants to merge 4 commits intounicode-org:mainfrom
robertbastian:testdata

Conversation

@robertbastian
Copy link
Member

Currently some of the testdata is only not generated because some files are not downloaded to repo source data. However, this is not what clients can/should do, all of this should be controllable through options on SourceDataProvider, in particular marker attribute filters.

Running make_testdata.rs with SourceDataProvider::new() instead of SourceDataProvider::new_testing() should yield the same results, new_testing is convenient for avoiding the network, but should not behave differently.

@robertbastian robertbastian requested review from a team, Manishearth and sffc as code owners February 23, 2026 15:01
)
})
.with_marker_attributes_filter("numbering_system", |attrs| {
matches!(attrs.as_str(), "arab" | "beng" | "cakm" | "latn" | "thai")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL we filter this for testdata.

observation: we do not filter this for bakeddata, which is the correct behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have filters because it reduces the number of json files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we currently generate the numbering systems for all locales in the data. because we only have select locales in the testdata, this behaves differently between new_testing and new. should we actually filter the numbering systems by locales somehow? the code is:

/// Produce `DataIdentifier`'s for all *used* numbering systems in the form und/<numsys>
fn iter_ids_for_used_numbers(&self) -> Result<HashSet<DataIdentifierCow<'static>>, DataError> {
Ok(self
.cldr()?
.numbers()
.list_locales()?
.flat_map(|locale| {
self.get_supported_numsys_for_langid(&locale, false)
.expect("All languages from list_locales should be present")
.into_iter()
.map(move |nsname| {
DataIdentifierBorrowed::for_marker_attributes_and_locale(
DataMarkerAttributes::try_from_str(&nsname).unwrap(),
&locale!("und").into(),
)
.into_owned()
})
})
.collect())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question?

[char; 10],
#[cfg(feature = "datagen")]
attributes_domain = "numbering_system"
attributes_domain = "numbering-system"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why did you change the casing? And in make_testdata.rs, you use _. I wasn't sure whether we should use _ or - so I started using _ on my newly added attribute domains since it seems that what we are using elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_testdata was an oversight.

we use - in most ICU4X identifiers, I don't think we use _ anywhere. - is easier to type and more pleasing to read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just here, it is other places, too:

https://github.com/search?q=repo%3Aunicode-org%2Ficu4x%20attributes_domain&type=code

I don't want us to become inconsistent. I would rather make a separate PR to change all the instances at the same time, rather than switching just this one. (Please don't change all the others in this PR)

[char; 10],
#[cfg(feature = "datagen")]
attributes_domain = "numbering_system"
attributes_domain = "numbering-system"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just here, it is other places, too:

https://github.com/search?q=repo%3Aunicode-org%2Ficu4x%20attributes_domain&type=code

I don't want us to become inconsistent. I would rather make a separate PR to change all the instances at the same time, rather than switching just this one. (Please don't change all the others in this PR)

@robertbastian robertbastian requested a review from sffc February 23, 2026 21:57
@sffc sffc changed the title Make testdata more realistic Add more collator data to testdata; add transliterator attributes domain Feb 24, 2026
@sffc sffc changed the title Add more collator data to testdata; add transliterator attributes domain Add more collator data and filtering to testdata; add transliterator attributes domain Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants