Skip to content

Conversation

@opnuub
Copy link
Contributor

@opnuub opnuub commented Jan 15, 2026

@sffc
Modified:

  • provider/icu4x-datagen/src/main.rs
  • provider/source/src/lib.rs
  • provider/source/src/segmenter/mod.rs (to import unihan.rs)
  • provider/source/src/segmenter/unihan.rs (to use LiteMap<codepoint, int8> via test)
  • provider/source/src/source.rs (added UnihanCache with proprocessing)
  • tools/make/download-repo-sources/globs.rs.data (added filename IRGSources.txt)
  • tools/make/download-repo-sources/src/main.rs

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A bunch of small comments.

})
.collect::<Vec<_>>()
.join(",\n ");
#[allow(dead_code)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be dead code?

Suggested change
#[allow(dead_code)]

fs::write(
&irg_path,
fs::read_to_string(&irg_path)?
.lines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since this file is very big, do the more efficient line-by-line reading and writing, as described in

https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html

For writing, you can write each line to the output file using writeln!

&irg_path,
fs::read_to_string(&irg_path)?
.lines()
.filter(|l| l.contains("kRSUnicode") || l.starts_with('#'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: The filtering for just kRSUnicode seems fine. It keeps the file manageable in size.

}

#[derive(Debug)]
#[allow(dead_code)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all dead_code attributes; they shouldn't be necessary

};
let clean_str = radical_str.replace('\'', "");
if let Ok(value) = clean_str.parse::<u8>() {
map.insert(codepoint, IRGValue { value });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use map.try_append() instead, since I think these should be in ascending order of code point value. It is much more efficient than map.insert().

.strip_prefix("U+")
.and_then(|hex| u32::from_str_radix(hex, 16).ok())
.and_then(char::from_u32)
.unwrap_or('\u{4E00}');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think you should be more strict. If the file isn't the format you expect, either return a DataError or panic. It's datagen so panicking is okay.

// } else {
// println!("Char: {} (U+{:04X}) => Not found in IRG sources", char, char as u32);
// }
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Delete old commented code that you don't plan to use.

}

#[cfg(feature = "networking")]
pub fn with_unihan_for_tag(self, tag: &str) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Not sure if it should be called "tag" or something else like "release" or "version". I sent an email.

/// The segmentation LSTM model tag that has been verified to work with this version of `SourceDataProvider`.
pub const TESTED_SEGMENTER_LSTM_TAG: &'static str = "v0.1.0";

pub const TESTED_UNIHAN_TAG: &'static str = "latest";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This should be a stable version number such as 16.0.0

Self {
unihan_paths: Some(Arc::new(UnihanCache {
root: AbstractFs::new_from_url(format!(
"https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the URL should not have the extra "UCD" in it, because we want the directory where you can put in all the different version tags.

#[arg(help = "Download Unihan data from unicode.org.")]
#[cfg_attr(not(feature = "networking"), arg(hide = true))]
#[cfg(feature = "provider")]
unihan_tag: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: should this be unicode_tag? Unihan is versioned with Unicode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that Unihan and UCD would be two different sources, since Unihan is an extra zip file separate from the UCD.

I have an email thread about this, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two different zip files yes, but we probably don't want to allow version skew

Copy link
Member

@sffc sffc Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah --ucd-tag is probably the right thing to do, and it'll be used for both data sources (Unihan and a future non-Unihan UCD).

But, we should keep --unihan-root.

Self {
unihan_paths: Some(Arc::new(UnihanCache {
root: AbstractFs::new_from_url(format!(
"https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be:

Suggested change
"https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"
"https://unicode.org/Public/{tag}/ucd/Unihan.zip"

your URL only works with latest


impl UnihanCache {
#[allow(dead_code)]
pub(crate) fn irg_sources(&self) -> Result<LiteMap<char, IRGValue>, DataError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: LiteMap is nice for small code size and no-std environments, but it has a limited API and is probably less performant than std maps. in icu_provider_source, we usually use HashMap, and BTreeMap when sorted iteration is important.

Copy link
Member

@sffc sffc Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested LiteMap because the file is sorted in ascending code point order so it's efficient to build a LiteMap from it. (see other comment about using try_append instead of insert)

@robertbastian robertbastian changed the title Datagen for radicals Add UCD source to datagen Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants