Add UCD source to datagen #7436

opnuub · 2026-01-15T17:43:38Z

@sffc
Modified:

provider/icu4x-datagen/src/main.rs
provider/source/src/lib.rs
provider/source/src/segmenter/mod.rs (to import unihan.rs)
provider/source/src/segmenter/unihan.rs (to use LiteMap<codepoint, int8> via test)
provider/source/src/source.rs (added UnihanCache with proprocessing)
tools/make/download-repo-sources/globs.rs.data (added filename IRGSources.txt)
tools/make/download-repo-sources/src/main.rs

sffc

Nice work! A bunch of small comments.

sffc · 2026-01-16T00:50:54Z

tools/make/download-repo-sources/src/main.rs

        })
        .collect::<Vec<_>>()
        .join(",\n                        ");
+    #[allow(dead_code)]


Shouldn't be dead code?

Suggested change

#[allow(dead_code)]

sffc · 2026-01-16T00:52:14Z

tools/make/download-repo-sources/src/main.rs

+    fs::write(
+        &irg_path,
+        fs::read_to_string(&irg_path)?
+            .lines()


Nit: Since this file is very big, do the more efficient line-by-line reading and writing, as described in

https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html

For writing, you can write each line to the output file using writeln!

sffc · 2026-01-16T00:54:50Z

tools/make/download-repo-sources/src/main.rs

+        &irg_path,
+        fs::read_to_string(&irg_path)?
+            .lines()
+            .filter(|l| l.contains("kRSUnicode") || l.starts_with('#'))


Praise: The filtering for just kRSUnicode seems fine. It keeps the file manageable in size.

sffc · 2026-01-16T00:55:33Z

provider/source/src/source.rs

 }
+
+#[derive(Debug)]
+#[allow(dead_code)]


Please remove all dead_code attributes; they shouldn't be necessary

sffc · 2026-01-16T00:57:41Z

provider/source/src/source.rs

+            };
+            let clean_str = radical_str.replace('\'', "");
+            if let Ok(value) = clean_str.parse::<u8>() {
+                map.insert(codepoint, IRGValue { value });


Suggestion: use map.try_append() instead, since I think these should be in ascending order of code point value. It is much more efficient than map.insert().

sffc · 2026-01-16T01:02:51Z

provider/source/src/source.rs

+                .strip_prefix("U+")
+                .and_then(|hex| u32::from_str_radix(hex, 16).ok())
+                .and_then(char::from_u32)
+                .unwrap_or('\u{4E00}');


Nit: I think you should be more strict. If the file isn't the format you expect, either return a DataError or panic. It's datagen so panicking is okay.

sffc · 2026-01-16T01:04:01Z

provider/source/src/segmenter/unihan.rs

+        //     } else {
+        //         println!("Char: {} (U+{:04X}) => Not found in IRG sources", char, char as u32);
+        //     }
+        // }


Nit: Delete old commented code that you don't plan to use.

sffc · 2026-01-16T01:05:50Z

provider/source/src/lib.rs

    }

+    #[cfg(feature = "networking")]
+    pub fn with_unihan_for_tag(self, tag: &str) -> Self {


Thought: Not sure if it should be called "tag" or something else like "release" or "version". I sent an email.

sffc · 2026-01-16T01:09:35Z

provider/source/src/lib.rs

    /// The segmentation LSTM model tag that has been verified to work with this version of `SourceDataProvider`.
    pub const TESTED_SEGMENTER_LSTM_TAG: &'static str = "v0.1.0";

+    pub const TESTED_UNIHAN_TAG: &'static str = "latest";


Issue: This should be a stable version number such as 16.0.0

sffc · 2026-01-16T01:10:00Z

provider/source/src/lib.rs

+        Self {
+            unihan_paths: Some(Arc::new(UnihanCache {
+                root: AbstractFs::new_from_url(format!(
+                    "https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"


I think the URL should not have the extra "UCD" in it, because we want the directory where you can put in all the different version tags.

robertbastian · 2026-01-16T10:10:19Z

provider/icu4x-datagen/src/main.rs

+    #[arg(help = "Download Unihan data from unicode.org.")]
+    #[cfg_attr(not(feature = "networking"), arg(hide = true))]
+    #[cfg(feature = "provider")]
+    unihan_tag: String,


thought: should this be unicode_tag? Unihan is versioned with Unicode

I was thinking that Unihan and UCD would be two different sources, since Unihan is an extra zip file separate from the UCD.

I have an email thread about this, too.

two different zip files yes, but we probably don't want to allow version skew

Yeah --ucd-tag is probably the right thing to do, and it'll be used for both data sources (Unihan and a future non-Unihan UCD).

But, we should keep --unihan-root.

robertbastian · 2026-01-16T10:12:16Z

provider/source/src/lib.rs

+        Self {
+            unihan_paths: Some(Arc::new(UnihanCache {
+                root: AbstractFs::new_from_url(format!(
+                    "https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"


this needs to be:

Suggested change

"https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"

"https://unicode.org/Public/{tag}/ucd/Unihan.zip"

your URL only works with latest

robertbastian · 2026-01-16T10:14:01Z

provider/source/src/source.rs

+
+impl UnihanCache {
+    #[allow(dead_code)]
+    pub(crate) fn irg_sources(&self) -> Result<LiteMap<char, IRGValue>, DataError> {


issue: LiteMap is nice for small code size and no-std environments, but it has a limited API and is probably less performant than std maps. in icu_provider_source, we usually use HashMap, and BTreeMap when sorted iteration is important.

I suggested LiteMap because the file is sorted in ascending code point order so it's efficient to build a LiteMap from it. (see other comment about using try_append instead of insert)

Datagen for radicals

5e7f738

opnuub requested review from a team, Manishearth, robertbastian and sffc as code owners January 15, 2026 17:43

Manishearth removed request for a team, Manishearth and robertbastian January 15, 2026 21:42

opnuub added 2 commits January 15, 2026 22:22

fixed CI issues

6c4cf02

truncated irgsources.txt

2993553

sffc reviewed Jan 16, 2026

View reviewed changes

robertbastian reviewed Jan 16, 2026

View reviewed changes

robertbastian changed the title ~~Datagen for radicals~~ Add UCD source to datagen Jan 16, 2026

	"https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip"
	"https://unicode.org/Public/{tag}/ucd/Unihan.zip"

Add UCD source to datagen #7436

Are you sure you want to change the base?

Add UCD source to datagen #7436

Uh oh!

Conversation

opnuub commented Jan 15, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sffc Jan 17, 2026 •

edited

Loading

sffc Jan 16, 2026 •

edited

Loading