-
Notifications
You must be signed in to change notification settings - Fork 255
Add UCD source to datagen #7436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sffc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A bunch of small comments.
| }) | ||
| .collect::<Vec<_>>() | ||
| .join(",\n "); | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be dead code?
| #[allow(dead_code)] |
| fs::write( | ||
| &irg_path, | ||
| fs::read_to_string(&irg_path)? | ||
| .lines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since this file is very big, do the more efficient line-by-line reading and writing, as described in
https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html
For writing, you can write each line to the output file using writeln!
| &irg_path, | ||
| fs::read_to_string(&irg_path)? | ||
| .lines() | ||
| .filter(|l| l.contains("kRSUnicode") || l.starts_with('#')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: The filtering for just kRSUnicode seems fine. It keeps the file manageable in size.
| } | ||
|
|
||
| #[derive(Debug)] | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove all dead_code attributes; they shouldn't be necessary
| }; | ||
| let clean_str = radical_str.replace('\'', ""); | ||
| if let Ok(value) = clean_str.parse::<u8>() { | ||
| map.insert(codepoint, IRGValue { value }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: use map.try_append() instead, since I think these should be in ascending order of code point value. It is much more efficient than map.insert().
| .strip_prefix("U+") | ||
| .and_then(|hex| u32::from_str_radix(hex, 16).ok()) | ||
| .and_then(char::from_u32) | ||
| .unwrap_or('\u{4E00}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think you should be more strict. If the file isn't the format you expect, either return a DataError or panic. It's datagen so panicking is okay.
| // } else { | ||
| // println!("Char: {} (U+{:04X}) => Not found in IRG sources", char, char as u32); | ||
| // } | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Delete old commented code that you don't plan to use.
| } | ||
|
|
||
| #[cfg(feature = "networking")] | ||
| pub fn with_unihan_for_tag(self, tag: &str) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: Not sure if it should be called "tag" or something else like "release" or "version". I sent an email.
| /// The segmentation LSTM model tag that has been verified to work with this version of `SourceDataProvider`. | ||
| pub const TESTED_SEGMENTER_LSTM_TAG: &'static str = "v0.1.0"; | ||
|
|
||
| pub const TESTED_UNIHAN_TAG: &'static str = "latest"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This should be a stable version number such as 16.0.0
| Self { | ||
| unihan_paths: Some(Arc::new(UnihanCache { | ||
| root: AbstractFs::new_from_url(format!( | ||
| "https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the URL should not have the extra "UCD" in it, because we want the directory where you can put in all the different version tags.
| #[arg(help = "Download Unihan data from unicode.org.")] | ||
| #[cfg_attr(not(feature = "networking"), arg(hide = true))] | ||
| #[cfg(feature = "provider")] | ||
| unihan_tag: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: should this be unicode_tag? Unihan is versioned with Unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that Unihan and UCD would be two different sources, since Unihan is an extra zip file separate from the UCD.
I have an email thread about this, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two different zip files yes, but we probably don't want to allow version skew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah --ucd-tag is probably the right thing to do, and it'll be used for both data sources (Unihan and a future non-Unihan UCD).
But, we should keep --unihan-root.
| Self { | ||
| unihan_paths: Some(Arc::new(UnihanCache { | ||
| root: AbstractFs::new_from_url(format!( | ||
| "https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be:
| "https://www.unicode.org/Public/UCD/{tag}/ucd/Unihan.zip" | |
| "https://unicode.org/Public/{tag}/ucd/Unihan.zip" |
your URL only works with latest
|
|
||
| impl UnihanCache { | ||
| #[allow(dead_code)] | ||
| pub(crate) fn irg_sources(&self) -> Result<LiteMap<char, IRGValue>, DataError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: LiteMap is nice for small code size and no-std environments, but it has a limited API and is probably less performant than std maps. in icu_provider_source, we usually use HashMap, and BTreeMap when sorted iteration is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested LiteMap because the file is sorted in ascending code point order so it's efficient to build a LiteMap from it. (see other comment about using try_append instead of insert)
@sffc
Modified: