Skip to content

Conversation

@evgenyrp
Copy link
Collaborator

@evgenyrp evgenyrp commented Nov 18, 2025

@evgenyrp evgenyrp requested a review from ZJaume November 19, 2025 00:30
# failed to download/unpack
# - mtdata_Statmt-ccaligned-1-eng-zho_CN # ~5,669,496 sentences (640.7 MB)
# - mtdata_Statmt-ccaligned-1-eng-zho_TW # ~2,407,082 sentences (272.0 MB)
- url_https://storage.googleapis.com/releng-translations-dev/data/en-zh_hant/hplt_v1_1.[LANG].zst
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded opus hplt v1.1 to GCS as it's not available through the API

@evgenyrp
Copy link
Collaborator Author

@ZJaume FYI these are the hacks I'm adding to train zh-hant. They are not meant to be landed as is to main.

parser.add_argument("--language", type=str, help="The BCP 47 language tag of the dataset")
parser.add_argument("--src", type=bool, help="Source language of a language pair")
parser.add_argument("--trg", type=bool, help="Target language of a language pair")
parser.add_argument("--src", type=str, help="Source language of a language pair")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug caused everything to be always filtered and never converted to one type of Chinese. It was likely the case for Simplified as well but it's not critical as there was enough data for it.

Copy link
Collaborator

@ZJaume ZJaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@evgenyrp
Copy link
Collaborator Author

mono hplt was filtered down to just 29M lines. (52 on import and everything else with cleaning). I wonder if it's too agressive. I tried fasttext model manually and it makes mistakes quite often. import, stats, cleaning

@evgenyrp
Copy link
Collaborator Author

evgenyrp commented Nov 20, 2025

Only 24M of parallel corpus examples left (stats, group)

When I trained zh-en in Simplified, corpus size was 86M (stats, group)

@ZJaume do you have ideas why such a big difference? Can it be the new nllb fasttext model? We currently identify only zh code with it without the script. We do the conversion from Simplified to Traditional before that on data import step. I wonder if the model didn't like the converted texts.

NLLB was filtered from 71M to 24M
61M from 71M are examples that were converted from Simplified to Traditional.

@evgenyrp evgenyrp changed the title [WIP] Chinese Traditional [Experiment] Chinese Traditional Nov 20, 2025
@ZJaume
Copy link
Collaborator

ZJaume commented Nov 21, 2025

So, regarding mono hplt this seems to be the cleaning summary:

22735836 HPLT_LID_SEGMENT
29798622 DUPLICATE
2028464 CLEAN_LID
3920256 RATIO_ALPHA
3683762 RATIO_CHARS
11074337 TOO_LONG

things that I think may be addressed:

  • Relax too long, which is limited to 150 chars in the case of CJK and it seems that debug.txt lines discarded by this rule are not that long to be discarded, I would say.
  • Ratio alpha discards a significant number of sentences that are actually only or almost only "alpha", so either opuscleaner.filters.clean_common regex for Hant is missing those characters, or they are simplified characters that LID is not discarding, or mixed trad and simplified scripts, or they are characters from both scripts and the regex only includes traditional (?). I don't really know what to do with this.
  • The duplicates I believe we cannot do anything about it. Maybe for Chinese there's more boilerplate than other languages in HPLT because the text extractor is not good at it.
  • The HPLT LID by segment condition could be relaxed and accept Cantonese. I know that Cantonese and Mandarin in traditional are usually difficult to distinguish by langid models. We could trust only document-level LID and not segment-level for being unreliable. However, I don't know if that's a safe decision or not. If you do this, take into account that HPLT v3 has a mistake and yue_Hant is zho_Hant in the seg_langs field.
  • I've now realised we didn't change the LID cleaning for monolingual to use the NLLB model. But it seems that this model is pretty bad distinguishing Cantonese from Mandarin Traditional. So, maybe for this specific case you could use OpenLID-v2 or also being permissive and allow every sentence classified as yue.

Maybe all these things are useful to you for parallel data, but I haven't had time to look at it. Will do on Monday.

@evgenyrp
Copy link
Collaborator Author

eflomal fails with OOM on 200M back-translated corpus https://firefox-ci-tc.services.mozilla.com/tasks/UJbEJNKJQrKGe7UFiqDrQQ/runs/3/logs/public/logs/live.log I wonder if it's because the vocab is more diverse for zh-hant, or maybe back-translation is of low quality.

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 27, 2025

Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part because of lid218e model, let's see what comes out of it after the change to Openlid.

I think that maybe it is a good option to take a step back and start from parallel data that's originally written in Hant, then adding converted data and see if it improves. As far as I can tell, we are only using HPLT as "native" corpus and the rest is converted. There are only a few more in opus but training only on that can take the conversion out of the equation of what's hurting quality. Also, a small but maybe not that small thing is that it seems we are converting everything to Traditional here

ch_type = self._detect(line)
if ch_type in (ch_type.none, to):
new_line = line
else:
new_line = self._convert_line(line, to)
stats.script_conversion.converted += 1
out_file.write(new_line)

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH. Then there's the hanzidentifier.MIXED case, but converting in that case might be right choice.

@evgenyrp
Copy link
Collaborator Author

It's 71M sentences of parallel corpus vs 24M before, after switching to openlid-v2 and relaxing length a little. I think the main issue was in lang id.

And 86M of HPLT mono after implementing your suggestions, so we're probably good.

@evgenyrp
Copy link
Collaborator Author

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH

maybe, but it shouldn't hurt hypothetically

@evgenyrp
Copy link
Collaborator Author

evgenyrp commented Dec 4, 2025

The corpus size is 84M now, after adding OPUS zh_TW, but the backward model training chrF validation curve is slightly lower https://wandb.ai/moz-translations/zh-en?nw=nwuserepavlov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants