[Experiment] Chinese Traditional #1288

evgenyrp · 2025-11-18T21:35:10Z

evgenyrp · 2025-11-19T00:32:33Z

configs/cjk/zh-hant-en-cjk.yml

+# failed to download/unpack
+#  - mtdata_Statmt-ccaligned-1-eng-zho_CN #              ~5,669,496 sentences (640.7 MB)
+#  - mtdata_Statmt-ccaligned-1-eng-zho_TW #              ~2,407,082 sentences (272.0 MB)
+  - url_https://storage.googleapis.com/releng-translations-dev/data/en-zh_hant/hplt_v1_1.[LANG].zst


I uploaded opus hplt v1.1 to GCS as it's not available through the API

evgenyrp · 2025-11-19T00:34:35Z

@ZJaume FYI these are the hacks I'm adding to train zh-hant. They are not meant to be landed as is to main.

evgenyrp · 2025-11-19T00:36:31Z

pipeline/data/mono_importer.py

    parser.add_argument("--language", type=str, help="The BCP 47 language tag of the dataset")
-    parser.add_argument("--src", type=bool, help="Source language of a language pair")
-    parser.add_argument("--trg", type=bool, help="Target language of a language pair")
+    parser.add_argument("--src", type=str, help="Source language of a language pair")


This bug caused everything to be always filtered and never converted to one type of Chinese. It was likely the case for Simplified as well but it's not critical as there was enough data for it.

ZJaume

LGTM

evgenyrp · 2025-11-20T18:25:50Z

mono hplt was filtered down to just 29M lines. (52 on import and everything else with cleaning). I wonder if it's too agressive. I tried fasttext model manually and it makes mistakes quite often. import, stats, cleaning

evgenyrp · 2025-11-20T19:23:34Z

Only 24M of parallel corpus examples left (stats, group)

When I trained zh-en in Simplified, corpus size was 86M (stats, group)

@ZJaume do you have ideas why such a big difference? Can it be the new nllb fasttext model? We currently identify only zh code with it without the script. We do the conversion from Simplified to Traditional before that on data import step. I wonder if the model didn't like the converted texts.

NLLB was filtered from 71M to 24M
61M from 71M are examples that were converted from Simplified to Traditional.

ZJaume · 2025-11-21T17:32:26Z

So, regarding mono hplt this seems to be the cleaning summary:

22735836 HPLT_LID_SEGMENT
29798622 DUPLICATE
2028464 CLEAN_LID
3920256 RATIO_ALPHA
3683762 RATIO_CHARS
11074337 TOO_LONG

things that I think may be addressed:

Relax too long, which is limited to 150 chars in the case of CJK and it seems that debug.txt lines discarded by this rule are not that long to be discarded, I would say.
Ratio alpha discards a significant number of sentences that are actually only or almost only "alpha", so either opuscleaner.filters.clean_common regex for Hant is missing those characters, or they are simplified characters that LID is not discarding, or mixed trad and simplified scripts, or they are characters from both scripts and the regex only includes traditional (?). I don't really know what to do with this.
The duplicates I believe we cannot do anything about it. Maybe for Chinese there's more boilerplate than other languages in HPLT because the text extractor is not good at it.
The HPLT LID by segment condition could be relaxed and accept Cantonese. I know that Cantonese and Mandarin in traditional are usually difficult to distinguish by langid models. We could trust only document-level LID and not segment-level for being unreliable. However, I don't know if that's a safe decision or not. If you do this, take into account that HPLT v3 has a mistake and yue_Hant is zho_Hant in the seg_langs field.
I've now realised we didn't change the LID cleaning for monolingual to use the NLLB model. But it seems that this model is pretty bad distinguishing Cantonese from Mandarin Traditional. So, maybe for this specific case you could use OpenLID-v2 or also being permissive and allow every sentence classified as yue.

Maybe all these things are useful to you for parallel data, but I haven't had time to look at it. Will do on Monday.

evgenyrp · 2025-11-25T23:58:11Z

eflomal fails with OOM on 200M back-translated corpus https://firefox-ci-tc.services.mozilla.com/tasks/UJbEJNKJQrKGe7UFiqDrQQ/runs/3/logs/public/logs/live.log I wonder if it's because the vocab is more diverse for zh-hant, or maybe back-translation is of low quality.

ZJaume · 2025-11-27T11:58:31Z

Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part because of lid218e model, let's see what comes out of it after the change to Openlid.

I think that maybe it is a good option to take a step back and start from parallel data that's originally written in Hant, then adding converted data and see if it improves. As far as I can tell, we are only using HPLT as "native" corpus and the rest is converted. There are only a few more in opus but training only on that can take the conversion out of the equation of what's hurting quality. Also, a small but maybe not that small thing is that it seems we are converting everything to Traditional here

translations/pipeline/data/cjk.py

Lines 65 to 71 in a3406bc

    
           ch_type = self._detect(line) 
        
           if ch_type in (ch_type.none, to): 
        
               new_line = line 
        
           else: 
        
               new_line = self._convert_line(line, to) 
        
               stats.script_conversion.converted += 1 
        
           out_file.write(new_line)

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH. Then there's the hanzidentifier.MIXED case, but converting in that case might be right choice.

evgenyrp · 2025-11-27T18:47:51Z

It's 71M sentences of parallel corpus vs 24M before, after switching to openlid-v2 and relaxing length a little. I think the main issue was in lang id.

And 86M of HPLT mono after implementing your suggestions, so we're probably good.

evgenyrp · 2025-11-27T18:48:22Z

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH

maybe, but it shouldn't hurt hypothetically

evgenyrp · 2025-12-04T19:01:30Z

The corpus size is 84M now, after adding OPUS zh_TW, but the backward model training chrF validation curve is slightly lower https://wandb.ai/moz-translations/zh-en?nw=nwuserepavlov

evgenyrp added 8 commits November 13, 2025 16:55

Use zh-hant by default

93236c4

Use zh in CI

96a42ee

Fix mono importer

fc1e566

Fix flores

a7a930d

Update configs

4577f44

Switch pontoon to zh-tw

8ef4426

Update configs

c5ae258

Add ntrex

463b9d4

evgenyrp requested a review from ZJaume November 19, 2025 00:30

evgenyrp commented Nov 19, 2025

View reviewed changes

Merge branch 'main' into dev_zh_hant

b0bd06b

evgenyrp commented Nov 19, 2025

View reviewed changes

ZJaume approved these changes Nov 19, 2025

View reviewed changes

Update config

9904ace

evgenyrp changed the title ~~[WIP] Chinese Traditional~~ [Experiment] Chinese Traditional Nov 20, 2025

evgenyrp added 7 commits November 26, 2025 10:47

Use openlid model

accde81

Fix mono clean lang

e55feff

Use explicit language codes

a05b492

Update fasttext

34df158

Rely on langid

76d5cde

Increase max length

4b03ff5

Fix en

8dc43b8

ZJaume mentioned this pull request Nov 27, 2025

We need filter debugging for OpusCleaner #1301

Open

evgenyrp added 3 commits December 2, 2025 16:19

Reduce alignment chunk size

1470fd7

Update config

05edf2f

Add OPUS zh_TW datasets

e14924c

evgenyrp added 2 commits December 8, 2025 15:30

Symmetrize aln in chunks

95431e8

Fix configs

6498ddc

[Experiment] Chinese Traditional #1288

Are you sure you want to change the base?

[Experiment] Chinese Traditional #1288

Uh oh!

Conversation

evgenyrp commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evgenyrp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

evgenyrp commented Nov 19, 2025

Uh oh!

evgenyrp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ZJaume left a comment

Choose a reason for hiding this comment

Uh oh!

evgenyrp commented Nov 20, 2025

Uh oh!

evgenyrp commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJaume commented Nov 21, 2025

Uh oh!

evgenyrp commented Nov 25, 2025

Uh oh!

ZJaume commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evgenyrp commented Nov 27, 2025

Uh oh!

evgenyrp commented Nov 27, 2025

Uh oh!

evgenyrp commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

evgenyrp commented Nov 18, 2025 •

edited

Loading

evgenyrp commented Nov 20, 2025 •

edited

Loading

ZJaume commented Nov 27, 2025 •

edited

Loading