-
Notifications
You must be signed in to change notification settings - Fork 39
[Experiment] Chinese Traditional #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # failed to download/unpack | ||
| # - mtdata_Statmt-ccaligned-1-eng-zho_CN # ~5,669,496 sentences (640.7 MB) | ||
| # - mtdata_Statmt-ccaligned-1-eng-zho_TW # ~2,407,082 sentences (272.0 MB) | ||
| - url_https://storage.googleapis.com/releng-translations-dev/data/en-zh_hant/hplt_v1_1.[LANG].zst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uploaded opus hplt v1.1 to GCS as it's not available through the API
|
@ZJaume FYI these are the hacks I'm adding to train zh-hant. They are not meant to be landed as is to main. |
| parser.add_argument("--language", type=str, help="The BCP 47 language tag of the dataset") | ||
| parser.add_argument("--src", type=bool, help="Source language of a language pair") | ||
| parser.add_argument("--trg", type=bool, help="Target language of a language pair") | ||
| parser.add_argument("--src", type=str, help="Source language of a language pair") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug caused everything to be always filtered and never converted to one type of Chinese. It was likely the case for Simplified as well but it's not critical as there was enough data for it.
ZJaume
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Only 24M of parallel corpus examples left (stats, group) When I trained zh-en in Simplified, corpus size was 86M (stats, group) @ZJaume do you have ideas why such a big difference? Can it be the new nllb fasttext model? We currently identify only zh code with it without the script. We do the conversion from Simplified to Traditional before that on data import step. I wonder if the model didn't like the converted texts. NLLB was filtered from 71M to 24M |
|
So, regarding mono hplt this seems to be the cleaning summary: things that I think may be addressed:
Maybe all these things are useful to you for parallel data, but I haven't had time to look at it. Will do on Monday. |
|
eflomal fails with OOM on 200M back-translated corpus https://firefox-ci-tc.services.mozilla.com/tasks/UJbEJNKJQrKGe7UFiqDrQQ/runs/3/logs/public/logs/live.log I wonder if it's because the vocab is more diverse for zh-hant, or maybe back-translation is of low quality. |
|
Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part because of I think that maybe it is a good option to take a step back and start from parallel data that's originally written in Hant, then adding converted data and see if it improves. As far as I can tell, we are only using HPLT as "native" corpus and the rest is converted. There are only a few more in opus but training only on that can take the conversion out of the equation of what's hurting quality. Also, a small but maybe not that small thing is that it seems we are converting everything to Traditional here translations/pipeline/data/cjk.py Lines 65 to 71 in a3406bc
and there is at least one detection case that we shouldn't convert, which is |
|
It's 71M sentences of parallel corpus vs 24M before, after switching to openlid-v2 and relaxing length a little. I think the main issue was in lang id. And 86M of HPLT mono after implementing your suggestions, so we're probably good. |
maybe, but it shouldn't hurt hypothetically |
|
The corpus size is 84M now, after adding OPUS zh_TW, but the backward model training chrF validation curve is slightly lower https://wandb.ai/moz-translations/zh-en?nw=nwuserepavlov |
Training dashboard