You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for releasing OmniVoice with an open finetune pipeline — it's the most promising path we've found for low-resource Caucasus languages.
We are building an open-source Chechen (che, ~1.5M speakers, Cyrillic script) MT+TTS toolkit (https://github.com/alpha-service/chechen-language-toolkit) and are preparing a TTS corpus: ~20–40 h of clean single-narrator audio (verse-aligned audiobook data) with a possibility of additional studio recordings.
Chechen is not among the 646 pretrain languages, but Kabardian (108 h) and Ossetic (1.4 h) are, so the model has already seen related Caucasus phonologies (ejectives, pharyngeals).
Questions:
language_id for a new language: for finetuning on a language outside the 646, is it better to omit language_id in the JSONL, or can/should we introduce a new id (e.g. che)? Does the id influence anything beyond a learned embedding lookup?
Official che support: any plans to extend language coverage? Would a clean, openly-licensed Chechen dataset (with text alignments) be of interest upstream once we clear licensing?
Recipe sanity check: for ~20–40 h of single-narrator data on a new language, would you still recommend the run_finetune.sh defaults as a starting point (we saw your note in Fine-tuning guidance for low-resource languages? #5 that 10 h works), and is the community finding of LR 2e-5 (vs default 5e-5) to avoid cross-lingual forgetting consistent with your experience?
Happy to share results back — we plan to publish the finetuned model and the data pipeline openly.
Hi, thanks for releasing OmniVoice with an open finetune pipeline — it's the most promising path we've found for low-resource Caucasus languages.
We are building an open-source Chechen (che, ~1.5M speakers, Cyrillic script) MT+TTS toolkit (https://github.com/alpha-service/chechen-language-toolkit) and are preparing a TTS corpus: ~20–40 h of clean single-narrator audio (verse-aligned audiobook data) with a possibility of additional studio recordings.
Chechen is not among the 646 pretrain languages, but Kabardian (108 h) and Ossetic (1.4 h) are, so the model has already seen related Caucasus phonologies (ejectives, pharyngeals).
Questions:
language_idin the JSONL, or can/should we introduce a new id (e.g.che)? Does the id influence anything beyond a learned embedding lookup?chesupport: any plans to extend language coverage? Would a clean, openly-licensed Chechen dataset (with text alignments) be of interest upstream once we clear licensing?run_finetune.shdefaults as a starting point (we saw your note in Fine-tuning guidance for low-resource languages? #5 that 10 h works), and is the community finding of LR 2e-5 (vs default 5e-5) to avoid cross-lingual forgetting consistent with your experience?Happy to share results back — we plan to publish the finetuned model and the data pipeline openly.