Chechen (che): new-language finetune guidance + potential open dataset contribution (~20-40 h)

Hi, thanks for releasing OmniVoice with an open finetune pipeline — it's the most promising path we've found for low-resource Caucasus languages.

We are building an open-source Chechen (che, ~1.5M speakers, Cyrillic script) MT+TTS toolkit (https://github.com/alpha-service/chechen-language-toolkit) and are preparing a TTS corpus: ~20–40 h of clean single-narrator audio (verse-aligned audiobook data) with a possibility of additional studio recordings.

Chechen is not among the 646 pretrain languages, but Kabardian (108 h) and Ossetic (1.4 h) are, so the model has already seen related Caucasus phonologies (ejectives, pharyngeals).

Questions:

1. **language_id for a new language**: for finetuning on a language outside the 646, is it better to omit `language_id` in the JSONL, or can/should we introduce a new id (e.g. `che`)? Does the id influence anything beyond a learned embedding lookup?
2. **Official `che` support**: any plans to extend language coverage? Would a clean, openly-licensed Chechen dataset (with text alignments) be of interest upstream once we clear licensing?
3. **Recipe sanity check**: for ~20–40 h of single-narrator data on a new language, would you still recommend the `run_finetune.sh` defaults as a starting point (we saw your note in #5 that 10 h works), and is the community finding of LR 2e-5 (vs default 5e-5) to avoid cross-lingual forgetting consistent with your experience?

Happy to share results back — we plan to publish the finetuned model and the data pipeline openly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chechen (che): new-language finetune guidance + potential open dataset contribution (~20-40 h) #185

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Chechen (che): new-language finetune guidance + potential open dataset contribution (~20-40 h) #185

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions