Skip to content

Chechen (che): new-language finetune guidance + potential open dataset contribution (~20-40 h) #185

@alpha-service

Description

@alpha-service

Hi, thanks for releasing OmniVoice with an open finetune pipeline — it's the most promising path we've found for low-resource Caucasus languages.

We are building an open-source Chechen (che, ~1.5M speakers, Cyrillic script) MT+TTS toolkit (https://github.com/alpha-service/chechen-language-toolkit) and are preparing a TTS corpus: ~20–40 h of clean single-narrator audio (verse-aligned audiobook data) with a possibility of additional studio recordings.

Chechen is not among the 646 pretrain languages, but Kabardian (108 h) and Ossetic (1.4 h) are, so the model has already seen related Caucasus phonologies (ejectives, pharyngeals).

Questions:

  1. language_id for a new language: for finetuning on a language outside the 646, is it better to omit language_id in the JSONL, or can/should we introduce a new id (e.g. che)? Does the id influence anything beyond a learned embedding lookup?
  2. Official che support: any plans to extend language coverage? Would a clean, openly-licensed Chechen dataset (with text alignments) be of interest upstream once we clear licensing?
  3. Recipe sanity check: for ~20–40 h of single-narrator data on a new language, would you still recommend the run_finetune.sh defaults as a starting point (we saw your note in Fine-tuning guidance for low-resource languages? #5 that 10 h works), and is the community finding of LR 2e-5 (vs default 5e-5) to avoid cross-lingual forgetting consistent with your experience?

Happy to share results back — we plan to publish the finetuned model and the data pipeline openly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions