Skip to content

Latest commit

 

History

History
76 lines (57 loc) · 4.05 KB

File metadata and controls

76 lines (57 loc) · 4.05 KB

Wav2Vec2 ASR Recipes

We provide two fairseq2 recipes, which are pre-configured training and evaluation workflows that combine models, datasets, and hyperparameters into reproducible experiments that you can run with a single command.

wav2vec2.asr.recipe - Training recipe supporting:

wav2vec2.asr.eval.recipe - Evaluation recipe for testing model performance on fleurs-mls-mini from our data preparation tutorial.

Each recipe stores its configurations as YAML files in its config directory wav2vec2.asr.config and wav2vec2.asr.eval.config. The evaluation recipe reuses the datasets from the training recipe.

Dataset Backends

Our dataset implementation supports flexible combinations of storage and task backends:

The SSL task returns a SequenceBatch (audio-only) rather than Seq2SeqBatch (audio + text), making integration non-trivial but kept here as a reference. We also include the manifest-based storage implementation as an alternative to parquet; the codebase includes comprehensive comments to guide implementation.

Usage

Set an output directory for the resulting artifacts (model checkpoints during training or hypothesis generation during evaluation) and run the recipe:

> cd omnilingual_asr
> export OUTPUT_DIR="/path/to/artifact/directory"
> python -m workflows.recipes.wav2vec2.asr $OUTPUT_DIR --config-file workflows/recipes/wav2vec2/asr/configs/ctc-finetune.yaml

Core Recipe Structure

.
└── wav2vec2/asr
    ├── eval
    │   ├── configs/            # Eval recipe YAML recipe
    │   ├── default_config.py
    │   └── recipe.py           # Eval logic
    ├── configs/                # Train recipe YAML configs
    ├── criterion.py
    ├── dataset_selector.py     # Dataset backend switching
    ├── default_config.py
    ├── recipe.py               # Train logic
    └── wer_calculator.py       # WER metric

Training Strategies

We offer the following recommendations for users who are compute-constrained and wish to fine-tune our smaller CTC checkpoints on specific low-resource languages. As reported in Section 5.7.5 of the paper, fine-tuning smaller-scale CTC models in these settings produced models that were competitive with our 7B LLM ASR model on the specific languages. Of course, the optimal fine-tuning hyper-parameters will vary from language to language, but the following presets performed generally well and serve as a good starting point.

dataset:
  (...)
  asr_task_config:
    max_audio_len: 960_000      # 60s at 16kHz
    max_num_elements: 7_680_000 # maximum of eight 60s samples, or more samples at lower lengths

optimizer:
  config:
    lr: 1e-05

trainer:
  grad_accumulation:
    num_batches: 4 # Increase gradient accumulation if running OOM during training, we use 32 GPUs for 300M, 64 GPUs for 1B and 96 GPUs for 3B

regime:
  num_steps: 5_000

We provide an example configuration under ctc-finetune-recommendation.yaml to further train our CTC checkpoint, or use ctc-from-encoder-recommendation.yaml to train your own CTC model from our W2V encoder checkpoint.