You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per-domain breakdowns are included automatically based on the `domain_label` field.
593
+
594
+
## CoVoST 2
595
+
596
+
CoVoST 2 is a large-scale multilingual corpus for speech recognition (ASR) and speech translation (AST), built on Common Voice audio with translation references from Facebook's [CoVoST v2](https://github.com/facebookresearch/covost) release.
597
+
598
+
**Tasks:** ASR (monolingual transcription) and AST (X→en / en→X translation)
599
+
600
+
**Splits:**`validation`, `test`
601
+
602
+
For non-alphabetic scripts (`zh-CN`, `ja`), evaluation reports Character Error Rate (CER) instead of Word Error Rate (WER); the choice is made per-sample via the `use_cer` flag set during data preparation.
603
+
604
+
### Dataset Location
605
+
606
+
- Benchmark is defined in [`nemo_skills/dataset/covost2/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/covost2/__init__.py)
607
+
- Original benchmark source is hosted on [GitHub](https://github.com/facebookresearch/covost)
608
+
609
+
### Preparing CoVoST 2 Data
610
+
611
+
Unlike most other benchmarks on this page, **CoVoST 2 does not auto-download audio**. You must provide a Common Voice extraction with the layout:
612
+
613
+
```
614
+
<cv_data_dir>/
615
+
<lang>/
616
+
<split>/
617
+
common_voice_<lang>_<id>.wav
618
+
```
619
+
620
+
and the corresponding `validated.tsv` (columns: `path, split, lang, sentence`).
621
+
622
+
The `--languages` flag selects which CoVoST 2 languages are prepared. For ASR it filters the source-language audio that is transcribed; for AST every valid X→en / en→X pair touching the listed languages is included. Omit it to prepare all 21 supported languages.
Each `--task` produces a separate manifest: `{split}-asr.jsonl` or `{split}-ast.jsonl` (e.g. `test-asr.jsonl`).
651
+
652
+
## FLEURS
653
+
654
+
[FLEURS](https://huggingface.co/datasets/google/fleurs) (Few-shot Learning Evaluation of Universal Representations of Speech) is Google's multilingual speech benchmark covering 102 locales. It supports both ASR and AST.
655
+
656
+
**Splits:**`train`, `dev`, `test`
657
+
658
+
CER (rather than WER) is used for these locales: `cmn_hans_cn`, `yue_hant_hk`, `ja_jp`, `th_th`, `lo_la`, `my_mm`, `km_kh`, `ko_kr`, `vi_vn`.
659
+
660
+
### Dataset Location
661
+
662
+
- Benchmark is defined in [`nemo_skills/dataset/fleurs/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/fleurs/__init__.py)
663
+
- Original dataset is hosted on [HuggingFace](https://huggingface.co/datasets/google/fleurs)
664
+
665
+
### Preparing FLEURS Data
666
+
667
+
Audio is downloaded automatically from HuggingFace. As with CoVoST 2, `--task` produces `{split}-asr.jsonl` or `{split}-ast.jsonl`.
668
+
669
+
The `--languages` flag selects which FLEURS locales are prepared. For ASR it filters the source-language audio that is transcribed; for AST every (`en_us` → locale) and (locale → `en_us`) pair across the listed locales is included. Omit it to prepare all 102 locales.
0 commit comments