Skip to content

Commit add4364

Browse files
naymaraqnaymaraqclaude
authored
Add Covost2 audio benchmark (#1397)
Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f81cb4e commit add4364

4 files changed

Lines changed: 460 additions & 0 deletions

File tree

docs/evaluation/speech-audio.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -590,3 +590,102 @@ pass@1 | 128 | 12000 | 98.87% | 1.13% | 1.55% | 0.
590590
```
591591

592592
Per-domain breakdowns are included automatically based on the `domain_label` field.
593+
594+
## CoVoST 2
595+
596+
CoVoST 2 is a large-scale multilingual corpus for speech recognition (ASR) and speech translation (AST), built on Common Voice audio with translation references from Facebook's [CoVoST v2](https://github.com/facebookresearch/covost) release.
597+
598+
**Tasks:** ASR (monolingual transcription) and AST (X→en / en→X translation)
599+
600+
**Splits:** `validation`, `test`
601+
602+
For non-alphabetic scripts (`zh-CN`, `ja`), evaluation reports Character Error Rate (CER) instead of Word Error Rate (WER); the choice is made per-sample via the `use_cer` flag set during data preparation.
603+
604+
### Dataset Location
605+
606+
- Benchmark is defined in [`nemo_skills/dataset/covost2/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/covost2/__init__.py)
607+
- Original benchmark source is hosted on [GitHub](https://github.com/facebookresearch/covost)
608+
609+
### Preparing CoVoST 2 Data
610+
611+
Unlike most other benchmarks on this page, **CoVoST 2 does not auto-download audio**. You must provide a Common Voice extraction with the layout:
612+
613+
```
614+
<cv_data_dir>/
615+
<lang>/
616+
<split>/
617+
common_voice_<lang>_<id>.wav
618+
```
619+
620+
and the corresponding `validated.tsv` (columns: `path, split, lang, sentence`).
621+
622+
The `--languages` flag selects which CoVoST 2 languages are prepared. For ASR it filters the source-language audio that is transcribed; for AST every valid X→en / en→X pair touching the listed languages is included. Omit it to prepare all 21 supported languages.
623+
624+
=== "ASR"
625+
626+
```bash
627+
ns prepare_data covost2 \
628+
--data_dir /path/to/data \
629+
--cluster <cluster_name> \
630+
--task ASR \
631+
--languages de fr \
632+
--split test \
633+
--cv_data_dir /workspace/datasets/covost2 \
634+
--validated_tsv /workspace/datasets/covost2/validated.tsv
635+
```
636+
637+
=== "AST"
638+
639+
```bash
640+
ns prepare_data covost2 \
641+
--data_dir /path/to/data \
642+
--cluster <cluster_name> \
643+
--task AST \
644+
--languages de fr es \
645+
--split test \
646+
--cv_data_dir /workspace/datasets/covost2 \
647+
--validated_tsv /workspace/datasets/covost2/validated.tsv
648+
```
649+
650+
Each `--task` produces a separate manifest: `{split}-asr.jsonl` or `{split}-ast.jsonl` (e.g. `test-asr.jsonl`).
651+
652+
## FLEURS
653+
654+
[FLEURS](https://huggingface.co/datasets/google/fleurs) (Few-shot Learning Evaluation of Universal Representations of Speech) is Google's multilingual speech benchmark covering 102 locales. It supports both ASR and AST.
655+
656+
**Splits:** `train`, `dev`, `test`
657+
658+
CER (rather than WER) is used for these locales: `cmn_hans_cn`, `yue_hant_hk`, `ja_jp`, `th_th`, `lo_la`, `my_mm`, `km_kh`, `ko_kr`, `vi_vn`.
659+
660+
### Dataset Location
661+
662+
- Benchmark is defined in [`nemo_skills/dataset/fleurs/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/fleurs/__init__.py)
663+
- Original dataset is hosted on [HuggingFace](https://huggingface.co/datasets/google/fleurs)
664+
665+
### Preparing FLEURS Data
666+
667+
Audio is downloaded automatically from HuggingFace. As with CoVoST 2, `--task` produces `{split}-asr.jsonl` or `{split}-ast.jsonl`.
668+
669+
The `--languages` flag selects which FLEURS locales are prepared. For ASR it filters the source-language audio that is transcribed; for AST every (`en_us` → locale) and (locale → `en_us`) pair across the listed locales is included. Omit it to prepare all 102 locales.
670+
671+
=== "ASR"
672+
673+
```bash
674+
ns prepare_data fleurs \
675+
--data_dir /path/to/data \
676+
--cluster <cluster_name> \
677+
--task ASR \
678+
--languages en_us de_de fr_fr \
679+
--split test
680+
```
681+
682+
=== "AST"
683+
684+
```bash
685+
ns prepare_data fleurs \
686+
--data_dir /path/to/data \
687+
--cluster <cluster_name> \
688+
--task AST \
689+
--languages en_us de_de fr_fr es_419 it_it ja_jp \
690+
--split test
691+
```
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
REQUIRES_DATA_DIR = True
16+
METRICS_TYPE = "audio"
17+
EVAL_ARGS = "++eval_type=audio ++eval_config.normalization_mode=multilingual"
18+
GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"
19+
JUDGE_PIPELINE_ARGS = {
20+
"source_key": "extra_fields.src_text",
21+
"reference_key": "extra_fields.tgt_text",
22+
}

0 commit comments

Comments
 (0)