Skip to content

Commit 7dcb0a6

Browse files
authored
Add ContextASR-Bench benchmark (contextual ASR with NE-WER/NE-FNR metrics) (#1405)
Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
1 parent f57b173 commit 7dcb0a6

14 files changed

Lines changed: 1030 additions & 0 deletions

File tree

core/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
bs4
77
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@e01a5d2
8+
contractions
89
datasets
910
editdistance
1011
evalplus @ git+https://github.com/evalplus/evalplus@c91370f

docs/evaluation/speech-audio.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -483,3 +483,110 @@ Numb3rs reports the following metrics:
483483
- **success_rate**: Percentage of samples with WER < 0.5
484484

485485
Per-category breakdowns (e.g., `numb3rs-numb3rs_CARDINAL`, `numb3rs-numb3rs_MONEY`) are included automatically.
486+
487+
## ContextASR-Bench
488+
489+
ContextASR-Bench evaluates contextual ASR performance by measuring how well models transcribe speech when given different levels of contextual information. It focuses on named entity recognition accuracy alongside standard WER.
490+
491+
**Dataset:** [MrSupW/ContextASR-Bench](https://huggingface.co/datasets/MrSupW/ContextASR-Bench) (English Speech subset: 15,326 samples, ~188 hours, 116,167 named entities across 10+ domains)
492+
493+
**Evaluation Modes:**
494+
495+
- `contextasr-bench.contextless`: Plain transcription (no context)
496+
- `contextasr-bench.coarse`: Domain label provided as context
497+
- `contextasr-bench.fine`: Domain label + entity list provided as context
498+
499+
**Metrics:**
500+
501+
- **WER**: Word Error Rate (corpus-level)
502+
- **NE-WER**: Named Entity WER — WER computed on fuzzy-matched entity token sequences
503+
- **NE-FNR**: Named Entity False Negative Rate — fraction of reference entities not found in the transcription
504+
505+
### Dataset Location
506+
507+
* Benchmark is defined in `nemo_skills/dataset/contextasr-bench/__init__.py`
508+
* Original dataset is hosted on [HuggingFace](https://huggingface.co/datasets/MrSupW/ContextASR-Bench)
509+
510+
### Preparing ContextASR-Bench Data
511+
512+
ContextASR-Bench requires audio files for meaningful evaluation. **Audio files are downloaded
513+
automatically by default** from HuggingFace (~22 GB, may take 30-60 minutes).
514+
515+
```bash
516+
ns prepare_data contextasr-bench
517+
```
518+
519+
!!! warning "Large download"
520+
521+
The automatic download fetches ~22 GB of audio data (JSONL + 8 tar files) from HuggingFace.
522+
This can take 30-60 minutes depending on network speed. If you already have the data
523+
downloaded, use `--data_dir` to skip the download.
524+
525+
To download to a specific directory, or to use pre-downloaded data:
526+
527+
```bash
528+
ns prepare_data contextasr-bench --data_dir=/path/to/ContextASR-Bench
529+
```
530+
531+
If the directory already contains `ContextASR-Speech_English.jsonl`, the existing data is
532+
used directly. If the file is missing, data is downloaded there automatically.
533+
534+
To use a custom audio path prefix (e.g., for container mount points):
535+
536+
```bash
537+
ns prepare_data contextasr-bench --data_dir=/path/to/ContextASR-Bench --audio-prefix /data/contextasr
538+
```
539+
540+
### Running ContextASR-Bench Evaluation
541+
542+
Evaluate all three modes:
543+
544+
```bash
545+
ns eval \
546+
--cluster=local \
547+
--benchmarks=contextasr-bench \
548+
--server_type=openai \
549+
--server_address=http://localhost:8000/v1 \
550+
--model=Qwen/Qwen3-Omni-7B \
551+
--output_dir=/workspace/contextasr-eval \
552+
--data_dir=/path/to/ContextASR-Bench
553+
```
554+
555+
Evaluate a single mode:
556+
557+
```bash
558+
ns eval --benchmarks=contextasr-bench.fine ...
559+
```
560+
561+
### Understanding ContextASR-Bench Results
562+
563+
```
564+
<output_dir>/
565+
└── eval-results/
566+
└── contextasr-bench/
567+
├── metrics.json # Overall aggregate
568+
├── contextasr-bench.contextless/
569+
│ └── metrics.json
570+
├── contextasr-bench.coarse/
571+
│ └── metrics.json
572+
└── contextasr-bench.fine/
573+
└── metrics.json
574+
```
575+
576+
Example output:
577+
578+
```
579+
----------------------- contextasr-bench.contextless -----------------------
580+
evaluation_mode | avg_tokens | gen_seconds | success_rate | wer | ne_wer | ne_fnr | num_entries
581+
pass@1 | 128 | 12000 | 97.73% | 2.27% | 7.83% | 9.08% | 15326
582+
583+
------------------------- contextasr-bench.coarse --------------------------
584+
evaluation_mode | avg_tokens | gen_seconds | success_rate | wer | ne_wer | ne_fnr | num_entries
585+
pass@1 | 128 | 12000 | 97.83% | 2.17% | 8.11% | 9.32% | 15326
586+
587+
-------------------------- contextasr-bench.fine ---------------------------
588+
evaluation_mode | avg_tokens | gen_seconds | success_rate | wer | ne_wer | ne_fnr | num_entries
589+
pass@1 | 128 | 12000 | 98.87% | 1.13% | 1.55% | 0.53% | 15326
590+
```
591+
592+
Per-domain breakdowns are included automatically based on the `domain_label` field.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""ContextASR-Bench: Contextual ASR evaluation benchmark.
16+
17+
Evaluates ASR models across three context settings:
18+
- Contextless: Plain transcription
19+
- Coarse-grained: Domain label provided as context
20+
- Fine-grained: Domain label + entity list provided as context
21+
22+
Metrics: WER, NE-WER (entity-focused WER with fuzzy matching), NE-FNR (entity miss rate)
23+
24+
Dataset: https://huggingface.co/datasets/MrSupW/ContextASR-Bench
25+
Paper: ContextASR-Bench (English Speech subset, 15,326 samples, ~188 hours)
26+
"""
27+
28+
REQUIRES_DATA_DIR = True
29+
IS_BENCHMARK_GROUP = True
30+
SCORE_MODULE = "nemo_skills.dataset.contextasr-bench.contextasr_score"
31+
32+
BENCHMARKS = {
33+
"contextasr-bench.contextless": {},
34+
"contextasr-bench.coarse": {},
35+
"contextasr-bench.fine": {},
36+
}
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""ContextASR-Bench coarse mode: domain label provided as context."""
16+
17+
METRICS_TYPE = "contextasr"
18+
EVAL_ARGS = "++eval_type=contextasr"
19+
GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
def compute_score(combined_metrics: dict) -> dict:
17+
"""Aggregate metrics from the three ContextASR-Bench sub-benchmarks.
18+
19+
Computes weighted averages of WER, NE-WER, NE-FNR across contextless,
20+
coarse, and fine evaluation modes.
21+
"""
22+
main_names = ["contextless", "coarse", "fine"]
23+
benchmarks = {k: v for k, v in combined_metrics.items() if k.split(".")[-1] in main_names}
24+
25+
if not benchmarks:
26+
return {}
27+
28+
first_benchmark = next(iter(benchmarks.values()))
29+
eval_modes = list(first_benchmark.keys())
30+
31+
aggregated = {}
32+
for eval_mode in eval_modes:
33+
total_entries = 0
34+
weighted_success = 0.0
35+
total_gen_seconds = 0
36+
weighted_tokens = 0.0
37+
weighted_wer = 0.0
38+
weighted_ne_wer = 0.0
39+
weighted_ne_fnr = 0.0
40+
wer_entries = 0
41+
ne_wer_entries = 0
42+
ne_fnr_entries = 0
43+
44+
for benchmark_data in benchmarks.values():
45+
if eval_mode not in benchmark_data:
46+
continue
47+
48+
metrics = benchmark_data[eval_mode]
49+
num_entries = metrics["num_entries"]
50+
if num_entries == 0:
51+
continue
52+
53+
total_entries += num_entries
54+
weighted_success += metrics["success_rate"] * num_entries
55+
total_gen_seconds += metrics["gen_seconds"]
56+
weighted_tokens += metrics["avg_tokens"] * num_entries
57+
58+
if "wer" in metrics:
59+
weighted_wer += metrics["wer"] * num_entries
60+
wer_entries += num_entries
61+
if "ne_wer" in metrics:
62+
weighted_ne_wer += metrics["ne_wer"] * num_entries
63+
ne_wer_entries += num_entries
64+
if "ne_fnr" in metrics:
65+
weighted_ne_fnr += metrics["ne_fnr"] * num_entries
66+
ne_fnr_entries += num_entries
67+
68+
if total_entries == 0:
69+
continue
70+
71+
agg = {
72+
"avg_tokens": int(weighted_tokens / total_entries),
73+
"gen_seconds": total_gen_seconds,
74+
"success_rate": weighted_success / total_entries,
75+
"num_entries": total_entries,
76+
}
77+
78+
if wer_entries > 0:
79+
agg["wer"] = round(weighted_wer / wer_entries, 2)
80+
if ne_wer_entries > 0:
81+
agg["ne_wer"] = round(weighted_ne_wer / ne_wer_entries, 2)
82+
if ne_fnr_entries > 0:
83+
agg["ne_fnr"] = round(weighted_ne_fnr / ne_fnr_entries, 2)
84+
85+
aggregated[eval_mode] = agg
86+
87+
return aggregated
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""ContextASR-Bench contextless mode: plain transcription without any context."""
16+
17+
METRICS_TYPE = "contextasr"
18+
EVAL_ARGS = "++eval_type=contextasr"
19+
GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""ContextASR-Bench fine mode: domain label and entity list provided as context."""
16+
17+
METRICS_TYPE = "contextasr"
18+
EVAL_ARGS = "++eval_type=contextasr"
19+
GENERATION_ARGS = "++prompt_format=openai ++enable_audio=true"

0 commit comments

Comments
 (0)