Add LibriSpeechMix benchmark with ASR and SA-ASR support#1399
Add LibriSpeechMix benchmark with ASR and SA-ASR support#1399vmendelev wants to merge 3 commits into
Conversation
📝 WalkthroughWalkthroughAdds comprehensive LibriSpeechMix dataset support with 12 benchmark variants across 2 modes (ASR and speaker-attributed ASR), 2 splits, and 3 audio mix configurations. Includes data preparation with audio file handling and mixing, permutation-invariant evaluation for overlapped speech, WER scoring, and supporting infrastructure. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant OpenSLR as OpenSLR (LibriSpeech)
participant Manifest as Manifest Handler
participant AudioMix as Audio Mixer
participant JSONL as JSONL Builder
User->>PrepareScript: Call prepare (mode, split, mix)
PrepareScript->>OpenSLR: Download/validate FLAC splits
OpenSLR-->>PrepareScript: FLAC files
PrepareScript->>Manifest: Load mix manifest (SHA256 check)
Manifest-->>PrepareScript: Tracks & delays per utterance
PrepareScript->>AudioMix: Convert FLAC→WAV, cache
AudioMix-->>PrepareScript: Cached WAV files
PrepareScript->>AudioMix: Mix tracks with offsets
AudioMix-->>PrepareScript: Mixed WAV file
PrepareScript->>JSONL: Build record (prompt, answer, audio path, duration)
JSONL-->>PrepareScript: JSONL line with metadata
PrepareScript->>User: Write JSONL + materialized audio
sequenceDiagram
participant Evaluator as evaluate_librispeechmix_asr()
participant Preprocessor as Normalize & Split
participant Aligner as Permutation Aligner
participant WERCalc as WER Calculator
Evaluator->>Preprocessor: references (list), hypothesis (str)
Preprocessor->>Preprocessor: Normalize text, split into streams
Preprocessor-->>Aligner: normalized hypothesis streams, ref streams
Aligner->>Aligner: Try all hypothesis permutations
Aligner->>WERCalc: Best alignment pair
WERCalc->>WERCalc: Count subs/ins/dels/ref_words
WERCalc-->>Aligner: WER component counts
Aligner-->>Evaluator: aligned refs, preds, WER metrics
Evaluator->>Evaluator: Map to reference_streams, predicted_streams
Evaluator-->>Evaluator: Return dict with WER & streams
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (8)
nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py (1)
15-19: Please ensure LibriSpeechMix is covered in Slurm benchmark CI.Since this adds a new audio modality path with non-trivial ASR/SA-ASR scoring, it’s worth validating this benchmark group in slurm tests for end-to-end regressions.
Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py` around lines 15 - 19, Add the new LibriSpeechMix audio benchmark to the Slurm CI test matrix so end-to-end ASR/SA‑ASR scoring is validated: update the Slurm benchmark configuration (the list of datasets used by CI) to include LibriSpeechMix and ensure the audio modality and evaluation flags (referencing METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS, EVAL_SPLIT, DEFAULT_SPLIT from the new dataset) are enabled in the test job; also make sure the CI job runs the relevant scoring pipeline for audio/ASR to exercise the non-trivial metrics logic.nemo_skills/dataset/librispeechmix/librispeechmix_score.py (1)
74-80: Potential fragility: assumes all benchmarks share the same eval_modes structure.Line 74-75 extracts
eval_modesfrom the first benchmark and applies it to all. If benchmarks have inconsistent structures (e.g., different evaluation configurations were run), this could silently skip modes or fail.Given that all LibriSpeechMix benchmarks are generated from the same template, this is likely safe in practice, but consider adding a defensive check or comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/librispeechmix_score.py` around lines 74 - 80, The code assumes eval_modes from first_benchmark applies to all entries (first_benchmark, eval_modes, combined_metrics, grouped) which is fragile; update the logic to defensively verify or unify modes: either compute the union of all benchmark keys (e.g., gather all keys across combined_metrics values into a single eval_modes set) or add an explicit consistency check that every benchmark in combined_metrics has the same set of eval modes and raise/log an informative error if not; ensure you update any downstream usage that references eval_modes to use the new unified/validated value and keep references to first_benchmark and grouped behavior consistent.nemo_skills/dataset/librispeechmix/prepare.py (4)
227-228: Addstrict=Trueto zip() for consistency.Same rationale as above—fail fast if
delaysanddurationshave mismatched lengths.♻️ Proposed fix
def _estimate_mixed_duration(row: dict) -> float: - return max(delay + duration for delay, duration in zip(row["delays"], row["durations"])) + return max(delay + duration for delay, duration in zip(row["delays"], row["durations"], strict=True))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 227 - 228, Update the _estimate_mixed_duration function to use zip(..., strict=True) when iterating over row["delays"] and row["durations"] so the function fails fast on length mismatches; locate the zip call in _estimate_mixed_duration and add the strict=True argument to the zip invocation.
202-216: Addstrict=Trueto zip() to catch manifest inconsistencies early.If
relative_wav_pathsanddelayshave mismatched lengths due to a malformed manifest, the current code would silently truncate. Addingstrict=Truemakes this fail fast with a clear error.♻️ Proposed fix
- for relative_wav_path, delay in zip(relative_wav_paths, delays): + for relative_wav_path, delay in zip(relative_wav_paths, delays, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 202 - 216, The loop over relative_wav_paths and delays silently truncates if the lists differ in length; update the zip call in the prepare routine that iterates those variables to use strict=True (i.e., replace zip(relative_wav_paths, delays) with zip(relative_wav_paths, delays, strict=True)) so mismatched manifest entries raise immediately; ensure this change is applied where tracks are appended (tracks.append((audio, delay_frames))) and that any Python version compatibility expectations are documented if needed.
317-320: Addstrict=Trueto zip() for SA-ASR reference building.This ensures
speaker_profile_indexandtextshave matching lengths.♻️ Proposed fix
labeled_references = [ f"speaker_{speaker_idx}: {text}" - for speaker_idx, text in sorted(zip(row["speaker_profile_index"], row["texts"]), key=lambda item: item[0]) + for speaker_idx, text in sorted(zip(row["speaker_profile_index"], row["texts"], strict=True), key=lambda item: item[0]) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 317 - 320, The list comprehension building labeled_references uses zip(row["speaker_profile_index"], row["texts"]) without strict checking; change that zip call to zip(..., strict=True) so mismatched lengths raise immediately. Update the expression that constructs labeled_references (the comprehension referencing speaker_idx/text from zip of row["speaker_profile_index"] and row["texts"]) to pass strict=True to zip to enforce equal lengths.
214-214: Remove unnecessaryint()cast.
round()already returns an integer in Python 3 when called with a single argument or withndigits=0. Theint()wrapper is redundant.♻️ Proposed fix
- delay_frames = int(round(delay * current_sample_rate)) + delay_frames = round(delay * current_sample_rate)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix/prepare.py` at line 214, Remove the redundant int() cast when computing delay_frames: replace the expression delay_frames = int(round(delay * current_sample_rate)) with delay_frames = round(delay * current_sample_rate) (locate the assignment to delay_frames in prepare.py where delay and current_sample_rate are used) so the result remains an integer without an unnecessary conversion.nemo_skills/evaluation/evaluator/audio.py (1)
497-545: Addstrict=Trueto zip() for speaker index matching.If
speaker_profile_indexandreferenceshave mismatched lengths due to data corruption, the current code would silently truncate. Addingstrict=Trueensures a clear error.♻️ Proposed fix
reference_map = { f"speaker_{speaker_idx}": preprocess_asr_text(reference, mode=normalization_mode) - for speaker_idx, reference in zip(speaker_profile_index, references) + for speaker_idx, reference in zip(speaker_profile_index, references, strict=True) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/audio.py` around lines 497 - 545, In evaluate_librispeechmix_sa_asr update the reference_map creation to use zip(..., strict=True) to ensure speaker_profile_index and references have the same length (so mismatched lengths raise a ValueError instead of silently truncating); locate the dict comprehension that currently calls zip(speaker_profile_index, references) and add the strict=True argument to that zip invocation to enforce matching lengths.tests/test_librispeechmix.py (1)
75-136: Consider adding this dataset to Slurm tests for comprehensive evaluation.The test coverage here is good for unit-level validation. Based on learnings, when enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive end-to-end evaluation on a cluster.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_librispeechmix.py` around lines 75 - 136, Add this librispeechmix dataset/test to the Slurm end-to-end test suite so cluster-level evaluation runs it: update the Slurm test matrix to include the test named test_librispeechmix_prepare_records_writes_absolute_paths (which exercises build_output_record, write_benchmark_records and resolve_base_data_dir) and ensure the job provides the required fixtures (audio_prefix/source/raw_librispeech_root, materialize_audio true and tmp data staging) so the mixed audio file is materialized and assertions pass; also include any necessary dataset assets and paths in the Slurm job setup so the test can run non-interactively on the cluster.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Around line 420-479: Add runnable ns eval examples and expected tested-model
result snapshots to the LibriSpeechMix docs: insert two example commands (a
generic benchmark run using --benchmarks=librispeechmix and a variant for sa-asr
like --benchmarks=librispeechmix.sa-asr-test-clean-2mix) showing typical flags
(--cluster, --output_dir, --model, --server_type, --server_gpus, --data_dir) and
include one or more validated metrics.json snippets (WER, success_rate,
num_entries) from tested runs so users can sanity-check; update the
LibriSpeechMix section (heading LibriSpeechMix and examples under Preparing
LibriSpeechMix Data / Evaluation Assumptions) to include these ns eval examples
and expected results.
In `@nemo_skills/evaluation/metrics/audio_metrics.py`:
- Around line 304-305: metrics_to_print() currently only adds "wer" to
agg_metrics when self.wer_references or self.wer_scores exist, so corpus-level
WER computed from self.wer_total_errors/self.wer_total_ref_words can be omitted
in aggregate-only runs; update metrics_to_print() to check for positive
self.wer_total_ref_words (or self.wer_total_errors) and, if present, add
agg_metrics["wer"] = round(100.0 * self.wer_total_errors /
self.wer_total_ref_words, 2) so the computed corpus WER is always exposed even
when self.wer_references/self.wer_scores are empty, leaving existing
per-utterance logic unchanged.
---
Nitpick comments:
In `@nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py`:
- Around line 15-19: Add the new LibriSpeechMix audio benchmark to the Slurm CI
test matrix so end-to-end ASR/SA‑ASR scoring is validated: update the Slurm
benchmark configuration (the list of datasets used by CI) to include
LibriSpeechMix and ensure the audio modality and evaluation flags (referencing
METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS, EVAL_SPLIT, DEFAULT_SPLIT from the new
dataset) are enabled in the test job; also make sure the CI job runs the
relevant scoring pipeline for audio/ASR to exercise the non-trivial metrics
logic.
In `@nemo_skills/dataset/librispeechmix/librispeechmix_score.py`:
- Around line 74-80: The code assumes eval_modes from first_benchmark applies to
all entries (first_benchmark, eval_modes, combined_metrics, grouped) which is
fragile; update the logic to defensively verify or unify modes: either compute
the union of all benchmark keys (e.g., gather all keys across combined_metrics
values into a single eval_modes set) or add an explicit consistency check that
every benchmark in combined_metrics has the same set of eval modes and raise/log
an informative error if not; ensure you update any downstream usage that
references eval_modes to use the new unified/validated value and keep references
to first_benchmark and grouped behavior consistent.
In `@nemo_skills/dataset/librispeechmix/prepare.py`:
- Around line 227-228: Update the _estimate_mixed_duration function to use
zip(..., strict=True) when iterating over row["delays"] and row["durations"] so
the function fails fast on length mismatches; locate the zip call in
_estimate_mixed_duration and add the strict=True argument to the zip invocation.
- Around line 202-216: The loop over relative_wav_paths and delays silently
truncates if the lists differ in length; update the zip call in the prepare
routine that iterates those variables to use strict=True (i.e., replace
zip(relative_wav_paths, delays) with zip(relative_wav_paths, delays,
strict=True)) so mismatched manifest entries raise immediately; ensure this
change is applied where tracks are appended (tracks.append((audio,
delay_frames))) and that any Python version compatibility expectations are
documented if needed.
- Around line 317-320: The list comprehension building labeled_references uses
zip(row["speaker_profile_index"], row["texts"]) without strict checking; change
that zip call to zip(..., strict=True) so mismatched lengths raise immediately.
Update the expression that constructs labeled_references (the comprehension
referencing speaker_idx/text from zip of row["speaker_profile_index"] and
row["texts"]) to pass strict=True to zip to enforce equal lengths.
- Line 214: Remove the redundant int() cast when computing delay_frames: replace
the expression delay_frames = int(round(delay * current_sample_rate)) with
delay_frames = round(delay * current_sample_rate) (locate the assignment to
delay_frames in prepare.py where delay and current_sample_rate are used) so the
result remains an integer without an unnecessary conversion.
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 497-545: In evaluate_librispeechmix_sa_asr update the
reference_map creation to use zip(..., strict=True) to ensure
speaker_profile_index and references have the same length (so mismatched lengths
raise a ValueError instead of silently truncating); locate the dict
comprehension that currently calls zip(speaker_profile_index, references) and
add the strict=True argument to that zip invocation to enforce matching lengths.
In `@tests/test_librispeechmix.py`:
- Around line 75-136: Add this librispeechmix dataset/test to the Slurm
end-to-end test suite so cluster-level evaluation runs it: update the Slurm test
matrix to include the test named
test_librispeechmix_prepare_records_writes_absolute_paths (which exercises
build_output_record, write_benchmark_records and resolve_base_data_dir) and
ensure the job provides the required fixtures
(audio_prefix/source/raw_librispeech_root, materialize_audio true and tmp data
staging) so the mixed audio file is materialized and assertions pass; also
include any necessary dataset assets and paths in the Slurm job setup so the
test can run non-interactively on the cluster.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 74b098d0-ca7b-4c5b-99f9-a1024f20710a
⛔ Files ignored due to path filters (6)
nemo_skills/dataset/librispeechmix/manifests/dev-clean-1mix.jsonl.gzis excluded by!**/*.gznemo_skills/dataset/librispeechmix/manifests/dev-clean-2mix.jsonl.gzis excluded by!**/*.gznemo_skills/dataset/librispeechmix/manifests/dev-clean-3mix.jsonl.gzis excluded by!**/*.gznemo_skills/dataset/librispeechmix/manifests/test-clean-1mix.jsonl.gzis excluded by!**/*.gznemo_skills/dataset/librispeechmix/manifests/test-clean-2mix.jsonl.gzis excluded by!**/*.gznemo_skills/dataset/librispeechmix/manifests/test-clean-3mix.jsonl.gzis excluded by!**/*.gz
📒 Files selected for processing (22)
MANIFEST.indocs/evaluation/speech-audio.mdnemo_skills/dataset/librispeechmix/__init__.pynemo_skills/dataset/librispeechmix/asr-dev-clean-1mix/__init__.pynemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.pynemo_skills/dataset/librispeechmix/asr-dev-clean-3mix/__init__.pynemo_skills/dataset/librispeechmix/asr-test-clean-1mix/__init__.pynemo_skills/dataset/librispeechmix/asr-test-clean-2mix/__init__.pynemo_skills/dataset/librispeechmix/asr-test-clean-3mix/__init__.pynemo_skills/dataset/librispeechmix/librispeechmix_score.pynemo_skills/dataset/librispeechmix/prepare.pynemo_skills/dataset/librispeechmix/sa-asr-dev-clean-1mix/__init__.pynemo_skills/dataset/librispeechmix/sa-asr-dev-clean-2mix/__init__.pynemo_skills/dataset/librispeechmix/sa-asr-dev-clean-3mix/__init__.pynemo_skills/dataset/librispeechmix/sa-asr-test-clean-1mix/__init__.pynemo_skills/dataset/librispeechmix/sa-asr-test-clean-2mix/__init__.pynemo_skills/dataset/librispeechmix/sa-asr-test-clean-3mix/__init__.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/metrics/audio_metrics.pynemo_skills/pipeline/prepare_data.pytests/test_external_benchmarks.pytests/test_librispeechmix.py
| ## LibriSpeechMix | ||
|
|
||
| LibriSpeechMix evaluates overlapped transcription and speaker-attributed transcription on mixtures derived from LibriSpeech `dev-clean` and `test-clean`. | ||
|
|
||
| ### Dataset Location | ||
|
|
||
| - Benchmark group is defined in [`nemo_skills/dataset/librispeechmix/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeechmix/__init__.py) | ||
| - Official manifests come from [NaoyukiKanda/LibriSpeechMix](https://github.com/NaoyukiKanda/LibriSpeechMix) | ||
| - Source speech audio comes from [LibriSpeech OpenSLR-12](https://www.openslr.org/12/) | ||
|
|
||
| ### Supported Benchmarks | ||
|
|
||
| - Overlapped ASR: | ||
| `librispeechmix.asr-dev-clean-1mix`, | ||
| `librispeechmix.asr-dev-clean-2mix`, | ||
| `librispeechmix.asr-dev-clean-3mix`, | ||
| `librispeechmix.asr-test-clean-1mix`, | ||
| `librispeechmix.asr-test-clean-2mix`, | ||
| `librispeechmix.asr-test-clean-3mix` | ||
| - Speaker-attributed ASR: | ||
| `librispeechmix.sa-asr-dev-clean-1mix`, | ||
| `librispeechmix.sa-asr-dev-clean-2mix`, | ||
| `librispeechmix.sa-asr-dev-clean-3mix`, | ||
| `librispeechmix.sa-asr-test-clean-1mix`, | ||
| `librispeechmix.sa-asr-test-clean-2mix`, | ||
| `librispeechmix.sa-asr-test-clean-3mix` | ||
|
|
||
| ### Preparing LibriSpeechMix Data | ||
|
|
||
| LibriSpeechMix downloads LibriSpeech `dev-clean` and `test-clean` from OpenSLR, caches source WAV files for speaker profiles, synthesizes mixed WAVs, and writes benchmark JSONL files under your external `--data_dir`. | ||
|
|
||
| ```bash | ||
| ns prepare_data librispeechmix --data_dir=/path/to/data --cluster=<cluster_name> | ||
| ``` | ||
|
|
||
| Prepare only specific splits, mixtures, or modes: | ||
|
|
||
| ```bash | ||
| ns prepare_data librispeechmix \ | ||
| --data_dir=/path/to/data \ | ||
| --splits dev-clean \ | ||
| --mixes 2mix 3mix \ | ||
| --modes asr sa-asr | ||
| ``` | ||
|
|
||
| Override the absolute audio-path prefix embedded in JSONL files: | ||
|
|
||
| ```bash | ||
| ns prepare_data librispeechmix \ | ||
| --data_dir=/path/to/data \ | ||
| --audio-prefix /dataset/librispeechmix/audio | ||
| ``` | ||
|
|
||
| ### Evaluation Assumptions | ||
|
|
||
| - `1mix` uses standard WER against the single reference transcript. | ||
| - `2mix` and `3mix` use permutation-invariant WER over newline-separated hypothesized utterances. | ||
| - SA-ASR expects speaker-labeled lines in the format `speaker_<profile_index>: <transcript>`. | ||
| - SA-ASR scoring matches hypotheses to the reference `speaker_profile_index` values instead of transcript order. | ||
|
|
There was a problem hiding this comment.
Add explicit LibriSpeechMix evaluation commands and expected tested-model results.
The section documents preparation and scoring assumptions well, but it still needs runnable ns eval examples for LibriSpeechMix and expected result snapshots for tested models.
📝 Suggested doc patch
### Evaluation Assumptions
- `1mix` uses standard WER against the single reference transcript.
- `2mix` and `3mix` use permutation-invariant WER over newline-separated hypothesized utterances.
- SA-ASR expects speaker-labeled lines in the format `speaker_<profile_index>: <transcript>`.
- SA-ASR scoring matches hypotheses to the reference `speaker_profile_index` values instead of transcript order.
+
+### Running LibriSpeechMix Evaluation
+
+```bash
+ns eval \
+ --cluster=<cluster_name> \
+ --output_dir=/workspace/librispeechmix-eval \
+ --benchmarks=librispeechmix \
+ --model=/path/to/model \
+ --server_type=<server_type> \
+ --server_gpus=1 \
+ --data_dir=/path/to/data
+```
+
+Evaluate a specific variant:
+
+```bash
+ns eval \
+ --benchmarks=librispeechmix.sa-asr-test-clean-2mix \
+ --cluster=<cluster_name> \
+ --output_dir=/workspace/librispeechmix-sa-asr-eval \
+ --model=/path/to/model \
+ --server_type=<server_type> \
+ --server_gpus=1 \
+ --data_dir=/path/to/data
+```
+
+### Expected Results (Tested Models)
+
+Add one or more validated `metrics.json` snippets (WER, success_rate, num_entries) from tested model runs so users can sanity-check setup correctness.As per coding guidelines, "**/{benchmarks,docs}/**/*.{md,py}: When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/evaluation/speech-audio.md` around lines 420 - 479, Add runnable ns eval
examples and expected tested-model result snapshots to the LibriSpeechMix docs:
insert two example commands (a generic benchmark run using
--benchmarks=librispeechmix and a variant for sa-asr like
--benchmarks=librispeechmix.sa-asr-test-clean-2mix) showing typical flags
(--cluster, --output_dir, --model, --server_type, --server_gpus, --data_dir) and
include one or more validated metrics.json snippets (WER, success_rate,
num_entries) from tested runs so users can sanity-check; update the
LibriSpeechMix section (heading LibriSpeechMix and examples under Preparing
LibriSpeechMix Data / Evaluation Assumptions) to include these ns eval examples
and expected results.
| elif self.wer_total_ref_words > 0: | ||
| agg_metrics["wer"] = round(100.0 * self.wer_total_errors / self.wer_total_ref_words, 2) |
There was a problem hiding this comment.
Expose corpus-level WER in printable metrics for aggregate-only runs.
At Line 304, corpus WER is now correctly computed from totals, but metrics_to_print() still only exposes wer when self.wer_references or self.wer_scores exists. In aggregate-only flows, wer is computed but can be omitted from printed output.
Suggested fix
diff --git a/nemo_skills/evaluation/metrics/audio_metrics.py b/nemo_skills/evaluation/metrics/audio_metrics.py
@@
- if self.wer_references or self.wer_scores:
+ if self.wer_references or self.wer_scores or self.wer_total_ref_words > 0:
base_metrics["wer"] = as_percentage🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/metrics/audio_metrics.py` around lines 304 - 305,
metrics_to_print() currently only adds "wer" to agg_metrics when
self.wer_references or self.wer_scores exist, so corpus-level WER computed from
self.wer_total_errors/self.wer_total_ref_words can be omitted in aggregate-only
runs; update metrics_to_print() to check for positive self.wer_total_ref_words
(or self.wer_total_errors) and, if present, add agg_metrics["wer"] = round(100.0
* self.wer_total_errors / self.wer_total_ref_words, 2) so the computed corpus
WER is always exposed even when self.wer_references/self.wer_scores are empty,
leaving existing per-utterance logic unchanged.
Summary
Verification
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements