Skip to content

Add LibriSpeechMix benchmark with ASR and SA-ASR support#1399

Open
vmendelev wants to merge 3 commits into
mainfrom
vmendelev/issue-65-librispeechmix
Open

Add LibriSpeechMix benchmark with ASR and SA-ASR support#1399
vmendelev wants to merge 3 commits into
mainfrom
vmendelev/issue-65-librispeechmix

Conversation

@vmendelev
Copy link
Copy Markdown
Collaborator

@vmendelev vmendelev commented Apr 24, 2026

Summary

  • add the LibriSpeechMix benchmark and packaged manifests to NeMo Skills
  • add LibriSpeechMix prepare-data and audio evaluation support for ASR and SA-ASR
  • fix corpus-level WER aggregation and cover the regression with tests

Verification

  • python ./.symphony/bin/run_local_job --venv /home/vmendelev/tmp/gitlab-symphony-v2/workspaces/vmendelev-work-kb65/.symphony/artifacts/venvs/skills65 --cwd /home/vmendelev/tmp/gitlab-symphony-v2/workspaces/vmendelev-work-kb65 -- bash ./p.sh -m pytest Skills/tests/test_librispeechmix.py Skills/tests/test_external_benchmarks.py -q
  • python ./.symphony/bin/run_local_job --venv /home/vmendelev/tmp/gitlab-symphony-v2/workspaces/vmendelev-work-kb65/.symphony/artifacts/venvs/skills65 --cwd /home/vmendelev/tmp/gitlab-symphony-v2/workspaces/vmendelev-work-kb65 -- bash ./p.sh -m nemo_skills.pipeline.cli prepare_data librispeechmix --cluster oci_iad --config_dir /home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs --data_dir /lustre/fsw/portfolios/llmservice/users/vmendelev/workspace/data/issue65-librispeechmix --installation_command "pip install soundfile"

Summary by CodeRabbit

Release Notes

  • New Features

    • Added LibriSpeechMix benchmark support with 12 evaluation configurations
    • Support for ASR and speaker-attributed ASR evaluation modes
    • Multi-speaker audio mixing variants (1mix, 2mix, 3mix)
  • Documentation

    • Added comprehensive guide for LibriSpeechMix speech evaluation and preparation
  • Improvements

    • Enhanced audio metrics computation for overlapped and multi-speaker scenarios

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Adds comprehensive LibriSpeechMix dataset support with 12 benchmark variants across 2 modes (ASR and speaker-attributed ASR), 2 splits, and 3 audio mix configurations. Includes data preparation with audio file handling and mixing, permutation-invariant evaluation for overlapped speech, WER scoring, and supporting infrastructure.

Changes

Cohort / File(s) Summary
Manifest & Documentation
MANIFEST.in, docs/evaluation/speech-audio.md
Updated packaging to include .gz files; added comprehensive LibriSpeechMix benchmark documentation with dataset description, evaluation modes, data preparation commands, and hypothesis alignment specifications.
Core LibriSpeechMix Package
nemo_skills/dataset/librispeechmix/__init__.py
Introduced benchmark group metadata, declared supported modes (asr, sa-asr), splits, and mixes, and auto-generated 12 benchmark registry entries keyed by librispeechmix.{mode}-{split}-{mix}.
Dataset Variant Configurations
nemo_skills/dataset/librispeechmix/asr-*/__init__.py, nemo_skills/dataset/librispeechmix/sa-asr-*/__init__.py
Created 12 dataset variant __init__.py files (one per benchmark) exporting consistent METRICS_TYPE, DEFAULT_SPLIT, EVAL_SPLIT, EVAL_ARGS, and GENERATION_ARGS constants for evaluation and generation configuration.
Data Preparation
nemo_skills/dataset/librispeechmix/prepare.py
Implemented dataset preparation script with OpenSLR LibriSpeech download/validation, manifest loading, FLAC-to-WAV conversion with caching, multi-track audio mixing using manifest delay offsets, and JSONL record generation for ASR and speaker-attributed ASR modes.
Scoring & Metrics
nemo_skills/dataset/librispeechmix/librispeechmix_score.py, nemo_skills/evaluation/metrics/audio_metrics.py
Added LibriSpeechMix scoring with per-mode aggregation and WER computation from corpus totals; updated audio metrics to compute aggregated WER from accumulated error/word counts when per-example data unavailable.
Evaluation
nemo_skills/evaluation/evaluator/audio.py
Added evaluate_librispeechmix_asr() with permutation-invariant stream alignment and normalization-aware WER; added evaluate_librispeechmix_sa_asr() for speaker-attributed transcript matching and scoring; integrated both into main evaluator routing.
Pipeline Infrastructure
nemo_skills/pipeline/prepare_data.py
Added helper to inject NEMO_SKILLS_DATA_DIR environment variable into prepare data command invocation when dataset directory is specified.
Tests
tests/test_librispeechmix.py, tests/test_external_benchmarks.py
Added comprehensive test suite validating benchmark registration, evaluation correctness (permutation-invariant WER, speaker matching), data preparation output (audio materialization, JSONL paths), environment variable handling, and metrics aggregation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant OpenSLR as OpenSLR (LibriSpeech)
    participant Manifest as Manifest Handler
    participant AudioMix as Audio Mixer
    participant JSONL as JSONL Builder
    
    User->>PrepareScript: Call prepare (mode, split, mix)
    PrepareScript->>OpenSLR: Download/validate FLAC splits
    OpenSLR-->>PrepareScript: FLAC files
    PrepareScript->>Manifest: Load mix manifest (SHA256 check)
    Manifest-->>PrepareScript: Tracks & delays per utterance
    PrepareScript->>AudioMix: Convert FLAC→WAV, cache
    AudioMix-->>PrepareScript: Cached WAV files
    PrepareScript->>AudioMix: Mix tracks with offsets
    AudioMix-->>PrepareScript: Mixed WAV file
    PrepareScript->>JSONL: Build record (prompt, answer, audio path, duration)
    JSONL-->>PrepareScript: JSONL line with metadata
    PrepareScript->>User: Write JSONL + materialized audio
Loading
sequenceDiagram
    participant Evaluator as evaluate_librispeechmix_asr()
    participant Preprocessor as Normalize & Split
    participant Aligner as Permutation Aligner
    participant WERCalc as WER Calculator
    
    Evaluator->>Preprocessor: references (list), hypothesis (str)
    Preprocessor->>Preprocessor: Normalize text, split into streams
    Preprocessor-->>Aligner: normalized hypothesis streams, ref streams
    Aligner->>Aligner: Try all hypothesis permutations
    Aligner->>WERCalc: Best alignment pair
    WERCalc->>WERCalc: Count subs/ins/dels/ref_words
    WERCalc-->>Aligner: WER component counts
    Aligner-->>Evaluator: aligned refs, preds, WER metrics
    Evaluator->>Evaluator: Map to reference_streams, predicted_streams
    Evaluator-->>Evaluator: Return dict with WER & streams
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding LibriSpeechMix benchmark with support for ASR and SA-ASR evaluation modes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vmendelev/issue-65-librispeechmix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py (1)

15-19: Please ensure LibriSpeechMix is covered in Slurm benchmark CI.

Since this adds a new audio modality path with non-trivial ASR/SA-ASR scoring, it’s worth validating this benchmark group in slurm tests for end-to-end regressions.

Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py` around
lines 15 - 19, Add the new LibriSpeechMix audio benchmark to the Slurm CI test
matrix so end-to-end ASR/SA‑ASR scoring is validated: update the Slurm benchmark
configuration (the list of datasets used by CI) to include LibriSpeechMix and
ensure the audio modality and evaluation flags (referencing METRICS_TYPE,
EVAL_ARGS, GENERATION_ARGS, EVAL_SPLIT, DEFAULT_SPLIT from the new dataset) are
enabled in the test job; also make sure the CI job runs the relevant scoring
pipeline for audio/ASR to exercise the non-trivial metrics logic.
nemo_skills/dataset/librispeechmix/librispeechmix_score.py (1)

74-80: Potential fragility: assumes all benchmarks share the same eval_modes structure.

Line 74-75 extracts eval_modes from the first benchmark and applies it to all. If benchmarks have inconsistent structures (e.g., different evaluation configurations were run), this could silently skip modes or fail.

Given that all LibriSpeechMix benchmarks are generated from the same template, this is likely safe in practice, but consider adding a defensive check or comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/librispeechmix_score.py` around lines 74 -
80, The code assumes eval_modes from first_benchmark applies to all entries
(first_benchmark, eval_modes, combined_metrics, grouped) which is fragile;
update the logic to defensively verify or unify modes: either compute the union
of all benchmark keys (e.g., gather all keys across combined_metrics values into
a single eval_modes set) or add an explicit consistency check that every
benchmark in combined_metrics has the same set of eval modes and raise/log an
informative error if not; ensure you update any downstream usage that references
eval_modes to use the new unified/validated value and keep references to
first_benchmark and grouped behavior consistent.
nemo_skills/dataset/librispeechmix/prepare.py (4)

227-228: Add strict=True to zip() for consistency.

Same rationale as above—fail fast if delays and durations have mismatched lengths.

♻️ Proposed fix
 def _estimate_mixed_duration(row: dict) -> float:
-    return max(delay + duration for delay, duration in zip(row["delays"], row["durations"]))
+    return max(delay + duration for delay, duration in zip(row["delays"], row["durations"], strict=True))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 227 - 228, Update
the _estimate_mixed_duration function to use zip(..., strict=True) when
iterating over row["delays"] and row["durations"] so the function fails fast on
length mismatches; locate the zip call in _estimate_mixed_duration and add the
strict=True argument to the zip invocation.

202-216: Add strict=True to zip() to catch manifest inconsistencies early.

If relative_wav_paths and delays have mismatched lengths due to a malformed manifest, the current code would silently truncate. Adding strict=True makes this fail fast with a clear error.

♻️ Proposed fix
-    for relative_wav_path, delay in zip(relative_wav_paths, delays):
+    for relative_wav_path, delay in zip(relative_wav_paths, delays, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 202 - 216, The
loop over relative_wav_paths and delays silently truncates if the lists differ
in length; update the zip call in the prepare routine that iterates those
variables to use strict=True (i.e., replace zip(relative_wav_paths, delays) with
zip(relative_wav_paths, delays, strict=True)) so mismatched manifest entries
raise immediately; ensure this change is applied where tracks are appended
(tracks.append((audio, delay_frames))) and that any Python version compatibility
expectations are documented if needed.

317-320: Add strict=True to zip() for SA-ASR reference building.

This ensures speaker_profile_index and texts have matching lengths.

♻️ Proposed fix
         labeled_references = [
             f"speaker_{speaker_idx}: {text}"
-            for speaker_idx, text in sorted(zip(row["speaker_profile_index"], row["texts"]), key=lambda item: item[0])
+            for speaker_idx, text in sorted(zip(row["speaker_profile_index"], row["texts"], strict=True), key=lambda item: item[0])
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/prepare.py` around lines 317 - 320, The
list comprehension building labeled_references uses
zip(row["speaker_profile_index"], row["texts"]) without strict checking; change
that zip call to zip(..., strict=True) so mismatched lengths raise immediately.
Update the expression that constructs labeled_references (the comprehension
referencing speaker_idx/text from zip of row["speaker_profile_index"] and
row["texts"]) to pass strict=True to zip to enforce equal lengths.

214-214: Remove unnecessary int() cast.

round() already returns an integer in Python 3 when called with a single argument or with ndigits=0. The int() wrapper is redundant.

♻️ Proposed fix
-        delay_frames = int(round(delay * current_sample_rate))
+        delay_frames = round(delay * current_sample_rate)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix/prepare.py` at line 214, Remove the
redundant int() cast when computing delay_frames: replace the expression
delay_frames = int(round(delay * current_sample_rate)) with delay_frames =
round(delay * current_sample_rate) (locate the assignment to delay_frames in
prepare.py where delay and current_sample_rate are used) so the result remains
an integer without an unnecessary conversion.
nemo_skills/evaluation/evaluator/audio.py (1)

497-545: Add strict=True to zip() for speaker index matching.

If speaker_profile_index and references have mismatched lengths due to data corruption, the current code would silently truncate. Adding strict=True ensures a clear error.

♻️ Proposed fix
     reference_map = {
         f"speaker_{speaker_idx}": preprocess_asr_text(reference, mode=normalization_mode)
-        for speaker_idx, reference in zip(speaker_profile_index, references)
+        for speaker_idx, reference in zip(speaker_profile_index, references, strict=True)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 497 - 545, In
evaluate_librispeechmix_sa_asr update the reference_map creation to use zip(...,
strict=True) to ensure speaker_profile_index and references have the same length
(so mismatched lengths raise a ValueError instead of silently truncating);
locate the dict comprehension that currently calls zip(speaker_profile_index,
references) and add the strict=True argument to that zip invocation to enforce
matching lengths.
tests/test_librispeechmix.py (1)

75-136: Consider adding this dataset to Slurm tests for comprehensive evaluation.

The test coverage here is good for unit-level validation. Based on learnings, when enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive end-to-end evaluation on a cluster.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_librispeechmix.py` around lines 75 - 136, Add this librispeechmix
dataset/test to the Slurm end-to-end test suite so cluster-level evaluation runs
it: update the Slurm test matrix to include the test named
test_librispeechmix_prepare_records_writes_absolute_paths (which exercises
build_output_record, write_benchmark_records and resolve_base_data_dir) and
ensure the job provides the required fixtures
(audio_prefix/source/raw_librispeech_root, materialize_audio true and tmp data
staging) so the mixed audio file is materialized and assertions pass; also
include any necessary dataset assets and paths in the Slurm job setup so the
test can run non-interactively on the cluster.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Around line 420-479: Add runnable ns eval examples and expected tested-model
result snapshots to the LibriSpeechMix docs: insert two example commands (a
generic benchmark run using --benchmarks=librispeechmix and a variant for sa-asr
like --benchmarks=librispeechmix.sa-asr-test-clean-2mix) showing typical flags
(--cluster, --output_dir, --model, --server_type, --server_gpus, --data_dir) and
include one or more validated metrics.json snippets (WER, success_rate,
num_entries) from tested runs so users can sanity-check; update the
LibriSpeechMix section (heading LibriSpeechMix and examples under Preparing
LibriSpeechMix Data / Evaluation Assumptions) to include these ns eval examples
and expected results.

In `@nemo_skills/evaluation/metrics/audio_metrics.py`:
- Around line 304-305: metrics_to_print() currently only adds "wer" to
agg_metrics when self.wer_references or self.wer_scores exist, so corpus-level
WER computed from self.wer_total_errors/self.wer_total_ref_words can be omitted
in aggregate-only runs; update metrics_to_print() to check for positive
self.wer_total_ref_words (or self.wer_total_errors) and, if present, add
agg_metrics["wer"] = round(100.0 * self.wer_total_errors /
self.wer_total_ref_words, 2) so the computed corpus WER is always exposed even
when self.wer_references/self.wer_scores are empty, leaving existing
per-utterance logic unchanged.

---

Nitpick comments:
In `@nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py`:
- Around line 15-19: Add the new LibriSpeechMix audio benchmark to the Slurm CI
test matrix so end-to-end ASR/SA‑ASR scoring is validated: update the Slurm
benchmark configuration (the list of datasets used by CI) to include
LibriSpeechMix and ensure the audio modality and evaluation flags (referencing
METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS, EVAL_SPLIT, DEFAULT_SPLIT from the new
dataset) are enabled in the test job; also make sure the CI job runs the
relevant scoring pipeline for audio/ASR to exercise the non-trivial metrics
logic.

In `@nemo_skills/dataset/librispeechmix/librispeechmix_score.py`:
- Around line 74-80: The code assumes eval_modes from first_benchmark applies to
all entries (first_benchmark, eval_modes, combined_metrics, grouped) which is
fragile; update the logic to defensively verify or unify modes: either compute
the union of all benchmark keys (e.g., gather all keys across combined_metrics
values into a single eval_modes set) or add an explicit consistency check that
every benchmark in combined_metrics has the same set of eval modes and raise/log
an informative error if not; ensure you update any downstream usage that
references eval_modes to use the new unified/validated value and keep references
to first_benchmark and grouped behavior consistent.

In `@nemo_skills/dataset/librispeechmix/prepare.py`:
- Around line 227-228: Update the _estimate_mixed_duration function to use
zip(..., strict=True) when iterating over row["delays"] and row["durations"] so
the function fails fast on length mismatches; locate the zip call in
_estimate_mixed_duration and add the strict=True argument to the zip invocation.
- Around line 202-216: The loop over relative_wav_paths and delays silently
truncates if the lists differ in length; update the zip call in the prepare
routine that iterates those variables to use strict=True (i.e., replace
zip(relative_wav_paths, delays) with zip(relative_wav_paths, delays,
strict=True)) so mismatched manifest entries raise immediately; ensure this
change is applied where tracks are appended (tracks.append((audio,
delay_frames))) and that any Python version compatibility expectations are
documented if needed.
- Around line 317-320: The list comprehension building labeled_references uses
zip(row["speaker_profile_index"], row["texts"]) without strict checking; change
that zip call to zip(..., strict=True) so mismatched lengths raise immediately.
Update the expression that constructs labeled_references (the comprehension
referencing speaker_idx/text from zip of row["speaker_profile_index"] and
row["texts"]) to pass strict=True to zip to enforce equal lengths.
- Line 214: Remove the redundant int() cast when computing delay_frames: replace
the expression delay_frames = int(round(delay * current_sample_rate)) with
delay_frames = round(delay * current_sample_rate) (locate the assignment to
delay_frames in prepare.py where delay and current_sample_rate are used) so the
result remains an integer without an unnecessary conversion.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 497-545: In evaluate_librispeechmix_sa_asr update the
reference_map creation to use zip(..., strict=True) to ensure
speaker_profile_index and references have the same length (so mismatched lengths
raise a ValueError instead of silently truncating); locate the dict
comprehension that currently calls zip(speaker_profile_index, references) and
add the strict=True argument to that zip invocation to enforce matching lengths.

In `@tests/test_librispeechmix.py`:
- Around line 75-136: Add this librispeechmix dataset/test to the Slurm
end-to-end test suite so cluster-level evaluation runs it: update the Slurm test
matrix to include the test named
test_librispeechmix_prepare_records_writes_absolute_paths (which exercises
build_output_record, write_benchmark_records and resolve_base_data_dir) and
ensure the job provides the required fixtures
(audio_prefix/source/raw_librispeech_root, materialize_audio true and tmp data
staging) so the mixed audio file is materialized and assertions pass; also
include any necessary dataset assets and paths in the Slurm job setup so the
test can run non-interactively on the cluster.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74b098d0-ca7b-4c5b-99f9-a1024f20710a

📥 Commits

Reviewing files that changed from the base of the PR and between c231992 and 0e136ba.

⛔ Files ignored due to path filters (6)
  • nemo_skills/dataset/librispeechmix/manifests/dev-clean-1mix.jsonl.gz is excluded by !**/*.gz
  • nemo_skills/dataset/librispeechmix/manifests/dev-clean-2mix.jsonl.gz is excluded by !**/*.gz
  • nemo_skills/dataset/librispeechmix/manifests/dev-clean-3mix.jsonl.gz is excluded by !**/*.gz
  • nemo_skills/dataset/librispeechmix/manifests/test-clean-1mix.jsonl.gz is excluded by !**/*.gz
  • nemo_skills/dataset/librispeechmix/manifests/test-clean-2mix.jsonl.gz is excluded by !**/*.gz
  • nemo_skills/dataset/librispeechmix/manifests/test-clean-3mix.jsonl.gz is excluded by !**/*.gz
📒 Files selected for processing (22)
  • MANIFEST.in
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/librispeechmix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-dev-clean-1mix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-dev-clean-2mix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-dev-clean-3mix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-test-clean-1mix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-test-clean-2mix/__init__.py
  • nemo_skills/dataset/librispeechmix/asr-test-clean-3mix/__init__.py
  • nemo_skills/dataset/librispeechmix/librispeechmix_score.py
  • nemo_skills/dataset/librispeechmix/prepare.py
  • nemo_skills/dataset/librispeechmix/sa-asr-dev-clean-1mix/__init__.py
  • nemo_skills/dataset/librispeechmix/sa-asr-dev-clean-2mix/__init__.py
  • nemo_skills/dataset/librispeechmix/sa-asr-dev-clean-3mix/__init__.py
  • nemo_skills/dataset/librispeechmix/sa-asr-test-clean-1mix/__init__.py
  • nemo_skills/dataset/librispeechmix/sa-asr-test-clean-2mix/__init__.py
  • nemo_skills/dataset/librispeechmix/sa-asr-test-clean-3mix/__init__.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/metrics/audio_metrics.py
  • nemo_skills/pipeline/prepare_data.py
  • tests/test_external_benchmarks.py
  • tests/test_librispeechmix.py

Comment on lines +420 to +479
## LibriSpeechMix

LibriSpeechMix evaluates overlapped transcription and speaker-attributed transcription on mixtures derived from LibriSpeech `dev-clean` and `test-clean`.

### Dataset Location

- Benchmark group is defined in [`nemo_skills/dataset/librispeechmix/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/librispeechmix/__init__.py)
- Official manifests come from [NaoyukiKanda/LibriSpeechMix](https://github.com/NaoyukiKanda/LibriSpeechMix)
- Source speech audio comes from [LibriSpeech OpenSLR-12](https://www.openslr.org/12/)

### Supported Benchmarks

- Overlapped ASR:
`librispeechmix.asr-dev-clean-1mix`,
`librispeechmix.asr-dev-clean-2mix`,
`librispeechmix.asr-dev-clean-3mix`,
`librispeechmix.asr-test-clean-1mix`,
`librispeechmix.asr-test-clean-2mix`,
`librispeechmix.asr-test-clean-3mix`
- Speaker-attributed ASR:
`librispeechmix.sa-asr-dev-clean-1mix`,
`librispeechmix.sa-asr-dev-clean-2mix`,
`librispeechmix.sa-asr-dev-clean-3mix`,
`librispeechmix.sa-asr-test-clean-1mix`,
`librispeechmix.sa-asr-test-clean-2mix`,
`librispeechmix.sa-asr-test-clean-3mix`

### Preparing LibriSpeechMix Data

LibriSpeechMix downloads LibriSpeech `dev-clean` and `test-clean` from OpenSLR, caches source WAV files for speaker profiles, synthesizes mixed WAVs, and writes benchmark JSONL files under your external `--data_dir`.

```bash
ns prepare_data librispeechmix --data_dir=/path/to/data --cluster=<cluster_name>
```

Prepare only specific splits, mixtures, or modes:

```bash
ns prepare_data librispeechmix \
--data_dir=/path/to/data \
--splits dev-clean \
--mixes 2mix 3mix \
--modes asr sa-asr
```

Override the absolute audio-path prefix embedded in JSONL files:

```bash
ns prepare_data librispeechmix \
--data_dir=/path/to/data \
--audio-prefix /dataset/librispeechmix/audio
```

### Evaluation Assumptions

- `1mix` uses standard WER against the single reference transcript.
- `2mix` and `3mix` use permutation-invariant WER over newline-separated hypothesized utterances.
- SA-ASR expects speaker-labeled lines in the format `speaker_<profile_index>: <transcript>`.
- SA-ASR scoring matches hypotheses to the reference `speaker_profile_index` values instead of transcript order.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit LibriSpeechMix evaluation commands and expected tested-model results.

The section documents preparation and scoring assumptions well, but it still needs runnable ns eval examples for LibriSpeechMix and expected result snapshots for tested models.

📝 Suggested doc patch
 ### Evaluation Assumptions
 
 - `1mix` uses standard WER against the single reference transcript.
 - `2mix` and `3mix` use permutation-invariant WER over newline-separated hypothesized utterances.
 - SA-ASR expects speaker-labeled lines in the format `speaker_<profile_index>: <transcript>`.
 - SA-ASR scoring matches hypotheses to the reference `speaker_profile_index` values instead of transcript order.
+
+### Running LibriSpeechMix Evaluation
+
+```bash
+ns eval \
+  --cluster=<cluster_name> \
+  --output_dir=/workspace/librispeechmix-eval \
+  --benchmarks=librispeechmix \
+  --model=/path/to/model \
+  --server_type=<server_type> \
+  --server_gpus=1 \
+  --data_dir=/path/to/data
+```
+
+Evaluate a specific variant:
+
+```bash
+ns eval \
+  --benchmarks=librispeechmix.sa-asr-test-clean-2mix \
+  --cluster=<cluster_name> \
+  --output_dir=/workspace/librispeechmix-sa-asr-eval \
+  --model=/path/to/model \
+  --server_type=<server_type> \
+  --server_gpus=1 \
+  --data_dir=/path/to/data
+```
+
+### Expected Results (Tested Models)
+
+Add one or more validated `metrics.json` snippets (WER, success_rate, num_entries) from tested model runs so users can sanity-check setup correctness.

As per coding guidelines, "**/{benchmarks,docs}/**/*.{md,py}: When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speech-audio.md` around lines 420 - 479, Add runnable ns eval
examples and expected tested-model result snapshots to the LibriSpeechMix docs:
insert two example commands (a generic benchmark run using
--benchmarks=librispeechmix and a variant for sa-asr like
--benchmarks=librispeechmix.sa-asr-test-clean-2mix) showing typical flags
(--cluster, --output_dir, --model, --server_type, --server_gpus, --data_dir) and
include one or more validated metrics.json snippets (WER, success_rate,
num_entries) from tested runs so users can sanity-check; update the
LibriSpeechMix section (heading LibriSpeechMix and examples under Preparing
LibriSpeechMix Data / Evaluation Assumptions) to include these ns eval examples
and expected results.

Comment on lines +304 to +305
elif self.wer_total_ref_words > 0:
agg_metrics["wer"] = round(100.0 * self.wer_total_errors / self.wer_total_ref_words, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Expose corpus-level WER in printable metrics for aggregate-only runs.

At Line 304, corpus WER is now correctly computed from totals, but metrics_to_print() still only exposes wer when self.wer_references or self.wer_scores exists. In aggregate-only flows, wer is computed but can be omitted from printed output.

Suggested fix
diff --git a/nemo_skills/evaluation/metrics/audio_metrics.py b/nemo_skills/evaluation/metrics/audio_metrics.py
@@
-        if self.wer_references or self.wer_scores:
+        if self.wer_references or self.wer_scores or self.wer_total_ref_words > 0:
             base_metrics["wer"] = as_percentage
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/audio_metrics.py` around lines 304 - 305,
metrics_to_print() currently only adds "wer" to agg_metrics when
self.wer_references or self.wer_scores exist, so corpus-level WER computed from
self.wer_total_errors/self.wer_total_ref_words can be omitted in aggregate-only
runs; update metrics_to_print() to check for positive self.wer_total_ref_words
(or self.wer_total_errors) and, if present, add agg_metrics["wer"] = round(100.0
* self.wer_total_errors / self.wer_total_ref_words, 2) so the computed corpus
WER is always exposed even when self.wer_references/self.wer_scores are empty,
leaving existing per-utterance logic unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant