Conversation
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: Dav Karamyan <47416614+naymaraq@users.noreply.github.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Rename fleurs_score.py to audio_score.py with a generalized docstring and remove covost2_score.py (which was byte-identical apart from docstring). Both fleurs and covost2 SCORE_MODULE now point to the shared module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
📝 WalkthroughWalkthroughThis PR unifies audio dataset preparation for CoVoST2 and FLEURS benchmarks by removing task-specific CLI flags and always generating both ASR and speech-translation outputs in a single preparation run. Configuration constants move from parent modules to subtask-level modules, benchmark groups are registered with shared scoring logic, and both prepare scripts are refactored with dedicated helper functions. ChangesAudio Benchmark Group Unification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Around line 638-643: The two unlabeled fenced blocks that start with the
directory listings "<data_dir>/covost2/" and "<data_dir>/fleurs/" should be
marked with a language identifier to satisfy MD040; update each opening
triple-backtick for the blocks containing those listings to use "```text" so the
blocks are labeled as text (look for the blocks that contain "asr/test.jsonl",
"st/test.jsonl", and "audio/<lang>..." / "audio/<locale>...").
In `@nemo_skills/dataset/covost2/prepare.py`:
- Around line 253-260: The manifest entries for CoVoST2 are being labeled with
task_type="AST"; locate the _build_record(...) call that currently passes
task_type="AST" (the call building ST records in prepare.py) and change the
task_type value to "ST" so CoVoST2 entries follow the ST convention used by
downstream grouping and the covost2.st subtask.
In `@nemo_skills/dataset/fleurs/audio_score.py`:
- Around line 51-59: The aggregation block currently uses metrics.get(...) which
silently defaults missing keys to zero and corrupts totals; change all uses of
metrics.get to direct key access (metrics["num_entries"],
metrics["gen_seconds"], metrics["success_rate"], metrics["avg_tokens"],
metrics["no_answer"]) so a KeyError surfaces immediately for missing required
keys, and keep the existing conditional on num_entries (num_entries =
metrics["num_entries"]; if num_entries == 0: continue) while still updating
total_entries, total_gen_seconds, weighted_success, weighted_tokens, and
weighted_no_answer with the directly accessed values.
- Around line 32-34: The current code takes eval_modes from only first_benchmark
(first_benchmark = next(iter(benchmarks.values())); eval_modes =
list(first_benchmark.keys())), which can omit modes present only in other
benchmarks; change the logic to aggregate keys across all benchmarks by
iterating over benchmarks.values(), collecting the union of all sub-benchmark
keys (e.g., via a set) and then using that combined list for eval_modes so group
scoring covers every eval mode defined anywhere in benchmarks.
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 264-271: The _build_record call currently hardcodes
task_type="AST" which is inconsistent with the new subtask naming; change the
task_type argument in the _build_record invocation (the call that passes
expected_answer, instruction=get_st_instruction(tgt_locale),
container_audio_path=cpath, etc.) to use the correct subtask label "st" (or
whatever canonical lowercase "st" is used elsewhere) so ST records are
consistently tagged as st across manifests and downstream grouping.
- Around line 244-246: The code silently skips missing aligned targets by using
target_by_id.get(source_row["id"]) and continue; change this to directly access
the mapping (e.g., target_row = target_by_id[source_row["id"]]) so a KeyError is
raised when alignment is broken, or wrap that access in a small try/except that
raises a descriptive error mentioning the source id and file/context; update the
logic in the loop where target_row, target_by_id and source_row["id"] are used
so failures fail fast instead of being silently dropped.
- Around line 237-244: The loop currently calls load_fleurs(src_locale, ...)
inside the inner (src_locale, tgt_locale) pair loop, causing repeated reloads
for the same src_locale (e.g., en_us); instead, load and cache the source rows
once per src_locale before iterating its tgt_locales (e.g., build a dict mapping
id->row from load_fleurs for that src_locale and reuse it), move the
audio_dir.mkdir(parents=True, exist_ok=True) to run once per src_locale, and
then inside the inner loop look up target_row from the cached mapping (use
existing helper get_target_index only for target locale); update references to
source_row to use the cached data so load_fleurs is not re-run for every pair.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 91617366-5404-439c-8001-aeba9664776c
📒 Files selected for processing (11)
docs/evaluation/speech-audio.mdnemo_skills/dataset/covost2/__init__.pynemo_skills/dataset/covost2/asr/__init__.pynemo_skills/dataset/covost2/prepare.pynemo_skills/dataset/covost2/st/__init__.pynemo_skills/dataset/fleurs/__init__.pynemo_skills/dataset/fleurs/asr/__init__.pynemo_skills/dataset/fleurs/audio_score.pynemo_skills/dataset/fleurs/prepare.pynemo_skills/dataset/fleurs/st/__init__.pynemo_skills/pipeline/summarize_results.py
| ``` | ||
| <data_dir>/covost2/ | ||
| asr/test.jsonl # one record per (lang, audio) for transcription | ||
| st/test.jsonl # one record per (src→tgt, audio) for translation | ||
| audio/<lang>/<split>/...wav | ||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced output blocks.
Lines 638 and 676 use unlabeled fenced blocks, which triggers MD040. Mark them as text.
📝 Suggested fix
-```
+```text
<data_dir>/covost2/
asr/test.jsonl # one record per (lang, audio) for transcription
st/test.jsonl # one record per (src→tgt, audio) for translation
audio/<lang>/<split>/...wav@@
- +text
<data_dir>/fleurs/
asr/test.jsonl # one record per (locale, audio) for transcription
st/test.jsonl # one record per (src→tgt, audio) for translation
audio//...wav
Also applies to: 676-681
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 638-638: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/evaluation/speech-audio.md` around lines 638 - 643, The two unlabeled
fenced blocks that start with the directory listings "<data_dir>/covost2/" and
"<data_dir>/fleurs/" should be marked with a language identifier to satisfy
MD040; update each opening triple-backtick for the blocks containing those
listings to use "```text" so the blocks are labeled as text (look for the blocks
that contain "asr/test.jsonl", "st/test.jsonl", and "audio/<lang>..." /
"audio/<locale>...").
| _build_record( | ||
| expected_answer=item["translation"], | ||
| instruction=get_st_instruction(tgt_lang), | ||
| container_audio_path=cpath, | ||
| duration=duration, | ||
| subset_for_metrics=tag, | ||
| task_type="AST", | ||
| extra_fields={ |
There was a problem hiding this comment.
CoVoST2 ST manifest entries are still labeled AST.
Line 259 uses task_type="AST" in the ST path. This can conflict with the new covost2.st subtask convention and downstream grouping.
🔧 Suggested fix
- task_type="AST",
+ task_type="ST",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/covost2/prepare.py` around lines 253 - 260, The manifest
entries for CoVoST2 are being labeled with task_type="AST"; locate the
_build_record(...) call that currently passes task_type="AST" (the call building
ST records in prepare.py) and change the task_type value to "ST" so CoVoST2
entries follow the ST convention used by downstream grouping and the covost2.st
subtask.
| first_benchmark = next(iter(benchmarks.values())) | ||
| eval_modes = list(first_benchmark.keys()) | ||
|
|
There was a problem hiding this comment.
Aggregate evaluation modes from all sub-benchmarks, not just the first one.
Current logic can drop valid eval modes that exist only in later benchmarks, causing incomplete group scores.
💡 Suggested fix
- first_benchmark = next(iter(benchmarks.values()))
- eval_modes = list(first_benchmark.keys())
+ eval_modes = sorted({mode for benchmark_data in benchmarks.values() for mode in benchmark_data.keys()})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/fleurs/audio_score.py` around lines 32 - 34, The current
code takes eval_modes from only first_benchmark (first_benchmark =
next(iter(benchmarks.values())); eval_modes = list(first_benchmark.keys())),
which can omit modes present only in other benchmarks; change the logic to
aggregate keys across all benchmarks by iterating over benchmarks.values(),
collecting the union of all sub-benchmark keys (e.g., via a set) and then using
that combined list for eval_modes so group scoring covers every eval mode
defined anywhere in benchmarks.
| num_entries = metrics.get("num_entries", 0) | ||
| if num_entries == 0: | ||
| continue | ||
|
|
||
| total_entries += num_entries | ||
| total_gen_seconds += metrics.get("gen_seconds", 0) | ||
| weighted_success += metrics.get("success_rate", 0.0) * num_entries | ||
| weighted_tokens += metrics.get("avg_tokens", 0.0) * num_entries | ||
| weighted_no_answer += metrics.get("no_answer", 0.0) * num_entries |
There was a problem hiding this comment.
Fail fast on missing required metric keys instead of defaulting to zero.
Using defaults here can silently corrupt aggregate values when input schema is incomplete.
💡 Suggested fix
- num_entries = metrics.get("num_entries", 0)
+ num_entries = metrics["num_entries"]
if num_entries == 0:
continue
total_entries += num_entries
- total_gen_seconds += metrics.get("gen_seconds", 0)
- weighted_success += metrics.get("success_rate", 0.0) * num_entries
- weighted_tokens += metrics.get("avg_tokens", 0.0) * num_entries
- weighted_no_answer += metrics.get("no_answer", 0.0) * num_entries
+ total_gen_seconds += metrics["gen_seconds"]
+ weighted_success += metrics["success_rate"] * num_entries
+ weighted_tokens += metrics["avg_tokens"] * num_entries
+ weighted_no_answer += metrics["no_answer"] * num_entries🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/fleurs/audio_score.py` around lines 51 - 59, The
aggregation block currently uses metrics.get(...) which silently defaults
missing keys to zero and corrupts totals; change all uses of metrics.get to
direct key access (metrics["num_entries"], metrics["gen_seconds"],
metrics["success_rate"], metrics["avg_tokens"], metrics["no_answer"]) so a
KeyError surfaces immediately for missing required keys, and keep the existing
conditional on num_entries (num_entries = metrics["num_entries"]; if num_entries
== 0: continue) while still updating total_entries, total_gen_seconds,
weighted_success, weighted_tokens, and weighted_no_answer with the directly
accessed values.
| for src_locale, tgt_locale in pairs: | ||
| locale_audio_dir = audio_dir / src_locale | ||
| if not no_audio: | ||
| locale_audio_dir.mkdir(parents=True, exist_ok=True) | ||
| target_by_id = get_target_index(tgt_locale) | ||
| tag = f"{src_locale}->{tgt_locale}" | ||
| for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag): | ||
| target_row = target_by_id.get(source_row["id"]) |
There was a problem hiding this comment.
Cache source rows per locale in ST collection.
Line 243 reloads full locale data per (src_locale, tgt_locale) pair. For en_us, this repeats many times and significantly increases prep time.
♻️ Suggested fix
def _collect_st_records(
@@
) -> list[dict]:
pairs = build_translation_pairs(languages)
target_cache: dict[str, dict[int, dict]] = {}
+ source_cache: dict[str, list[dict]] = {}
@@
def get_target_index(locale: str) -> dict[int, dict]:
if locale not in target_cache:
target_cache[locale] = index_by_id(load_fleurs(locale, split, local_dir=local_dir))
return target_cache[locale]
+
+ def get_source_rows(locale: str) -> list[dict]:
+ if locale not in source_cache:
+ source_cache[locale] = load_fleurs(locale, split, local_dir=local_dir)
+ return source_cache[locale]
@@
- for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag):
+ for source_row in tqdm(get_source_rows(src_locale), desc=tag):As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/fleurs/prepare.py` around lines 237 - 244, The loop
currently calls load_fleurs(src_locale, ...) inside the inner (src_locale,
tgt_locale) pair loop, causing repeated reloads for the same src_locale (e.g.,
en_us); instead, load and cache the source rows once per src_locale before
iterating its tgt_locales (e.g., build a dict mapping id->row from load_fleurs
for that src_locale and reuse it), move the audio_dir.mkdir(parents=True,
exist_ok=True) to run once per src_locale, and then inside the inner loop look
up target_row from the cached mapping (use existing helper get_target_index only
for target locale); update references to source_row to use the cached data so
load_fleurs is not re-run for every pair.
| target_row = target_by_id.get(source_row["id"]) | ||
| if target_row is None: | ||
| continue |
There was a problem hiding this comment.
Fail fast instead of silently dropping missing aligned targets.
At Line 244, target_by_id.get(...) + Line 246 continue can silently lose ST samples when alignment breaks. Use direct key access so mismatches fail loudly.
🔧 Suggested fix
- target_row = target_by_id.get(source_row["id"])
- if target_row is None:
- continue
+ target_row = target_by_id[source_row["id"]]As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| target_row = target_by_id.get(source_row["id"]) | |
| if target_row is None: | |
| continue | |
| target_row = target_by_id[source_row["id"]] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/fleurs/prepare.py` around lines 244 - 246, The code
silently skips missing aligned targets by using
target_by_id.get(source_row["id"]) and continue; change this to directly access
the mapping (e.g., target_row = target_by_id[source_row["id"]]) so a KeyError is
raised when alignment is broken, or wrap that access in a small try/except that
raises a descriptive error mentioning the source id and file/context; update the
logic in the loop where target_row, target_by_id and source_row["id"] are used
so failures fail fast instead of being silently dropped.
| _build_record( | ||
| expected_answer=target_row["raw_transcription"], | ||
| instruction=get_st_instruction(tgt_locale), | ||
| container_audio_path=cpath, | ||
| duration=duration, | ||
| subset_for_metrics=subset_for_metrics, | ||
| task_type=task_type, | ||
| extra_fields=extra_fields, | ||
| source=source_text, | ||
| reference=reference_text, | ||
| subset_for_metrics=tag, | ||
| task_type="AST", | ||
| extra_fields=extra, |
There was a problem hiding this comment.
ST records are still tagged as AST.
Line 270 sets task_type="AST" even though this PR moves to st subtasks. This can create inconsistent metadata across generated manifests and downstream grouping.
🔧 Suggested fix
- task_type="AST",
+ task_type="ST",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _build_record( | |
| expected_answer=target_row["raw_transcription"], | |
| instruction=get_st_instruction(tgt_locale), | |
| container_audio_path=cpath, | |
| duration=duration, | |
| subset_for_metrics=subset_for_metrics, | |
| task_type=task_type, | |
| extra_fields=extra_fields, | |
| source=source_text, | |
| reference=reference_text, | |
| subset_for_metrics=tag, | |
| task_type="AST", | |
| extra_fields=extra, | |
| _build_record( | |
| expected_answer=target_row["raw_transcription"], | |
| instruction=get_st_instruction(tgt_locale), | |
| container_audio_path=cpath, | |
| duration=duration, | |
| subset_for_metrics=tag, | |
| task_type="ST", | |
| extra_fields=extra, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/fleurs/prepare.py` around lines 264 - 271, The
_build_record call currently hardcodes task_type="AST" which is inconsistent
with the new subtask naming; change the task_type argument in the _build_record
invocation (the call that passes expected_answer,
instruction=get_st_instruction(tgt_locale), container_audio_path=cpath, etc.) to
use the correct subtask label "st" (or whatever canonical lowercase "st" is used
elsewhere) so ST records are consistently tagged as st across manifests and
downstream grouping.
Summary
Reshape fleurs and covost2 into benchmark groups with asr/ and st/ subtasks, mirroring the contextasr-bench layout. One prepare_data invocation per dataset now produces both subtask manifests in one shot.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements
--taskflag requirement.