eval kit support v2 by Jorjeous · Pull Request #1295 · NVIDIA-NeMo/Skills

Jorjeous · 2026-03-07T08:06:47Z

Added ability to run eval kit in namo skills nativly, runnning benchmarks from EK in nemo skills with it's path and NS benchmarks throug EK path

Summary by CodeRabbit

New Features
- VLMEvalKit integration: run benchmarks in two modes (in-process Megatron or vLLM client), including self-contained in-process workflows and multi-GPU merging.
- Ability to consume pre-computed VLMEvalKit aggregates so external benchmark metrics can be imported and reported.
Documentation
- New user guide and index entry covering setup, modes, available benchmarks, sample commands, config snippets, results interpretation, and troubleshooting.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai · 2026-03-07T08:22:23Z

📝 Walkthrough

Walkthrough

Integrates VLMEvalKit into NeMo Skills: adds docs, dataset resolver and eval_kit dataset scaffold, EvalKit and Megatron (mcore) self-contained generation tasks, EvalKitMetrics, pipeline hooks for self-contained tasks and packaging, evaluator updates, and a runtime requirements placeholder.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/eval-kit.md`, `docs/evaluation/index.md`	Adds VLMEvalKit integration guide and index entry describing modes, prerequisites, sample commands, cluster config, outputs, and troubleshooting.
Dataset integration `nemo_skills/dataset/eval_kit/__init__.py`, `nemo_skills/dataset/utils.py`	New eval_kit dataset constants and helper `get_extra_generation_args`; dataset resolver special-cased to return `nemo_skills.dataset.eval_kit` for `eval_kit.*`.
EvalKit generation (vLLM / wrapper) `nemo_skills/inference/eval/eval_kit.py`	New EvalKitConfig and EvalKitGenerationTask: supports mcore and vllm modes, dataset build, async JSONL writer, inference/evaluation phases, metrics/output consolidation, error handling, and Hydra entrypoint.
Megatron in-process (mcore_skills) `nemo_skills/inference/mcore_skills.py`	New MegatronMCoreConfig and MegatronMCoreGenerationTask: in-process MultiModalMCore integration, prompt handling, per-rank IO, sharded generation, WER evaluation, metrics and .done marker, config store and Hydra entrypoint.
Generation framework hooks & mapping `nemo_skills/inference/generate.py`, `nemo_skills/inference/factory.py`	Adds declarative task attributes/methods (CONTAINER_KEY, USE_TORCHRUN, is_self_contained, get_env_prefix, get_extra_package_dirs) and new GenerationType `mcore_skills` mapping.
Pipeline wiring & utils `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/generation.py`	Adds `_apply_task_overrides` for env/torchrun/container selection, resolves `GENERATION_TASK_CLASS`, extends `BenchmarkArgs` (self_contained_task, num_gpus, generation_task_class), adjusts batching/extra args for self-contained tasks, uses `uv` venv when available, and makes `input_file` optional in command assembly.
Metrics & evaluators `nemo_skills/evaluation/metrics/eval_kit_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`, `nemo_skills/evaluation/evaluator/audio.py`	Adds EvalKitMetrics to load precomputed `eval_kit_metrics.json` and register as `"eval_kit"`; lazy-import tweak for `corpus_bleu`; audio evaluator extended for ST-* normalization, ASR/ASR_LEADERBOARD per-field WER, MathQA support, and robust missing-generation handling.
Runtime requirements placeholder `requirements/eval-kit.txt`	Adds placeholder documenting that VLMEvalKit is installed at job start (no venv deps specified).

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as NeMo Pipeline
    participant Task as EvalKitGenerationTask
    participant VLMDataset as VLMEvalKit (dataset build)
    participant Model as MultiModalMCore / VLLMLocal
    participant AsyncIO as Async JSONL Writer
    participant Evaluator as VLMEvalKit Evaluator
    participant Output as Result Files

    Pipeline->>Task: initialize(cfg)
    Task->>VLMDataset: build_dataset(vlm_dataset)
    VLMDataset-->>Task: dataset
    Task->>Task: setup_work_directories()
    Task->>AsyncIO: start_writer()
    loop per sample
        Task->>Model: generate(sample)
        Model-->>Task: generation
        Task->>AsyncIO: write_result(jsonl)
    end
    Task->>Evaluator: evaluate_results()
    Evaluator-->>Task: eval_kit_metrics.json
    Task->>Output: write_metrics_and_done()
    Output-->>Pipeline: results_complete

sequenceDiagram
    participant Pipeline as NeMo Pipeline
    participant Task as MegatronMCoreGenerationTask
    participant Data as DataLoader / Sharder
    participant Prompt as PromptFiller
    participant MCore as MultiModalMCore
    participant IO as Per-rank JSONL IO
    participant Eval as ASR/WER Evaluator

    Pipeline->>Task: initialize(cfg)
    Task->>MCore: _make_mcore_model(config)
    Task->>Data: load_and_shard_data()
    opt prompt_config
        Task->>Prompt: fill_prompt()
        Prompt-->>Task: prompt_template
    end
    loop samples (per rank)
        Task->>MCore: _generate_for_sample(messages/prompt)
        MCore-->>Task: raw_output
        Task->>IO: write_rank_output()
    end
    Task->>Task: merge_rank_outputs()
    Task->>Eval: _evaluate_results()
    Eval-->>Task: wer_metrics
    Task->>IO: write_eval_kit_metrics.json + .done
    IO-->>Pipeline: completed_with_metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

PR #1239: Directly related — implements the same VLMEvalKit integration (eval_kit modules, EvalKitGenerationTask, mcore task, metrics, and pipeline wiring).

Suggested reviewers

melllinia

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'eval kit support v2' is vague and lacks specificity. It uses a generic term 'support' without conveying what changes were made or why they matter.	Replace with a more descriptive title that captures the main feature (e.g., 'Add VLMEvalKit integration for native benchmark evaluation in NeMo Skills' or 'Enable VLMEvalKit benchmarks with Megatron and vLLM inference').

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval-kit-support-v2

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/eval-kit.md`:
- Around line 238-245: The fenced code block in docs/evaluation/eval-kit.md
showing the results directory tree is missing a language label; change the
opening fence from ``` to ```text so markdownlint recognizes it as a plain-text
block (keep the closing ``` unchanged) to fix the lint error.
- Around line 76-95: Add a concrete, known-good metric example to the eval_kit
docs: insert a short example block showing the exact command (using the existing
run example with server_type=megatron, ++model_type=mcore, ++model_config and
++load_dir) and append an "Expected result" line that names the benchmark
(eval_kit.LibriSpeech_test_clean) and a validated metric (e.g., WER=12.3% on
that checkpoint). Place this example near the Mode 1: Megatron in-process
(mcore) section (around the existing eval command shown) and mirror similar
expected-result additions where the docs reference benchmarks (the areas around
the other noted sections for 118-133 and 247-253) so readers have one concrete
score to verify bring-up.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 523-535: The early-return for missing generations only returns
global wer/bleu/cer fields, which skips per-field leaderboard metrics when
reference_fields is used; update the missing-generation branches in the
evaluator (where task_type, generation are checked) to detect reference_fields
and populate per-field defaults (e.g. for each suffix in reference_fields emit
wer_<suffix>=1.0 and is_correct_<suffix>=False, and for ASR-PC also populate
wer_c_<suffix>, wer_pc_<suffix>, per_<suffix>=1.0) in addition to the global
fields; make the same change in the other missing-generation block near the
later duplicated section so empty generations count as worst-case failures for
per-field ASR_LEADERBOARD metrics.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 79-86: The flattener in get_metrics()/summarize_results currently
only keeps numeric scalars from eval_kit_results and drops ASR Eval Kit payloads
where the score is inside a string field (e.g., value["result"] is a JSON
string); update the logic that iterates eval_kit_results (the loop over key,
value and nested sub_key, sub_value that writes into agg_dict) to detect when a
value or sub_value is a string containing JSON (specifically the ASR payload),
parse that JSON and extract concrete numeric metrics (e.g., score, wer,
num_entries) and insert them into agg_dict using normalized keys like
"{key}_result_score" or "{key}_{sub_key}_score" so summarize_results and the
written eval_kit_metrics.json preserve the actual numeric Eval Kit scores
instead of only num_entries.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 540-546: The current write-out in eval_kit.py turns non-dict
tabular eval_result into {"result": str(eval_result)}, losing structured
metrics; update the block handling eval_result in the function that writes
eval_kit_metrics.json so that when eval_result is a pandas.DataFrame or any
DataFrame-like (e.g., pyarrow table) you convert it to a JSON-serializable
structure (e.g., DataFrame.to_dict(orient="records") or to_dict() for a
single-row metrics map) and for numpy/scalar types normalize to native Python
types before json.dump; keep the output key as a dict (e.g., metrics_data) and
only fall back to str(eval_result) for truly non-serializable objects so
EvalKitMetrics can consume machine-readable aggregates.
- Around line 95-97: The skip_filled parameter is accepted but ignored; either
enforce it or implement resume behavior: update _start_async_writer() to check
the skip_filled flag and, when True, avoid deleting or truncating an existing
output_file and instead open it for appending/skip already-processed entries
(implement any required logic to detect/skippable records), or if you choose not
to support resume for this task, validate skip_filled early (e.g., in VLMEvalKit
initializer or the config validation path) and raise a clear exception
(ValueError) when skip_filled is True; apply the same change/validation to the
other occurrence referenced around the block at lines 328-331 so the flag is not
silently ignored.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 130-135: METRICS_TYPE_OVERRIDE currently forces summarize to
expect eval_kit_metrics.json but _evaluate_results() only computes ASR WER via
asr_wer() and writes {"wer":...}, and generate() swallows exceptions so non-ASR
benchmarks produce missing/incorrect metrics silently; update
_evaluate_results() to branch on the benchmark/type (or a metric_type field) and
compute/write the appropriate metric payload (not only wer) for each supported
eval_kit scenario, ensure asr_wer() is only invoked when the benchmark is ASR,
and remove or rework the broad exception handling in generate()/the 526-527
catch so unexpected failures propagate instead of creating a .done marker; refer
to METRICS_TYPE_OVERRIDE, _evaluate_results(), asr_wer(), and generate() when
making the changes.
- Around line 267-330: The _build_mcore_messages function currently buffers all
text in text_parts and appends one combined text at the end, which reorders and
collapses interleaved media/text; change it to preserve inline ordering by
emitting entries into mcore as you iterate messages and content (instead of
accumulating into text_parts): whenever you encounter a text string or a content
item with type "text", append a {"type":"text","value":...} to mcore immediately
(optionally merging only consecutive text fragments), and for image/audio items
resolve paths via _resolve_path and append their {"type":"image"/"sound",...}
entries in the same spot; remove the final combined_text join/append and ensure
use of the existing symbols messages, mcore, text_parts (or drop text_parts) and
_resolve_path to locate the changes.

In `@nemo_skills/pipeline/eval.py`:
- Around line 448-459: The code currently calls generation task classes'
get_extra_package_dirs() with no environment context, so env-only cluster YAML
vars (from pipeline_utils.get_env_variables(cluster_config) stored in env_vars)
aren't visible; modify the loop over benchmarks_dict to pass env_vars into
get_extra_package_dirs when supported: import inspect, check the signature of
task_cls.get_extra_package_dirs and if it accepts a parameter call
task_cls.get_extra_package_dirs(env_vars), otherwise call it with no args as a
fallback, preserve the seen_pkg_dirs logic and LOG.info usage, and ensure
extra_pkg_dirs still becomes None when empty; reference generation_task_class,
get_extra_package_dirs, pipeline_utils.get_env_variables, env_vars and
benchmarks_dict in your change.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 448-452: The current batching only forces separate jobs for
self-contained tasks (has_self_contained) but still allows mixing different
generation task classes (e.g., eval_kit.* in vllm mode vs plain GenerationTask)
into the same job which causes _apply_task_overrides() to pick one env/container
for the whole batch; update the logic that sets num_jobs (or the batching step)
to either (a) enforce one job per distinct generation task class present in the
batch (treat each unique task class as its own group) or (b) add a validation
step before batching that inspects the task classes in each proposed batch and
raises/adjusts when they do not all agree on runtime overrides (so
_apply_task_overrides() will not be applied to a mixed-class batch). Reference
has_self_contained, num_jobs, total_evals and
nemo_skills/pipeline/eval.py::_apply_task_overrides() when making the change.
- Around line 406-410: When detecting a self-contained task (inside the block
that checks task_cls and hasattr(..., "is_self_contained")), fail fast if
server_parameters["server_gpus"] is falsy instead of leaving ba.num_gpus unset:
after setting ba.self_contained_task = True, check
server_parameters["server_gpus"] and if it is None/0/False raise a clear
exception (e.g., ValueError) describing that a self-contained task requires a
GPU allocation; otherwise assign ba.num_gpus = server_parameters["server_gpus"]
so num_gpus is always set for self-contained tasks.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 465-467: The code currently omits ++input_file when input_file is
None which can silently allow callers to produce generation commands with
neither input_file nor input_dir; update the generation command builder (the
block that sets common_args and uses input_file/input_dir) to keep the
conditional inclusion of ++input_file but add explicit validation: if the job is
not self-contained (e.g., a parameter/self_contained flag) and both input_file
and input_dir are None, raise a clear ValueError/ConfigurationError; reference
the variables common_args, input_file, input_dir and the surrounding generation
function so the check runs before building common_args.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9e490c0-062e-44ab-80b0-e7bb8dcfc02c

📥 Commits

Reviewing files that changed from the base of the PR and between a5da597 and 2116473.

📒 Files selected for processing (16)

docs/evaluation/eval-kit.md
docs/evaluation/index.md
nemo_skills/dataset/eval_kit/__init__.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/eval/eval_kit.py
nemo_skills/inference/factory.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
requirements/eval-kit.txt

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

nemo_skills/pipeline/eval.py (1)
448-459: ⚠️ Potential issue | 🟠 Major

Resolve extra package dirs from cluster-config env vars too.

env_vars = pipeline_utils.get_env_variables(cluster_config) is already available above, but get_extra_package_dirs() is still called without that context. If NEMO_SKILLS_VLMEVALKIT_PATH is defined only in the cluster YAML, packaging will miss the Eval Kit sources and the submitted job won't ship them.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 448 - 459, The loop that builds
extra_pkg_dirs calls task_cls.get_extra_package_dirs() without passing env_vars
(from pipeline_utils.get_env_variables(cluster_config)), which misses dirs
declared via cluster-config env vars; update the code in the extra_pkg_dirs
construction to call get_extra_package_dirs with env_vars when available:
attempt task_cls.get_extra_package_dirs(env_vars) and fall back to
task_cls.get_extra_package_dirs() (e.g., catch TypeError) so benchmarks_dict ->
ba.generation_task_class and its get_extra_package_dirs can resolve
cluster-config paths like NEMO_SKILLS_VLMEVALKIT_PATH.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 51-54: The current logic only swaps the first "python -m " to
"torchrun ..." because combined_cmd.replace(..., 1) targets a single occurrence,
leaving later module launches unwrapped; update the transformation so every
"python -m " in combined_cmd is prefixed with "torchrun --nproc_per_node
{job_num_gpus} -m" when any task class has USE_TORCHRUN and job_num_gpus > 1
(i.e., remove the count-limited replace and perform a global replacement or
regex substitution of the token), ensuring this change is applied where
combined_cmd is built and referencing combined_cmd, task_classes, USE_TORCHRUN,
and job_num_gpus.
- Around line 57-61: The current loop silently falls back to the default
container when a task class sets CONTAINER_KEY that is not present in
cluster_config["containers"]; change the logic so that if a task class defines
key = getattr(tc, "CONTAINER_KEY", None) and key is truthy you must index
cluster_config["containers"] directly and raise a clear error when the key is
missing (e.g., if key not in cluster_config["containers"]: raise KeyError or
ValueError with a message including the missing key and available container
keys), otherwise set container = cluster_config["containers"][key]; this removes
the silent fallback to "nemo-skills" and surfaces misconfiguration.

---

Duplicate comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 448-459: The loop that builds extra_pkg_dirs calls
task_cls.get_extra_package_dirs() without passing env_vars (from
pipeline_utils.get_env_variables(cluster_config)), which misses dirs declared
via cluster-config env vars; update the code in the extra_pkg_dirs construction
to call get_extra_package_dirs with env_vars when available: attempt
task_cls.get_extra_package_dirs(env_vars) and fall back to
task_cls.get_extra_package_dirs() (e.g., catch TypeError) so benchmarks_dict ->
ba.generation_task_class and its get_extra_package_dirs can resolve
cluster-config paths like NEMO_SKILLS_VLMEVALKIT_PATH.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b1d043dc-51e8-4595-b471-cbdf663baccf

📥 Commits

Reviewing files that changed from the base of the PR and between 2116473 and 469eba6.

📒 Files selected for processing (1)

nemo_skills/pipeline/eval.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/pipeline/utils/eval.py (1)

118-162: Complex conditional logic for input_file resolution.

The nested conditionals for determining input_file and check_path are difficult to follow. Consider extracting this into a dedicated helper function with clearer variable names, which would improve readability and testability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 118 - 162, Extract the
nested input file resolution logic into a helper function (e.g.,
resolve_input_file) that takes cluster_config, data_path, local_data_path,
benchmark, split, is_on_cluster, data_dir, and skip_input_file and returns
input_file and check_path; preserve the existing use of
pipeline_utils.is_mounted_filepath, pipeline_utils.get_unmounted_path, and
pipeline_utils.cluster_path_exists and keep the same fallback behavior for
mounted vs unmounted paths and data_dir overrides, converting unmounted_path to
str as before and keeping the same ValueError messages when
Path(check_path).exists() or cluster_path_exists checks fail; replace the
in-place block in eval.py with a call to this new function to improve
readability and testability.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 118-162: Extract the nested input file resolution logic into a
helper function (e.g., resolve_input_file) that takes cluster_config, data_path,
local_data_path, benchmark, split, is_on_cluster, data_dir, and skip_input_file
and returns input_file and check_path; preserve the existing use of
pipeline_utils.is_mounted_filepath, pipeline_utils.get_unmounted_path, and
pipeline_utils.cluster_path_exists and keep the same fallback behavior for
mounted vs unmounted paths and data_dir overrides, converting unmounted_path to
str as before and keeping the same ValueError messages when
Path(check_path).exists() or cluster_path_exists checks fail; replace the
in-place block in eval.py with a call to this new function to improve
readability and testability.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 93452798-c58c-4601-ac7c-7efe62ef8d30

📥 Commits

Reviewing files that changed from the base of the PR and between 469eba6 and 320b526.

📒 Files selected for processing (3)

docs/evaluation/eval-kit.md
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py

Kipok · 2026-03-13T04:17:59Z

sorry, things are super busy, but I will try to review this next week

Jorjeous · 2026-03-13T12:43:10Z

[like] George Zelenfroynd reacted to your message:

…

________________________________ From: Igor Gitman ***@***.***> Sent: Friday, March 13, 2026 4:18:21 AM To: NVIDIA-NeMo/Skills ***@***.***> Cc: George Zelenfroynd ***@***.***>; Author ***@***.***> Subject: Re: [NVIDIA-NeMo/Skills] eval kit support v2 (PR #1295) [https://avatars.githubusercontent.com/u/2354422?s=20&v=4]Kipok left a comment (NVIDIA-NeMo/Skills#1295)<#1295 (comment)> sorry, things are super busy, but I will try to review this next week — Reply to this email directly, view it on GitHub<#1295 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AI4QZ2DW5FGCRVQXEPOZUPL4QODY3AVCNFSM6AAAAACWKIGXCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANJSGUYDOOBVGY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

melllinia

Overall looks good, can you please address the comments and since there are core changes let's wait for @Kipok or @gwarmstrong review as well, thank you!

melllinia · 2026-05-12T15:05:19Z

+                for item in content:
+                    if isinstance(item, dict):
+                        if item.get("type") == "text" and "text" in item:
+                            text_parts.append(item["text"].strip())


Text parts are accumulated and appended at the end, after all media items. This can destroy text/media ordering ("Describe [image1] vs [image2]" becomes [image1, image2, "Describe vs"]).

Suggestion: emit entries into the mcore list in-order during iteration, rather than buffering all text for a final append.

melllinia · 2026-05-12T15:07:13Z

+                    fout.write(json.dumps(entry) + "\n")
+
+            # Compute WER using VLMEvalKit (same function as eval_kit path)
+            wer_score = asr_wer(results)


This always calls asr_wer() regardless of the benchmark type. For non-ASR datasets, this may produce meaningless metrics. Can we filter based on dataset_name or a metric_type config field?

melllinia · 2026-05-12T15:15:33Z

        generation = strip_helpful_prefixes(generation)

+    # Normalise AudioBench speech-translation task types (ST-EN-ZH -> Translation)
+    _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}


ASR_LEADERBOARD is referenced in _ASR_TYPES but asr-leaderboard dataset preparation is updated and uses task_type="ASR" because the evaluation for two types were identical. Can we consider safe removal of ASR_LEADERBOARD references to avoid confusion, please?

melllinia · 2026-05-12T15:21:27Z

+                    effective_extra_args = extra_arguments
+                elif hasattr(generation_task, "configure_client_overrides"):
+                    # rsplit to handle URLs like http://host:port (takes last colon)
+                    host, port = (job_server_address or "localhost:5000").rsplit(":", 1)


For a URL like http://host:5000, rsplit(":", 1) produces host = "http://host", which could cause malformed URLs downstream. Can we use urllib.parse.urlparse for proper parsing here?

melllinia · 2026-05-12T15:24:32Z

@@ -0,0 +1,45 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


Jorjeous · 2026-05-14T09:04:57Z

@melllinia on the way + lets quick chat offline

eval kit support v2

2116473

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous added the run GPU tests label Mar 7, 2026

coderabbitai Bot reviewed Mar 7, 2026

View reviewed changes

fix: remove invalid --data_dir from summarize_results command

469eba6

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai Bot reviewed Mar 7, 2026

View reviewed changes

Comment thread nemo_skills/pipeline/eval.py

Comment thread nemo_skills/pipeline/eval.py

Jorjeous added run GPU tests and removed run GPU tests labels Mar 7, 2026

addressed coderabbit review comments

320b526

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous added run GPU tests and removed run GPU tests labels Mar 9, 2026

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

Jorjeous requested a review from gwarmstrong March 10, 2026 12:22

Jorjeous requested a review from melllinia May 12, 2026 13:39

melllinia requested changes May 12, 2026

View reviewed changes

		@@ -0,0 +1,45 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

Jorjeous commented Mar 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Mar 13, 2026

Uh oh!

Jorjeous commented Mar 13, 2026 via email

Uh oh!

melllinia left a comment

Choose a reason for hiding this comment

Uh oh!

melllinia May 12, 2026

Choose a reason for hiding this comment

Uh oh!

melllinia May 12, 2026

Choose a reason for hiding this comment

Uh oh!

melllinia May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melllinia May 12, 2026

Choose a reason for hiding this comment

Uh oh!

melllinia May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jorjeous commented Mar 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 7, 2026 •

edited

Loading

melllinia May 12, 2026 •

edited

Loading