Adding an option to store benchmarks in external repo by Kipok · Pull Request #1240 · NVIDIA-NeMo/Skills

Kipok · 2026-02-12T23:39:54Z

The main motivation for this pr is to enable registering external benchmarks, so that people can use nemo-skills with custom private eval repositories without having to maintain a whole fork. This required some restructuring which resulted in the following changes

Added an option for benchmark to be specified as full path to the dataset folder. Also added a logic to register "benchmark_map.json" that will map short name to full path, which can simplify referencing external benchmarks.
Added tests and documentation for the new logic, should be quite extensive
Added an option to pass evaluators as file::class, similar to how was already done for metrics
Made summarize results command from eval directly specify metrics type, this way it works inside the container without having to reference init from external benchmarks which might be in a different path
Added REQUIRES_DATA_DIR option to init directly, was previously hardcoded which made it hard to maintain
Made BFCL init.py files explicitly committed, they weren't dynamic, so having them being in the repo makes things simpler and more explicit
Removed previous logic to register extra datasets, which only worked with "simple" benchmarks that didn't require custom evaluation / generation / metrics
Removed DATASET_GROUP from inits as we didn't really use it much, it makes little sense, was inconsistent and makes things hard with external benchmarks
Changed the base generate api to add prompt_format argument. This isn't strictly needed for this PR, but it made an example in the docs much simpler - previously it was hard to switch from our prompt format in first turn to openai format in the second turn, but with this argument, we can override the prompt format and reuse a lot more code.
Removed hardcoded logic for datasets with subfolders for packaging and made it dynamically pick all jsonl automatically
Made a pythonic interface for launch / stop server. Again, not strictly needed, but made the new tests I added much faster as we reuse the server in them (otherwise would have to re-spin it 12 times). This also will make test_eval.py much better in the future if we add it there since it will allow us to run judge benchmarks, but I didn't yet make this change yet

Summary by CodeRabbit

New Features
- External/custom benchmark support: add, prepare, package, and evaluate benchmarks outside the repo; server launch/stop utilities for evaluations.
- prompt_format passthrough to customize prompt formatting across generation and evaluation tasks.
Improvements
- Simplified dataset resolution and data-dir/container handling for prepare/package/eval flows.
- More robust CLI behavior for prepare/eval/summarize and improved packaging of external datasets.
Documentation
- Added guide for defining and integrating external benchmarks.
Tests
- New end-to-end integration tests for external benchmark prepare + eval.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps · 2026-02-12T23:39:58Z

Too many files changed for review. (180 files found, 100 file limit)

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2026-02-12T23:55:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Wide refactor to dataset initialization and preparation: removes many DATASET_GROUP exports, introduces REQUIRES_DATA_DIR/HAS_DYNAMIC_INIT flags, modularizes RULER/RULER2 prepare flows, adds external-benchmark support (loading, packaging, tests), enhances dataset resolution and evaluator dispatch, and threads prompt_format through generation/eval call paths.

Changes

Cohort / File(s)	Summary
Docs & config `docs/evaluation/custom-benchmarks.md`, `docs/evaluation/index.md`, `mkdocs.yml`, `.gitignore`	Add documentation for external/custom benchmarks, update navigation, un-ignore a dummy external benchmark file and consolidate bfcl ignore patterns.
Dataset flag removals & additions `nemo_skills/dataset/.../__init__.py` (many files), `nemo_skills/dataset/ruler/__init__.py`, `nemo_skills/dataset/ruler2/__init__.py`	Remove `DATASET_GROUP` from many dataset packages; add/replace with `REQUIRES_DATA_DIR=True` and/or `HAS_DYNAMIC_INIT=True` in select packages; minor formatting tweaks.
BFCL subpackages `nemo_skills/dataset/bfcl_v3/.../__init__.py`, `nemo_skills/dataset/bfcl_v4/.../__init__.py`, `nemo_skills/dataset/bfcl_v3/prepare.py`	Add `METRICS_TYPE`, `GENERATION_ARGS`, `GENERATION_MODULE` constants across many bfcl subpackages; remove `DEFAULT_SETTINGS` and stop auto-writing `__init__.py` in bfcl_v3 prepare.
RULER modularization `nemo_skills/dataset/ruler/prepare.py`, `.../prepare_common.py`, `.../prepare_data.py`, `.../prepare_init.py`	Split monolithic RULER prepare into `prepare_common`, `prepare_data`, `prepare_init`; top-level `prepare.py` now delegates to those scripts.
RULER2 modularization `nemo_skills/dataset/ruler2/prepare.py`, `.../prepare_common.py`, `.../prepare_data.py`, `.../prepare_init.py`	Same modularization for RULER2: task logic moved into `prepare_data`/`prepare_init`, `prepare.py` reduced to a launcher.
Dataset resolution & utils `nemo_skills/dataset/utils.py`	New resolution APIs: `get_dataset_path`, `get_extra_benchmark_map`, `_load_external_dataset`, and `get_dataset_module` updated to support builtin, map-key, and path-based datasets; removed prior cluster/path helpers.
Dataset prepare entrypoint & CLI `nemo_skills/dataset/prepare.py`, `nemo_skills/pipeline/prepare_data.py`	Add CLI parsing helper, change `prepare_datasets` to accept `prepare_entrypoint`, refactor pipeline prepare flow to handle external datasets, dynamic init vs non-split flows, container path mapping, and data_dir copying.
Packaging & packager helpers `nemo_skills/pipeline/utils/packager.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/__init__.py`	Add `resolve_external_data_path`, allow `ignore_if_registered` in repo registration, improve include patterns to capture generated JSONL from git roots, and re-export `resolve_external_data_path`.
Evaluation dispatch & metrics `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/metrics/compute_metrics.py`	Add `_resolve_eval_type` to support module::Class and file::Class path formats, dynamic loading/dispatch, enhanced error messages; simplify `ComputeMetrics` init to infer metric type from dataset module.
Pipeline: remove extra-dataset plumbing `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/summarize_results.py`	Remove `extra_datasets`/`extra_datasets_type` and related data_dir plumbing from eval and summarize flows; drop `ExtraDatasetType` usage.
Generation & prompt_format plumbing `nemo_skills/inference/generate.py`, many `nemo_skills/inference/`, `recipes/`	Add optional `prompt_format` parameter to `fill_prompt` and `process_single_datapoint` signatures across many generation/eval tasks and thread it through where applicable.
Server orchestration `nemo_skills/pipeline/start_server.py`	Add `launch_server` and `stop_server` helpers; refactor `start_server` to centralize server/tunnel lifecycle and cleanup.
External benchmark fixtures & tests `tests/data/dummy_external_benchmark/**`, `tests/gpu-tests/test_external_benchmark_eval.py`, `tests/test_external_benchmarks.py`	Add a dummy external benchmark repo (datasets, prepare scripts, evaluator, metrics, prompts, benchmark_map), extensive unit and GPU/container integration tests validating external benchmark prepare, packaging, resolution, and evaluation flows.
Tests & small updates `tests/test_datasets.py`, `tests/test_configs.py`, `tests/gpu-tests/run_qwen.sh`, many tests/runner scripts	Remove DATASET_GROUP validation test, tweak mock configs, update prepare_data call sites to accept explicit `datasets` list and validate `wrap_arguments` behavior; add GPU test invocation.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CLI
    participant Prepare as Pipeline prepare
    participant Utils as Dataset Utils / Packager
    participant Container as Container / Job
    participant Eval as Evaluation runtime
    participant Metrics as ComputeMetrics

    Dev->>Prepare: run prepare (dataset path or map key)
    Prepare->>Utils: resolve dataset via get_dataset_path (path | map key | builtin)
    alt external dataset
        Utils->>Utils: register_external_repo / resolve_external_data_path
        Prepare->>Container: package & mount external repo
    end
    Prepare->>Container: invoke dataset prepare entrypoint (prepare_init/prepare_data)
    Container->>Prepare: produce prepared JSONL (test.jsonl)
    Dev->>Eval: run eval CLI (benchmark)
    Eval->>Utils: import dataset module (reads METRICS_TYPE / GENERATION_MODULE)
    Eval->>Eval: _resolve_eval_type -> load evaluator (module::Class or file::Class)
    Eval->>Metrics: instantiate and run metrics evaluator
    Metrics-->>Dev: write metrics.json (results)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add RULERv2 #1106 — overlaps RULER / ruler2 prepare scripts and dataset-init modularization.
BFCLv4 support #908 — overlaps BFCL dataset additions and prepare logic for bfcl_v3/bfcl_v4.
Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True #958 — related changes to evaluator dispatch/registration and dynamic evaluator resolution.

Suggested labels

enhancement, run GPU tests

Suggested reviewers

gwarmstrong
activatedgeek

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Adding an option to store benchmarks in external repo' accurately summarizes the main change: enabling external benchmark repositories. It is clear, specific, and directly reflects the primary objective of the PR.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/benchmarks-plugin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/generate.py (1)
583-591: ⚠️ Potential issue | 🔴 Critical

Breaking change: ArenaJudge will crash when calling super().process_single_datapoint().

ArenaJudge.process_single_datapoint calls super().process_single_datapoint(gen_base_data, all_data) at lines 152-153. This invokes the base class's process_single_datapoint, which then calls self.fill_prompt(data_point, all_data, prompt_format) as a positional call (line 692). Since self is an ArenaJudge instance, this attempts to pass 3 positional arguments to ArenaJudge.fill_prompt, which only accepts 2 parameters. This will raise TypeError: fill_prompt() takes 3 positional arguments but 4 were given.

To fix this, either:

Pass prompt_format as a keyword argument in the base class:
-"prompt": self.fill_prompt(data_point, all_data, prompt_format),
+"prompt": self.fill_prompt(data_point, all_data, prompt_format=prompt_format),
Update ArenaJudge.fill_prompt to accept the new parameter:
-def fill_prompt(self, data_point, data):
+def fill_prompt(self, data_point, data, prompt_format=None):
Both changes are needed for full compatibility.

🤖 Fix all issues with AI agents

In `@docs/evaluation/custom-benchmarks.md`:
- Around line 198-208: The example calls LOG.info in the generate function but
never defines LOG; add the logger setup by importing logging and get_logger_name
(from nemo_skills.utils) and creating LOG =
logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG
definition appear near the top of the example so LOG is available when
generate(cfg: WordCountGenerationConfig) calls LOG.info.
- Around line 264-277: In the update method, fix the NameError by passing the
correct variable name to _compute_pass_at_k: replace the incorrect
predicted_answer with the defined predicted_answers; specifically, in the
update(self, predictions) body ensure the call to _compute_pass_at_k uses
predicted_answers (the list defined from predictions) so it matches the later
_compute_majority_at_k call and the local variable name.

In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py`:
- Around line 1-3: The file is missing the standard NVIDIA Apache 2.0 license
header; add the exact repo-standard copyright/license header comment block to
the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above
the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so
the file matches other files in the PR.

In `@nemo_skills/dataset/prepare.py`:
- Around line 36-49: The CLI flag --retries currently defaults to 0 while the
prepare_datasets function signature defaults to 3, causing inconsistent
behavior; pick one canonical default and make both match (either change the
parser add_argument call for "--retries" to default=3 or change the
prepare_datasets signature to retries=0) so callers get the same retry behavior
whether invoked from the CLI or as a library function; update the parser
add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.

In `@nemo_skills/dataset/ruler/prepare_common.py`:
- Around line 33-39: The message MISSING_RULER_ARGS_MESSAGE currently begins
with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0),
producing a success status while signaling an error; fix by making them
consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if
skipping is intentional, or change the exit in parse_args_and_prepare_args to
SystemExit(1) to indicate failure; locate and update the symbols
MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args
(also consider how prepare.py's subprocess `check=True` will interpret the
chosen exit code) so the log text and exit code match the intended behavior.

In `@nemo_skills/dataset/ruler/prepare_data.py`:
- Around line 49-57: The git-LFS check in the "if 'cwe' in tasks" block
currently only catches subprocess.CalledProcessError but will crash with a
FileNotFoundError when git is not installed; update the exception handling
around the subprocess.run(["git", "lfs", "--version"]) call in prepare_data.py
to catch both subprocess.CalledProcessError and FileNotFoundError (or a broad
OSError) and then print the existing friendly message and exit(1) so missing git
or missing git-lfs both produce the same helpful output.

In `@nemo_skills/dataset/ruler2/prepare_data.py`:
- Around line 25-438: Many near-duplicate functions (e.g.,
prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic,
prepare_qa_hard, etc.) repeat subprocess.run logic with only module name
(prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace
them with one generic runner (e.g., run_prepare_task) that accepts a task key
and common params (output_folder, tokenizer_type, tokenizer_path, length,
dataset_size) and use a declarative dict mapping task keys to module name plus
per-task flag/value dicts; the runner should build the argv list by starting
with ["python","-m", module] then iterating the flag dict to append "--flag",
str(value), call subprocess.run(..., check=True), and replace all prepare_*
callers with a single call to run_prepare_task(task_key, ...).
- Around line 479-491: The current main block uses parse_known_args and assigns
unknown to `_`, silently discarding unrecognized CLI flags; change this to
either call parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 174-177: The debug print in the class-instantiation branch should
be removed or converted to a proper logger call; replace the line
`print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator:
%s", evaluator)` (or equivalent) inside the block where `is_class` is true after
`evaluator = obj(eval_config)` so you don't emit noisy stdout during
`evaluator.eval_full()` runs.

In `@nemo_skills/pipeline/eval.py`:
- Around line 745-746: The current f-string will inject the literal "None" when
both metric_type and benchmark_args.metrics_type are None; compute an effective
value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type)
and only append the "--metric_type=..." flag to the command when
effective_metric_type is truthy, so summarize_results never receives the string
"None" as a metric type.

In `@nemo_skills/pipeline/utils/packager.py`:
- Around line 191-194: The loop that builds include_patterns iterates over
dataset_dir.rglob("*.jsonl") and currently calls
include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that
loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.

🧹 Nitpick comments (17)

nemo_skills/inference/generate.py (1)
583-585: Add type hints for the new prompt_format parameters.

Per coding guidelines, simple types should have type hints.
-    def fill_prompt(self, data_point, data, prompt_format=None):
+    def fill_prompt(self, data_point, data, prompt_format: str | None = None):
-    async def process_single_datapoint(self, data_point, all_data, prompt_format=None):
+    async def process_single_datapoint(self, data_point, all_data, prompt_format: str | None = None):
As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code"

Also applies to: 680-681
nemo_skills/dataset/ruler/prepare_common.py (1)
89-94: Add return type hint.

Per project guidelines, use type hints for simple types.
Suggested fix
-def parse_args_and_prepare_args(parser: argparse.ArgumentParser):
+def parse_args_and_prepare_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, str]:
nemo_skills/dataset/ruler2/prepare_common.py (1)
75-77: Add return type hint.

Same as the ruler counterpart — add a return type hint for consistency.
Suggested fix
-def parse_known_args(parser: argparse.ArgumentParser):
+def parse_known_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, list[str]]:
nemo_skills/dataset/ruler/prepare_data.py (2)
60-60: Pass a string instead of a single-element list when using shell=True.

When shell=True, passing a list is misleading — the first element becomes the shell command string, and subsequent elements become arguments to the shell itself (not the command). Use a plain string here.
Suggested fix
-    subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
+    subprocess.run("pip install wonderwords html2text tenacity", check=True, shell=True)
62-69: Manual __enter__/__exit__ on TemporaryDirectory is fragile.

Calling dunder methods directly bypasses the context manager protocol's guarantees. Consider restructuring so the conditional temp directory uses a proper context manager or a simpler pattern:
Suggested refactor
-    if tmp_data_dir is not None:
-        tmpdirname = tmp_data_dir
-        Path(tmpdirname).mkdir(parents=True, exist_ok=True)
-        tmpdir_context = None
-    else:
-        tmpdir_context = tempfile.TemporaryDirectory()
-        tmpdirname = tmpdir_context.__enter__()
-
-    try:
+    with tempfile.TemporaryDirectory() as _tmpdir:
+        if tmp_data_dir is not None:
+            tmpdirname = tmp_data_dir
+            Path(tmpdirname).mkdir(parents=True, exist_ok=True)
+        else:
+            tmpdirname = _tmpdir
+
This removes the manual __enter__/__exit__ and the try/finally block entirely. The unused TemporaryDirectory is cheap when tmp_data_dir is provided.
nemo_skills/dataset/ruler2/prepare_init.py (1)

61-69: Silently discarding unknown CLI arguments may hide user errors.

Line 64 discards unknown args. Since prepare.py forwards all sys.argv to both prepare_init.py and prepare_data.py, this is understandable — init doesn't need data-prep args. However, if run standalone, typos or unsupported flags will be silently ignored.

Consider at minimum logging the discarded args for debuggability, or documenting that this script is intended to be invoked via prepare.py.
nemo_skills/dataset/ruler2/prepare_data.py (3)
459-460: Remove commented-out code.

The commented-out pip install on line 460 is dead code. If the dependency installation is needed, it should be documented or handled in a setup step, not left as a comment in runtime code.

441-476: No validation of task names — KeyError with no helpful message.

If a user passes an invalid task name, prepare_task[task] on line 466 raises a bare KeyError. Consider validating upfront with a clear error message listing available tasks.
Proposed fix
+    invalid_tasks = set(tasks) - set(prepare_task.keys())
+    if invalid_tasks:
+        raise ValueError(f"Unknown tasks: {invalid_tasks}. Available: {list(prepare_task.keys())}")
+
     with concurrent.futures.ThreadPoolExecutor() as executor:
441-441: Add type hints to function signatures.

All public functions in this file lack type hints. At minimum, prepare_dataset and the individual prepare_* functions should annotate their parameters with basic types (str, int, list[str], etc.). As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."
nemo_skills/pipeline/utils/packager.py (1)

82-129: Potential relative_to failure when repo_path is a subdirectory of the git root.

Line 106 validates that local_data_path is relative to repo_meta.path, but line 119 calls local_data_path.relative_to(effective_root) where effective_root is the git root. If repo_meta.path is registered as a parent of the actual git root (unlikely but not prevented by RepoMetadata), or if the repo structure has symlinks that break the ancestor chain, this relative_to call would raise an unhandled ValueError.

The happy path (repo_path is at or below git_root) is safe because git_root is always an ancestor of repo_path, hence also of local_data_path. Just something to be aware of in edge cases.
nemo_skills/pipeline/prepare_data.py (2)
213-225: --prepare_entrypoint prepare_data.py is appended after _build_command which already joined prepare_unknown_args.

Line 225 appends --prepare_entrypoint prepare_data.py after the unknown args were already joined on line 113 (inside _build_command). The resulting command would look like:
python -m nemo_skills.dataset.prepare <datasets> <unknown_args> --prepare_entrypoint prepare_data.py
This works because argparse parses named arguments positionally-independently, but the command string looks a bit odd. Consider passing prepare_entrypoint through the unknown args or appending it inside _build_command for clarity.

227-232: When executor == "none", _get_container_dataset_path returns container paths that won't exist locally.

After tracing the control flow, it appears that data_dir being set always implies containerized execution (lines 173-178 require cluster when data_dir is set, and line 181-185 only drops to executor="none" when data_dir is absent). So this path is safe.

However, this invariant is implicit and fragile — a future change to the early-exit logic could silently break the cp commands. Consider adding a defensive assertion or comment.
💡 Suggested comment for future maintainers
     if data_dir:
+        # data_dir implies containerized execution (executor != "none"),
+        # so container paths are valid for cp commands below.
         command += f" && mkdir -p {data_dir}"
nemo_skills/evaluation/evaluator/__init__.py (2)

92-117: _resolve_eval_type — clean centralized dispatch.

The dual-format support (built-in keys and :: path format) is well-structured. One minor note: getattr(module, attr_str) on line 109 will raise an AttributeError with a generic message if the attribute doesn't exist. Consider wrapping it to provide a more descriptive error mentioning the eval_type string.

147-154: supports_single_eval instantiates the evaluator just to check a capability flag.

obj(config) (line 153) constructs a full evaluator instance solely to call supports_single_eval(). If the constructor has side effects or is expensive, this is wasteful. Since supports_single_eval in BaseEvaluator only checks whether eval_single is overridden (a class-level property), this could be a static/class-level check instead. Low priority given this matches the previous pattern.

docs/evaluation/custom-benchmarks.md (1)

27-27: Add language specifiers to fenced code blocks.

Lines 27, 135, 240, and 281 have fenced code blocks without a language identifier. Consider adding text or the appropriate language (e.g., bash) to satisfy linting and improve rendering.
nemo_skills/dataset/utils.py (2)
62-75: Consider caching get_extra_benchmark_map() to avoid repeated file I/O.

get_extra_benchmark_map() is called here and again inside get_dataset_module(). Each call re-reads and parses the JSON file. If these functions are called in a loop over multiple datasets, the map file will be loaded on every iteration. A simple @functools.lru_cache on get_extra_benchmark_map (keyed on the env var value) would eliminate redundant reads.

55-59: Add type hints to new functions.

Per coding guidelines, simple types should be annotated. All new public functions (get_dataset_name, get_dataset_path, get_extra_benchmark_map, get_dataset_module, get_default_dataset_module) and the internal _load_external_dataset lack parameter and return type hints.

For example:
-def get_dataset_name(dataset):
+def get_dataset_name(dataset: str) -> str:
-def get_dataset_path(dataset):
+def get_dataset_path(dataset: str) -> Path:
-def get_extra_benchmark_map():
+def get_extra_benchmark_map() -> dict[str, str]:
As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

Also applies to: 62-75, 95-104, 107-111, 114-155

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok · 2026-02-14T06:40:58Z

@gwarmstrong please have a look when you get a chance, slurm tests look good

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Jorjeous

That's huge
@melllinia plz check that your datasets does not rely on vllm / speech lm dataset group

gwarmstrong

handful of comments and questions

gwarmstrong · 2026-02-17T18:55:33Z

+# ignore_if_registered avoids errors when the module is imported more than once.
+register_external_repo(
+    RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
+    ignore_if_registered=True,


when do you anticipate this happening? if multiple custom benchmarks are used? or if you are writing to the same name as an existing dataset?

if multiple benchmarks are used. I think main use-case will be to have a single internal repo with 10s of internal benchmarks. Then each of them has to have this register call and if a few are specified together, it will fail if we don't ignore registered

what do you mean by "specified together"? Is this targeted at clashing across namespaces (e.g., two different benchmarks register a "my_dataset" dataset?)? I'm wary of ignores, because they can make it easy to do something different than you intend.

I mean if internal benchmarks repo has "benchmark1" and "Benchmark2" in different folders under dataset. Both have to have this call in their init, but if you do --benchmarks=benchmark1,benchmark2 it will fail without that ignore argument

In what case would you want to not use ignore_if_registered then?

tbh, I'd remove that parameter and make it default behavior. But that would be a breaking change, so I didn't do it. I guess we can maybe add some check that if name / path pair is the same, then we don't error out, but if name is same, but path is different, then we do? As that's the only dangerous case here

I'm still a little confused why the double trigger occurs, but I think what you've said in the last comment here seems like a good solution.

gwarmstrong · 2026-02-17T18:57:42Z

+In `GENERATION_ARGS` this is referenced as:
+
+```
++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml


Is there any value to allowing the benchmark_dataset.jsonl to account for prompt configs too? or otherwise use the :: syntax in some way

not sure I fully understand, can you clarify this please? Prompt is just a yaml file, how would we use :: syntax?

Sorry, I mean more so is being able to specify it relative to the custom benchmark, like my_benchmarks/prompt/eval/word_count/default.yaml. It seems minor, but the /nemo_run/code mount causes a lot of friction for people, and I think allowing the relative usage would make the interface a bit more ergonomic for a lot of users.

got it, will update

gwarmstrong · 2026-02-17T19:36:08Z

+    Relative paths in the map are resolved relative to the map file's directory.
    """
-    Get dataset module either in default folder or in extra datasets folder.
+    map_path = os.environ.get("NEMO_SKILLS_EXTRA_BENCHMARK_MAP")


can we pass this as an argument so there is a little more configurability? e.g., this would enable passing benchmark_map as an arg at the python pipeline level, which I think would be more convenient when writing python than exporting as env var

will update

gwarmstrong · 2026-02-17T19:41:59Z

+        module_str, attr_str = eval_type.split("::", 1)
+        if Path(module_str).is_file():
+            module = import_from_path(module_str)
+        else:
+            module = importlib.import_module(module_str)
+        obj = getattr(module, attr_str)
+        is_class = inspect.isclass(obj) and issubclass(obj, BaseEvaluator)
+        return obj, is_class


is it possible the consolidate the logic between this and nemo_skills.mcp.utils.locate?

gwarmstrong · 2026-02-17T19:43:23Z

-            return super().fill_prompt(data_point, data)
+        prompt_format = prompt_format or self.cfg.prompt_format
+        if prompt_format == "openai":
+            return super().fill_prompt(data_point, data, prompt_format)


can we use keyword arguments for this? I find it good practice for clarity when there's more than one or two arguments to the function

Signed-off-by: Igor Gitman <igitman@nvidia.com>

gwarmstrong

Just a couple residual comments. If we prefer to merge quicker, I think both honestly can be refined over time if they prove to be issues.

gwarmstrong · 2026-02-19T20:16:44Z

+## Quick start
+
+1. **Create a repo** with `benchmark_map.json`, a dataset `__init__.py`, and a `prepare.py`.
+2. **Set the env var** `NEMO_SKILLS_EXTRA_BENCHMARK_MAP` to point at your `benchmark_map.json` (`name -> path` structure).


maybe we can have this demo prefer the pipeline argument now? Without having a concrete use case that I've iterated on myself a bit, I'm 50/50 on which is generally easier to pass

let me keep the current one, but we can update it later if that's not how most people use it

gwarmstrong · 2026-02-19T20:17:44Z

+# ignore_if_registered avoids errors when the module is imported more than once.
+register_external_repo(
+    RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
+    ignore_if_registered=True,


I'm still a little confused why the double trigger occurs, but I think what you've said in the last comment here seems like a good solution.

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

commit f5c0c53 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Mon Mar 16 16:45:33 2026 +0400 Add Global PIQA benchmark (#1299) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 86071c1 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Thu Mar 12 21:16:32 2026 -0700 fixing sandbox use for livecodebench (#1304) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 4928ef5 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 15:28:41 2026 -0700 nano v3 math tool calling slurm test (#1303) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d4e4450 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 14:17:03 2026 -0700 fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 2b0a84d Author: Mahan <25934206+MahanFathi@users.noreply.github.com> Date: Thu Mar 12 00:07:49 2026 -0400 Add HotpotQA multi-hop QA benchmark (#1292) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 75314b6 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Thu Mar 12 08:06:51 2026 +0400 Gnalbandyan/ugph hle verified (#1293) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8bbf387 Author: George Armstrong <georgea@nvidia.com> Date: Wed Mar 11 15:48:21 2026 -0700 build: fix gpu ci (#1301) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 005cd03 Author: Vahid Noroozi <VahidooX@users.noreply.github.com> Date: Tue Mar 10 12:52:27 2026 -0700 Fix 1-hour client timeout in long-running generation jobs (#1297) Signed-off-by: vahidoox <vnoroozi@nvidia.com> commit 596b888 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 19:11:26 2026 +0100 skip output-rs*_submissions.jsonl files when summarizing critpt (#1300) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> commit fe92aec Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 00:00:57 2026 +0100 use output-rs prefix when detecting sampling results (#1296) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit f6f7041 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Tue Mar 10 02:40:06 2026 +0400 Add MMMLU benchmark (#1281) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Kipok added 23 commits February 10, 2026 16:31

Explicitly add bfcl init files as they are static

f5ebde1

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Refactor ruler into prepare_data + prepare_init

c28972e

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Basic split into init and data prepare in pipeline

f49dc2a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Partial refactoring - remove module download logic

e3688de

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Always specify metrics type in eval

edab4db

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Basic support for specifying external path

27f24fb

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add basic support for extra dataset map

deb17ad

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add dynamic data prep

b22930b

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove data group from inits

dfb40d2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Clean up data groups and properly track external datasets

ec5353f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove logic to add lean headers

2b853ee

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add packaging for jsonl files in external repos

506d909

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add explicit jsonl files for packaging

52cedd9

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix packaging for external repos

98a7623

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Tmp remove scicode

bb25cfe

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add a way to pass evaluator as a string

147a459

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update to rglob for main datasets

cc2d744

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add relative path resolution

d8d5444

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix data_dir usage

f5d79f8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Refactor prepare data logic to fix data dir issue

643438f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Unconditional trigger for init

3b33b36

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix resolve

fbef304

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add docs

72a2c82

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Revert remove scicode

8fac5a8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai Bot reviewed Feb 12, 2026

View reviewed changes

Kipok added 3 commits February 12, 2026 16:06

Small doc update

bd0bada

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add tests

6fa989d

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add license

0c946e2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added 2 commits February 13, 2026 16:55

Roll-back split to prepare data and prepare init as it's not needed

aca2da1

Signed-off-by: Igor Gitman <igitman@nvidia.com>

rollback api change for datasets

f682692

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added the run GPU tests label Feb 14, 2026

Bug fix for files check

ca7241f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added run GPU tests and removed run GPU tests labels Feb 14, 2026

Add uncommitted env var

9c7b433

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added run GPU tests and removed run GPU tests labels Feb 14, 2026

Kipok requested a review from gwarmstrong February 14, 2026 06:40

Kipok added 2 commits February 13, 2026 22:44

Adjust constraint slightly

3f8180a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/benchmarks-plugin

c78c3fe

Jorjeous reviewed Feb 17, 2026

View reviewed changes

gwarmstrong reviewed Feb 17, 2026

View reviewed changes

Kipok added 4 commits February 17, 2026 12:49

Merge branch 'main' into igitman/benchmarks-plugin

57a2141

Use kwargs for prompt fill

01462f3

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Use locate and add explicit extra_benchmark_map argument

0711997

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add relative config resolution

24ea940

Signed-off-by: Igor Gitman <igitman@nvidia.com>

gwarmstrong approved these changes Feb 19, 2026

View reviewed changes

Merge branch 'main' into igitman/benchmarks-plugin

dc5f646

Kipok mentioned this pull request Feb 19, 2026

Refactor registering external repositories #1260

Open

Kipok merged commit 144c70b into main Feb 19, 2026
5 checks passed

Kipok deleted the igitman/benchmarks-plugin branch February 19, 2026 23:26

coderabbitai Bot mentioned this pull request Feb 21, 2026

Add DSBench-DA evaluation #1254

Merged

talorabr pushed a commit to talorabr/Nemo-Skills that referenced this pull request Feb 22, 2026

Adding an option to store benchmarks in external repo (NVIDIA-NeMo#1240)

ec19ba7

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Adding an option to store benchmarks in external repo (#1240)

8f16223

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Adding an option to store benchmarks in external repo (#1240)

e02c560

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Conversation

Kipok commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok commented Feb 14, 2026

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Feb 12, 2026 •

edited

Loading

greptile-apps Bot commented Feb 12, 2026 •

edited

Loading

coderabbitai Bot commented Feb 12, 2026 •

edited

Loading