Skip to content

Adding an option to store benchmarks in external repo#1240

Merged
Kipok merged 63 commits into
mainfrom
igitman/benchmarks-plugin
Feb 19, 2026
Merged

Adding an option to store benchmarks in external repo#1240
Kipok merged 63 commits into
mainfrom
igitman/benchmarks-plugin

Conversation

@Kipok

@Kipok Kipok commented Feb 12, 2026

Copy link
Copy Markdown
Collaborator

The main motivation for this pr is to enable registering external benchmarks, so that people can use nemo-skills with custom private eval repositories without having to maintain a whole fork. This required some restructuring which resulted in the following changes

  1. Added an option for benchmark to be specified as full path to the dataset folder. Also added a logic to register "benchmark_map.json" that will map short name to full path, which can simplify referencing external benchmarks.
  2. Added tests and documentation for the new logic, should be quite extensive
  3. Added an option to pass evaluators as file::class, similar to how was already done for metrics
  4. Made summarize results command from eval directly specify metrics type, this way it works inside the container without having to reference init from external benchmarks which might be in a different path
  5. Added REQUIRES_DATA_DIR option to init directly, was previously hardcoded which made it hard to maintain
  6. Made BFCL init.py files explicitly committed, they weren't dynamic, so having them being in the repo makes things simpler and more explicit
  7. Removed previous logic to register extra datasets, which only worked with "simple" benchmarks that didn't require custom evaluation / generation / metrics
  8. Removed DATASET_GROUP from inits as we didn't really use it much, it makes little sense, was inconsistent and makes things hard with external benchmarks
  9. Changed the base generate api to add prompt_format argument. This isn't strictly needed for this PR, but it made an example in the docs much simpler - previously it was hard to switch from our prompt format in first turn to openai format in the second turn, but with this argument, we can override the prompt format and reuse a lot more code.
  10. Removed hardcoded logic for datasets with subfolders for packaging and made it dynamically pick all jsonl automatically
  11. Made a pythonic interface for launch / stop server. Again, not strictly needed, but made the new tests I added much faster as we reuse the server in them (otherwise would have to re-spin it 12 times). This also will make test_eval.py much better in the future if we add it there since it will allow us to run judge benchmarks, but I didn't yet make this change yet

Summary by CodeRabbit

  • New Features

    • External/custom benchmark support: add, prepare, package, and evaluate benchmarks outside the repo; server launch/stop utilities for evaluations.
    • prompt_format passthrough to customize prompt formatting across generation and evaluation tasks.
  • Improvements

    • Simplified dataset resolution and data-dir/container handling for prepare/package/eval flows.
    • More robust CLI behavior for prepare/eval/summarize and improved packaging of external datasets.
  • Documentation

    • Added guide for defining and integrating external benchmarks.
  • Tests

    • New end-to-end integration tests for external benchmark prepare + eval.

Kipok added 23 commits February 10, 2026 16:31
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@greptile-apps

greptile-apps Bot commented Feb 12, 2026

Copy link
Copy Markdown
Contributor

Too many files changed for review. (180 files found, 100 file limit)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai

coderabbitai Bot commented Feb 12, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Wide refactor to dataset initialization and preparation: removes many DATASET_GROUP exports, introduces REQUIRES_DATA_DIR/HAS_DYNAMIC_INIT flags, modularizes RULER/RULER2 prepare flows, adds external-benchmark support (loading, packaging, tests), enhances dataset resolution and evaluator dispatch, and threads prompt_format through generation/eval call paths.

Changes

Cohort / File(s) Summary
Docs & config
docs/evaluation/custom-benchmarks.md, docs/evaluation/index.md, mkdocs.yml, .gitignore
Add documentation for external/custom benchmarks, update navigation, un-ignore a dummy external benchmark file and consolidate bfcl ignore patterns.
Dataset flag removals & additions
nemo_skills/dataset/.../__init__.py (many files), nemo_skills/dataset/ruler/__init__.py, nemo_skills/dataset/ruler2/__init__.py
Remove DATASET_GROUP from many dataset packages; add/replace with REQUIRES_DATA_DIR=True and/or HAS_DYNAMIC_INIT=True in select packages; minor formatting tweaks.
BFCL subpackages
nemo_skills/dataset/bfcl_v3/.../__init__.py, nemo_skills/dataset/bfcl_v4/.../__init__.py, nemo_skills/dataset/bfcl_v3/prepare.py
Add METRICS_TYPE, GENERATION_ARGS, GENERATION_MODULE constants across many bfcl subpackages; remove DEFAULT_SETTINGS and stop auto-writing __init__.py in bfcl_v3 prepare.
RULER modularization
nemo_skills/dataset/ruler/prepare.py, .../prepare_common.py, .../prepare_data.py, .../prepare_init.py
Split monolithic RULER prepare into prepare_common, prepare_data, prepare_init; top-level prepare.py now delegates to those scripts.
RULER2 modularization
nemo_skills/dataset/ruler2/prepare.py, .../prepare_common.py, .../prepare_data.py, .../prepare_init.py
Same modularization for RULER2: task logic moved into prepare_data/prepare_init, prepare.py reduced to a launcher.
Dataset resolution & utils
nemo_skills/dataset/utils.py
New resolution APIs: get_dataset_path, get_extra_benchmark_map, _load_external_dataset, and get_dataset_module updated to support builtin, map-key, and path-based datasets; removed prior cluster/path helpers.
Dataset prepare entrypoint & CLI
nemo_skills/dataset/prepare.py, nemo_skills/pipeline/prepare_data.py
Add CLI parsing helper, change prepare_datasets to accept prepare_entrypoint, refactor pipeline prepare flow to handle external datasets, dynamic init vs non-split flows, container path mapping, and data_dir copying.
Packaging & packager helpers
nemo_skills/pipeline/utils/packager.py, nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/__init__.py
Add resolve_external_data_path, allow ignore_if_registered in repo registration, improve include patterns to capture generated JSONL from git roots, and re-export resolve_external_data_path.
Evaluation dispatch & metrics
nemo_skills/evaluation/evaluator/__init__.py, nemo_skills/evaluation/metrics/compute_metrics.py
Add _resolve_eval_type to support module::Class and file::Class path formats, dynamic loading/dispatch, enhanced error messages; simplify ComputeMetrics init to infer metric type from dataset module.
Pipeline: remove extra-dataset plumbing
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/summarize_results.py
Remove extra_datasets/extra_datasets_type and related data_dir plumbing from eval and summarize flows; drop ExtraDatasetType usage.
Generation & prompt_format plumbing
nemo_skills/inference/generate.py, many nemo_skills/inference/*, recipes/*
Add optional prompt_format parameter to fill_prompt and process_single_datapoint signatures across many generation/eval tasks and thread it through where applicable.
Server orchestration
nemo_skills/pipeline/start_server.py
Add launch_server and stop_server helpers; refactor start_server to centralize server/tunnel lifecycle and cleanup.
External benchmark fixtures & tests
tests/data/dummy_external_benchmark/**, tests/gpu-tests/test_external_benchmark_eval.py, tests/test_external_benchmarks.py
Add a dummy external benchmark repo (datasets, prepare scripts, evaluator, metrics, prompts, benchmark_map), extensive unit and GPU/container integration tests validating external benchmark prepare, packaging, resolution, and evaluation flows.
Tests & small updates
tests/test_datasets.py, tests/test_configs.py, tests/gpu-tests/run_qwen.sh, many tests/runner scripts
Remove DATASET_GROUP validation test, tweak mock configs, update prepare_data call sites to accept explicit datasets list and validate wrap_arguments behavior; add GPU test invocation.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CLI
    participant Prepare as Pipeline prepare
    participant Utils as Dataset Utils / Packager
    participant Container as Container / Job
    participant Eval as Evaluation runtime
    participant Metrics as ComputeMetrics

    Dev->>Prepare: run prepare (dataset path or map key)
    Prepare->>Utils: resolve dataset via get_dataset_path (path | map key | builtin)
    alt external dataset
        Utils->>Utils: register_external_repo / resolve_external_data_path
        Prepare->>Container: package & mount external repo
    end
    Prepare->>Container: invoke dataset prepare entrypoint (prepare_init/prepare_data)
    Container->>Prepare: produce prepared JSONL (test.jsonl)
    Dev->>Eval: run eval CLI (benchmark)
    Eval->>Utils: import dataset module (reads METRICS_TYPE / GENERATION_MODULE)
    Eval->>Eval: _resolve_eval_type -> load evaluator (module::Class or file::Class)
    Eval->>Metrics: instantiate and run metrics evaluator
    Metrics-->>Dev: write metrics.json (results)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, run GPU tests

Suggested reviewers

  • gwarmstrong
  • activatedgeek
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Adding an option to store benchmarks in external repo' accurately summarizes the main change: enabling external benchmark repositories. It is clear, specific, and directly reflects the primary objective of the PR.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/benchmarks-plugin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)

583-591: ⚠️ Potential issue | 🔴 Critical

Breaking change: ArenaJudge will crash when calling super().process_single_datapoint().

ArenaJudge.process_single_datapoint calls super().process_single_datapoint(gen_base_data, all_data) at lines 152-153. This invokes the base class's process_single_datapoint, which then calls self.fill_prompt(data_point, all_data, prompt_format) as a positional call (line 692). Since self is an ArenaJudge instance, this attempts to pass 3 positional arguments to ArenaJudge.fill_prompt, which only accepts 2 parameters. This will raise TypeError: fill_prompt() takes 3 positional arguments but 4 were given.

To fix this, either:

  1. Pass prompt_format as a keyword argument in the base class:
-"prompt": self.fill_prompt(data_point, all_data, prompt_format),
+"prompt": self.fill_prompt(data_point, all_data, prompt_format=prompt_format),
  1. Update ArenaJudge.fill_prompt to accept the new parameter:
-def fill_prompt(self, data_point, data):
+def fill_prompt(self, data_point, data, prompt_format=None):

Both changes are needed for full compatibility.

🤖 Fix all issues with AI agents
In `@docs/evaluation/custom-benchmarks.md`:
- Around line 198-208: The example calls LOG.info in the generate function but
never defines LOG; add the logger setup by importing logging and get_logger_name
(from nemo_skills.utils) and creating LOG =
logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG
definition appear near the top of the example so LOG is available when
generate(cfg: WordCountGenerationConfig) calls LOG.info.
- Around line 264-277: In the update method, fix the NameError by passing the
correct variable name to _compute_pass_at_k: replace the incorrect
predicted_answer with the defined predicted_answers; specifically, in the
update(self, predictions) body ensure the call to _compute_pass_at_k uses
predicted_answers (the list defined from predictions) so it matches the later
_compute_majority_at_k call and the local variable name.

In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py`:
- Around line 1-3: The file is missing the standard NVIDIA Apache 2.0 license
header; add the exact repo-standard copyright/license header comment block to
the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above
the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so
the file matches other files in the PR.

In `@nemo_skills/dataset/prepare.py`:
- Around line 36-49: The CLI flag --retries currently defaults to 0 while the
prepare_datasets function signature defaults to 3, causing inconsistent
behavior; pick one canonical default and make both match (either change the
parser add_argument call for "--retries" to default=3 or change the
prepare_datasets signature to retries=0) so callers get the same retry behavior
whether invoked from the CLI or as a library function; update the parser
add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.

In `@nemo_skills/dataset/ruler/prepare_common.py`:
- Around line 33-39: The message MISSING_RULER_ARGS_MESSAGE currently begins
with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0),
producing a success status while signaling an error; fix by making them
consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if
skipping is intentional, or change the exit in parse_args_and_prepare_args to
SystemExit(1) to indicate failure; locate and update the symbols
MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args
(also consider how prepare.py's subprocess `check=True` will interpret the
chosen exit code) so the log text and exit code match the intended behavior.

In `@nemo_skills/dataset/ruler/prepare_data.py`:
- Around line 49-57: The git-LFS check in the "if 'cwe' in tasks" block
currently only catches subprocess.CalledProcessError but will crash with a
FileNotFoundError when git is not installed; update the exception handling
around the subprocess.run(["git", "lfs", "--version"]) call in prepare_data.py
to catch both subprocess.CalledProcessError and FileNotFoundError (or a broad
OSError) and then print the existing friendly message and exit(1) so missing git
or missing git-lfs both produce the same helpful output.

In `@nemo_skills/dataset/ruler2/prepare_data.py`:
- Around line 25-438: Many near-duplicate functions (e.g.,
prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic,
prepare_qa_hard, etc.) repeat subprocess.run logic with only module name
(prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace
them with one generic runner (e.g., run_prepare_task) that accepts a task key
and common params (output_folder, tokenizer_type, tokenizer_path, length,
dataset_size) and use a declarative dict mapping task keys to module name plus
per-task flag/value dicts; the runner should build the argv list by starting
with ["python","-m", module] then iterating the flag dict to append "--flag",
str(value), call subprocess.run(..., check=True), and replace all prepare_*
callers with a single call to run_prepare_task(task_key, ...).
- Around line 479-491: The current main block uses parse_known_args and assigns
unknown to `_`, silently discarding unrecognized CLI flags; change this to
either call parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 174-177: The debug print in the class-instantiation branch should
be removed or converted to a proper logger call; replace the line
`print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator:
%s", evaluator)` (or equivalent) inside the block where `is_class` is true after
`evaluator = obj(eval_config)` so you don't emit noisy stdout during
`evaluator.eval_full()` runs.

In `@nemo_skills/pipeline/eval.py`:
- Around line 745-746: The current f-string will inject the literal "None" when
both metric_type and benchmark_args.metrics_type are None; compute an effective
value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type)
and only append the "--metric_type=..." flag to the command when
effective_metric_type is truthy, so summarize_results never receives the string
"None" as a metric type.

In `@nemo_skills/pipeline/utils/packager.py`:
- Around line 191-194: The loop that builds include_patterns iterates over
dataset_dir.rglob("*.jsonl") and currently calls
include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that
loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.
🧹 Nitpick comments (17)
nemo_skills/inference/generate.py (1)

583-585: Add type hints for the new prompt_format parameters.

Per coding guidelines, simple types should have type hints.

-    def fill_prompt(self, data_point, data, prompt_format=None):
+    def fill_prompt(self, data_point, data, prompt_format: str | None = None):
-    async def process_single_datapoint(self, data_point, all_data, prompt_format=None):
+    async def process_single_datapoint(self, data_point, all_data, prompt_format: str | None = None):

As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code"

Also applies to: 680-681

nemo_skills/dataset/ruler/prepare_common.py (1)

89-94: Add return type hint.

Per project guidelines, use type hints for simple types.

Suggested fix
-def parse_args_and_prepare_args(parser: argparse.ArgumentParser):
+def parse_args_and_prepare_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, str]:
nemo_skills/dataset/ruler2/prepare_common.py (1)

75-77: Add return type hint.

Same as the ruler counterpart — add a return type hint for consistency.

Suggested fix
-def parse_known_args(parser: argparse.ArgumentParser):
+def parse_known_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, list[str]]:
nemo_skills/dataset/ruler/prepare_data.py (2)

60-60: Pass a string instead of a single-element list when using shell=True.

When shell=True, passing a list is misleading — the first element becomes the shell command string, and subsequent elements become arguments to the shell itself (not the command). Use a plain string here.

Suggested fix
-    subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
+    subprocess.run("pip install wonderwords html2text tenacity", check=True, shell=True)

62-69: Manual __enter__/__exit__ on TemporaryDirectory is fragile.

Calling dunder methods directly bypasses the context manager protocol's guarantees. Consider restructuring so the conditional temp directory uses a proper context manager or a simpler pattern:

Suggested refactor
-    if tmp_data_dir is not None:
-        tmpdirname = tmp_data_dir
-        Path(tmpdirname).mkdir(parents=True, exist_ok=True)
-        tmpdir_context = None
-    else:
-        tmpdir_context = tempfile.TemporaryDirectory()
-        tmpdirname = tmpdir_context.__enter__()
-
-    try:
+    with tempfile.TemporaryDirectory() as _tmpdir:
+        if tmp_data_dir is not None:
+            tmpdirname = tmp_data_dir
+            Path(tmpdirname).mkdir(parents=True, exist_ok=True)
+        else:
+            tmpdirname = _tmpdir
+

This removes the manual __enter__/__exit__ and the try/finally block entirely. The unused TemporaryDirectory is cheap when tmp_data_dir is provided.

nemo_skills/dataset/ruler2/prepare_init.py (1)

61-69: Silently discarding unknown CLI arguments may hide user errors.

Line 64 discards unknown args. Since prepare.py forwards all sys.argv to both prepare_init.py and prepare_data.py, this is understandable — init doesn't need data-prep args. However, if run standalone, typos or unsupported flags will be silently ignored.

Consider at minimum logging the discarded args for debuggability, or documenting that this script is intended to be invoked via prepare.py.

nemo_skills/dataset/ruler2/prepare_data.py (3)

459-460: Remove commented-out code.

The commented-out pip install on line 460 is dead code. If the dependency installation is needed, it should be documented or handled in a setup step, not left as a comment in runtime code.


441-476: No validation of task names — KeyError with no helpful message.

If a user passes an invalid task name, prepare_task[task] on line 466 raises a bare KeyError. Consider validating upfront with a clear error message listing available tasks.

Proposed fix
+    invalid_tasks = set(tasks) - set(prepare_task.keys())
+    if invalid_tasks:
+        raise ValueError(f"Unknown tasks: {invalid_tasks}. Available: {list(prepare_task.keys())}")
+
     with concurrent.futures.ThreadPoolExecutor() as executor:

441-441: Add type hints to function signatures.

All public functions in this file lack type hints. At minimum, prepare_dataset and the individual prepare_* functions should annotate their parameters with basic types (str, int, list[str], etc.). As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

nemo_skills/pipeline/utils/packager.py (1)

82-129: Potential relative_to failure when repo_path is a subdirectory of the git root.

Line 106 validates that local_data_path is relative to repo_meta.path, but line 119 calls local_data_path.relative_to(effective_root) where effective_root is the git root. If repo_meta.path is registered as a parent of the actual git root (unlikely but not prevented by RepoMetadata), or if the repo structure has symlinks that break the ancestor chain, this relative_to call would raise an unhandled ValueError.

The happy path (repo_path is at or below git_root) is safe because git_root is always an ancestor of repo_path, hence also of local_data_path. Just something to be aware of in edge cases.

nemo_skills/pipeline/prepare_data.py (2)

213-225: --prepare_entrypoint prepare_data.py is appended after _build_command which already joined prepare_unknown_args.

Line 225 appends --prepare_entrypoint prepare_data.py after the unknown args were already joined on line 113 (inside _build_command). The resulting command would look like:

python -m nemo_skills.dataset.prepare <datasets> <unknown_args> --prepare_entrypoint prepare_data.py

This works because argparse parses named arguments positionally-independently, but the command string looks a bit odd. Consider passing prepare_entrypoint through the unknown args or appending it inside _build_command for clarity.


227-232: When executor == "none", _get_container_dataset_path returns container paths that won't exist locally.

After tracing the control flow, it appears that data_dir being set always implies containerized execution (lines 173-178 require cluster when data_dir is set, and line 181-185 only drops to executor="none" when data_dir is absent). So this path is safe.

However, this invariant is implicit and fragile — a future change to the early-exit logic could silently break the cp commands. Consider adding a defensive assertion or comment.

💡 Suggested comment for future maintainers
     if data_dir:
+        # data_dir implies containerized execution (executor != "none"),
+        # so container paths are valid for cp commands below.
         command += f" && mkdir -p {data_dir}"
nemo_skills/evaluation/evaluator/__init__.py (2)

92-117: _resolve_eval_type — clean centralized dispatch.

The dual-format support (built-in keys and :: path format) is well-structured. One minor note: getattr(module, attr_str) on line 109 will raise an AttributeError with a generic message if the attribute doesn't exist. Consider wrapping it to provide a more descriptive error mentioning the eval_type string.


147-154: supports_single_eval instantiates the evaluator just to check a capability flag.

obj(config) (line 153) constructs a full evaluator instance solely to call supports_single_eval(). If the constructor has side effects or is expensive, this is wasteful. Since supports_single_eval in BaseEvaluator only checks whether eval_single is overridden (a class-level property), this could be a static/class-level check instead. Low priority given this matches the previous pattern.

docs/evaluation/custom-benchmarks.md (1)

27-27: Add language specifiers to fenced code blocks.

Lines 27, 135, 240, and 281 have fenced code blocks without a language identifier. Consider adding text or the appropriate language (e.g., bash) to satisfy linting and improve rendering.

nemo_skills/dataset/utils.py (2)

62-75: Consider caching get_extra_benchmark_map() to avoid repeated file I/O.

get_extra_benchmark_map() is called here and again inside get_dataset_module(). Each call re-reads and parses the JSON file. If these functions are called in a loop over multiple datasets, the map file will be loaded on every iteration. A simple @functools.lru_cache on get_extra_benchmark_map (keyed on the env var value) would eliminate redundant reads.


55-59: Add type hints to new functions.

Per coding guidelines, simple types should be annotated. All new public functions (get_dataset_name, get_dataset_path, get_extra_benchmark_map, get_dataset_module, get_default_dataset_module) and the internal _load_external_dataset lack parameter and return type hints.

For example:

-def get_dataset_name(dataset):
+def get_dataset_name(dataset: str) -> str:
-def get_dataset_path(dataset):
+def get_dataset_path(dataset: str) -> Path:
-def get_extra_benchmark_map():
+def get_extra_benchmark_map() -> dict[str, str]:

As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

Also applies to: 62-75, 95-104, 107-111, 114-155

Comment thread docs/evaluation/custom-benchmarks.md
Comment thread docs/evaluation/custom-benchmarks.md
Comment thread nemo_skills/dataset/bfcl_v3/parallel/__init__.py
Comment thread nemo_skills/dataset/prepare.py
Comment thread nemo_skills/dataset/ruler/prepare_common.py Outdated
Comment thread nemo_skills/dataset/ruler2/prepare_data.py Outdated
Comment thread nemo_skills/dataset/ruler2/prepare_data.py Outdated
Comment thread nemo_skills/evaluation/evaluator/__init__.py
Comment thread nemo_skills/pipeline/eval.py Outdated
Comment thread nemo_skills/pipeline/utils/packager.py
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok

Kipok commented Feb 14, 2026

Copy link
Copy Markdown
Collaborator Author

@gwarmstrong please have a look when you get a chance, slurm tests look good

@Jorjeous Jorjeous left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's huge
@melllinia plz check that your datasets does not rely on vllm / speech lm dataset group

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handful of comments and questions

# ignore_if_registered avoids errors when the module is imported more than once.
register_external_repo(
RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
ignore_if_registered=True,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do you anticipate this happening? if multiple custom benchmarks are used? or if you are writing to the same name as an existing dataset?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if multiple benchmarks are used. I think main use-case will be to have a single internal repo with 10s of internal benchmarks. Then each of them has to have this register call and if a few are specified together, it will fail if we don't ignore registered

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "specified together"? Is this targeted at clashing across namespaces (e.g., two different benchmarks register a "my_dataset" dataset?)? I'm wary of ignores, because they can make it easy to do something different than you intend.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if internal benchmarks repo has "benchmark1" and "Benchmark2" in different folders under dataset. Both have to have this call in their init, but if you do --benchmarks=benchmark1,benchmark2 it will fail without that ignore argument

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case would you want to not use ignore_if_registered then?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I'd remove that parameter and make it default behavior. But that would be a breaking change, so I didn't do it. I guess we can maybe add some check that if name / path pair is the same, then we don't error out, but if name is same, but path is different, then we do? As that's the only dangerous case here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused why the double trigger occurs, but I think what you've said in the last comment here seems like a good solution.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread docs/evaluation/custom-benchmarks.md Outdated
In `GENERATION_ARGS` this is referenced as:

```
++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value to allowing the benchmark_dataset.jsonl to account for prompt configs too? or otherwise use the :: syntax in some way

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I fully understand, can you clarify this please? Prompt is just a yaml file, how would we use :: syntax?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean more so is being able to specify it relative to the custom benchmark, like my_benchmarks/prompt/eval/word_count/default.yaml. It seems minor, but the /nemo_run/code mount causes a lot of friction for people, and I think allowing the relative usage would make the interface a bit more ergonomic for a lot of users.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will update

Comment thread nemo_skills/dataset/utils.py Outdated
Relative paths in the map are resolved relative to the map file's directory.
"""
Get dataset module either in default folder or in extra datasets folder.
map_path = os.environ.get("NEMO_SKILLS_EXTRA_BENCHMARK_MAP")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass this as an argument so there is a little more configurability? e.g., this would enable passing benchmark_map as an arg at the python pipeline level, which I think would be more convenient when writing python than exporting as env var

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

Comment on lines +104 to +111
module_str, attr_str = eval_type.split("::", 1)
if Path(module_str).is_file():
module = import_from_path(module_str)
else:
module = importlib.import_module(module_str)
obj = getattr(module, attr_str)
is_class = inspect.isclass(obj) and issubclass(obj, BaseEvaluator)
return obj, is_class

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible the consolidate the logic between this and nemo_skills.mcp.utils.locate?

return super().fill_prompt(data_point, data)
prompt_format = prompt_format or self.cfg.prompt_format
if prompt_format == "openai":
return super().fill_prompt(data_point, data, prompt_format)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use keyword arguments for this? I find it good practice for clarity when there's more than one or two arguments to the function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment thread nemo_skills/inference/eval/bfcl.py
Comment thread nemo_skills/pipeline/eval.py
Comment thread nemo_skills/pipeline/start_server.py
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple residual comments. If we prefer to merge quicker, I think both honestly can be refined over time if they prove to be issues.

## Quick start

1. **Create a repo** with `benchmark_map.json`, a dataset `__init__.py`, and a `prepare.py`.
2. **Set the env var** `NEMO_SKILLS_EXTRA_BENCHMARK_MAP` to point at your `benchmark_map.json` (`name -> path` structure).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have this demo prefer the pipeline argument now? Without having a concrete use case that I've iterated on myself a bit, I'm 50/50 on which is generally easier to pass

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me keep the current one, but we can update it later if that's not how most people use it

# ignore_if_registered avoids errors when the module is imported more than once.
register_external_repo(
RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
ignore_if_registered=True,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused why the double trigger occurs, but I think what you've said in the last comment here seems like a good solution.

@Kipok Kipok merged commit 144c70b into main Feb 19, 2026
5 checks passed
@Kipok Kipok deleted the igitman/benchmarks-plugin branch February 19, 2026 23:26
talorabr pushed a commit to talorabr/Nemo-Skills that referenced this pull request Feb 22, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
sgunasekar added a commit that referenced this pull request Mar 24, 2026
commit f5c0c53
Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com>
Date:   Mon Mar 16 16:45:33 2026 +0400

    Add Global PIQA benchmark (#1299)

    Signed-off-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 86071c1
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Thu Mar 12 21:16:32 2026 -0700

    fixing sandbox use for livecodebench (#1304)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 4928ef5
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 12 15:28:41 2026 -0700

    nano v3 math tool calling slurm test (#1303)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d4e4450
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 12 14:17:03 2026 -0700

    fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 2b0a84d
Author: Mahan <25934206+MahanFathi@users.noreply.github.com>
Date:   Thu Mar 12 00:07:49 2026 -0400

    Add HotpotQA multi-hop QA benchmark (#1292)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 75314b6
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Thu Mar 12 08:06:51 2026 +0400

    Gnalbandyan/ugph hle verified (#1293)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8bbf387
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Mar 11 15:48:21 2026 -0700

    build: fix gpu ci (#1301)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 005cd03
Author: Vahid Noroozi <VahidooX@users.noreply.github.com>
Date:   Tue Mar 10 12:52:27 2026 -0700

    Fix 1-hour client timeout in long-running generation jobs (#1297)

    Signed-off-by: vahidoox <vnoroozi@nvidia.com>

commit 596b888
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Tue Mar 10 19:11:26 2026 +0100

    skip output-rs*_submissions.jsonl files when summarizing critpt (#1300)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

commit fe92aec
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Tue Mar 10 00:00:57 2026 +0100

    use output-rs prefix when detecting sampling results (#1296)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit f6f7041
Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com>
Date:   Tue Mar 10 02:40:06 2026 +0400

    Add MMMLU benchmark (#1281)

    Signed-off-by: naymaraq <dkaramyan@nvidia.com>
    Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
    Co-authored-by: naymaraq <dkaramyan@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants