Skip to content

Evaluation on LiveCodeBench-Cpp#885

Merged
wasiahmad merged 46 commits into
mainfrom
livecodebench_cpp
Oct 13, 2025
Merged

Evaluation on LiveCodeBench-Cpp#885
wasiahmad merged 46 commits into
mainfrom
livecodebench_cpp

Conversation

@wasiahmad

@wasiahmad wasiahmad commented Oct 3, 2025

Copy link
Copy Markdown
Collaborator

In this PR, we are adding evaluation support for LiveCodeBench-CPP through nemo-skills. Primary changes are:

  • Dataset download through nemo-skills/dataset/livecodebench-cpp
  • Evaluation logic implemented at nemo_skills/evaluation/evaluator/livecodebench.py

Summary by CodeRabbit

  • New Features

    • Added C++ LiveCodeBench support: data-prep utility, default eval/generation configs, and two dataset splits for benchmarking.
    • Evaluation: sandbox readiness probe plus retry/backoff with configurable timeout and retry settings.
  • Bug Fixes

    • Reduced false-positive sandbox/mounts warnings by tightening the warning condition.
  • Documentation

    • Updated evaluation guidance and examples, including sandbox-related flags and Pypy3 guidance.
  • Chores

    • Updated copyright/license years.

Note

Adds LiveCodeBench-CPP dataset/eval support, introduces sandbox availability checks with retry/backoff across evaluators, and updates docs and warnings.

  • Benchmarks:
    • livecodebench-cpp: New dataset with defaults (__init__.py) and data prep (prepare.py) producing v5_2408_2501.jsonl and v6_2408_2505.jsonl.
  • Evaluators:
    • LiveCodeBench (evaluation/evaluator/livecodebench.py):
      • Add sandbox availability probe and retry/backoff wrapper for code execution; configurable timeout_buffer and num_retries.
      • Normalize release_version handling (prefix only for Python); force python interpreter for C++; gate Pypy3 on sandbox availability.
    • OJBench (evaluation/evaluator/ojbench.py):
      • Reuse sandbox context/utilities, add retries/backoff and timeout_buffer/num_retries; require sandbox availability before running.
  • Pipeline:
    • pipeline/utils/eval.py: Reduce false-positive sandbox mount warnings by tightening keep_mounts_for_sandbox condition.
  • Docs:
    • Update LiveCodeBench guidance: Pypy3 requires --with_sandbox --keep_mounts_for_sandbox; add livecodebench-cpp section with data prep/eval notes.
  • Chore:
    • Update copyright years.

Written by Cursor Bugbot for commit e4b3cb8. This will update automatically on new commits. Configure here.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad marked this pull request as draft October 3, 2025 08:45
@coderabbitai

coderabbitai Bot commented Oct 3, 2025

Copy link
Copy Markdown
Contributor

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a LiveCodeBench-CPP dataset and prepare script, adds sandbox availability probes and httpx-backed retry helpers to LiveCodeBench and OJBench evaluators, tightens a sandbox-mounts warning condition, updates docs to require sandbox for some runs and add C++ guidance, bumps license years, and removes two sandbox constants.

Changes

Cohort / File(s) Summary
Docs updates
docs/evaluation/code.md
Removed default --with_sandbox for standard Python runs; made PyPy3 guidance require --with_sandbox and --keep_mounts_for_sandbox; added LiveCodeBench-CPP subsection with dataset splits and C++ benchmark guidance; updated OJBench sample run to include sandbox flags; minor wording tweaks.
LiveCodeBench-CPP dataset (new)
nemo_skills/dataset/livecodebench-cpp/__init__.py, nemo_skills/dataset/livecodebench-cpp/prepare.py
Added module constants (DATASET_GROUP, METRICS_TYPE, EVAL_SPLIT, EVAL_ARGS, GENERATION_ARGS) and new prepare.py that loads HF splits, formats prompts (with/without starter code), cleans fields (adds task_id, question), normalizes indentation, and writes per-split JSONL for v5_2408_2501 and v6_2408_2505.
LiveCodeBench evaluator resilience
nemo_skills/evaluation/evaluator/livecodebench.py
Introduced httpx-backed retry helper execute_in_sandbox_with_retries, sandbox probe is_sandbox_available, new config fields timeout_buffer and num_retries, updated _install_packages_in_sandbox signature to accept eval config, refactored sandbox install/eval flows to use retries and timeout_buffer, adjusted release_version handling and orchestrator sandbox/local routing, and added logging around availability and retries.
OJBench evaluator changes
nemo_skills/evaluation/evaluator/ojbench.py
Migrated to a Sandbox class API and retry utilities, added timeout_buffer and num_retries to OJBenchConfig, changed install_packages to accept a Sandbox instance, refactored eval_ojbench_async to use execute_in_sandbox_with_retries, and added sandbox readiness checks in the synchronous wrapper (errors if sandbox unavailable).
Eval pipeline warning condition
nemo_skills/pipeline/utils/eval.py
Tightened mounts warning: now requires benchmark requires_sandbox, per-benchmark keep_mounts_for_sandbox true, and global keep-mounts-for-sandbox false before emitting the warning.
LiveCodeBench dataset module updates
nemo_skills/dataset/livecodebench/__init__.py, nemo_skills/dataset/livecodebench/prepare.py
Bumped copyright year to 2025; removed REQUIRES_SANDBOX and KEEP_MOUNTS_FOR_SANDBOX constants from __init__.py; no functional changes to prepare.py.
OJBench year updates
nemo_skills/dataset/ojbench/__init__.py, nemo_skills/dataset/ojbench/prepare.py
Updated license header years to 2025 only; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Orchestrator as eval_livecodebench / eval_ojbench
    participant Sandbox as Sandbox API
    participant Local as Local Runner

    User->>Orchestrator: Start evaluation(config)
    Orchestrator->>Sandbox: is_sandbox_available(sandbox_config)
    alt Sandbox available
        Note over Orchestrator,Sandbox: Install + run via sandbox (retries)
        Orchestrator->>Sandbox: _install_packages_in_sandbox(sandbox, eval_config)
        Orchestrator->>Sandbox: execute_in_sandbox_with_retries(cmd, timeout=base+timeout_buffer, num_retries)
        Sandbox-->>Orchestrator: Results / logs
    else Sandbox unavailable
        Note over Orchestrator,Local: Fallback to local execution or error (per evaluator)
        Orchestrator->>Local: Run evaluation locally (adjust release_version handling)
        Local-->>Orchestrator: Results / logs
    end
    Orchestrator-->>User: Aggregated results
Loading
sequenceDiagram
    autonumber
    participant Prep as prepare.py (livecodebench-cpp)
    participant HF as load_dataset
    participant FS as Filesystem

    Prep->>HF: load_dataset(split)
    HF-->>Prep: Dataset
    Prep->>Prep: clean_data (cast fields, add task_id, build question)
    alt starter_code present
        Prep->>Prep: format prompt with starter code template (tabs preserved)
    else no starter_code
        Prep->>Prep: format prompt without starter code
    end
    Prep->>FS: Write JSONL per split (adds subset_for_metrics, release_version)
    FS-->>Prep: Done
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble bytes beneath the hedge,
C++ tasks lined up on the ledge,
Retries like carrots, one by one,
Sandboxes warm beneath the sun,
Prompts and tabs — a rabbit's run 🐇💻

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the primary change of the pull request, which is adding evaluation support for the LiveCodeBench-CPP benchmark, and is directly related to the main content of the changeset. It is clear, concise, and specific enough for teammates to understand the focus when scanning the history. There is no extraneous information or ambiguity in the wording.
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch livecodebench_cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad changed the title Evaluation on LiveCodeBench-C++ Evaluation on LiveCodeBench-Cpp Oct 3, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad marked this pull request as ready for review October 3, 2025 21:22
@wasiahmad wasiahmad requested a review from Kipok October 3, 2025 21:23
@wasiahmad wasiahmad enabled auto-merge (squash) October 3, 2025 21:23
@wasiahmad

Copy link
Copy Markdown
Collaborator Author

@Kipok I have tested this PR, and it works fine. The only issue that needs to be addressed is the use of keep_mounts_for_sandbox=True is not required for LiveCodeBench-CPP but right now, if we don't set it to True, the evaluation fails with complain that custom_output_file is missing within the sandbox container.

Please suggest how to resolve it. Thanks!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f94fdbd and 0c5ca9f.

📒 Files selected for processing (4)
  • docs/evaluation/code.md (1 hunks)
  • nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench-cpp/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests

Comment thread docs/evaluation/code.md
Comment thread nemo_skills/dataset/livecodebench-cpp/prepare.py
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/evaluation/code.md (1)

265-267: Use descriptive link text for the dataset reference.

Replace the "here" anchor with descriptive text (e.g., “LiveCodeBench-CPP dataset on Hugging Face”) to satisfy MD059 and make the documentation clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c5ca9f and ffdad7d.

📒 Files selected for processing (1)
  • docs/evaluation/code.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/evaluation/code.md (1)

265-267: Use descriptive link text for the dataset URL.

Markdownlint (rule MD059) flags “here” as non-descriptive. Please change the anchor text to something like “LiveCodeBench-CPP dataset” so readers know what the link targets without context.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffdad7d and ee346da.

📒 Files selected for processing (1)
  • docs/evaluation/code.md (2 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee346da and 05b94d4.

📒 Files selected for processing (2)
  • nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
  • nemo_skills/pipeline/utils/eval.py (1 hunks)
🔇 Additional comments (1)
nemo_skills/dataset/livecodebench/__init__.py (1)

20-20: LGTM: Sandbox requirement properly declared.

Marking the existing LiveCodeBench dataset as requiring sandbox evaluation is appropriate for code execution benchmarks and aligns with the PR's sandbox-aware evaluation framework.

Comment thread nemo_skills/pipeline/utils/eval.py
@wasiahmad

Copy link
Copy Markdown
Collaborator Author

@Kipok I have tested this PR, and it works fine. The only issue that needs to be addressed is the use of keep_mounts_for_sandbox=True is not required for LiveCodeBench-CPP but right now, if we don't set it to True, the evaluation fails with complain that custom_output_file is missing within the sandbox container.

Please suggest how to resolve it. Thanks!

As per my understanding, I think we need to have the following in dataset __init__.py script for LiveCodeBench.

REQUIRES_SANDBOX = True
KEEP_MOUNTS_FOR_SANDBOX = True

In that case, the issue gets resolved.

gwarmstrong and others added 4 commits October 9, 2025 09:45
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/pipeline/utils/exp.py (1)

528-548: Propagate QoS to the sandbox executor

When --with-sandbox is used together with --qos, the sandbox leg still calls get_executor(...) without forwarding qos. That leg will therefore run with the cluster default QoS (or fail on clusters that require an explicit QoS), while the main/server legs honor the user-specified value. Please pass qos=qos into this get_executor call so all heterogenous groups share the same Slurm QoS.

nemo_skills/inference/model/parallel_thinking.py (1)

344-352: Wire the configured endpoint_type into downstream requests.

ParallelThinkingConfig.endpoint_type is never propagated, so flipping it from the default has no effect—all calls still rely on implicit inference. Please default kwargs["endpoint_type"] to self.cfg.endpoint_type before we fan out to _get_multiple_solutions and _generate_parallel_thinking_contraction.

-        result = {}  # Result dictionary
-        local_random = random.Random(kwargs.get("random_seed", 0))
+        kwargs = dict(kwargs)
+        kwargs.setdefault("endpoint_type", self.cfg.endpoint_type)
+
+        result = {}  # Result dictionary
+        local_random = random.Random(kwargs.get("random_seed", 0))
🧹 Nitpick comments (6)
nemo_skills/dataset/ruler/prepare.py (1)

32-34: Update the comment to reflect the new parameter.

The comment references "completions api" but the parameter has been changed to endpoint_type=text. Consider updating the comment for clarity.

Apply this diff to update the comment:

-    # ruler is adding prefix for assistant response, so it has to go through completions api
+    # ruler is adding prefix for assistant response, so it must use text endpoint type
     "++start_assistant_response_key=generation "
     "++inference.endpoint_type=text "
nemo_skills/inference/patch_litellm_logging.py (2)

61-73: Consider broader exception handling for robustness.

The current implementation only catches ModuleNotFoundError, which handles the case when litellm is not installed. However, if litellm's internal structure changes (e.g., the module path or attribute names change), ImportError or AttributeError could be raised instead, causing the patch to fail unexpectedly.

Apply this diff to handle additional exception types:

     try:
         import litellm.litellm_core_utils.logging_worker as logging_worker_module
 
         logging_worker_module.LoggingWorker = NoOpLoggingWorker
         logging_worker_module.GLOBAL_LOGGING_WORKER = NoOpLoggingWorker()
-    except ModuleNotFoundError:
-        # Ensure compatibility with different litellm versions
+    except (ModuleNotFoundError, ImportError, AttributeError):
+        # Ensure compatibility with different litellm versions or when litellm is unavailable
         pass

61-73: Document when and where this patch should be applied.

The function patch_litellm_logging_worker() is defined but there's no indication in this file about when or where it should be called (e.g., during application startup, before using litellm APIs, etc.). This could lead to confusion about proper usage.

Consider adding a module-level docstring or expanding the function docstring with usage guidance, for example:

"""
Patches the litellm LoggingWorker to disable its functionality.
This prevents any logging worker from keeping the client alive.

Usage:
    Call this function once during application initialization, before any
    litellm API calls are made:
    
    >>> from nemo_skills.inference.patch_litellm_logging import patch_litellm_logging_worker
    >>> patch_litellm_logging_worker()
"""
nemo_skills/training/openrlhf/math_reward.py (1)

61-63: Consider concurrent execution for async calls.

The current implementation calls asyncio.run() sequentially for each judge prompt, creating a new event loop for each item. This is inefficient compared to concurrent execution.

For better performance, consider gathering all async calls:

-    outputs = [
-        asyncio.run(llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases)) for jp in judge_prompts
-    ]
+    async def generate_all():
+        return await asyncio.gather(*[
+            llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases) for jp in judge_prompts
+        ])
+    outputs = asyncio.run(generate_all())

Note: Given the comment at line 60 indicating this module is no longer actively supported, this optimization may be deferred.

tests/test_code_execution.py (1)

113-113: Minor: Consider fixing the grammatical error in the timeout message.

The test correctly validates the new timeout message format. However, the message "Execution timed out after 1 seconds" has a grammatical error (should be "1 second"). While the test is correct, consider updating the implementation to use proper singular/plural forms.

Example fix for the implementation:

# In the timeout handling code
timeout_msg = f"Execution timed out after {timeout} second{'s' if timeout != 1 else ''}\n"
nemo_skills/inference/model/tool_call.py (1)

94-96: Consider enforcing zip strictness
asyncio.gather guarantees the same number of results you scheduled, so adding strict=True would automatically flag any future divergence in tool-call bookkeeping instead of silently truncating: for tool_call, tool_result in zip(tool_calls, tool_results, strict=True).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5db2f25 and 9c1060b.

📒 Files selected for processing (71)
  • .github/workflows/copyright-check.yml (1 hunks)
  • README.md (1 hunks)
  • docs/basics/inference.md (3 hunks)
  • docs/basics/prompt-format.md (1 hunks)
  • docs/evaluation/index.md (2 hunks)
  • docs/evaluation/long-context.md (1 hunks)
  • docs/evaluation/multilingual.md (3 hunks)
  • docs/index.md (1 hunks)
  • docs/pipelines/generation.md (8 hunks)
  • docs/releases/openmathinstruct2/dataset.md (6 hunks)
  • docs/releases/openmathreasoning/evaluation.md (2 hunks)
  • docs/tutorials/posts/gpt-oss-python.md (2 hunks)
  • nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (4 hunks)
  • nemo_skills/code_execution/sandbox.py (5 hunks)
  • nemo_skills/conversion/save_sharded_state.py (0 hunks)
  • nemo_skills/dataset/bfcl_v3/utils.py (1 hunks)
  • nemo_skills/dataset/flores200/__init__.py (1 hunks)
  • nemo_skills/dataset/flores200/prepare.py (1 hunks)
  • nemo_skills/dataset/livecodebench-cpp/prepare.py (1 hunks)
  • nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench/prepare.py (1 hunks)
  • nemo_skills/dataset/math/fix_ref_solns.py (1 hunks)
  • nemo_skills/dataset/ojbench/__init__.py (1 hunks)
  • nemo_skills/dataset/ojbench/prepare.py (1 hunks)
  • nemo_skills/dataset/ruler/prepare.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
  • nemo_skills/inference/chat_interface/chat_service.py (2 hunks)
  • nemo_skills/inference/eval/bfcl.py (7 hunks)
  • nemo_skills/inference/eval/bfcl_utils.py (1 hunks)
  • nemo_skills/inference/eval/scicode.py (1 hunks)
  • nemo_skills/inference/factory.py (1 hunks)
  • nemo_skills/inference/generate.py (15 hunks)
  • nemo_skills/inference/merge_chunks.py (1 hunks)
  • nemo_skills/inference/model/base.py (7 hunks)
  • nemo_skills/inference/model/code_execution.py (8 hunks)
  • nemo_skills/inference/model/context_retry.py (13 hunks)
  • nemo_skills/inference/model/openai.py (1 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (2 hunks)
  • nemo_skills/inference/model/tool_call.py (6 hunks)
  • nemo_skills/inference/model/utils.py (3 hunks)
  • nemo_skills/inference/model/vllm.py (1 hunks)
  • nemo_skills/inference/patch_litellm_logging.py (1 hunks)
  • nemo_skills/mcp/adapters.py (4 hunks)
  • nemo_skills/mcp/config.py (1 hunks)
  • nemo_skills/mcp/tool_manager.py (1 hunks)
  • nemo_skills/mcp/tool_providers.py (1 hunks)
  • nemo_skills/pipeline/eval.py (3 hunks)
  • nemo_skills/pipeline/generate.py (2 hunks)
  • nemo_skills/pipeline/megatron_lm/train.py (2 hunks)
  • nemo_skills/pipeline/nemo_rl/grpo.py (3 hunks)
  • nemo_skills/pipeline/nemo_rl/sft.py (3 hunks)
  • nemo_skills/pipeline/prepare_data.py (2 hunks)
  • nemo_skills/pipeline/run_cmd.py (2 hunks)
  • nemo_skills/pipeline/start_server.py (2 hunks)
  • nemo_skills/pipeline/utils/exp.py (5 hunks)
  • nemo_skills/pipeline/verl/ppo.py (2 hunks)
  • nemo_skills/prompt/config/multilingual/__init__.py (1 hunks)
  • nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)
  • nemo_skills/prompt/few_shot_examples/lean4.py (1 hunks)
  • nemo_skills/prompt/utils.py (4 hunks)
  • nemo_skills/training/merge_packed_data.py (1 hunks)
  • nemo_skills/training/openrlhf/math_reward.py (2 hunks)
  • requirements/main.txt (2 hunks)
  • tests/slurm-tests/gpt_oss_python_aime25/check_results.py (2 hunks)
  • tests/slurm-tests/gpt_oss_python_aime25/run_test.py (1 hunks)
  • tests/test_code_execution.py (1 hunks)
  • tests/test_mcp_clients.py (3 hunks)
💤 Files with no reviewable changes (1)
  • nemo_skills/conversion/save_sharded_state.py
✅ Files skipped from review due to trivial changes (17)
  • README.md
  • nemo_skills/inference/eval/bfcl_utils.py
  • nemo_skills/inference/factory.py
  • docs/index.md
  • nemo_skills/mcp/tool_providers.py
  • nemo_skills/dataset/ojbench/init.py
  • nemo_skills/dataset/math/fix_ref_solns.py
  • nemo_skills/prompt/few_shot_examples/lean4.py
  • docs/basics/prompt-format.md
  • docs/evaluation/index.md
  • docs/evaluation/long-context.md
  • nemo_skills/prompt/config/multilingual/init.py
  • nemo_skills/dataset/ojbench/prepare.py
  • nemo_skills/dataset/bfcl_v3/utils.py
  • nemo_skills/dataset/livecodebench/prepare.py
  • nemo_skills/mcp/config.py
  • nemo_skills/inference/merge_chunks.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/livecodebench-cpp/prepare.py
🧰 Additional context used
🧬 Code graph analysis (20)
nemo_skills/inference/model/openai.py (2)
nemo_skills/inference/model/base.py (2)
  • _build_responses_request_params (209-210)
  • _build_chat_request_params (202-203)
nemo_skills/inference/model/vllm.py (2)
  • _build_responses_request_params (142-148)
  • _build_chat_request_params (102-140)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
  • TranslationMetrics (22-81)
tests/test_mcp_clients.py (1)
nemo_skills/mcp/tool_manager.py (3)
  • list_tools (51-52)
  • ToolManager (64-164)
  • list_all_tools (111-151)
nemo_skills/inference/model/code_execution.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
  • write_data_to_file (23-36)
  • main (39-47)
nemo_skills/inference/eval/bfcl.py (3)
nemo_skills/inference/generate.py (6)
  • generate (676-703)
  • generate (711-716)
  • GenerateSolutionsConfig (81-233)
  • GenerationTask (240-703)
  • InferenceConfig (59-77)
  • generate_with_semaphore (554-562)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/prompt/utils.py (1)
  • get_token_count (310-369)
tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)
tests/slurm-tests/utils.py (3)
  • assert_all (46-58)
  • load_json (18-21)
  • soft_assert (36-43)
nemo_skills/inference/model/vllm.py (2)
nemo_skills/inference/model/base.py (2)
  • _build_responses_request_params (209-210)
  • _build_chat_request_params (202-203)
nemo_skills/inference/model/openai.py (2)
  • _build_responses_request_params (155-160)
  • _build_chat_request_params (84-153)
nemo_skills/mcp/adapters.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/training/openrlhf/math_reward.py (1)
nemo_skills/inference/model/base.py (1)
  • generate_async (213-315)
nemo_skills/inference/chat_interface/chat_service.py (2)
nemo_skills/inference/model/base.py (1)
  • generate_async (213-315)
nemo_skills/inference/model/code_execution.py (1)
  • generate_async (241-300)
nemo_skills/inference/model/base.py (7)
nemo_skills/inference/patch_litellm_logging.py (1)
  • patch_litellm_logging_worker (61-73)
nemo_skills/utils.py (1)
  • get_logger_name (130-134)
nemo_skills/inference/model/context_retry.py (2)
  • ContextLimitRetryConfig (114-157)
  • with_context_retry (160-180)
nemo_skills/inference/model/utils.py (3)
  • ServerTokenizer (54-83)
  • WrapperAutoTokenizer (86-102)
  • trim_after_stop_phrases (27-33)
nemo_skills/inference/model/openai.py (3)
  • _build_responses_request_params (155-160)
  • _build_chat_request_params (84-153)
  • _build_completion_request_params (64-82)
nemo_skills/inference/model/vllm.py (3)
  • _build_responses_request_params (142-148)
  • _build_chat_request_params (102-140)
  • _build_completion_request_params (63-100)
nemo_skills/inference/model/tool_call.py (1)
  • generate_async (98-160)
nemo_skills/inference/generate.py (3)
nemo_skills/inference/model/base.py (2)
  • EndpointType (38-41)
  • generate_async (213-315)
nemo_skills/evaluation/evaluator/__init__.py (2)
  • get_evaluator_class (98-108)
  • supports_single_eval (111-117)
nemo_skills/prompt/utils.py (1)
  • get_token_count (310-369)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
  • BaseMetrics (23-434)
  • as_float (449-452)
nemo_skills/inference/model/tool_call.py (2)
nemo_skills/mcp/adapters.py (3)
  • format_tool_list_by_endpoint_type (51-76)
  • format_tool_response_by_endpoint_type (96-111)
  • get_tool_details_by_endpoint_type (114-124)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
  • write_data_to_file (23-38)
  • main (41-54)
nemo_skills/inference/eval/scicode.py (1)
nemo_skills/inference/generate.py (1)
  • process_single_datapoint (527-552)
nemo_skills/inference/model/context_retry.py (1)
nemo_skills/inference/model/utils.py (5)
  • is_context_window_exceeded_error (36-51)
  • encode (61-74)
  • encode (93-98)
  • ServerTokenizer (54-83)
  • WrapperAutoTokenizer (86-102)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
nemo_skills/code_execution/sandbox.py (2)
  • Sandbox (45-334)
  • get_sandbox (419-422)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (36-92)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/multilingual.md

76-76: Link text should be descriptive

(MD059, descriptive-link-text)


149-149: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)
tests/test_mcp_clients.py

264-264: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

nemo_skills/inference/eval/bfcl.py

139-139: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


140-140: Use raise without specifying exception name

Remove exception name

(TRY201)


272-272: Use raise without specifying exception name

Remove exception name

(TRY201)

tests/slurm-tests/gpt_oss_python_aime25/check_results.py

24-24: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)


54-54: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/mcp/adapters.py

76-76: Avoid specifying long messages outside the exception class

(TRY003)


111-111: Avoid specifying long messages outside the exception class

(TRY003)


123-123: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/base.py

298-298: Avoid specifying long messages outside the exception class

(TRY003)


301-301: Consider moving this statement to an else block

(TRY300)


310-310: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


313-313: Use raise without specifying exception name

Remove exception name

(TRY201)


452-452: Unused method argument: kwargs

(ARG002)


509-509: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/generate.py

229-229: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/tool_call.py

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


95-95: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/inference/model/context_retry.py

192-192: Use raise without specifying exception name

Remove exception name

(TRY201)


207-207: Consider moving this statement to an else block

(TRY300)


208-208: Do not catch blind exception: Exception

(BLE001)


226-226: Use raise without specifying exception name

Remove exception name

(TRY201)


241-241: Consider moving this statement to an else block

(TRY300)


242-242: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/evaluator/livecodebench.py

67-67: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


128-128: Consider moving this statement to an else block

(TRY300)


134-134: subprocess call: check for execution of untrusted input

(S603)


138-138: Consider moving this statement to an else block

(TRY300)


140-140: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


155-155: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


219-219: Avoid specifying long messages outside the exception class

(TRY003)


221-221: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (24)
requirements/main.txt (2)

29-29: LGTM on new dependencies.

The additions of langcodes and sacrebleu align well with the PR's goal of adding LiveCodeBench-CPP evaluation support and likely support translation/multilingual features and BLEU metrics mentioned in the AI summary.

Also applies to: 39-39


30-30: litellm 1.77.5 has no known security advisories. All reported vulnerabilities were patched by versions ≤1.61.15; 1.77.5 is unaffected.

nemo_skills/dataset/livecodebench/__init__.py (2)

1-1: LGTM! Copyright year updated.

The copyright year update from 2024 to 2025 is appropriate.


15-20: Removal of sandbox constants is safe
No references to REQUIRES_SANDBOX or KEEP_MOUNTS_FOR_SANDBOX exist in the Python LiveCodeBench evaluator; functionality remains unchanged.

nemo_skills/inference/model/context_retry.py (11)

30-110: LGTM! Enhanced error parsing covers multiple formats.

The expanded pattern matching now handles 5 different context window error formats, including the new message_tokens_overflow case (Pattern 5) when max_tokens goes negative. The progressive fallback through patterns is well-structured.


183-214: LGTM! Improved error handling with better error categorization.

The restructured soft-fail logic now:

  • Clearly distinguishes context window errors from other exceptions
  • Returns error_reason="context_window_exceeded" for better error tracking
  • Attempts retry with modified kwargs when strategy is configured

Note: The Ruff warnings (TRY201, TRY300, BLE001) are false positives. The raise error preserves the original exception, the nested try-catch is appropriate for retry logic, and catching Exception is necessary to handle any error from the retry attempt.


217-248: LGTM! Sync version mirrors async improvements.

The sync handler implements the same improved error handling as the async version. The code duplication is appropriate given the fundamental difference between sync and async execution.

Note: Same Ruff warnings apply here as in the async version - these are false positives.


251-288: LGTM! Added essential guard for prompt overflow.

The early exit at lines 271-274 correctly prevents attempting to reduce generation tokens when the prompt itself overflows the context limit. This saves a futile retry attempt.


291-317: LGTM! Safety margin prevents edge-case failures.

The safe_remaining_budget calculation (lines 299-306) adds a necessary buffer to account for special tokens, preventing failures at exact context boundaries. The logic correctly checks if messages already consume the available space.


320-362: LGTM! Direct calculation for overflow cases is more efficient.

When message_tokens_overflow is present (lines 328-334), the code directly calculates the required prompt reduction instead of relying on indirect token budget calculations. This is both clearer and more accurate.


365-382: LGTM! Consistent kwargs access pattern.

Accessing the prompt via kwargs["prompt"] (line 372) maintains consistency with how other parameters are retrieved throughout the module.


385-408: LGTM! Proper handling of tools parameter for chat templates.

Extracting tools from kwargs (line 394) and passing it to the message trimming functions ensures accurate tokenization when function calling is involved.


411-500: LGTM! Tools parameter enables accurate tokenization.

Both trimming functions now accept and pass the tools parameter to tokenizer.encode (lines 424, 468), ensuring correct token counts when chat templates include function calling definitions.


525-531: LGTM! Error categorization improves observability.

Adding the error_reason parameter (line 525) enables structured error tracking. Callers can now identify context window errors specifically via error="context_window_exceeded" while preserving detailed diagnostics in detailed_error.


121-121: Verify special token budget adequacy. Automated search didn’t locate any tokenizer or prompt-template definitions—manually confirm that reducing num_special_tokens_budget to 50 still covers the maximum special-token overhead for all supported model prompt templates and tokenizers.

nemo_skills/pipeline/generate.py (1)

107-107: LGTM! QoS parameter integration is consistent.

The qos parameter has been properly added and forwarded through the task creation pipeline. This change is part of a consistent pattern across multiple pipeline files (generate.py, megatron_lm/train.py, verl/ppo.py, nemo_rl/grpo.py, nemo_rl/sft.py, eval.py) enabling Slurm QoS specification for job submissions.

Also applies to: 314-314

tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)

34-76: LGTM! Timeout validation logic is well-structured.

The new timeout checking functionality correctly:

  • Parses and aggregates timeout counts from JSONL output files
  • Validates per-file and total timeout thresholds using soft assertions
  • Extracts seeds from filenames with proper error handling

The implementation integrates cleanly with the existing test workflow.

docs/basics/inference.md (1)

37-37: LGTM! Documentation correctly updated for async generation.

The documentation properly reflects the migration from synchronous to asynchronous generation, with consistent updates across all examples. The asyncio import is correctly added, and the usage pattern is clear.

Also applies to: 43-43, 73-73, 85-85

nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)

1-3: LGTM! Translation prompt config is well-structured.

The prompt template is clear and concise, with appropriate placeholders for text and target language. The format aligns with the dataset usage patterns in FLORES200 and WMT24PP.

docs/releases/openmathreasoning/evaluation.md (1)

107-107: LGTM! Documentation correctly updated for endpoint_type parameter.

The change from ++use_completions_api=True to ++inference.endpoint_type=text is consistent across both examples and aligns with the broader endpoint-type migration in the codebase.

Also applies to: 130-130

nemo_skills/evaluation/metrics/map_metrics.py (1)

36-36: LGTM! Translation metrics correctly integrated.

The TranslationMetrics import and registration in METRICS_MAP follow the established pattern. The implementation properly extends BaseMetrics with BLEU-based evaluation.

Also applies to: 60-60

nemo_skills/inference/chat_interface/chat_service.py (1)

17-17: LGTM! Async migration correctly implemented.

The change from generate_sync to asyncio.run(llm.generate_async(...)) maintains the synchronous interface of stream_chat while using the new async generation pathway. This is appropriate since the function signature returns Iterator[str] (not AsyncIterator), indicating it's designed to be called from synchronous contexts.

Also applies to: 60-69

nemo_skills/inference/model/vllm.py (1)

142-148: LGTM! Implementation follows the established pattern.

The _build_responses_request_params method correctly delegates to _build_chat_request_params and renames the messages parameter to input. Note that this differs from the OpenAI implementation which also renames max_completion_tokensmax_output_tokens. If this reflects actual API differences between VLLM and OpenAI responses endpoints, the implementation is correct.

nemo_skills/inference/model/openai.py (1)

155-160: Ensure OpenAI Responses API mapping: _build_responses_request_params renames both messagesinput and max_completion_tokensmax_output_tokens, unlike VLLMModel. Verify this aligns with the official OpenAI Responses endpoint specification.

Comment on lines +39 to +47
def main(args):
datasets = {}
for lang in args.target_languages:
datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]

data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Respect the requested split when loading WMT24PP.

We expose --split (defaulting to test), yet we unconditionally fetch the train split via ["train"]. That silently pipes training data into evaluation, invalidating metrics. Please honor the CLI flag.

-        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def main(args):
datasets = {}
for lang in args.target_languages:
datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)
def main(args):
datasets = {}
for lang in args.target_languages:
datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)
🤖 Prompt for AI Agents
nemo_skills/dataset/wmt24pp/prepare.py around lines 39 to 47: the code currently
unconditionally selects the "train" split when loading "google/wmt24pp",
ignoring the CLI --split flag; change the dataset loading to use the requested
split (e.g., use load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)
or index with [args.split]) so the script actually loads the specified split
(default test) before writing to the output file.

Comment on lines +201 to +209
evaluate_fn(
custom_output_file=jsonl_file,
release_version=f"release_{release_version}",
k_list=[1],
language=eval_config.language,
test_file=eval_config.test_file,
num_process_evaluate=eval_config.num_processes,
timeout=eval_config.timeout,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the mismatched release_version in the local path.

_preprocess_and_validate_file already returns the exact release_version string stored in the samples (e.g., "release_cpp_202409"). The sandbox flow forwards it unchanged, but the local flow prepends "release_" again, resulting in "release_release_cpp_202409". That value doesn’t exist in the LiveCodeBench package, so local evaluations will now fail with a missing-release error. Please drop the extra prefix (or normalize both paths the same way).

Apply this diff:

-        evaluate_fn(
-            custom_output_file=jsonl_file,
-            release_version=f"release_{release_version}",
+        evaluate_fn(
+            custom_output_file=jsonl_file,
+            release_version=release_version,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
evaluate_fn(
custom_output_file=jsonl_file,
release_version=f"release_{release_version}",
k_list=[1],
language=eval_config.language,
test_file=eval_config.test_file,
num_process_evaluate=eval_config.num_processes,
timeout=eval_config.timeout,
)
evaluate_fn(
custom_output_file=jsonl_file,
release_version=release_version,
k_list=[1],
language=eval_config.language,
test_file=eval_config.test_file,
num_process_evaluate=eval_config.num_processes,
timeout=eval_config.timeout,
)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/livecodebench.py around lines 201 to 209,
the local call to evaluate_fn is prepending "release_" to release_version,
causing "release_release_..." mismatches with samples; remove the extra prefix
and pass the release_version string returned by _preprocess_and_validate_file
directly (or normalize both sandbox and local flows to the same format) so
evaluate_fn receives the exact release tag present in the LiveCodeBench package.

Comment on lines +65 to 76
tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)

##
# TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.
# Kept here to handle errors for now.
try:
tool_args = json.loads(tool_args)
except json.decoder.JSONDecodeError as e:
LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
LOG.exception(e)
return {"error": "Tool argument parsing failed."}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle dict tool arguments for responses endpoint
For EndpointType.responses, get_tool_details_by_endpoint_type (see nemo_skills/mcp/adapters.py, lines 113-123) already returns tool_call["arguments"], which is a dict. Passing that dict into json.loads raises a TypeError, so tool execution now crashes before the try/except (you only catch JSONDecodeError). Please decode only when the payload is a string, or otherwise accept the existing dict. One possible fix:

-        try:
-            tool_args = json.loads(tool_args)
-        except json.decoder.JSONDecodeError as e:
+        try:
+            if isinstance(tool_args, str):
+                tool_args = json.loads(tool_args)
+        except json.decoder.JSONDecodeError as e:
             LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
             LOG.exception(e)
             return {"error": "Tool argument parsing failed."}
+        except TypeError as e:
+            LOG.error("Tool arguments must be str or dict; got %s", type(tool_args))
+            LOG.exception(e)
+            return {"error": "Tool argument parsing failed."}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)
##
# TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.
# Kept here to handle errors for now.
try:
tool_args = json.loads(tool_args)
except json.decoder.JSONDecodeError as e:
LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
LOG.exception(e)
return {"error": "Tool argument parsing failed."}
tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)
##
# TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.
# Kept here to handle errors for now.
try:
if isinstance(tool_args, str):
tool_args = json.loads(tool_args)
except json.decoder.JSONDecodeError as e:
LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
LOG.exception(e)
return {"error": "Tool argument parsing failed."}
except TypeError as e:
LOG.error("Tool arguments must be str or dict; got %s", type(tool_args))
LOG.exception(e)
return {"error": "Tool argument parsing failed."}
🧰 Tools
🪛 Ruff (0.13.3)

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


74-74: Redundant exception object included in logging.exception call

(TRY401)

🤖 Prompt for AI Agents
In nemo_skills/inference/model/tool_call.py around lines 65 to 76, the code
always calls json.loads on tool_args which can be a dict for
EndpointType.responses causing a TypeError; update the logic to only call
json.loads when tool_args is a string (e.g., isinstance(tool_args, str)); if
tool_args is already a dict, skip decoding and use it as-is; keep the existing
JSONDecodeError handling for string parsing and optionally catch TypeError if
non-str/non-dict types may appear and return the same error response.


Currently, the async_loop function in generate.py sometimes gets stuck forever because some of the coroutines in the litellm logging worker are not finished.
Debugging why the logger is not finished is non-trivial, so we just patch it to disable its functionality.
The behavior is that it keeps slurm jobs from existing and we waste gpus.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in documentation.

"existing" should be "exiting" (jobs should exit, not exist).

Apply this diff:

-The behavior is that it keeps slurm jobs from existing and we waste gpus.
+The behavior is that it keeps slurm jobs from exiting and we waste gpus.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The behavior is that it keeps slurm jobs from existing and we waste gpus.
The behavior is that it keeps slurm jobs from exiting and we waste gpus.
🤖 Prompt for AI Agents
In nemo_skills/inference/patch_litellm_logging.py around line 21, the
documentation/comment contains a typo: "existing" should be "exiting"; update
the sentence so it reads that slurm jobs keep from exiting (not existing) to
reflect correct behavior and save GPUs.

Comment on lines +105 to +109
return {
"type": "function_call_output",
"call_id": tool_call["call_id"],
"output": json.dumps(result) if not isinstance(result, str) else result,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix responses tool-call output key.

For the Responses API path we return "call_id": tool_call["call_id"], but the function-call object coming from OpenAI (and from model_dump()) exposes this identifier as "id". Using "call_id" raises a KeyError as soon as we format any tool output under the responses endpoint. Please grab the "id" field instead.

-        return {
-            "type": "function_call_output",
-            "call_id": tool_call["call_id"],
-            "output": json.dumps(result) if not isinstance(result, str) else result,
-        }
+        return {
+            "type": "function_call_output",
+            "call_id": tool_call["id"],
+            "output": json.dumps(result) if not isinstance(result, str) else result,
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return {
"type": "function_call_output",
"call_id": tool_call["call_id"],
"output": json.dumps(result) if not isinstance(result, str) else result,
}
return {
"type": "function_call_output",
"call_id": tool_call["id"],
"output": json.dumps(result) if not isinstance(result, str) else result,
}
🤖 Prompt for AI Agents
In nemo_skills/mcp/adapters.py around lines 105 to 109, the responses endpoint
incorrectly accesses tool_call["call_id"] which raises a KeyError because the
OpenAI Responses/function-call object uses "id"; update the returned mapping to
use tool_call["id"] instead of tool_call["call_id"] (optionally add a fallback
like tool_call.get("id") or tool_call.get("call_id") if you want backward
compatibility).

"++inference.top_p=1.0 "
"++prompt_config=gpt-oss/math "
"++use_completions_api=true "
"++inference.endpoint_type=text "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

LGTM! Configuration flag updated to use new endpoint_type parameter.

The change from ++use_completions_api=true to ++inference.endpoint_type=text aligns with the broader refactoring to endpoint_type-based configuration mentioned in the PR.

Verify that the old use_completions_api flag has been replaced consistently across the codebase:


🏁 Script executed:

#!/bin/bash
# Description: Search for any remaining usage of the old completions API flag

# Search for old flag usage
rg -n 'use_completions_api' --type py

Length of output: 305


Remove deprecated use_completions_api in generate.py
The test change is correct but nemo_skills/inference/generate.py still defines and checks use_completions_api. Remove the parameter and its deprecation branch (lines 89, 228–229) and switch to inference.endpoint_type only.

🤖 Prompt for AI Agents
In nemo_skills/inference/generate.py around lines 89 and 228-229, remove the
deprecated use_completions_api parameter and the conditional branch that checks
it; delete any parameter definition, default, or docstring references to
use_completions_api and remove the deprecation warning branch. Replace any logic
that selects behavior based on use_completions_api with a single check using
inference.endpoint_type (e.g., compare strings like "completions" vs "chat" or
whatever existing endpoint_type values are used) so all code paths rely only on
inference.endpoint_type. Ensure imports, function signatures, and tests are
updated accordingly and no dead references to use_completions_api remain.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

252-279: Consider using LOG.exception() for preprocessing errors.

At line 263, use LOG.exception() instead of LOG.error() when logging preprocessing errors to include the full traceback.

Apply this diff:

         except ValueError as e:
-            LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}")
+            LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.")
             continue

Otherwise, LGTM: Local evaluation path mirrors sandbox path well.

The local evaluation flow is consistent with the sandbox path, with the same preprocessing, release_version handling (prepending "release_" for Python), and postprocessing logic.

docs/evaluation/code.md (1)

326-327: Verify if sandbox flags are automatically set for OJBench.

A past reviewer (Kipok) questioned whether these flags are necessary, suggesting they might be set automatically. Since LiveCodeBench has automatic sandbox routing logic, please confirm whether OJBench requires explicit sandbox flags or if they're inferred from the benchmark configuration.

🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (2)

183-202: Consider using LOG.exception() for better error diagnostics.

At line 201, when logging errors during package installation, use LOG.exception() instead of LOG.error() to automatically include the full traceback. This will help diagnose installation failures more quickly.

Apply this diff:

         except (subprocess.CalledProcessError, ImportError) as e:
-            LOG.error(f"Failed to install/import 'livecodebench'. Please install it manually. Error: {e}")
+            LOG.exception(f"Failed to install/import 'livecodebench'. Please install it manually.")
             raise

205-249: Consider using LOG.exception() for preprocessing errors.

Similar to the previous comment, at line 216, use LOG.exception() instead of LOG.error() when logging preprocessing errors to include the full traceback.

Apply this diff:

             except ValueError as e:
-                LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}")
+                LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.")
                 continue

Otherwise, LGTM: Sandbox evaluation path is well-structured.

The async evaluation flow with retry logic, proper timeout calculation (timeout * len(samples) + timeout_buffer), and consistent preprocessing/postprocessing is well-implemented.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1060b and d8db2c7.

📒 Files selected for processing (3)
  • docs/evaluation/code.md (3 hunks)
  • nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/livecodebench-cpp/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (36-92)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

265-265: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


126-126: Avoid specifying long messages outside the exception class

(TRY003)


131-131: Avoid specifying long messages outside the exception class

(TRY003)


189-189: Consider moving this statement to an else block

(TRY300)


195-195: subprocess call: check for execution of untrusted input

(S603)


199-199: Consider moving this statement to an else block

(TRY300)


201-201: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


216-216: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


287-287: Avoid specifying long messages outside the exception class

(TRY003)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (5)
nemo_skills/evaluation/evaluator/livecodebench.py (3)

51-90: LGTM: Robust retry mechanism with proper backoff.

The async context manager and retry logic are well-implemented. The exponential backoff and clear error logging will help diagnose connectivity issues.


117-162: LGTM: Clean consolidation of preprocessing and postprocessing logic.

These helper functions eliminate code duplication between the sandbox and local evaluation paths. The validation logic (checking for empty samples and consistent release versions) is appropriate, and the in-place file modification approach works well with the evaluation flow.


282-302: LGTM: Well-designed routing logic with proper validation.

The main entry point effectively:

  • Validates interpreter constraints per language (Python allows python/pypy3, C++ requires python)
  • Forces the correct interpreter for C++ with a clear warning
  • Checks sandbox availability before routing
  • Enforces sandbox requirement for pypy3 (raises error if unavailable)
  • Gracefully falls back to local evaluation when sandbox is unavailable

The warning message at lines 289-293 addresses the past review comment about explaining why the Python interpreter is used for C++ evaluation.

docs/evaluation/code.md (2)

224-227: Verify if --with_sandbox flag is necessary.

The documentation states that --with_sandbox --keep_mounts_for_sandbox flags must be passed for Pypy3 evaluation. However, the code at lines 295-297 automatically checks sandbox availability and routes accordingly, raising an error if pypy3 is used without an available sandbox.

The --with_sandbox flag might be implicitly handled, though --keep_mounts_for_sandbox may still be necessary for file access (as mentioned in the PR objectives). Please confirm whether users truly need to pass these flags explicitly.


262-267: LGTM: C++ benchmark documentation is clear and correct.

The new section properly:

  • Links to both the dataset definition and the original Hugging Face source
  • Specifies the exact benchmark entrypoint (--benchmarks=livecodebench-cpp)
  • Names both available splits (v5_2408_2501 and v6_2408_2505)
  • References the Python instructions for additional details

This addresses the past review comment about providing C++-specific instructions.

Comment thread nemo_skills/evaluation/evaluator/livecodebench.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
nemo_skills/evaluation/evaluator/livecodebench.py (2)

93-115: Consider narrowing the exception handling scope.

The function correctly implements sandbox availability checking, but catching all Exception types on line 112 is overly broad. While this is acceptable for an availability check that should never fail the caller, you should consider whether certain exception types (e.g., KeyboardInterrupt, SystemExit) should propagate.

As per the Ruff hint (BLE001), you could narrow the exception handling:

         LOG.info("Sandbox connection successful. Sandbox is available.")
         return True
     except httpx.NetworkError as e:
         LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
         return False
-    except Exception as e:
+    except (RuntimeError, ValueError, TypeError) as e:
         LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
         return False

Alternatively, if you want to keep the broad catch for robustness, the current implementation is acceptable since it logs the error details.


288-293: Consider adding an inline comment explaining the interpreter constraint.

While the warning message clearly explains that C++ evaluation requires the Python interpreter to run the harness, an inline comment would improve code readability for future maintainers.

     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # The C++ evaluation harness is written in Python, so we must use the Python interpreter
+        # to execute it, regardless of the target language being C++
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"

Based on learnings (past review comment from Kipok suggesting to add a comment about why Python interpreter is used for C++ tests).

docs/evaluation/code.md (1)

334-335: Remove redundant sandbox flags from code example
In docs/evaluation/code.md (lines 334–335), the --with_sandbox and --keep_mounts_for_sandbox flags can be omitted—they’re already enabled by REQUIRES_SANDBOX = True and KEEP_MOUNTS_FOR_SANDBOX = True in nemo_skills/dataset/ojbench/__init__.py.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1060b and 8b48105.

📒 Files selected for processing (3)
  • docs/evaluation/code.md (3 hunks)
  • nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/livecodebench-cpp/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

273-273: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (9)
docs/evaluation/code.md (2)

232-232: Documentation correctly reflects sandbox requirement for Pypy3.

The updated instructions clearly state that sandbox flags are required for Pypy3 evaluation, which aligns with the PR objectives and the resolution of the custom_output_file issue.


270-275: LGTM! Previous review concerns have been addressed.

The livecodebench-cpp documentation now correctly:

  • Fixes the typo ("repare" → "prepare")
  • Explicitly instructs users to target the C++ benchmark entrypoint (--benchmarks=livecodebench-cpp)
  • Specifies the correct split names (v5_2408_2501 or v6_2408_2505)

The implementation matches the suggestions from the previous review.

nemo_skills/evaluation/evaluator/livecodebench.py (7)

27-28: LGTM! Import required for network error handling.

The httpx import is necessary for catching httpx.NetworkError in the new retry logic.


47-48: LGTM! New configuration fields support retry logic.

The addition of timeout_buffer and num_retries fields appropriately extends the configuration to support the new retry and timeout management features.


62-91: LGTM! Retry logic implements exponential backoff correctly.

The implementation properly:

  • Catches httpx.NetworkError specifically for transient network issues
  • Uses exponential backoff (2^attempt) between retries
  • Logs warnings with attempt counts and wait times
  • Re-raises the last exception after all retries are exhausted

165-180: LGTM! Function signature updated to support retry configuration.

The changes correctly:

  • Accept eval_config to access interpreter and num_retries
  • Use the new _execute_in_sandbox_with_retries helper
  • Maintain backward compatibility with retry logic

219-220: LGTM! Consistent release_version handling across sandbox and local paths.

Both evaluation paths now apply the same logic: prepending "release_" to the release_version when language == "python". This resolves the previous review concern about mismatched handling between sandbox and local flows. The conditional prefix suggests that C++ releases use a different naming convention that doesn't require the "release_" prefix.

Also applies to: 266-267


236-242: LGTM! Timeout calculation includes buffer for improved reliability.

The timeout now accounts for per-sample timeout plus a configurable buffer (timeout_buffer), which helps prevent premature timeouts during batch evaluation of large sample sets.


295-297: LGTM! Sandbox availability check ensures pypy3 requirements are met.

The code correctly:

  • Checks sandbox availability before attempting evaluation
  • Raises a clear error when pypy3 is configured but sandbox is unavailable
  • Routes to the appropriate evaluation path based on availability

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
cursor[bot]

This comment was marked as outdated.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

288-293: Consider adding a comment explaining the C++ interpreter override.

The code forces the interpreter to "python" for C++ evaluations, overriding any user configuration. While this is correct (the LiveCodeBench evaluation harness is Python-based, even for C++ submissions), a brief comment would help future maintainers understand why this override is necessary.

Apply this diff:

     if eval_config.language == "cpp" and eval_config.interpreter != "python":
         LOG.warning(
-            f"For C++ evaluation, the harness must be run with 'python'. "
+            # The LiveCodeBench evaluation harness for C++ is Python-based, regardless of the code being evaluated
+            f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b48105 and f311171.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/evaluation/evaluator/livecodebench.py (7)

27-28: LGTM!

The httpx import is required for the new network retry logic introduced in _execute_in_sandbox_with_retries.


42-48: LGTM!

The new configuration fields (timeout_buffer, num_retries) are well-suited for controlling the retry behavior and timeout calculations introduced in this PR.


62-91: LGTM!

The retry mechanism with exponential backoff appropriately handles transient network failures when communicating with the sandbox service. The function correctly forwards arguments to sandbox.execute_code.


93-115: LGTM!

The sandbox availability check correctly probes connectivity by executing a simple shell command. The broad exception handling on line 112 is appropriate for an availability check, as it should gracefully handle any unexpected errors and report the sandbox as unavailable. The static analysis warnings (TRY300, BLE001) can be safely ignored in this context.


165-180: LGTM!

The function correctly integrates the retry logic and properly uses eval_config to determine the appropriate interpreter and number of retries.


295-302: LGTM!

The sandbox availability check and path selection logic correctly handles the requirement that pypy3 must run in a sandbox environment, while allowing fallback to local evaluation for standard Python when the sandbox is unavailable.


219-226: Verify sample release_version format

Confirm that the values returned by _preprocess_and_validate_file for Python samples do not already include the "release_" prefix; otherwise the code at lines 219–220 will produce "release_release_…" mismatches.

@wasiahmad

wasiahmad commented Oct 12, 2025

Copy link
Copy Markdown
Collaborator Author

@Kipok this PR is working. I have made the following key changes.

1. We check if Sandbox is available for LCB evaluation.

async def is_sandbox_available(sandbox_config: dict) -> bool:
    """
    Checks if the sandbox service is running and accessible by sending a test request.

    Args:
        sandbox_config: The configuration dictionary for the sandbox.

    Returns:
        True if a connection can be established, False otherwise.
    """
    LOG.info(f"Attempting to connect to sandbox with config: {sandbox_config}")
    try:
        async with sandbox_context(sandbox_config) as sandbox:
            await _execute_in_sandbox_with_retries(sandbox, 1, "true", language="shell", timeout=5)
        LOG.info("Sandbox connection successful. Sandbox is available.")
        return True
    except httpx.NetworkError as e:
        LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
        return False
    except Exception as e:
        LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
        return False

2. Core logic when to use Sandbox

sandbox_is_ready = asyncio.run(is_sandbox_available(eval_config.sandbox))
if eval_config.interpreter == "pypy3" and not sandbox_is_ready:
    raise RuntimeError("The 'pypy3' interpreter requires a running sandbox, but the service was unreachable.")

if sandbox_is_ready:
    asyncio.run(eval_livecodebench_async(cfg, eval_config))
else:
    eval_livecodebench_without_sandbox(cfg, eval_config)

3. Execute code in Sandbox with a number of retries

async def _execute_in_sandbox_with_retries(
    sandbox: Sandbox,
    num_retries: int,
    *args,
    **kwargs,
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """
    Executes a command in the sandbox, retrying on network errors.

    Uses exponential backoff for delays between retries.
    """
    last_exception = None
    for attempt in range(num_retries):
        try:
            return await sandbox.execute_code(*args, **kwargs)
        except httpx.NetworkError as e:
            last_exception = e
            if attempt == num_retries - 1:
                break

            wait_time = 2**attempt
            LOG.warning(
                f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
                f"Retrying in {wait_time} second(s)..."
            )
            await asyncio.sleep(wait_time)

    LOG.error(f"All {num_retries} sandbox connection attempts failed.")
    raise last_exception

Currently, the default num_retries is set to 3.

Feel free to take a look at the entire script - nemo_skills/evaluation/evaluator/livecodebench.py.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad requested a review from Kipok October 13, 2025 04:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (2)

112-114: Narrow the exception handling to avoid masking unexpected errors.

Catching a generic Exception after the specific httpx.NetworkError can suppress unexpected bugs (e.g., AttributeError, TypeError). Consider either:

  • Catching specific exceptions expected during sandbox initialization (e.g., ValueError, ConnectionError)
  • Or, logging the full traceback and re-raising if the exception is not one you expect

Example approach:

     except httpx.NetworkError as e:
         LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
         return False
-    except Exception as e:
-        LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
+    except (ValueError, ConnectionError) as e:
+        LOG.warning(f"Sandbox configuration or connection error: {type(e).__name__} - {e}")
         return False
+    except Exception as e:
+        LOG.error(f"Unexpected error during sandbox availability check: {e}", exc_info=True)
+        raise

288-294: Add a comment explaining why C++ evaluation requires the Python interpreter.

The code forces interpreter = "python" for C++ evaluation, but the rationale isn't immediately clear from the code alone. Adding a brief comment would help future maintainers understand that the evaluation harness itself is written in Python, even though it evaluates C++ code.

Based on past review comments

Apply this diff:

     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # The LiveCodeBench evaluation harness is implemented in Python,
+        # so C++ code evaluation must use the Python interpreter to run the harness.
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02ad4b3 and aea6c79.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

219-220: No changes needed to release_version prefixing. The evaluator’s logic aligns with the dataset prepare scripts: Python samples are loaded via parse_data(release_{version}) and C++ via raw splits (v5_2408_2501/v6_2408_2505), so conditional prefixing is correct.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

288-293: Consider adding an inline comment.

The logic correctly forces the Python interpreter for C++ evaluation, but an inline comment explaining the rationale would improve maintainability.

For example:

     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # C++ evaluation harness in livecodebench must run with Python interpreter
+        # regardless of the configured interpreter for test execution
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02ad4b3 and 54a2aae.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (8)
nemo_skills/evaluation/evaluator/livecodebench.py (8)

27-28: LGTM!

The httpx import is necessary for handling NetworkError exceptions in the new retry logic.


42-42: LGTM!

The updated comment clearly indicates the valid language options.


47-48: LGTM!

The new configuration fields timeout_buffer and num_retries provide necessary control over sandbox operations.


93-115: LGTM!

The sandbox availability check correctly probes connectivity using a lightweight test command with retry logic.


165-180: LGTM!

The updated signature and use of eval_config improve consistency and enable proper retry behavior for package installation.


205-249: LGTM!

The sandbox evaluation flow correctly:

  • Uses the updated installation helper with retry support
  • Applies consistent release_version prefixing for Python (matching the local path)
  • Incorporates timeout buffer and retry logic for resilient execution

266-267: LGTM!

The conditional release_version prefixing now matches the sandbox path, resolving the previous mismatch issue.


295-302: LGTM!

The orchestration logic correctly:

  • Checks sandbox availability before proceeding
  • Enforces sandbox requirement for PyPy3
  • Routes to the appropriate evaluation path based on availability

Comment on lines +62 to +91
async def _execute_in_sandbox_with_retries(
sandbox: Sandbox,
num_retries: int,
*args,
**kwargs,
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
"""
Executes a command in the sandbox, retrying on network errors.

Uses exponential backoff for delays between retries.
"""
last_exception = None
for attempt in range(num_retries):
try:
return await sandbox.execute_code(*args, **kwargs)
except httpx.NetworkError as e:
last_exception = e
if attempt == num_retries - 1:
break

wait_time = 2**attempt
LOG.warning(
f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
f"Retrying in {wait_time} second(s)..."
)
await asyncio.sleep(wait_time)

LOG.error(f"All {num_retries} sandbox connection attempts failed.")
raise last_exception

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix incorrect return type annotation.

The return type annotation specifies Tuple[Dict[str, Any], Dict[str, Any]], but sandbox.execute_code returns Tuple[Dict, str] where the second element is a session ID string, not a dictionary.

Apply this diff to correct the type annotation:

 async def _execute_in_sandbox_with_retries(
     sandbox: Sandbox,
     num_retries: int,
     *args,
     **kwargs,
-) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+) -> Tuple[Dict[str, Any], str]:
     """
     Executes a command in the sandbox, retrying on network errors.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def _execute_in_sandbox_with_retries(
sandbox: Sandbox,
num_retries: int,
*args,
**kwargs,
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
"""
Executes a command in the sandbox, retrying on network errors.
Uses exponential backoff for delays between retries.
"""
last_exception = None
for attempt in range(num_retries):
try:
return await sandbox.execute_code(*args, **kwargs)
except httpx.NetworkError as e:
last_exception = e
if attempt == num_retries - 1:
break
wait_time = 2**attempt
LOG.warning(
f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
f"Retrying in {wait_time} second(s)..."
)
await asyncio.sleep(wait_time)
LOG.error(f"All {num_retries} sandbox connection attempts failed.")
raise last_exception
async def _execute_in_sandbox_with_retries(
sandbox: Sandbox,
num_retries: int,
*args,
**kwargs,
) -> Tuple[Dict[str, Any], str]:
"""
Executes a command in the sandbox, retrying on network errors.
Uses exponential backoff for delays between retries.
"""
last_exception = None
for attempt in range(num_retries):
try:
return await sandbox.execute_code(*args, **kwargs)
except httpx.NetworkError as e:
last_exception = e
if attempt == num_retries - 1:
break
wait_time = 2**attempt
LOG.warning(
f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
f"Retrying in {wait_time} second(s)..."
)
await asyncio.sleep(wait_time)
LOG.error(f"All {num_retries} sandbox connection attempts failed.")
raise last_exception
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/livecodebench.py around lines 62 to 91 the
function _execute_in_sandbox_with_retries has an incorrect return type
annotation Tuple[Dict[str, Any], Dict[str, Any]]; update it to reflect the
actual sandbox.execute_code return type (Tuple[Dict[str, Any], str]) by changing
the annotation to return Tuple[Dict[str, Any], str] and adjust any imports/types
if necessary so the second element is typed as a string rather than a dict.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

62-90: Fix the return type annotation.

The function annotation specifies Tuple[Dict[str, Any], Dict[str, Any]], but sandbox.execute_code returns Tuple[Dict[str, Any], str] where the second element is a session ID string (see nemo_skills/code_execution/sandbox.py:151). This was flagged in a prior review but remains unaddressed.

Apply this diff:

 async def execute_in_sandbox_with_retries(
     sandbox: Sandbox,
     num_retries: int,
     *args,
     **kwargs,
-) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+) -> Tuple[Dict[str, Any], str]:
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)

288-293: Consider adding an inline comment for clarity.

The warning log at lines 289-292 explains why the interpreter is forced to python for C++ evaluation. An inline comment before line 288 would help future maintainers understand this requirement without checking logs.

Apply this diff:

+    # C++ evaluation requires the LiveCodeBench harness to run with Python,
+    # regardless of the configured interpreter setting
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
nemo_skills/evaluation/evaluator/ojbench.py (1)

26-30: Consider extracting shared utilities to a common module.

The sandbox utilities (execute_in_sandbox_with_retries, is_sandbox_available, sandbox_context) are imported from livecodebench.py. While this works, it creates a dependency where ojbench imports from livecodebench, which couples unrelated evaluators. Consider extracting these utilities to a shared module like nemo_skills/evaluation/evaluator/sandbox_utils.py for better modularity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54a2aae and e4b3cb8.

📒 Files selected for processing (3)
  • docs/evaluation/code.md (2 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
  • nemo_skills/evaluation/evaluator/ojbench.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (1)
  • Sandbox (45-334)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (36-92)
nemo_skills/evaluation/evaluator/livecodebench.py (3)
  • execute_in_sandbox_with_retries (62-90)
  • is_sandbox_available (93-114)
  • sandbox_context (52-59)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

273-273: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


297-297: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ojbench.py

146-146: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (12)
docs/evaluation/code.md (2)

232-232: LGTM!

The Pypy3 evaluation instructions clearly state the sandbox requirement and provide the correct flags. The guidance aligns with the evaluator implementation that enforces sandbox availability for pypy3.


270-275: LGTM!

The livecodebench-cpp section is complete and accurate. It correctly references the cpp benchmark entrypoint and provides clear data preparation and evaluation instructions. The typo has been fixed, and the guidance mirrors the Python livecodebench workflow appropriately.

nemo_skills/evaluation/evaluator/livecodebench.py (6)

47-48: LGTM!

The new configuration fields timeout_buffer and num_retries provide sensible defaults (60 seconds and 3 retries) for managing sandbox execution reliability. These parameters are appropriately used throughout the evaluator to handle network instability.


93-114: LGTM!

The sandbox availability check correctly uses generated_code="true" and handles both network errors and unexpected exceptions gracefully. The broad exception handler at line 112 is appropriate for an availability probe, where any failure should be treated as "unavailable" rather than propagated.


165-180: LGTM!

The package installation function correctly selects the appropriate pip command and git URL based on the interpreter, uses retry logic for network resilience, and provides clear logging. The error handling is appropriate.


205-249: LGTM!

The sandboxed evaluation flow correctly installs packages, preprocesses files, applies the release_ prefix for Python language, and uses retry logic with appropriate timeouts. The timeout calculation (timeout * len(samples) + timeout_buffer) provides adequate buffer for evaluation completion.


252-279: LGTM!

The local evaluation flow now correctly handles release_version consistently with the sandboxed path. Both flows add the release_ prefix when language == "python", ensuring the evaluate function receives the correct release tag format.


282-302: LGTM!

The orchestration logic correctly validates interpreter settings, enforces sandbox requirements for pypy3, checks sandbox availability, and routes evaluation to the appropriate implementation. The C++ interpreter forcing is well-explained in the log warning.

nemo_skills/evaluation/evaluator/ojbench.py (4)

40-41: LGTM!

The timeout and retry configuration fields match the pattern established in livecodebench.py, providing consistent reliability handling across evaluators.


44-67: LGTM!

The package installation function correctly uses retry logic for both cloning and installation, provides clear error logging, and follows the same pattern as the livecodebench evaluator for consistency.


70-137: LGTM!

The async evaluation flow correctly manages sandbox lifecycle, installs packages with retry logic, and executes evaluation with appropriate timeout buffering. The error handling and result processing logic is sound.


139-146: LGTM!

The orchestration correctly checks sandbox availability and enforces the sandbox requirement for OJBench evaluation. Unlike LiveCodeBench, OJBench has no local fallback, which is appropriate given that the evaluation harness requires sandbox isolation for accurate results.

@wasiahmad wasiahmad merged commit ad8e8f8 into main Oct 13, 2025
7 of 8 checks passed
@wasiahmad wasiahmad deleted the livecodebench_cpp branch October 13, 2025 23:07
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@coderabbitai coderabbitai Bot mentioned this pull request Feb 5, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants