Evaluation on LiveCodeBench-Cpp by wasiahmad · Pull Request #885 · NVIDIA-NeMo/Skills

wasiahmad · 2025-10-03T08:45:44Z

In this PR, we are adding evaluation support for LiveCodeBench-CPP through nemo-skills. Primary changes are:

Dataset download through nemo-skills/dataset/livecodebench-cpp
Evaluation logic implemented at nemo_skills/evaluation/evaluator/livecodebench.py

Summary by CodeRabbit

New Features
- Added C++ LiveCodeBench support: data-prep utility, default eval/generation configs, and two dataset splits for benchmarking.
- Evaluation: sandbox readiness probe plus retry/backoff with configurable timeout and retry settings.
Bug Fixes
- Reduced false-positive sandbox/mounts warnings by tightening the warning condition.
Documentation
- Updated evaluation guidance and examples, including sandbox-related flags and Pypy3 guidance.
Chores
- Updated copyright/license years.

Note

Adds LiveCodeBench-CPP dataset/eval support, introduces sandbox availability checks with retry/backoff across evaluators, and updates docs and warnings.

Benchmarks:
- livecodebench-cpp: New dataset with defaults (__init__.py) and data prep (prepare.py) producing v5_2408_2501.jsonl and v6_2408_2505.jsonl.
Evaluators:
- LiveCodeBench (evaluation/evaluator/livecodebench.py):
  - Add sandbox availability probe and retry/backoff wrapper for code execution; configurable timeout_buffer and num_retries.
  - Normalize release_version handling (prefix only for Python); force python interpreter for C++; gate Pypy3 on sandbox availability.
- OJBench (evaluation/evaluator/ojbench.py):
  - Reuse sandbox context/utilities, add retries/backoff and timeout_buffer/num_retries; require sandbox availability before running.
Pipeline:
- pipeline/utils/eval.py: Reduce false-positive sandbox mount warnings by tightening keep_mounts_for_sandbox condition.
Docs:
- Update LiveCodeBench guidance: Pypy3 requires --with_sandbox --keep_mounts_for_sandbox; add livecodebench-cpp section with data prep/eval notes.
Chore:
- Update copyright years.

^{Written by Cursor Bugbot for commit e4b3cb8. This will update automatically on new commits. Configure here.}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai · 2025-10-03T08:45:51Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a LiveCodeBench-CPP dataset and prepare script, adds sandbox availability probes and httpx-backed retry helpers to LiveCodeBench and OJBench evaluators, tightens a sandbox-mounts warning condition, updates docs to require sandbox for some runs and add C++ guidance, bumps license years, and removes two sandbox constants.

Changes

Cohort / File(s)	Summary
Docs updates `docs/evaluation/code.md`	Removed default `--with_sandbox` for standard Python runs; made PyPy3 guidance require `--with_sandbox` and `--keep_mounts_for_sandbox`; added LiveCodeBench-CPP subsection with dataset splits and C++ benchmark guidance; updated OJBench sample run to include sandbox flags; minor wording tweaks.
LiveCodeBench-CPP dataset (new) `nemo_skills/dataset/livecodebench-cpp/__init__.py`, `nemo_skills/dataset/livecodebench-cpp/prepare.py`	Added module constants (`DATASET_GROUP`, `METRICS_TYPE`, `EVAL_SPLIT`, `EVAL_ARGS`, `GENERATION_ARGS`) and new `prepare.py` that loads HF splits, formats prompts (with/without starter code), cleans fields (adds `task_id`, `question`), normalizes indentation, and writes per-split JSONL for `v5_2408_2501` and `v6_2408_2505`.
LiveCodeBench evaluator resilience `nemo_skills/evaluation/evaluator/livecodebench.py`	Introduced `httpx`-backed retry helper `execute_in_sandbox_with_retries`, sandbox probe `is_sandbox_available`, new config fields `timeout_buffer` and `num_retries`, updated `_install_packages_in_sandbox` signature to accept eval config, refactored sandbox install/eval flows to use retries and `timeout_buffer`, adjusted `release_version` handling and orchestrator sandbox/local routing, and added logging around availability and retries.
OJBench evaluator changes `nemo_skills/evaluation/evaluator/ojbench.py`	Migrated to a Sandbox class API and retry utilities, added `timeout_buffer` and `num_retries` to `OJBenchConfig`, changed `install_packages` to accept a `Sandbox` instance, refactored `eval_ojbench_async` to use `execute_in_sandbox_with_retries`, and added sandbox readiness checks in the synchronous wrapper (errors if sandbox unavailable).
Eval pipeline warning condition `nemo_skills/pipeline/utils/eval.py`	Tightened mounts warning: now requires benchmark `requires_sandbox`, per-benchmark `keep_mounts_for_sandbox` true, and global `keep-mounts-for-sandbox` false before emitting the warning.
LiveCodeBench dataset module updates `nemo_skills/dataset/livecodebench/__init__.py`, `nemo_skills/dataset/livecodebench/prepare.py`	Bumped copyright year to 2025; removed `REQUIRES_SANDBOX` and `KEEP_MOUNTS_FOR_SANDBOX` constants from `__init__.py`; no functional changes to `prepare.py`.
OJBench year updates `nemo_skills/dataset/ojbench/__init__.py`, `nemo_skills/dataset/ojbench/prepare.py`	Updated license header years to 2025 only; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Orchestrator as eval_livecodebench / eval_ojbench
    participant Sandbox as Sandbox API
    participant Local as Local Runner

    User->>Orchestrator: Start evaluation(config)
    Orchestrator->>Sandbox: is_sandbox_available(sandbox_config)
    alt Sandbox available
        Note over Orchestrator,Sandbox: Install + run via sandbox (retries)
        Orchestrator->>Sandbox: _install_packages_in_sandbox(sandbox, eval_config)
        Orchestrator->>Sandbox: execute_in_sandbox_with_retries(cmd, timeout=base+timeout_buffer, num_retries)
        Sandbox-->>Orchestrator: Results / logs
    else Sandbox unavailable
        Note over Orchestrator,Local: Fallback to local execution or error (per evaluator)
        Orchestrator->>Local: Run evaluation locally (adjust release_version handling)
        Local-->>Orchestrator: Results / logs
    end
    Orchestrator-->>User: Aggregated results

sequenceDiagram
    autonumber
    participant Prep as prepare.py (livecodebench-cpp)
    participant HF as load_dataset
    participant FS as Filesystem

    Prep->>HF: load_dataset(split)
    HF-->>Prep: Dataset
    Prep->>Prep: clean_data (cast fields, add task_id, build question)
    alt starter_code present
        Prep->>Prep: format prompt with starter code template (tabs preserved)
    else no starter_code
        Prep->>Prep: format prompt without starter code
    end
    Prep->>FS: Write JSONL per split (adds subset_for_metrics, release_version)
    FS-->>Prep: Done

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble bytes beneath the hedge,
C++ tasks lined up on the ledge,
Retries like carrots, one by one,
Sandboxes warm beneath the sun,
Prompts and tabs — a rabbit's run 🐇💻

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the primary change of the pull request, which is adding evaluation support for the LiveCodeBench-CPP benchmark, and is directly related to the main content of the changeset. It is clear, concise, and specific enough for teammates to understand the focus when scanning the history. There is no extraneous information or ambiguity in the wording.
Docstring Coverage	✅ Passed	Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch livecodebench_cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad · 2025-10-03T21:26:09Z

@Kipok I have tested this PR, and it works fine. The only issue that needs to be addressed is the use of keep_mounts_for_sandbox=True is not required for LiveCodeBench-CPP but right now, if we don't set it to True, the evaluation fails with complain that custom_output_file is missing within the sandbox container.

Please suggest how to resolve it. Thanks!

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f94fdbd and 0c5ca9f.

📒 Files selected for processing (4)

docs/evaluation/code.md (1 hunks)
nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
nemo_skills/dataset/livecodebench-cpp/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/evaluation/code.md (1)

265-267: Use descriptive link text for the dataset reference.

Replace the "here" anchor with descriptive text (e.g., “LiveCodeBench-CPP dataset on Hugging Face”) to satisfy MD059 and make the documentation clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c5ca9f and ffdad7d.

📒 Files selected for processing (1)

docs/evaluation/code.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/evaluation/code.md (1)

265-267: Use descriptive link text for the dataset URL.

Markdownlint (rule MD059) flags “here” as non-descriptive. Please change the anchor text to something like “LiveCodeBench-CPP dataset” so readers know what the link targets without context.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffdad7d and ee346da.

📒 Files selected for processing (1)

docs/evaluation/code.md (2 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

266-266: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee346da and 05b94d4.

📒 Files selected for processing (2)

nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
nemo_skills/pipeline/utils/eval.py (1 hunks)

🔇 Additional comments (1)

nemo_skills/dataset/livecodebench/__init__.py (1)

20-20: LGTM: Sandbox requirement properly declared.

Marking the existing LiveCodeBench dataset as requiring sandbox evaluation is appropriate for code execution benchmarks and aligns with the PR's sandbox-aware evaluation framework.

wasiahmad · 2025-10-04T17:19:20Z

@Kipok I have tested this PR, and it works fine. The only issue that needs to be addressed is the use of keep_mounts_for_sandbox=True is not required for LiveCodeBench-CPP but right now, if we don't set it to True, the evaluation fails with complain that custom_output_file is missing within the sandbox container.

Please suggest how to resolve it. Thanks!

As per my understanding, I think we need to have the following in dataset __init__.py script for LiveCodeBench.

REQUIRES_SANDBOX = True
KEEP_MOUNTS_FOR_SANDBOX = True

In that case, the issue gets resolved.

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

nemo_skills/pipeline/utils/exp.py (1)

528-548: Propagate QoS to the sandbox executor

When --with-sandbox is used together with --qos, the sandbox leg still calls get_executor(...) without forwarding qos. That leg will therefore run with the cluster default QoS (or fail on clusters that require an explicit QoS), while the main/server legs honor the user-specified value. Please pass qos=qos into this get_executor call so all heterogenous groups share the same Slurm QoS.
nemo_skills/inference/model/parallel_thinking.py (1)
344-352: Wire the configured endpoint_type into downstream requests.

ParallelThinkingConfig.endpoint_type is never propagated, so flipping it from the default has no effect—all calls still rely on implicit inference. Please default kwargs["endpoint_type"] to self.cfg.endpoint_type before we fan out to _get_multiple_solutions and _generate_parallel_thinking_contraction.
-        result = {}  # Result dictionary
-        local_random = random.Random(kwargs.get("random_seed", 0))
+        kwargs = dict(kwargs)
+        kwargs.setdefault("endpoint_type", self.cfg.endpoint_type)
+
+        result = {}  # Result dictionary
+        local_random = random.Random(kwargs.get("random_seed", 0))

🧹 Nitpick comments (6)

nemo_skills/dataset/ruler/prepare.py (1)
32-34: Update the comment to reflect the new parameter.

The comment references "completions api" but the parameter has been changed to endpoint_type=text. Consider updating the comment for clarity.

Apply this diff to update the comment:
-    # ruler is adding prefix for assistant response, so it has to go through completions api
+    # ruler is adding prefix for assistant response, so it must use text endpoint type
     "++start_assistant_response_key=generation "
     "++inference.endpoint_type=text "
nemo_skills/inference/patch_litellm_logging.py (2)
61-73: Consider broader exception handling for robustness.

The current implementation only catches ModuleNotFoundError, which handles the case when litellm is not installed. However, if litellm's internal structure changes (e.g., the module path or attribute names change), ImportError or AttributeError could be raised instead, causing the patch to fail unexpectedly.

Apply this diff to handle additional exception types:
     try:
         import litellm.litellm_core_utils.logging_worker as logging_worker_module
 
         logging_worker_module.LoggingWorker = NoOpLoggingWorker
         logging_worker_module.GLOBAL_LOGGING_WORKER = NoOpLoggingWorker()
-    except ModuleNotFoundError:
-        # Ensure compatibility with different litellm versions
+    except (ModuleNotFoundError, ImportError, AttributeError):
+        # Ensure compatibility with different litellm versions or when litellm is unavailable
         pass
61-73: Document when and where this patch should be applied.

The function patch_litellm_logging_worker() is defined but there's no indication in this file about when or where it should be called (e.g., during application startup, before using litellm APIs, etc.). This could lead to confusion about proper usage.

Consider adding a module-level docstring or expanding the function docstring with usage guidance, for example:
"""
Patches the litellm LoggingWorker to disable its functionality.
This prevents any logging worker from keeping the client alive.

Usage:
    Call this function once during application initialization, before any
    litellm API calls are made:
    
    >>> from nemo_skills.inference.patch_litellm_logging import patch_litellm_logging_worker
    >>> patch_litellm_logging_worker()
"""
nemo_skills/training/openrlhf/math_reward.py (1)
61-63: Consider concurrent execution for async calls.

The current implementation calls asyncio.run() sequentially for each judge prompt, creating a new event loop for each item. This is inefficient compared to concurrent execution.

For better performance, consider gathering all async calls:
-    outputs = [
-        asyncio.run(llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases)) for jp in judge_prompts
-    ]
+    async def generate_all():
+        return await asyncio.gather(*[
+            llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases) for jp in judge_prompts
+        ])
+    outputs = asyncio.run(generate_all())
Note: Given the comment at line 60 indicating this module is no longer actively supported, this optimization may be deferred.
tests/test_code_execution.py (1)
113-113: Minor: Consider fixing the grammatical error in the timeout message.

The test correctly validates the new timeout message format. However, the message "Execution timed out after 1 seconds" has a grammatical error (should be "1 second"). While the test is correct, consider updating the implementation to use proper singular/plural forms.

Example fix for the implementation:
# In the timeout handling code
timeout_msg = f"Execution timed out after {timeout} second{'s' if timeout != 1 else ''}\n"
nemo_skills/inference/model/tool_call.py (1)

94-96: Consider enforcing zip strictness
asyncio.gather guarantees the same number of results you scheduled, so adding strict=True would automatically flag any future divergence in tool-call bookkeeping instead of silently truncating: for tool_call, tool_result in zip(tool_calls, tool_results, strict=True).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5db2f25 and 9c1060b.

📒 Files selected for processing (71)

.github/workflows/copyright-check.yml (1 hunks)
README.md (1 hunks)
docs/basics/inference.md (3 hunks)
docs/basics/prompt-format.md (1 hunks)
docs/evaluation/index.md (2 hunks)
docs/evaluation/long-context.md (1 hunks)
docs/evaluation/multilingual.md (3 hunks)
docs/index.md (1 hunks)
docs/pipelines/generation.md (8 hunks)
docs/releases/openmathinstruct2/dataset.md (6 hunks)
docs/releases/openmathreasoning/evaluation.md (2 hunks)
docs/tutorials/posts/gpt-oss-python.md (2 hunks)
nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (4 hunks)
nemo_skills/code_execution/sandbox.py (5 hunks)
nemo_skills/conversion/save_sharded_state.py (0 hunks)
nemo_skills/dataset/bfcl_v3/utils.py (1 hunks)
nemo_skills/dataset/flores200/__init__.py (1 hunks)
nemo_skills/dataset/flores200/prepare.py (1 hunks)
nemo_skills/dataset/livecodebench-cpp/prepare.py (1 hunks)
nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
nemo_skills/dataset/livecodebench/prepare.py (1 hunks)
nemo_skills/dataset/math/fix_ref_solns.py (1 hunks)
nemo_skills/dataset/ojbench/__init__.py (1 hunks)
nemo_skills/dataset/ojbench/prepare.py (1 hunks)
nemo_skills/dataset/ruler/prepare.py (1 hunks)
nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
nemo_skills/inference/chat_interface/chat_service.py (2 hunks)
nemo_skills/inference/eval/bfcl.py (7 hunks)
nemo_skills/inference/eval/bfcl_utils.py (1 hunks)
nemo_skills/inference/eval/scicode.py (1 hunks)
nemo_skills/inference/factory.py (1 hunks)
nemo_skills/inference/generate.py (15 hunks)
nemo_skills/inference/merge_chunks.py (1 hunks)
nemo_skills/inference/model/base.py (7 hunks)
nemo_skills/inference/model/code_execution.py (8 hunks)
nemo_skills/inference/model/context_retry.py (13 hunks)
nemo_skills/inference/model/openai.py (1 hunks)
nemo_skills/inference/model/parallel_thinking.py (2 hunks)
nemo_skills/inference/model/tool_call.py (6 hunks)
nemo_skills/inference/model/utils.py (3 hunks)
nemo_skills/inference/model/vllm.py (1 hunks)
nemo_skills/inference/patch_litellm_logging.py (1 hunks)
nemo_skills/mcp/adapters.py (4 hunks)
nemo_skills/mcp/config.py (1 hunks)
nemo_skills/mcp/tool_manager.py (1 hunks)
nemo_skills/mcp/tool_providers.py (1 hunks)
nemo_skills/pipeline/eval.py (3 hunks)
nemo_skills/pipeline/generate.py (2 hunks)
nemo_skills/pipeline/megatron_lm/train.py (2 hunks)
nemo_skills/pipeline/nemo_rl/grpo.py (3 hunks)
nemo_skills/pipeline/nemo_rl/sft.py (3 hunks)
nemo_skills/pipeline/prepare_data.py (2 hunks)
nemo_skills/pipeline/run_cmd.py (2 hunks)
nemo_skills/pipeline/start_server.py (2 hunks)
nemo_skills/pipeline/utils/exp.py (5 hunks)
nemo_skills/pipeline/verl/ppo.py (2 hunks)
nemo_skills/prompt/config/multilingual/__init__.py (1 hunks)
nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)
nemo_skills/prompt/few_shot_examples/lean4.py (1 hunks)
nemo_skills/prompt/utils.py (4 hunks)
nemo_skills/training/merge_packed_data.py (1 hunks)
nemo_skills/training/openrlhf/math_reward.py (2 hunks)
requirements/main.txt (2 hunks)
tests/slurm-tests/gpt_oss_python_aime25/check_results.py (2 hunks)
tests/slurm-tests/gpt_oss_python_aime25/run_test.py (1 hunks)
tests/test_code_execution.py (1 hunks)
tests/test_mcp_clients.py (3 hunks)

💤 Files with no reviewable changes (1)

nemo_skills/conversion/save_sharded_state.py

✅ Files skipped from review due to trivial changes (17)

README.md
nemo_skills/inference/eval/bfcl_utils.py
nemo_skills/inference/factory.py
docs/index.md
nemo_skills/mcp/tool_providers.py
nemo_skills/dataset/ojbench/init.py
nemo_skills/dataset/math/fix_ref_solns.py
nemo_skills/prompt/few_shot_examples/lean4.py
docs/basics/prompt-format.md
docs/evaluation/index.md
docs/evaluation/long-context.md
nemo_skills/prompt/config/multilingual/init.py
nemo_skills/dataset/ojbench/prepare.py
nemo_skills/dataset/bfcl_v3/utils.py
nemo_skills/dataset/livecodebench/prepare.py
nemo_skills/mcp/config.py
nemo_skills/inference/merge_chunks.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/livecodebench-cpp/prepare.py

🧰 Additional context used

🧬 Code graph analysis (20)

nemo_skills/inference/model/openai.py (2)

nemo_skills/inference/model/base.py (2)

_build_responses_request_params (209-210)

_build_chat_request_params (202-203)

nemo_skills/inference/model/vllm.py (2)

_build_responses_request_params (142-148)

_build_chat_request_params (102-140)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

TranslationMetrics (22-81)

tests/test_mcp_clients.py (1)

nemo_skills/mcp/tool_manager.py (3)

list_tools (51-52)

ToolManager (64-164)

list_all_tools (111-151)

nemo_skills/inference/model/code_execution.py (1)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/inference/model/parallel_thinking.py (1)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/dataset/flores200/prepare.py (1)

nemo_skills/dataset/wmt24pp/prepare.py (2)

write_data_to_file (23-36)

main (39-47)

nemo_skills/inference/eval/bfcl.py (3)

nemo_skills/inference/generate.py (6)

generate (676-703)

generate (711-716)

GenerateSolutionsConfig (81-233)

GenerationTask (240-703)

InferenceConfig (59-77)

generate_with_semaphore (554-562)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/prompt/utils.py (1)

get_token_count (310-369)

tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)

tests/slurm-tests/utils.py (3)

assert_all (46-58)

load_json (18-21)

soft_assert (36-43)

nemo_skills/inference/model/vllm.py (2)

nemo_skills/inference/model/base.py (2)

_build_responses_request_params (209-210)

_build_chat_request_params (202-203)

nemo_skills/inference/model/openai.py (2)

_build_responses_request_params (155-160)

_build_chat_request_params (84-153)

nemo_skills/mcp/adapters.py (1)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/training/openrlhf/math_reward.py (1)

nemo_skills/inference/model/base.py (1)

generate_async (213-315)

nemo_skills/inference/chat_interface/chat_service.py (2)

nemo_skills/inference/model/base.py (1)

generate_async (213-315)

nemo_skills/inference/model/code_execution.py (1)

generate_async (241-300)

nemo_skills/inference/model/base.py (7)

nemo_skills/inference/patch_litellm_logging.py (1)

patch_litellm_logging_worker (61-73)

nemo_skills/utils.py (1)

get_logger_name (130-134)

nemo_skills/inference/model/context_retry.py (2)

ContextLimitRetryConfig (114-157)

with_context_retry (160-180)

nemo_skills/inference/model/utils.py (3)

ServerTokenizer (54-83)

WrapperAutoTokenizer (86-102)

trim_after_stop_phrases (27-33)

nemo_skills/inference/model/openai.py (3)

_build_responses_request_params (155-160)

_build_chat_request_params (84-153)

_build_completion_request_params (64-82)

nemo_skills/inference/model/vllm.py (3)

_build_responses_request_params (142-148)

_build_chat_request_params (102-140)

_build_completion_request_params (63-100)

nemo_skills/inference/model/tool_call.py (1)

generate_async (98-160)

nemo_skills/inference/generate.py (3)

nemo_skills/inference/model/base.py (2)

EndpointType (38-41)

generate_async (213-315)

nemo_skills/evaluation/evaluator/__init__.py (2)

get_evaluator_class (98-108)

supports_single_eval (111-117)

nemo_skills/prompt/utils.py (1)

get_token_count (310-369)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

BaseMetrics (23-434)

as_float (449-452)

nemo_skills/inference/model/tool_call.py (2)

nemo_skills/mcp/adapters.py (3)

format_tool_list_by_endpoint_type (51-76)

format_tool_response_by_endpoint_type (96-111)

get_tool_details_by_endpoint_type (114-124)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/dataset/wmt24pp/prepare.py (1)

nemo_skills/dataset/flores200/prepare.py (2)

write_data_to_file (23-38)

main (41-54)

nemo_skills/inference/eval/scicode.py (1)

nemo_skills/inference/generate.py (1)

process_single_datapoint (527-552)

nemo_skills/inference/model/context_retry.py (1)

nemo_skills/inference/model/utils.py (5)

is_context_window_exceeded_error (36-51)

encode (61-74)

encode (93-98)

ServerTokenizer (54-83)

WrapperAutoTokenizer (86-102)

nemo_skills/evaluation/evaluator/livecodebench.py (2)

nemo_skills/code_execution/sandbox.py (2)

Sandbox (45-334)

get_sandbox (419-422)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (36-92)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/multilingual.md

76-76: Link text should be descriptive

(MD059, descriptive-link-text)

149-149: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)

tests/test_mcp_clients.py

264-264: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

nemo_skills/inference/eval/bfcl.py

139-139: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

140-140: Use raise without specifying exception name

Remove exception name

(TRY201)

272-272: Use raise without specifying exception name

Remove exception name

(TRY201)

tests/slurm-tests/gpt_oss_python_aime25/check_results.py

24-24: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

54-54: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/mcp/adapters.py

76-76: Avoid specifying long messages outside the exception class

(TRY003)

111-111: Avoid specifying long messages outside the exception class

(TRY003)

123-123: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/base.py

298-298: Avoid specifying long messages outside the exception class

(TRY003)

301-301: Consider moving this statement to an else block

(TRY300)

310-310: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

313-313: Use raise without specifying exception name

Remove exception name

(TRY201)

452-452: Unused method argument: kwargs

(ARG002)

509-509: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/generate.py

229-229: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/tool_call.py

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

95-95: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/inference/model/context_retry.py

192-192: Use raise without specifying exception name

Remove exception name

(TRY201)

207-207: Consider moving this statement to an else block

(TRY300)

208-208: Do not catch blind exception: Exception

(BLE001)

226-226: Use raise without specifying exception name

Remove exception name

(TRY201)

241-241: Consider moving this statement to an else block

(TRY300)

242-242: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/evaluator/livecodebench.py

67-67: Avoid specifying long messages outside the exception class

(TRY003)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Consider moving this statement to an else block

(TRY300)

134-134: subprocess call: check for execution of untrusted input

(S603)

138-138: Consider moving this statement to an else block

(TRY300)

140-140: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

155-155: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

219-219: Avoid specifying long messages outside the exception class

(TRY003)

221-221: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (24)

requirements/main.txt (2)

29-29: LGTM on new dependencies.

The additions of langcodes and sacrebleu align well with the PR's goal of adding LiveCodeBench-CPP evaluation support and likely support translation/multilingual features and BLEU metrics mentioned in the AI summary.

Also applies to: 39-39

30-30: litellm 1.77.5 has no known security advisories. All reported vulnerabilities were patched by versions ≤1.61.15; 1.77.5 is unaffected.

nemo_skills/dataset/livecodebench/__init__.py (2)

1-1: LGTM! Copyright year updated.

The copyright year update from 2024 to 2025 is appropriate.

15-20: Removal of sandbox constants is safe
No references to REQUIRES_SANDBOX or KEEP_MOUNTS_FOR_SANDBOX exist in the Python LiveCodeBench evaluator; functionality remains unchanged.

nemo_skills/inference/model/context_retry.py (11)

30-110: LGTM! Enhanced error parsing covers multiple formats.

The expanded pattern matching now handles 5 different context window error formats, including the new message_tokens_overflow case (Pattern 5) when max_tokens goes negative. The progressive fallback through patterns is well-structured.

183-214: LGTM! Improved error handling with better error categorization.

The restructured soft-fail logic now:

Clearly distinguishes context window errors from other exceptions

Returns error_reason="context_window_exceeded" for better error tracking

Attempts retry with modified kwargs when strategy is configured

Note: The Ruff warnings (TRY201, TRY300, BLE001) are false positives. The raise error preserves the original exception, the nested try-catch is appropriate for retry logic, and catching Exception is necessary to handle any error from the retry attempt.

217-248: LGTM! Sync version mirrors async improvements.

The sync handler implements the same improved error handling as the async version. The code duplication is appropriate given the fundamental difference between sync and async execution.

Note: Same Ruff warnings apply here as in the async version - these are false positives.

251-288: LGTM! Added essential guard for prompt overflow.

The early exit at lines 271-274 correctly prevents attempting to reduce generation tokens when the prompt itself overflows the context limit. This saves a futile retry attempt.

291-317: LGTM! Safety margin prevents edge-case failures.

The safe_remaining_budget calculation (lines 299-306) adds a necessary buffer to account for special tokens, preventing failures at exact context boundaries. The logic correctly checks if messages already consume the available space.

320-362: LGTM! Direct calculation for overflow cases is more efficient.

When message_tokens_overflow is present (lines 328-334), the code directly calculates the required prompt reduction instead of relying on indirect token budget calculations. This is both clearer and more accurate.

365-382: LGTM! Consistent kwargs access pattern.

Accessing the prompt via kwargs["prompt"] (line 372) maintains consistency with how other parameters are retrieved throughout the module.

385-408: LGTM! Proper handling of tools parameter for chat templates.

Extracting tools from kwargs (line 394) and passing it to the message trimming functions ensures accurate tokenization when function calling is involved.

411-500: LGTM! Tools parameter enables accurate tokenization.

Both trimming functions now accept and pass the tools parameter to tokenizer.encode (lines 424, 468), ensuring correct token counts when chat templates include function calling definitions.

525-531: LGTM! Error categorization improves observability.

Adding the error_reason parameter (line 525) enables structured error tracking. Callers can now identify context window errors specifically via error="context_window_exceeded" while preserving detailed diagnostics in detailed_error.

121-121: Verify special token budget adequacy. Automated search didn’t locate any tokenizer or prompt-template definitions—manually confirm that reducing num_special_tokens_budget to 50 still covers the maximum special-token overhead for all supported model prompt templates and tokenizers.

nemo_skills/pipeline/generate.py (1)

107-107: LGTM! QoS parameter integration is consistent.

The qos parameter has been properly added and forwarded through the task creation pipeline. This change is part of a consistent pattern across multiple pipeline files (generate.py, megatron_lm/train.py, verl/ppo.py, nemo_rl/grpo.py, nemo_rl/sft.py, eval.py) enabling Slurm QoS specification for job submissions.

Also applies to: 314-314

tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)

34-76: LGTM! Timeout validation logic is well-structured.

The new timeout checking functionality correctly:

Parses and aggregates timeout counts from JSONL output files

Validates per-file and total timeout thresholds using soft assertions

Extracts seeds from filenames with proper error handling

The implementation integrates cleanly with the existing test workflow.

docs/basics/inference.md (1)

37-37: LGTM! Documentation correctly updated for async generation.

The documentation properly reflects the migration from synchronous to asynchronous generation, with consistent updates across all examples. The asyncio import is correctly added, and the usage pattern is clear.

Also applies to: 43-43, 73-73, 85-85

nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)

1-3: LGTM! Translation prompt config is well-structured.

The prompt template is clear and concise, with appropriate placeholders for text and target language. The format aligns with the dataset usage patterns in FLORES200 and WMT24PP.

docs/releases/openmathreasoning/evaluation.md (1)

107-107: LGTM! Documentation correctly updated for endpoint_type parameter.

The change from ++use_completions_api=True to ++inference.endpoint_type=text is consistent across both examples and aligns with the broader endpoint-type migration in the codebase.

Also applies to: 130-130

nemo_skills/evaluation/metrics/map_metrics.py (1)

36-36: LGTM! Translation metrics correctly integrated.

The TranslationMetrics import and registration in METRICS_MAP follow the established pattern. The implementation properly extends BaseMetrics with BLEU-based evaluation.

Also applies to: 60-60

nemo_skills/inference/chat_interface/chat_service.py (1)

17-17: LGTM! Async migration correctly implemented.

The change from generate_sync to asyncio.run(llm.generate_async(...)) maintains the synchronous interface of stream_chat while using the new async generation pathway. This is appropriate since the function signature returns Iterator[str] (not AsyncIterator), indicating it's designed to be called from synchronous contexts.

Also applies to: 60-69

nemo_skills/inference/model/vllm.py (1)

142-148: LGTM! Implementation follows the established pattern.

The _build_responses_request_params method correctly delegates to _build_chat_request_params and renames the messages parameter to input. Note that this differs from the OpenAI implementation which also renames max_completion_tokens → max_output_tokens. If this reflects actual API differences between VLLM and OpenAI responses endpoints, the implementation is correct.

nemo_skills/inference/model/openai.py (1)

155-160: Ensure OpenAI Responses API mapping: _build_responses_request_params renames both messages→input and max_completion_tokens→max_output_tokens, unlike VLLMModel. Verify this aligns with the official OpenAI Responses endpoint specification.

coderabbitai · 2025-10-09T16:53:46Z

+def main(args):
+    datasets = {}
+    for lang in args.target_languages:
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)


⚠️ Potential issue | 🔴 Critical

Respect the requested split when loading WMT24PP.

We expose --split (defaulting to test), yet we unconditionally fetch the train split via ["train"]. That silently pipes training data into evaluation, invalidating metrics. Please honor the CLI flag.

- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"] + datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def main(args):

datasets = {}

for lang in args.target_languages:

datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]

data_dir = Path(__file__).absolute().parent

data_dir.mkdir(exist_ok=True)

output_file = data_dir / f"{args.split}.jsonl"

write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)

def main(args):

datasets = {}

for lang in args.target_languages:

datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)

data_dir = Path(__file__).absolute().parent

data_dir.mkdir(exist_ok=True)

output_file = data_dir / f"{args.split}.jsonl"

write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)

🤖 Prompt for AI Agents

nemo_skills/dataset/wmt24pp/prepare.py around lines 39 to 47: the code currently unconditionally selects the "train" split when loading "google/wmt24pp", ignoring the CLI --split flag; change the dataset loading to use the requested split (e.g., use load_dataset("google/wmt24pp", f"en-{lang}", split=args.split) or index with [args.split]) so the script actually loads the specified split (default test) before writing to the output file.

coderabbitai · 2025-10-09T16:53:47Z

+        evaluate_fn(
+            custom_output_file=jsonl_file,
+            release_version=f"release_{release_version}",
+            k_list=[1],
+            language=eval_config.language,
+            test_file=eval_config.test_file,
+            num_process_evaluate=eval_config.num_processes,
+            timeout=eval_config.timeout,
+        )


⚠️ Potential issue | 🟠 Major

Fix the mismatched release_version in the local path.

_preprocess_and_validate_file already returns the exact release_version string stored in the samples (e.g., "release_cpp_202409"). The sandbox flow forwards it unchanged, but the local flow prepends "release_" again, resulting in "release_release_cpp_202409". That value doesn’t exist in the LiveCodeBench package, so local evaluations will now fail with a missing-release error. Please drop the extra prefix (or normalize both paths the same way).

Apply this diff:

- evaluate_fn( - custom_output_file=jsonl_file, - release_version=f"release_{release_version}", + evaluate_fn( + custom_output_file=jsonl_file, + release_version=release_version,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

evaluate_fn(

custom_output_file=jsonl_file,

release_version=f"release_{release_version}",

k_list=[1],

language=eval_config.language,

test_file=eval_config.test_file,

num_process_evaluate=eval_config.num_processes,

timeout=eval_config.timeout,

)

evaluate_fn(

custom_output_file=jsonl_file,

release_version=release_version,

k_list=[1],

language=eval_config.language,

test_file=eval_config.test_file,

num_process_evaluate=eval_config.num_processes,

timeout=eval_config.timeout,

)

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/livecodebench.py around lines 201 to 209, the local call to evaluate_fn is prepending "release_" to release_version, causing "release_release_..." mismatches with samples; remove the extra prefix and pass the release_version string returned by _preprocess_and_validate_file directly (or normalize both sandbox and local flows to the same format) so evaluate_fn receives the exact release tag present in the LiveCodeBench package.

coderabbitai · 2025-10-09T16:53:47Z

+        tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)

        ##
        # TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.
        #   Kept here to handle errors for now.
        try:
            tool_args = json.loads(tool_args)
        except json.decoder.JSONDecodeError as e:
+            LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
            LOG.exception(e)
            return {"error": "Tool argument parsing failed."}



⚠️ Potential issue | 🔴 Critical

Handle dict tool arguments for responses endpoint
For EndpointType.responses, get_tool_details_by_endpoint_type (see nemo_skills/mcp/adapters.py, lines 113-123) already returns tool_call["arguments"], which is a dict. Passing that dict into json.loads raises a TypeError, so tool execution now crashes before the try/except (you only catch JSONDecodeError). Please decode only when the payload is a string, or otherwise accept the existing dict. One possible fix:

- try: - tool_args = json.loads(tool_args) - except json.decoder.JSONDecodeError as e: + try: + if isinstance(tool_args, str): + tool_args = json.loads(tool_args) + except json.decoder.JSONDecodeError as e: LOG.error(f"Tool arguments are not in JSON format: {tool_args}") LOG.exception(e) return {"error": "Tool argument parsing failed."} + except TypeError as e: + LOG.error("Tool arguments must be str or dict; got %s", type(tool_args)) + LOG.exception(e) + return {"error": "Tool argument parsing failed."}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)

##

# TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.

# Kept here to handle errors for now.

try:

tool_args = json.loads(tool_args)

except json.decoder.JSONDecodeError as e:

LOG.error(f"Tool arguments are not in JSON format: {tool_args}")

LOG.exception(e)

return {"error": "Tool argument parsing failed."}

tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type)

##

# TODO(sanyamk): Not all tool arguments might necessarily be in JSON format.

# Kept here to handle errors for now.

try:

if isinstance(tool_args, str):

tool_args = json.loads(tool_args)

except json.decoder.JSONDecodeError as e:

LOG.error(f"Tool arguments are not in JSON format: {tool_args}")

LOG.exception(e)

return {"error": "Tool argument parsing failed."}

except TypeError as e:

LOG.error("Tool arguments must be str or dict; got %s", type(tool_args))

LOG.exception(e)

return {"error": "Tool argument parsing failed."}

🧰 Tools

🪛 Ruff (0.13.3)

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

74-74: Redundant exception object included in logging.exception call

(TRY401)

🤖 Prompt for AI Agents

In nemo_skills/inference/model/tool_call.py around lines 65 to 76, the code always calls json.loads on tool_args which can be a dict for EndpointType.responses causing a TypeError; update the logic to only call json.loads when tool_args is a string (e.g., isinstance(tool_args, str)); if tool_args is already a dict, skip decoding and use it as-is; keep the existing JSONDecodeError handling for string parsing and optionally catch TypeError if non-str/non-dict types may appear and return the same error response.

coderabbitai · 2025-10-09T16:53:47Z

+
+Currently, the async_loop function in generate.py sometimes gets stuck forever because some of the coroutines in the litellm logging worker are not finished.
+Debugging why the logger is not finished is non-trivial, so we just patch it to disable its functionality.
+The behavior is that it keeps slurm jobs from existing and we waste gpus.


⚠️ Potential issue | 🟡 Minor

Fix typo in documentation.

"existing" should be "exiting" (jobs should exit, not exist).

Apply this diff:

-The behavior is that it keeps slurm jobs from existing and we waste gpus. +The behavior is that it keeps slurm jobs from exiting and we waste gpus.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The behavior is that it keeps slurm jobs from existing and we waste gpus.

The behavior is that it keeps slurm jobs from exiting and we waste gpus.

🤖 Prompt for AI Agents

In nemo_skills/inference/patch_litellm_logging.py around line 21, the documentation/comment contains a typo: "existing" should be "exiting"; update the sentence so it reads that slurm jobs keep from exiting (not existing) to reflect correct behavior and save GPUs.

coderabbitai · 2025-10-09T16:53:47Z

+        return {
+            "type": "function_call_output",
+            "call_id": tool_call["call_id"],
+            "output": json.dumps(result) if not isinstance(result, str) else result,
+        }


⚠️ Potential issue | 🔴 Critical

Fix responses tool-call output key.

For the Responses API path we return "call_id": tool_call["call_id"], but the function-call object coming from OpenAI (and from model_dump()) exposes this identifier as "id". Using "call_id" raises a KeyError as soon as we format any tool output under the responses endpoint. Please grab the "id" field instead.

- return { - "type": "function_call_output", - "call_id": tool_call["call_id"], - "output": json.dumps(result) if not isinstance(result, str) else result, - } + return { + "type": "function_call_output", + "call_id": tool_call["id"], + "output": json.dumps(result) if not isinstance(result, str) else result, + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return {

"type": "function_call_output",

"call_id": tool_call["call_id"],

"output": json.dumps(result) if not isinstance(result, str) else result,

}

return {

"type": "function_call_output",

"call_id": tool_call["id"],

"output": json.dumps(result) if not isinstance(result, str) else result,

}

🤖 Prompt for AI Agents

In nemo_skills/mcp/adapters.py around lines 105 to 109, the responses endpoint incorrectly accesses tool_call["call_id"] which raises a KeyError because the OpenAI Responses/function-call object uses "id"; update the returned mapping to use tool_call["id"] instead of tool_call["call_id"] (optionally add a fallback like tool_call.get("id") or tool_call.get("call_id") if you want backward compatibility).

coderabbitai · 2025-10-09T16:53:47Z

            "++inference.top_p=1.0 "
            "++prompt_config=gpt-oss/math "
-            "++use_completions_api=true "
+            "++inference.endpoint_type=text "


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

LGTM! Configuration flag updated to use new endpoint_type parameter.

The change from ++use_completions_api=true to ++inference.endpoint_type=text aligns with the broader refactoring to endpoint_type-based configuration mentioned in the PR.

Verify that the old use_completions_api flag has been replaced consistently across the codebase:

🏁 Script executed:

#!/bin/bash # Description: Search for any remaining usage of the old completions API flag # Search for old flag usage rg -n 'use_completions_api' --type py

Length of output: 305

Remove deprecated use_completions_api in generate.py
The test change is correct but nemo_skills/inference/generate.py still defines and checks use_completions_api. Remove the parameter and its deprecation branch (lines 89, 228–229) and switch to inference.endpoint_type only.

🤖 Prompt for AI Agents

In nemo_skills/inference/generate.py around lines 89 and 228-229, remove the deprecated use_completions_api parameter and the conditional branch that checks it; delete any parameter definition, default, or docstring references to use_completions_api and remove the deprecation warning branch. Replace any logic that selects behavior based on use_completions_api with a single check using inference.endpoint_type (e.g., compare strings like "completions" vs "chat" or whatever existing endpoint_type values are used) so all code paths rely only on inference.endpoint_type. Ensure imports, function signatures, and tests are updated accordingly and no dead references to use_completions_api remain.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

nemo_skills/evaluation/evaluator/livecodebench.py (1)
252-279: Consider using LOG.exception() for preprocessing errors.

At line 263, use LOG.exception() instead of LOG.error() when logging preprocessing errors to include the full traceback.

Apply this diff:
         except ValueError as e:
-            LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}")
+            LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.")
             continue
Otherwise, LGTM: Local evaluation path mirrors sandbox path well.

The local evaluation flow is consistent with the sandbox path, with the same preprocessing, release_version handling (prepending "release_" for Python), and postprocessing logic.
docs/evaluation/code.md (1)

326-327: Verify if sandbox flags are automatically set for OJBench.

A past reviewer (Kipok) questioned whether these flags are necessary, suggesting they might be set automatically. Since LiveCodeBench has automatic sandbox routing logic, please confirm whether OJBench requires explicit sandbox flags or if they're inferred from the benchmark configuration.

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/livecodebench.py (2)
183-202: Consider using LOG.exception() for better error diagnostics.

At line 201, when logging errors during package installation, use LOG.exception() instead of LOG.error() to automatically include the full traceback. This will help diagnose installation failures more quickly.

Apply this diff:
         except (subprocess.CalledProcessError, ImportError) as e:
-            LOG.error(f"Failed to install/import 'livecodebench'. Please install it manually. Error: {e}")
+            LOG.exception(f"Failed to install/import 'livecodebench'. Please install it manually.")
             raise
205-249: Consider using LOG.exception() for preprocessing errors.

Similar to the previous comment, at line 216, use LOG.exception() instead of LOG.error() when logging preprocessing errors to include the full traceback.

Apply this diff:
             except ValueError as e:
-                LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}")
+                LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.")
                 continue
Otherwise, LGTM: Sandbox evaluation path is well-structured.

The async evaluation flow with retry logic, proper timeout calculation (timeout * len(samples) + timeout_buffer), and consistent preprocessing/postprocessing is well-implemented.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1060b and d8db2c7.

📒 Files selected for processing (3)

docs/evaluation/code.md (3 hunks)
nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/livecodebench-cpp/init.py

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/livecodebench.py (2)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (36-92)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

265-265: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

126-126: Avoid specifying long messages outside the exception class

(TRY003)

131-131: Avoid specifying long messages outside the exception class

(TRY003)

189-189: Consider moving this statement to an else block

(TRY300)

195-195: subprocess call: check for execution of untrusted input

(S603)

199-199: Consider moving this statement to an else block

(TRY300)

201-201: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

216-216: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

287-287: Avoid specifying long messages outside the exception class

(TRY003)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor Bugbot

🔇 Additional comments (5)

nemo_skills/evaluation/evaluator/livecodebench.py (3)

51-90: LGTM: Robust retry mechanism with proper backoff.

The async context manager and retry logic are well-implemented. The exponential backoff and clear error logging will help diagnose connectivity issues.

117-162: LGTM: Clean consolidation of preprocessing and postprocessing logic.

These helper functions eliminate code duplication between the sandbox and local evaluation paths. The validation logic (checking for empty samples and consistent release versions) is appropriate, and the in-place file modification approach works well with the evaluation flow.

282-302: LGTM: Well-designed routing logic with proper validation.

The main entry point effectively:

Validates interpreter constraints per language (Python allows python/pypy3, C++ requires python)

Forces the correct interpreter for C++ with a clear warning

Checks sandbox availability before routing

Enforces sandbox requirement for pypy3 (raises error if unavailable)

Gracefully falls back to local evaluation when sandbox is unavailable

The warning message at lines 289-293 addresses the past review comment about explaining why the Python interpreter is used for C++ evaluation.

docs/evaluation/code.md (2)

224-227: Verify if --with_sandbox flag is necessary.

The documentation states that --with_sandbox --keep_mounts_for_sandbox flags must be passed for Pypy3 evaluation. However, the code at lines 295-297 automatically checks sandbox availability and routes accordingly, raising an error if pypy3 is used without an available sandbox.

The --with_sandbox flag might be implicitly handled, though --keep_mounts_for_sandbox may still be necessary for file access (as mentioned in the PR objectives). Please confirm whether users truly need to pass these flags explicitly.

262-267: LGTM: C++ benchmark documentation is clear and correct.

The new section properly:

Links to both the dataset definition and the original Hugging Face source

Specifies the exact benchmark entrypoint (--benchmarks=livecodebench-cpp)

Names both available splits (v5_2408_2501 and v6_2408_2505)

References the Python instructions for additional details

This addresses the past review comment about providing C++-specific instructions.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

nemo_skills/evaluation/evaluator/livecodebench.py (2)
93-115: Consider narrowing the exception handling scope.

The function correctly implements sandbox availability checking, but catching all Exception types on line 112 is overly broad. While this is acceptable for an availability check that should never fail the caller, you should consider whether certain exception types (e.g., KeyboardInterrupt, SystemExit) should propagate.

As per the Ruff hint (BLE001), you could narrow the exception handling:
         LOG.info("Sandbox connection successful. Sandbox is available.")
         return True
     except httpx.NetworkError as e:
         LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
         return False
-    except Exception as e:
+    except (RuntimeError, ValueError, TypeError) as e:
         LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
         return False
Alternatively, if you want to keep the broad catch for robustness, the current implementation is acceptable since it logs the error details.

288-293: Consider adding an inline comment explaining the interpreter constraint.

While the warning message clearly explains that C++ evaluation requires the Python interpreter to run the harness, an inline comment would improve code readability for future maintainers.
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # The C++ evaluation harness is written in Python, so we must use the Python interpreter
+        # to execute it, regardless of the target language being C++
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"
Based on learnings (past review comment from Kipok suggesting to add a comment about why Python interpreter is used for C++ tests).
docs/evaluation/code.md (1)

334-335: Remove redundant sandbox flags from code example
In docs/evaluation/code.md (lines 334–335), the --with_sandbox and --keep_mounts_for_sandbox flags can be omitted—they’re already enabled by REQUIRES_SANDBOX = True and KEEP_MOUNTS_FOR_SANDBOX = True in nemo_skills/dataset/ojbench/__init__.py.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1060b and 8b48105.

📒 Files selected for processing (3)

docs/evaluation/code.md (3 hunks)
nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/livecodebench-cpp/init.py

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

273-273: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.13.3)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (9)

docs/evaluation/code.md (2)

232-232: Documentation correctly reflects sandbox requirement for Pypy3.

The updated instructions clearly state that sandbox flags are required for Pypy3 evaluation, which aligns with the PR objectives and the resolution of the custom_output_file issue.

270-275: LGTM! Previous review concerns have been addressed.

The livecodebench-cpp documentation now correctly:

Fixes the typo ("repare" → "prepare")

Explicitly instructs users to target the C++ benchmark entrypoint (--benchmarks=livecodebench-cpp)

Specifies the correct split names (v5_2408_2501 or v6_2408_2505)

The implementation matches the suggestions from the previous review.

nemo_skills/evaluation/evaluator/livecodebench.py (7)

27-28: LGTM! Import required for network error handling.

The httpx import is necessary for catching httpx.NetworkError in the new retry logic.

47-48: LGTM! New configuration fields support retry logic.

The addition of timeout_buffer and num_retries fields appropriately extends the configuration to support the new retry and timeout management features.

62-91: LGTM! Retry logic implements exponential backoff correctly.

The implementation properly:

Catches httpx.NetworkError specifically for transient network issues

Uses exponential backoff (2^attempt) between retries

Logs warnings with attempt counts and wait times

Re-raises the last exception after all retries are exhausted

165-180: LGTM! Function signature updated to support retry configuration.

The changes correctly:

Accept eval_config to access interpreter and num_retries

Use the new _execute_in_sandbox_with_retries helper

Maintain backward compatibility with retry logic

219-220: LGTM! Consistent release_version handling across sandbox and local paths.

Both evaluation paths now apply the same logic: prepending "release_" to the release_version when language == "python". This resolves the previous review concern about mismatched handling between sandbox and local flows. The conditional prefix suggests that C++ releases use a different naming convention that doesn't require the "release_" prefix.

Also applies to: 266-267

236-242: LGTM! Timeout calculation includes buffer for improved reliability.

The timeout now accounts for per-sample timeout plus a configurable buffer (timeout_buffer), which helps prevent premature timeouts during batch evaluation of large sample sets.

295-297: LGTM! Sandbox availability check ensures pypy3 requirements are met.

The code correctly:

Checks sandbox availability before attempting evaluation

Raises a clear error when pypy3 is configured but sandbox is unavailable

Routes to the appropriate evaluation path based on availability

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding a comment explaining the C++ interpreter override.

The code forces the interpreter to "python" for C++ evaluations, overriding any user configuration. While this is correct (the LiveCodeBench evaluation harness is Python-based, even for C++ submissions), a brief comment would help future maintainers understand why this override is necessary.

Apply this diff:
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
         LOG.warning(
-            f"For C++ evaluation, the harness must be run with 'python'. "
+            # The LiveCodeBench evaluation harness for C++ is Python-based, regardless of the code being evaluated
+            f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b48105 and f311171.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

🪛 Ruff (0.13.3)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Cursor Bugbot
GitHub Check: unit-tests

🔇 Additional comments (7)

nemo_skills/evaluation/evaluator/livecodebench.py (7)

27-28: LGTM!

The httpx import is required for the new network retry logic introduced in _execute_in_sandbox_with_retries.

42-48: LGTM!

The new configuration fields (timeout_buffer, num_retries) are well-suited for controlling the retry behavior and timeout calculations introduced in this PR.

62-91: LGTM!

The retry mechanism with exponential backoff appropriately handles transient network failures when communicating with the sandbox service. The function correctly forwards arguments to sandbox.execute_code.

93-115: LGTM!

The sandbox availability check correctly probes connectivity by executing a simple shell command. The broad exception handling on line 112 is appropriate for an availability check, as it should gracefully handle any unexpected errors and report the sandbox as unavailable. The static analysis warnings (TRY300, BLE001) can be safely ignored in this context.

165-180: LGTM!

The function correctly integrates the retry logic and properly uses eval_config to determine the appropriate interpreter and number of retries.

295-302: LGTM!

The sandbox availability check and path selection logic correctly handles the requirement that pypy3 must run in a sandbox environment, while allowing fallback to local evaluation for standard Python when the sandbox is unavailable.

219-226: Verify sample release_version format

Confirm that the values returned by _preprocess_and_validate_file for Python samples do not already include the "release_" prefix; otherwise the code at lines 219–220 will produce "release_release_…" mismatches.

wasiahmad · 2025-10-12T19:48:55Z

@Kipok this PR is working. I have made the following key changes.

1. We check if Sandbox is available for LCB evaluation.

async def is_sandbox_available(sandbox_config: dict) -> bool:
    """
    Checks if the sandbox service is running and accessible by sending a test request.

    Args:
        sandbox_config: The configuration dictionary for the sandbox.

    Returns:
        True if a connection can be established, False otherwise.
    """
    LOG.info(f"Attempting to connect to sandbox with config: {sandbox_config}")
    try:
        async with sandbox_context(sandbox_config) as sandbox:
            await _execute_in_sandbox_with_retries(sandbox, 1, "true", language="shell", timeout=5)
        LOG.info("Sandbox connection successful. Sandbox is available.")
        return True
    except httpx.NetworkError as e:
        LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
        return False
    except Exception as e:
        LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
        return False

2. Core logic when to use Sandbox

sandbox_is_ready = asyncio.run(is_sandbox_available(eval_config.sandbox))
if eval_config.interpreter == "pypy3" and not sandbox_is_ready:
    raise RuntimeError("The 'pypy3' interpreter requires a running sandbox, but the service was unreachable.")

if sandbox_is_ready:
    asyncio.run(eval_livecodebench_async(cfg, eval_config))
else:
    eval_livecodebench_without_sandbox(cfg, eval_config)

3. Execute code in Sandbox with a number of retries

async def _execute_in_sandbox_with_retries(
    sandbox: Sandbox,
    num_retries: int,
    *args,
    **kwargs,
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """
    Executes a command in the sandbox, retrying on network errors.

    Uses exponential backoff for delays between retries.
    """
    last_exception = None
    for attempt in range(num_retries):
        try:
            return await sandbox.execute_code(*args, **kwargs)
        except httpx.NetworkError as e:
            last_exception = e
            if attempt == num_retries - 1:
                break

            wait_time = 2**attempt
            LOG.warning(
                f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
                f"Retrying in {wait_time} second(s)..."
            )
            await asyncio.sleep(wait_time)

    LOG.error(f"All {num_retries} sandbox connection attempts failed.")
    raise last_exception

Currently, the default num_retries is set to 3.

Feel free to take a look at the entire script - nemo_skills/evaluation/evaluator/livecodebench.py.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/livecodebench.py (2)
112-114: Narrow the exception handling to avoid masking unexpected errors.

Catching a generic Exception after the specific httpx.NetworkError can suppress unexpected bugs (e.g., AttributeError, TypeError). Consider either:

Catching specific exceptions expected during sandbox initialization (e.g., ValueError, ConnectionError)

Or, logging the full traceback and re-raising if the exception is not one you expect

Example approach:
     except httpx.NetworkError as e:
         LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}")
         return False
-    except Exception as e:
-        LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}")
+    except (ValueError, ConnectionError) as e:
+        LOG.warning(f"Sandbox configuration or connection error: {type(e).__name__} - {e}")
         return False
+    except Exception as e:
+        LOG.error(f"Unexpected error during sandbox availability check: {e}", exc_info=True)
+        raise
288-294: Add a comment explaining why C++ evaluation requires the Python interpreter.

The code forces interpreter = "python" for C++ evaluation, but the rationale isn't immediately clear from the code alone. Adding a brief comment would help future maintainers understand that the evaluation harness itself is written in Python, even though it evaluates C++ code.

Based on past review comments

Apply this diff:
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # The LiveCodeBench evaluation harness is implemented in Python,
+        # so C++ code evaluation must use the Python interpreter to run the harness.
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02ad4b3 and aea6c79.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

🪛 Ruff (0.14.0)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: Cursor Bugbot

🔇 Additional comments (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

219-220: No changes needed to release_version prefixing. The evaluator’s logic aligns with the dataset prepare scripts: Python samples are loaded via parse_data(release_{version}) and C++ via raw splits (v5_2408_2501/v6_2408_2505), so conditional prefixing is correct.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding an inline comment.

The logic correctly forces the Python interpreter for C++ evaluation, but an inline comment explaining the rationale would improve maintainability.

For example:
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
+        # C++ evaluation harness in livecodebench must run with Python interpreter
+        # regardless of the configured interpreter for test execution
         LOG.warning(
             f"For C++ evaluation, the harness must be run with 'python'. "
             f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'."
         )
         eval_config.interpreter = "python"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02ad4b3 and 54a2aae.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

🪛 Ruff (0.14.0)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: Cursor Bugbot

🔇 Additional comments (8)

nemo_skills/evaluation/evaluator/livecodebench.py (8)

27-28: LGTM!

The httpx import is necessary for handling NetworkError exceptions in the new retry logic.

42-42: LGTM!

The updated comment clearly indicates the valid language options.

47-48: LGTM!

The new configuration fields timeout_buffer and num_retries provide necessary control over sandbox operations.

93-115: LGTM!

The sandbox availability check correctly probes connectivity using a lightweight test command with retry logic.

165-180: LGTM!

The updated signature and use of eval_config improve consistency and enable proper retry behavior for package installation.

205-249: LGTM!

The sandbox evaluation flow correctly:

Uses the updated installation helper with retry support

Applies consistent release_version prefixing for Python (matching the local path)

Incorporates timeout buffer and retry logic for resilient execution

266-267: LGTM!

The conditional release_version prefixing now matches the sandbox path, resolving the previous mismatch issue.

295-302: LGTM!

The orchestration logic correctly:

Checks sandbox availability before proceeding

Enforces sandbox requirement for PyPy3

Routes to the appropriate evaluation path based on availability

coderabbitai · 2025-10-13T21:55:27Z

+async def _execute_in_sandbox_with_retries(
+    sandbox: Sandbox,
+    num_retries: int,
+    *args,
+    **kwargs,
+) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+    """
+    Executes a command in the sandbox, retrying on network errors.
+
+    Uses exponential backoff for delays between retries.
+    """
+    last_exception = None
+    for attempt in range(num_retries):
+        try:
+            return await sandbox.execute_code(*args, **kwargs)
+        except httpx.NetworkError as e:
+            last_exception = e
+            if attempt == num_retries - 1:
+                break
+
+            wait_time = 2**attempt
+            LOG.warning(
+                f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "
+                f"Retrying in {wait_time} second(s)..."
+            )
+            await asyncio.sleep(wait_time)
+
+    LOG.error(f"All {num_retries} sandbox connection attempts failed.")
+    raise last_exception
+


⚠️ Potential issue | 🟡 Minor

Fix incorrect return type annotation.

The return type annotation specifies Tuple[Dict[str, Any], Dict[str, Any]], but sandbox.execute_code returns Tuple[Dict, str] where the second element is a session ID string, not a dictionary.

Apply this diff to correct the type annotation:

async def _execute_in_sandbox_with_retries( sandbox: Sandbox, num_retries: int, *args, **kwargs, -) -> Tuple[Dict[str, Any], Dict[str, Any]]: +) -> Tuple[Dict[str, Any], str]: """ Executes a command in the sandbox, retrying on network errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def _execute_in_sandbox_with_retries(

sandbox: Sandbox,

num_retries: int,

*args,

**kwargs,

) -> Tuple[Dict[str, Any], Dict[str, Any]]:

"""

Executes a command in the sandbox, retrying on network errors.

Uses exponential backoff for delays between retries.

"""

last_exception = None

for attempt in range(num_retries):

try:

return await sandbox.execute_code(*args, **kwargs)

except httpx.NetworkError as e:

last_exception = e

if attempt == num_retries - 1:

break

wait_time = 2**attempt

LOG.warning(

f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "

f"Retrying in {wait_time} second(s)..."

)

await asyncio.sleep(wait_time)

LOG.error(f"All {num_retries} sandbox connection attempts failed.")

raise last_exception

async def _execute_in_sandbox_with_retries(

sandbox: Sandbox,

num_retries: int,

*args,

**kwargs,

) -> Tuple[Dict[str, Any], str]:

"""

Executes a command in the sandbox, retrying on network errors.

Uses exponential backoff for delays between retries.

"""

last_exception = None

for attempt in range(num_retries):

try:

return await sandbox.execute_code(*args, **kwargs)

except httpx.NetworkError as e:

last_exception = e

if attempt == num_retries - 1:

break

wait_time = 2**attempt

LOG.warning(

f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. "

f"Retrying in {wait_time} second(s)..."

)

await asyncio.sleep(wait_time)

LOG.error(f"All {num_retries} sandbox connection attempts failed.")

raise last_exception

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/livecodebench.py around lines 62 to 91 the function _execute_in_sandbox_with_retries has an incorrect return type annotation Tuple[Dict[str, Any], Dict[str, Any]]; update it to reflect the actual sandbox.execute_code return type (Tuple[Dict[str, Any], str]) by changing the annotation to return Tuple[Dict[str, Any], str] and adjust any imports/types if necessary so the second element is typed as a string rather than a dict.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

nemo_skills/evaluation/evaluator/livecodebench.py (1)
62-90: Fix the return type annotation.

The function annotation specifies Tuple[Dict[str, Any], Dict[str, Any]], but sandbox.execute_code returns Tuple[Dict[str, Any], str] where the second element is a session ID string (see nemo_skills/code_execution/sandbox.py:151). This was flagged in a prior review but remains unaddressed.

Apply this diff:
 async def execute_in_sandbox_with_retries(
     sandbox: Sandbox,
     num_retries: int,
     *args,
     **kwargs,
-) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+) -> Tuple[Dict[str, Any], str]:

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding an inline comment for clarity.

The warning log at lines 289-292 explains why the interpreter is forced to python for C++ evaluation. An inline comment before line 288 would help future maintainers understand this requirement without checking logs.

Apply this diff:
+    # C++ evaluation requires the LiveCodeBench harness to run with Python,
+    # regardless of the configured interpreter setting
     if eval_config.language == "cpp" and eval_config.interpreter != "python":
nemo_skills/evaluation/evaluator/ojbench.py (1)

26-30: Consider extracting shared utilities to a common module.

The sandbox utilities (execute_in_sandbox_with_retries, is_sandbox_available, sandbox_context) are imported from livecodebench.py. While this works, it creates a dependency where ojbench imports from livecodebench, which couples unrelated evaluators. Consider extracting these utilities to a shared module like nemo_skills/evaluation/evaluator/sandbox_utils.py for better modularity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54a2aae and e4b3cb8.

📒 Files selected for processing (3)

docs/evaluation/code.md (2 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (9 hunks)
nemo_skills/evaluation/evaluator/ojbench.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/evaluation/evaluator/livecodebench.py (1)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

nemo_skills/evaluation/evaluator/ojbench.py (3)

nemo_skills/code_execution/sandbox.py (1)

Sandbox (45-334)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (36-92)

nemo_skills/evaluation/evaluator/livecodebench.py (3)

execute_in_sandbox_with_retries (62-90)

is_sandbox_available (93-114)

sandbox_context (52-59)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

273-273: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.14.0)

nemo_skills/evaluation/evaluator/livecodebench.py

108-108: Consider moving this statement to an else block

(TRY300)

112-112: Do not catch blind exception: Exception

(BLE001)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ojbench.py

146-146: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (12)

docs/evaluation/code.md (2)

232-232: LGTM!

The Pypy3 evaluation instructions clearly state the sandbox requirement and provide the correct flags. The guidance aligns with the evaluator implementation that enforces sandbox availability for pypy3.

270-275: LGTM!

The livecodebench-cpp section is complete and accurate. It correctly references the cpp benchmark entrypoint and provides clear data preparation and evaluation instructions. The typo has been fixed, and the guidance mirrors the Python livecodebench workflow appropriately.

nemo_skills/evaluation/evaluator/livecodebench.py (6)

47-48: LGTM!

The new configuration fields timeout_buffer and num_retries provide sensible defaults (60 seconds and 3 retries) for managing sandbox execution reliability. These parameters are appropriately used throughout the evaluator to handle network instability.

93-114: LGTM!

The sandbox availability check correctly uses generated_code="true" and handles both network errors and unexpected exceptions gracefully. The broad exception handler at line 112 is appropriate for an availability probe, where any failure should be treated as "unavailable" rather than propagated.

165-180: LGTM!

The package installation function correctly selects the appropriate pip command and git URL based on the interpreter, uses retry logic for network resilience, and provides clear logging. The error handling is appropriate.

205-249: LGTM!

The sandboxed evaluation flow correctly installs packages, preprocesses files, applies the release_ prefix for Python language, and uses retry logic with appropriate timeouts. The timeout calculation (timeout * len(samples) + timeout_buffer) provides adequate buffer for evaluation completion.

252-279: LGTM!

The local evaluation flow now correctly handles release_version consistently with the sandboxed path. Both flows add the release_ prefix when language == "python", ensuring the evaluate function receives the correct release tag format.

282-302: LGTM!

The orchestration logic correctly validates interpreter settings, enforces sandbox requirements for pypy3, checks sandbox availability, and routes evaluation to the appropriate implementation. The C++ interpreter forcing is well-explained in the log warning.

nemo_skills/evaluation/evaluator/ojbench.py (4)

40-41: LGTM!

The timeout and retry configuration fields match the pattern established in livecodebench.py, providing consistent reliability handling across evaluators.

44-67: LGTM!

The package installation function correctly uses retry logic for both cloning and installation, provides clear error logging, and follows the same pattern as the livecodebench evaluator for consistency.

70-137: LGTM!

The async evaluation flow correctly manages sandbox lifecycle, installs packages with retry logic, and executes evaluation with appropriate timeout buffering. The error handling and result processing logic is sound.

139-146: LGTM!

The orchestration correctly checks sandbox availability and enforces the sandbox requirement for OJBench evaluation. Unlike LiveCodeBench, OJBench has no local fallback, which is appropriate given that the evaluation harness requires sandbox isolation for accurate results.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

adding eval support for lcb-c++

7690e27

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad marked this pull request as draft October 3, 2025 08:45

setting language default

1b8fe33

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad changed the title ~~Evaluation on LiveCodeBench-C++~~ Evaluation on LiveCodeBench-Cpp Oct 3, 2025

wasiahmad added 4 commits October 3, 2025 02:08

prefix for release version is only needed for python

4931f45

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

for c++, we need python interpretor for corresponding eval harness

07deb25

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

code eval docs updated

7e6720a

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into livecodebench_cpp

0c5ca9f

wasiahmad marked this pull request as ready for review October 3, 2025 21:22

wasiahmad requested a review from Kipok October 3, 2025 21:23

wasiahmad enabled auto-merge (squash) October 3, 2025 21:23

coderabbitai Bot reviewed Oct 3, 2025

View reviewed changes

Comment thread docs/evaluation/code.md

Comment thread nemo_skills/dataset/livecodebench-cpp/prepare.py

minor eval docs update

ffdad7d

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 3, 2025

View reviewed changes

minor eval docs update

ee346da

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 3, 2025

View reviewed changes

wasiahmad added 9 commits October 3, 2025 16:25

debugging generate with sandbox

84a8a0e

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

535c8af

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

e29d05d

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

d3f83ea

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

ac2dbda

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

a846042

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

debugging generate with sandbox

dc91b99

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

getting back all changes

05b94d4

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

setting KEEP_MOUNTS_FOR_SANDBOX = True for lcb eval

6547325

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 4, 2025

View reviewed changes

Comment thread nemo_skills/pipeline/utils/eval.py

Merge remote-tracking branch 'origin/main' into livecodebench_cpp

e3cc020

gwarmstrong and others added 4 commits October 9, 2025 09:45

Fix MCP tests (#916)

a8a0952

Signed-off-by: George Armstrong <georgea@nvidia.com>

Addding BFCL headers (#917)

03ff6a9

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

updated year in copyright message

9c1060b

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

lcb-cpp eval without sandbox

1bc7c26

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 9, 2025

View reviewed changes

wasiahmad added 3 commits October 9, 2025 09:57

minor doc update

f5ee7b6

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

sanbodx use logic updated with retries

d8db2c7

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

resolving conflicts

8b48105

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 12, 2025

View reviewed changes

Comment thread nemo_skills/evaluation/evaluator/livecodebench.py

coderabbitai Bot reviewed Oct 12, 2025

View reviewed changes

minor bug fix

f311171

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

This comment was marked as outdated.

Sign in to view

coderabbitai Bot reviewed Oct 12, 2025

View reviewed changes

removing unwanted print statement

bd19311

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad requested a review from Kipok October 13, 2025 04:01

wasiahmad added 4 commits October 13, 2025 08:41

Merge branch 'main' into livecodebench_cpp

bf31d28

Merge branch 'main' into livecodebench_cpp

02ad4b3

added a comment

aea6c79

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into livecodebench_cpp

54a2aae

coderabbitai Bot reviewed Oct 13, 2025

View reviewed changes

ojbench uses tries when using sandbox

e4b3cb8

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot reviewed Oct 13, 2025

View reviewed changes

Kipok approved these changes Oct 13, 2025

View reviewed changes

wasiahmad merged commit ad8e8f8 into main Oct 13, 2025
7 of 8 checks passed

wasiahmad deleted the livecodebench_cpp branch October 13, 2025 23:07

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Evaluation on LiveCodeBench-Cpp (#885)

f80fa7c

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai Bot mentioned this pull request Feb 5, 2026

LCB generic prompting #1215

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Evaluation on LiveCodeBench-Cpp (#885)

09b96dd

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

	The behavior is that it keeps slurm jobs from existing and we waste gpus.
	The behavior is that it keeps slurm jobs from exiting and we waste gpus.

Conversation

wasiahmad commented Oct 3, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

wasiahmad commented Oct 3, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wasiahmad commented Oct 4, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wasiahmad commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. We check if Sandbox is available for LCB evaluation.

2. Core logic when to use Sandbox

3. Execute code in Sandbox with a number of retries

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 13, 2025

wasiahmad commented Oct 3, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 3, 2025 •

edited

Loading

wasiahmad commented Oct 12, 2025 •

edited

Loading