Evaluation on LiveCodeBench-Cpp#885
Conversation
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a LiveCodeBench-CPP dataset and prepare script, adds sandbox availability probes and httpx-backed retry helpers to LiveCodeBench and OJBench evaluators, tightens a sandbox-mounts warning condition, updates docs to require sandbox for some runs and add C++ guidance, bumps license years, and removes two sandbox constants. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Orchestrator as eval_livecodebench / eval_ojbench
participant Sandbox as Sandbox API
participant Local as Local Runner
User->>Orchestrator: Start evaluation(config)
Orchestrator->>Sandbox: is_sandbox_available(sandbox_config)
alt Sandbox available
Note over Orchestrator,Sandbox: Install + run via sandbox (retries)
Orchestrator->>Sandbox: _install_packages_in_sandbox(sandbox, eval_config)
Orchestrator->>Sandbox: execute_in_sandbox_with_retries(cmd, timeout=base+timeout_buffer, num_retries)
Sandbox-->>Orchestrator: Results / logs
else Sandbox unavailable
Note over Orchestrator,Local: Fallback to local execution or error (per evaluator)
Orchestrator->>Local: Run evaluation locally (adjust release_version handling)
Local-->>Orchestrator: Results / logs
end
Orchestrator-->>User: Aggregated results
sequenceDiagram
autonumber
participant Prep as prepare.py (livecodebench-cpp)
participant HF as load_dataset
participant FS as Filesystem
Prep->>HF: load_dataset(split)
HF-->>Prep: Dataset
Prep->>Prep: clean_data (cast fields, add task_id, build question)
alt starter_code present
Prep->>Prep: format prompt with starter code template (tabs preserved)
else no starter_code
Prep->>Prep: format prompt without starter code
end
Prep->>FS: Write JSONL per split (adds subset_for_metrics, release_version)
FS-->>Prep: Done
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
|
@Kipok I have tested this PR, and it works fine. The only issue that needs to be addressed is the use of Please suggest how to resolve it. Thanks! |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/evaluation/code.md(1 hunks)nemo_skills/dataset/livecodebench-cpp/__init__.py(1 hunks)nemo_skills/dataset/livecodebench-cpp/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(4 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
266-266: Link text should be descriptive
(MD059, descriptive-link-text)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/evaluation/code.md (1)
265-267: Use descriptive link text for the dataset reference.Replace the
"here"anchor with descriptive text (e.g., “LiveCodeBench-CPP dataset on Hugging Face”) to satisfy MD059 and make the documentation clearer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/evaluation/code.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
266-266: Link text should be descriptive
(MD059, descriptive-link-text)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/evaluation/code.md (1)
265-267: Use descriptive link text for the dataset URL.Markdownlint (rule MD059) flags “here” as non-descriptive. Please change the anchor text to something like “LiveCodeBench-CPP dataset” so readers know what the link targets without context.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/evaluation/code.md(2 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
266-266: Link text should be descriptive
(MD059, descriptive-link-text)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/livecodebench/__init__.py(1 hunks)nemo_skills/pipeline/utils/eval.py(1 hunks)
🔇 Additional comments (1)
nemo_skills/dataset/livecodebench/__init__.py (1)
20-20: LGTM: Sandbox requirement properly declared.Marking the existing LiveCodeBench dataset as requiring sandbox evaluation is appropriate for code execution benchmarks and aligns with the PR's sandbox-aware evaluation framework.
As per my understanding, I think we need to have the following in dataset In that case, the issue gets resolved. |
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
nemo_skills/pipeline/utils/exp.py (1)
528-548: Propagate QoS to the sandbox executorWhen
--with-sandboxis used together with--qos, the sandbox leg still callsget_executor(...)without forwardingqos. That leg will therefore run with the cluster default QoS (or fail on clusters that require an explicit QoS), while the main/server legs honor the user-specified value. Please passqos=qosinto thisget_executorcall so all heterogenous groups share the same Slurm QoS.nemo_skills/inference/model/parallel_thinking.py (1)
344-352: Wire the configured endpoint_type into downstream requests.
ParallelThinkingConfig.endpoint_typeis never propagated, so flipping it from the default has no effect—all calls still rely on implicit inference. Please defaultkwargs["endpoint_type"]toself.cfg.endpoint_typebefore we fan out to_get_multiple_solutionsand_generate_parallel_thinking_contraction.- result = {} # Result dictionary - local_random = random.Random(kwargs.get("random_seed", 0)) + kwargs = dict(kwargs) + kwargs.setdefault("endpoint_type", self.cfg.endpoint_type) + + result = {} # Result dictionary + local_random = random.Random(kwargs.get("random_seed", 0))
🧹 Nitpick comments (6)
nemo_skills/dataset/ruler/prepare.py (1)
32-34: Update the comment to reflect the new parameter.The comment references "completions api" but the parameter has been changed to
endpoint_type=text. Consider updating the comment for clarity.Apply this diff to update the comment:
- # ruler is adding prefix for assistant response, so it has to go through completions api + # ruler is adding prefix for assistant response, so it must use text endpoint type "++start_assistant_response_key=generation " "++inference.endpoint_type=text "nemo_skills/inference/patch_litellm_logging.py (2)
61-73: Consider broader exception handling for robustness.The current implementation only catches
ModuleNotFoundError, which handles the case when litellm is not installed. However, if litellm's internal structure changes (e.g., the module path or attribute names change),ImportErrororAttributeErrorcould be raised instead, causing the patch to fail unexpectedly.Apply this diff to handle additional exception types:
try: import litellm.litellm_core_utils.logging_worker as logging_worker_module logging_worker_module.LoggingWorker = NoOpLoggingWorker logging_worker_module.GLOBAL_LOGGING_WORKER = NoOpLoggingWorker() - except ModuleNotFoundError: - # Ensure compatibility with different litellm versions + except (ModuleNotFoundError, ImportError, AttributeError): + # Ensure compatibility with different litellm versions or when litellm is unavailable pass
61-73: Document when and where this patch should be applied.The function
patch_litellm_logging_worker()is defined but there's no indication in this file about when or where it should be called (e.g., during application startup, before using litellm APIs, etc.). This could lead to confusion about proper usage.Consider adding a module-level docstring or expanding the function docstring with usage guidance, for example:
""" Patches the litellm LoggingWorker to disable its functionality. This prevents any logging worker from keeping the client alive. Usage: Call this function once during application initialization, before any litellm API calls are made: >>> from nemo_skills.inference.patch_litellm_logging import patch_litellm_logging_worker >>> patch_litellm_logging_worker() """nemo_skills/training/openrlhf/math_reward.py (1)
61-63: Consider concurrent execution for async calls.The current implementation calls
asyncio.run()sequentially for each judge prompt, creating a new event loop for each item. This is inefficient compared to concurrent execution.For better performance, consider gathering all async calls:
- outputs = [ - asyncio.run(llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases)) for jp in judge_prompts - ] + async def generate_all(): + return await asyncio.gather(*[ + llm.generate_async(prompt=jp, stop_phrases=prompt.stop_phrases) for jp in judge_prompts + ]) + outputs = asyncio.run(generate_all())Note: Given the comment at line 60 indicating this module is no longer actively supported, this optimization may be deferred.
tests/test_code_execution.py (1)
113-113: Minor: Consider fixing the grammatical error in the timeout message.The test correctly validates the new timeout message format. However, the message "Execution timed out after 1 seconds" has a grammatical error (should be "1 second"). While the test is correct, consider updating the implementation to use proper singular/plural forms.
Example fix for the implementation:
# In the timeout handling code timeout_msg = f"Execution timed out after {timeout} second{'s' if timeout != 1 else ''}\n"nemo_skills/inference/model/tool_call.py (1)
94-96: Consider enforcing zip strictness
asyncio.gatherguarantees the same number of results you scheduled, so addingstrict=Truewould automatically flag any future divergence in tool-call bookkeeping instead of silently truncating:for tool_call, tool_result in zip(tool_calls, tool_results, strict=True).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (71)
.github/workflows/copyright-check.yml(1 hunks)README.md(1 hunks)docs/basics/inference.md(3 hunks)docs/basics/prompt-format.md(1 hunks)docs/evaluation/index.md(2 hunks)docs/evaluation/long-context.md(1 hunks)docs/evaluation/multilingual.md(3 hunks)docs/index.md(1 hunks)docs/pipelines/generation.md(8 hunks)docs/releases/openmathinstruct2/dataset.md(6 hunks)docs/releases/openmathreasoning/evaluation.md(2 hunks)docs/tutorials/posts/gpt-oss-python.md(2 hunks)nemo_skills/code_execution/local_sandbox/local_sandbox_server.py(4 hunks)nemo_skills/code_execution/sandbox.py(5 hunks)nemo_skills/conversion/save_sharded_state.py(0 hunks)nemo_skills/dataset/bfcl_v3/utils.py(1 hunks)nemo_skills/dataset/flores200/__init__.py(1 hunks)nemo_skills/dataset/flores200/prepare.py(1 hunks)nemo_skills/dataset/livecodebench-cpp/prepare.py(1 hunks)nemo_skills/dataset/livecodebench/__init__.py(1 hunks)nemo_skills/dataset/livecodebench/prepare.py(1 hunks)nemo_skills/dataset/math/fix_ref_solns.py(1 hunks)nemo_skills/dataset/ojbench/__init__.py(1 hunks)nemo_skills/dataset/ojbench/prepare.py(1 hunks)nemo_skills/dataset/ruler/prepare.py(1 hunks)nemo_skills/dataset/wmt24pp/__init__.py(1 hunks)nemo_skills/dataset/wmt24pp/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(4 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)nemo_skills/inference/chat_interface/chat_service.py(2 hunks)nemo_skills/inference/eval/bfcl.py(7 hunks)nemo_skills/inference/eval/bfcl_utils.py(1 hunks)nemo_skills/inference/eval/scicode.py(1 hunks)nemo_skills/inference/factory.py(1 hunks)nemo_skills/inference/generate.py(15 hunks)nemo_skills/inference/merge_chunks.py(1 hunks)nemo_skills/inference/model/base.py(7 hunks)nemo_skills/inference/model/code_execution.py(8 hunks)nemo_skills/inference/model/context_retry.py(13 hunks)nemo_skills/inference/model/openai.py(1 hunks)nemo_skills/inference/model/parallel_thinking.py(2 hunks)nemo_skills/inference/model/tool_call.py(6 hunks)nemo_skills/inference/model/utils.py(3 hunks)nemo_skills/inference/model/vllm.py(1 hunks)nemo_skills/inference/patch_litellm_logging.py(1 hunks)nemo_skills/mcp/adapters.py(4 hunks)nemo_skills/mcp/config.py(1 hunks)nemo_skills/mcp/tool_manager.py(1 hunks)nemo_skills/mcp/tool_providers.py(1 hunks)nemo_skills/pipeline/eval.py(3 hunks)nemo_skills/pipeline/generate.py(2 hunks)nemo_skills/pipeline/megatron_lm/train.py(2 hunks)nemo_skills/pipeline/nemo_rl/grpo.py(3 hunks)nemo_skills/pipeline/nemo_rl/sft.py(3 hunks)nemo_skills/pipeline/prepare_data.py(2 hunks)nemo_skills/pipeline/run_cmd.py(2 hunks)nemo_skills/pipeline/start_server.py(2 hunks)nemo_skills/pipeline/utils/exp.py(5 hunks)nemo_skills/pipeline/verl/ppo.py(2 hunks)nemo_skills/prompt/config/multilingual/__init__.py(1 hunks)nemo_skills/prompt/config/multilingual/segment-translation.yaml(1 hunks)nemo_skills/prompt/few_shot_examples/lean4.py(1 hunks)nemo_skills/prompt/utils.py(4 hunks)nemo_skills/training/merge_packed_data.py(1 hunks)nemo_skills/training/openrlhf/math_reward.py(2 hunks)requirements/main.txt(2 hunks)tests/slurm-tests/gpt_oss_python_aime25/check_results.py(2 hunks)tests/slurm-tests/gpt_oss_python_aime25/run_test.py(1 hunks)tests/test_code_execution.py(1 hunks)tests/test_mcp_clients.py(3 hunks)
💤 Files with no reviewable changes (1)
- nemo_skills/conversion/save_sharded_state.py
✅ Files skipped from review due to trivial changes (17)
- README.md
- nemo_skills/inference/eval/bfcl_utils.py
- nemo_skills/inference/factory.py
- docs/index.md
- nemo_skills/mcp/tool_providers.py
- nemo_skills/dataset/ojbench/init.py
- nemo_skills/dataset/math/fix_ref_solns.py
- nemo_skills/prompt/few_shot_examples/lean4.py
- docs/basics/prompt-format.md
- docs/evaluation/index.md
- docs/evaluation/long-context.md
- nemo_skills/prompt/config/multilingual/init.py
- nemo_skills/dataset/ojbench/prepare.py
- nemo_skills/dataset/bfcl_v3/utils.py
- nemo_skills/dataset/livecodebench/prepare.py
- nemo_skills/mcp/config.py
- nemo_skills/inference/merge_chunks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/livecodebench-cpp/prepare.py
🧰 Additional context used
🧬 Code graph analysis (20)
nemo_skills/inference/model/openai.py (2)
nemo_skills/inference/model/base.py (2)
_build_responses_request_params(209-210)_build_chat_request_params(202-203)nemo_skills/inference/model/vllm.py (2)
_build_responses_request_params(142-148)_build_chat_request_params(102-140)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
TranslationMetrics(22-81)
tests/test_mcp_clients.py (1)
nemo_skills/mcp/tool_manager.py (3)
list_tools(51-52)ToolManager(64-164)list_all_tools(111-151)
nemo_skills/inference/model/code_execution.py (1)
nemo_skills/inference/model/base.py (1)
EndpointType(38-41)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/inference/model/base.py (1)
EndpointType(38-41)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
write_data_to_file(23-36)main(39-47)
nemo_skills/inference/eval/bfcl.py (3)
nemo_skills/inference/generate.py (6)
generate(676-703)generate(711-716)GenerateSolutionsConfig(81-233)GenerationTask(240-703)InferenceConfig(59-77)generate_with_semaphore(554-562)nemo_skills/inference/model/base.py (1)
EndpointType(38-41)nemo_skills/prompt/utils.py (1)
get_token_count(310-369)
tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)
tests/slurm-tests/utils.py (3)
assert_all(46-58)load_json(18-21)soft_assert(36-43)
nemo_skills/inference/model/vllm.py (2)
nemo_skills/inference/model/base.py (2)
_build_responses_request_params(209-210)_build_chat_request_params(202-203)nemo_skills/inference/model/openai.py (2)
_build_responses_request_params(155-160)_build_chat_request_params(84-153)
nemo_skills/mcp/adapters.py (1)
nemo_skills/inference/model/base.py (1)
EndpointType(38-41)
nemo_skills/training/openrlhf/math_reward.py (1)
nemo_skills/inference/model/base.py (1)
generate_async(213-315)
nemo_skills/inference/chat_interface/chat_service.py (2)
nemo_skills/inference/model/base.py (1)
generate_async(213-315)nemo_skills/inference/model/code_execution.py (1)
generate_async(241-300)
nemo_skills/inference/model/base.py (7)
nemo_skills/inference/patch_litellm_logging.py (1)
patch_litellm_logging_worker(61-73)nemo_skills/utils.py (1)
get_logger_name(130-134)nemo_skills/inference/model/context_retry.py (2)
ContextLimitRetryConfig(114-157)with_context_retry(160-180)nemo_skills/inference/model/utils.py (3)
ServerTokenizer(54-83)WrapperAutoTokenizer(86-102)trim_after_stop_phrases(27-33)nemo_skills/inference/model/openai.py (3)
_build_responses_request_params(155-160)_build_chat_request_params(84-153)_build_completion_request_params(64-82)nemo_skills/inference/model/vllm.py (3)
_build_responses_request_params(142-148)_build_chat_request_params(102-140)_build_completion_request_params(63-100)nemo_skills/inference/model/tool_call.py (1)
generate_async(98-160)
nemo_skills/inference/generate.py (3)
nemo_skills/inference/model/base.py (2)
EndpointType(38-41)generate_async(213-315)nemo_skills/evaluation/evaluator/__init__.py (2)
get_evaluator_class(98-108)supports_single_eval(111-117)nemo_skills/prompt/utils.py (1)
get_token_count(310-369)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
BaseMetrics(23-434)as_float(449-452)
nemo_skills/inference/model/tool_call.py (2)
nemo_skills/mcp/adapters.py (3)
format_tool_list_by_endpoint_type(51-76)format_tool_response_by_endpoint_type(96-111)get_tool_details_by_endpoint_type(114-124)nemo_skills/inference/model/base.py (1)
EndpointType(38-41)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
write_data_to_file(23-38)main(41-54)
nemo_skills/inference/eval/scicode.py (1)
nemo_skills/inference/generate.py (1)
process_single_datapoint(527-552)
nemo_skills/inference/model/context_retry.py (1)
nemo_skills/inference/model/utils.py (5)
is_context_window_exceeded_error(36-51)encode(61-74)encode(93-98)ServerTokenizer(54-83)WrapperAutoTokenizer(86-102)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
nemo_skills/code_execution/sandbox.py (2)
Sandbox(45-334)get_sandbox(419-422)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(36-92)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/multilingual.md
76-76: Link text should be descriptive
(MD059, descriptive-link-text)
149-149: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.13.3)
tests/test_mcp_clients.py
264-264: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
nemo_skills/inference/eval/bfcl.py
139-139: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
140-140: Use raise without specifying exception name
Remove exception name
(TRY201)
272-272: Use raise without specifying exception name
Remove exception name
(TRY201)
tests/slurm-tests/gpt_oss_python_aime25/check_results.py
24-24: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
54-54: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/mcp/adapters.py
76-76: Avoid specifying long messages outside the exception class
(TRY003)
111-111: Avoid specifying long messages outside the exception class
(TRY003)
123-123: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/inference/model/base.py
298-298: Avoid specifying long messages outside the exception class
(TRY003)
301-301: Consider moving this statement to an else block
(TRY300)
310-310: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
313-313: Use raise without specifying exception name
Remove exception name
(TRY201)
452-452: Unused method argument: kwargs
(ARG002)
509-509: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/inference/generate.py
229-229: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/inference/model/tool_call.py
73-73: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
95-95: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/inference/model/context_retry.py
192-192: Use raise without specifying exception name
Remove exception name
(TRY201)
207-207: Consider moving this statement to an else block
(TRY300)
208-208: Do not catch blind exception: Exception
(BLE001)
226-226: Use raise without specifying exception name
Remove exception name
(TRY201)
241-241: Consider moving this statement to an else block
(TRY300)
242-242: Do not catch blind exception: Exception
(BLE001)
nemo_skills/evaluation/evaluator/livecodebench.py
67-67: Avoid specifying long messages outside the exception class
(TRY003)
72-72: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Consider moving this statement to an else block
(TRY300)
134-134: subprocess call: check for execution of untrusted input
(S603)
138-138: Consider moving this statement to an else block
(TRY300)
140-140: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
155-155: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
198-198: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
219-219: Avoid specifying long messages outside the exception class
(TRY003)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (24)
requirements/main.txt (2)
29-29: LGTM on new dependencies.The additions of
langcodesandsacrebleualign well with the PR's goal of adding LiveCodeBench-CPP evaluation support and likely support translation/multilingual features and BLEU metrics mentioned in the AI summary.Also applies to: 39-39
30-30: litellm 1.77.5 has no known security advisories. All reported vulnerabilities were patched by versions ≤1.61.15; 1.77.5 is unaffected.nemo_skills/dataset/livecodebench/__init__.py (2)
1-1: LGTM! Copyright year updated.The copyright year update from 2024 to 2025 is appropriate.
15-20: Removal of sandbox constants is safe
No references to REQUIRES_SANDBOX or KEEP_MOUNTS_FOR_SANDBOX exist in the Python LiveCodeBench evaluator; functionality remains unchanged.nemo_skills/inference/model/context_retry.py (11)
30-110: LGTM! Enhanced error parsing covers multiple formats.The expanded pattern matching now handles 5 different context window error formats, including the new
message_tokens_overflowcase (Pattern 5) when max_tokens goes negative. The progressive fallback through patterns is well-structured.
183-214: LGTM! Improved error handling with better error categorization.The restructured soft-fail logic now:
- Clearly distinguishes context window errors from other exceptions
- Returns
error_reason="context_window_exceeded"for better error tracking- Attempts retry with modified kwargs when strategy is configured
Note: The Ruff warnings (TRY201, TRY300, BLE001) are false positives. The
raise errorpreserves the original exception, the nested try-catch is appropriate for retry logic, and catchingExceptionis necessary to handle any error from the retry attempt.
217-248: LGTM! Sync version mirrors async improvements.The sync handler implements the same improved error handling as the async version. The code duplication is appropriate given the fundamental difference between sync and async execution.
Note: Same Ruff warnings apply here as in the async version - these are false positives.
251-288: LGTM! Added essential guard for prompt overflow.The early exit at lines 271-274 correctly prevents attempting to reduce generation tokens when the prompt itself overflows the context limit. This saves a futile retry attempt.
291-317: LGTM! Safety margin prevents edge-case failures.The
safe_remaining_budgetcalculation (lines 299-306) adds a necessary buffer to account for special tokens, preventing failures at exact context boundaries. The logic correctly checks if messages already consume the available space.
320-362: LGTM! Direct calculation for overflow cases is more efficient.When
message_tokens_overflowis present (lines 328-334), the code directly calculates the required prompt reduction instead of relying on indirect token budget calculations. This is both clearer and more accurate.
365-382: LGTM! Consistent kwargs access pattern.Accessing the prompt via
kwargs["prompt"](line 372) maintains consistency with how other parameters are retrieved throughout the module.
385-408: LGTM! Proper handling of tools parameter for chat templates.Extracting
toolsfrom kwargs (line 394) and passing it to the message trimming functions ensures accurate tokenization when function calling is involved.
411-500: LGTM! Tools parameter enables accurate tokenization.Both trimming functions now accept and pass the
toolsparameter totokenizer.encode(lines 424, 468), ensuring correct token counts when chat templates include function calling definitions.
525-531: LGTM! Error categorization improves observability.Adding the
error_reasonparameter (line 525) enables structured error tracking. Callers can now identify context window errors specifically viaerror="context_window_exceeded"while preserving detailed diagnostics indetailed_error.
121-121: Verify special token budget adequacy. Automated search didn’t locate any tokenizer or prompt-template definitions—manually confirm that reducingnum_special_tokens_budgetto 50 still covers the maximum special-token overhead for all supported model prompt templates and tokenizers.nemo_skills/pipeline/generate.py (1)
107-107: LGTM! QoS parameter integration is consistent.The
qosparameter has been properly added and forwarded through the task creation pipeline. This change is part of a consistent pattern across multiple pipeline files (generate.py, megatron_lm/train.py, verl/ppo.py, nemo_rl/grpo.py, nemo_rl/sft.py, eval.py) enabling Slurm QoS specification for job submissions.Also applies to: 314-314
tests/slurm-tests/gpt_oss_python_aime25/check_results.py (1)
34-76: LGTM! Timeout validation logic is well-structured.The new timeout checking functionality correctly:
- Parses and aggregates timeout counts from JSONL output files
- Validates per-file and total timeout thresholds using soft assertions
- Extracts seeds from filenames with proper error handling
The implementation integrates cleanly with the existing test workflow.
docs/basics/inference.md (1)
37-37: LGTM! Documentation correctly updated for async generation.The documentation properly reflects the migration from synchronous to asynchronous generation, with consistent updates across all examples. The
asyncioimport is correctly added, and the usage pattern is clear.Also applies to: 43-43, 73-73, 85-85
nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)
1-3: LGTM! Translation prompt config is well-structured.The prompt template is clear and concise, with appropriate placeholders for text and target language. The format aligns with the dataset usage patterns in FLORES200 and WMT24PP.
docs/releases/openmathreasoning/evaluation.md (1)
107-107: LGTM! Documentation correctly updated for endpoint_type parameter.The change from
++use_completions_api=Trueto++inference.endpoint_type=textis consistent across both examples and aligns with the broader endpoint-type migration in the codebase.Also applies to: 130-130
nemo_skills/evaluation/metrics/map_metrics.py (1)
36-36: LGTM! Translation metrics correctly integrated.The TranslationMetrics import and registration in METRICS_MAP follow the established pattern. The implementation properly extends BaseMetrics with BLEU-based evaluation.
Also applies to: 60-60
nemo_skills/inference/chat_interface/chat_service.py (1)
17-17: LGTM! Async migration correctly implemented.The change from
generate_synctoasyncio.run(llm.generate_async(...))maintains the synchronous interface ofstream_chatwhile using the new async generation pathway. This is appropriate since the function signature returnsIterator[str](notAsyncIterator), indicating it's designed to be called from synchronous contexts.Also applies to: 60-69
nemo_skills/inference/model/vllm.py (1)
142-148: LGTM! Implementation follows the established pattern.The
_build_responses_request_paramsmethod correctly delegates to_build_chat_request_paramsand renames themessagesparameter toinput. Note that this differs from the OpenAI implementation which also renamesmax_completion_tokens→max_output_tokens. If this reflects actual API differences between VLLM and OpenAI responses endpoints, the implementation is correct.nemo_skills/inference/model/openai.py (1)
155-160: Ensure OpenAI Responses API mapping:_build_responses_request_paramsrenames bothmessages→inputandmax_completion_tokens→max_output_tokens, unlike VLLMModel. Verify this aligns with the official OpenAI Responses endpoint specification.
| def main(args): | ||
| datasets = {} | ||
| for lang in args.target_languages: | ||
| datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"] | ||
|
|
||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
| output_file = data_dir / f"{args.split}.jsonl" | ||
| write_data_to_file(output_file, datasets, tgt_languages=args.target_languages) |
There was a problem hiding this comment.
Respect the requested split when loading WMT24PP.
We expose --split (defaulting to test), yet we unconditionally fetch the train split via ["train"]. That silently pipes training data into evaluation, invalidating metrics. Please honor the CLI flag.
- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+ datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def main(args): | |
| datasets = {} | |
| for lang in args.target_languages: | |
| datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"] | |
| data_dir = Path(__file__).absolute().parent | |
| data_dir.mkdir(exist_ok=True) | |
| output_file = data_dir / f"{args.split}.jsonl" | |
| write_data_to_file(output_file, datasets, tgt_languages=args.target_languages) | |
| def main(args): | |
| datasets = {} | |
| for lang in args.target_languages: | |
| datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}", split=args.split) | |
| data_dir = Path(__file__).absolute().parent | |
| data_dir.mkdir(exist_ok=True) | |
| output_file = data_dir / f"{args.split}.jsonl" | |
| write_data_to_file(output_file, datasets, tgt_languages=args.target_languages) |
🤖 Prompt for AI Agents
nemo_skills/dataset/wmt24pp/prepare.py around lines 39 to 47: the code currently
unconditionally selects the "train" split when loading "google/wmt24pp",
ignoring the CLI --split flag; change the dataset loading to use the requested
split (e.g., use load_dataset("google/wmt24pp", f"en-{lang}", split=args.split)
or index with [args.split]) so the script actually loads the specified split
(default test) before writing to the output file.
| evaluate_fn( | ||
| custom_output_file=jsonl_file, | ||
| release_version=f"release_{release_version}", | ||
| k_list=[1], | ||
| language=eval_config.language, | ||
| test_file=eval_config.test_file, | ||
| num_process_evaluate=eval_config.num_processes, | ||
| timeout=eval_config.timeout, | ||
| ) |
There was a problem hiding this comment.
Fix the mismatched release_version in the local path.
_preprocess_and_validate_file already returns the exact release_version string stored in the samples (e.g., "release_cpp_202409"). The sandbox flow forwards it unchanged, but the local flow prepends "release_" again, resulting in "release_release_cpp_202409". That value doesn’t exist in the LiveCodeBench package, so local evaluations will now fail with a missing-release error. Please drop the extra prefix (or normalize both paths the same way).
Apply this diff:
- evaluate_fn(
- custom_output_file=jsonl_file,
- release_version=f"release_{release_version}",
+ evaluate_fn(
+ custom_output_file=jsonl_file,
+ release_version=release_version,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| evaluate_fn( | |
| custom_output_file=jsonl_file, | |
| release_version=f"release_{release_version}", | |
| k_list=[1], | |
| language=eval_config.language, | |
| test_file=eval_config.test_file, | |
| num_process_evaluate=eval_config.num_processes, | |
| timeout=eval_config.timeout, | |
| ) | |
| evaluate_fn( | |
| custom_output_file=jsonl_file, | |
| release_version=release_version, | |
| k_list=[1], | |
| language=eval_config.language, | |
| test_file=eval_config.test_file, | |
| num_process_evaluate=eval_config.num_processes, | |
| timeout=eval_config.timeout, | |
| ) |
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/livecodebench.py around lines 201 to 209,
the local call to evaluate_fn is prepending "release_" to release_version,
causing "release_release_..." mismatches with samples; remove the extra prefix
and pass the release_version string returned by _preprocess_and_validate_file
directly (or normalize both sandbox and local flows to the same format) so
evaluate_fn receives the exact release tag present in the LiveCodeBench package.
| tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type) | ||
|
|
||
| ## | ||
| # TODO(sanyamk): Not all tool arguments might necessarily be in JSON format. | ||
| # Kept here to handle errors for now. | ||
| try: | ||
| tool_args = json.loads(tool_args) | ||
| except json.decoder.JSONDecodeError as e: | ||
| LOG.error(f"Tool arguments are not in JSON format: {tool_args}") | ||
| LOG.exception(e) | ||
| return {"error": "Tool argument parsing failed."} | ||
|
|
There was a problem hiding this comment.
Handle dict tool arguments for responses endpoint
For EndpointType.responses, get_tool_details_by_endpoint_type (see nemo_skills/mcp/adapters.py, lines 113-123) already returns tool_call["arguments"], which is a dict. Passing that dict into json.loads raises a TypeError, so tool execution now crashes before the try/except (you only catch JSONDecodeError). Please decode only when the payload is a string, or otherwise accept the existing dict. One possible fix:
- try:
- tool_args = json.loads(tool_args)
- except json.decoder.JSONDecodeError as e:
+ try:
+ if isinstance(tool_args, str):
+ tool_args = json.loads(tool_args)
+ except json.decoder.JSONDecodeError as e:
LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
LOG.exception(e)
return {"error": "Tool argument parsing failed."}
+ except TypeError as e:
+ LOG.error("Tool arguments must be str or dict; got %s", type(tool_args))
+ LOG.exception(e)
+ return {"error": "Tool argument parsing failed."}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type) | |
| ## | |
| # TODO(sanyamk): Not all tool arguments might necessarily be in JSON format. | |
| # Kept here to handle errors for now. | |
| try: | |
| tool_args = json.loads(tool_args) | |
| except json.decoder.JSONDecodeError as e: | |
| LOG.error(f"Tool arguments are not in JSON format: {tool_args}") | |
| LOG.exception(e) | |
| return {"error": "Tool argument parsing failed."} | |
| tool_name, tool_args = get_tool_details_by_endpoint_type(tool_call, endpoint_type) | |
| ## | |
| # TODO(sanyamk): Not all tool arguments might necessarily be in JSON format. | |
| # Kept here to handle errors for now. | |
| try: | |
| if isinstance(tool_args, str): | |
| tool_args = json.loads(tool_args) | |
| except json.decoder.JSONDecodeError as e: | |
| LOG.error(f"Tool arguments are not in JSON format: {tool_args}") | |
| LOG.exception(e) | |
| return {"error": "Tool argument parsing failed."} | |
| except TypeError as e: | |
| LOG.error("Tool arguments must be str or dict; got %s", type(tool_args)) | |
| LOG.exception(e) | |
| return {"error": "Tool argument parsing failed."} |
🧰 Tools
🪛 Ruff (0.13.3)
73-73: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
74-74: Redundant exception object included in logging.exception call
(TRY401)
🤖 Prompt for AI Agents
In nemo_skills/inference/model/tool_call.py around lines 65 to 76, the code
always calls json.loads on tool_args which can be a dict for
EndpointType.responses causing a TypeError; update the logic to only call
json.loads when tool_args is a string (e.g., isinstance(tool_args, str)); if
tool_args is already a dict, skip decoding and use it as-is; keep the existing
JSONDecodeError handling for string parsing and optionally catch TypeError if
non-str/non-dict types may appear and return the same error response.
|
|
||
| Currently, the async_loop function in generate.py sometimes gets stuck forever because some of the coroutines in the litellm logging worker are not finished. | ||
| Debugging why the logger is not finished is non-trivial, so we just patch it to disable its functionality. | ||
| The behavior is that it keeps slurm jobs from existing and we waste gpus. |
There was a problem hiding this comment.
Fix typo in documentation.
"existing" should be "exiting" (jobs should exit, not exist).
Apply this diff:
-The behavior is that it keeps slurm jobs from existing and we waste gpus.
+The behavior is that it keeps slurm jobs from exiting and we waste gpus.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The behavior is that it keeps slurm jobs from existing and we waste gpus. | |
| The behavior is that it keeps slurm jobs from exiting and we waste gpus. |
🤖 Prompt for AI Agents
In nemo_skills/inference/patch_litellm_logging.py around line 21, the
documentation/comment contains a typo: "existing" should be "exiting"; update
the sentence so it reads that slurm jobs keep from exiting (not existing) to
reflect correct behavior and save GPUs.
| return { | ||
| "type": "function_call_output", | ||
| "call_id": tool_call["call_id"], | ||
| "output": json.dumps(result) if not isinstance(result, str) else result, | ||
| } |
There was a problem hiding this comment.
Fix responses tool-call output key.
For the Responses API path we return "call_id": tool_call["call_id"], but the function-call object coming from OpenAI (and from model_dump()) exposes this identifier as "id". Using "call_id" raises a KeyError as soon as we format any tool output under the responses endpoint. Please grab the "id" field instead.
- return {
- "type": "function_call_output",
- "call_id": tool_call["call_id"],
- "output": json.dumps(result) if not isinstance(result, str) else result,
- }
+ return {
+ "type": "function_call_output",
+ "call_id": tool_call["id"],
+ "output": json.dumps(result) if not isinstance(result, str) else result,
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return { | |
| "type": "function_call_output", | |
| "call_id": tool_call["call_id"], | |
| "output": json.dumps(result) if not isinstance(result, str) else result, | |
| } | |
| return { | |
| "type": "function_call_output", | |
| "call_id": tool_call["id"], | |
| "output": json.dumps(result) if not isinstance(result, str) else result, | |
| } |
🤖 Prompt for AI Agents
In nemo_skills/mcp/adapters.py around lines 105 to 109, the responses endpoint
incorrectly accesses tool_call["call_id"] which raises a KeyError because the
OpenAI Responses/function-call object uses "id"; update the returned mapping to
use tool_call["id"] instead of tool_call["call_id"] (optionally add a fallback
like tool_call.get("id") or tool_call.get("call_id") if you want backward
compatibility).
| "++inference.top_p=1.0 " | ||
| "++prompt_config=gpt-oss/math " | ||
| "++use_completions_api=true " | ||
| "++inference.endpoint_type=text " |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
LGTM! Configuration flag updated to use new endpoint_type parameter.
The change from ++use_completions_api=true to ++inference.endpoint_type=text aligns with the broader refactoring to endpoint_type-based configuration mentioned in the PR.
Verify that the old use_completions_api flag has been replaced consistently across the codebase:
🏁 Script executed:
#!/bin/bash
# Description: Search for any remaining usage of the old completions API flag
# Search for old flag usage
rg -n 'use_completions_api' --type pyLength of output: 305
Remove deprecated use_completions_api in generate.py
The test change is correct but nemo_skills/inference/generate.py still defines and checks use_completions_api. Remove the parameter and its deprecation branch (lines 89, 228–229) and switch to inference.endpoint_type only.
🤖 Prompt for AI Agents
In nemo_skills/inference/generate.py around lines 89 and 228-229, remove the
deprecated use_completions_api parameter and the conditional branch that checks
it; delete any parameter definition, default, or docstring references to
use_completions_api and remove the deprecation warning branch. Replace any logic
that selects behavior based on use_completions_api with a single check using
inference.endpoint_type (e.g., compare strings like "completions" vs "chat" or
whatever existing endpoint_type values are used) so all code paths rely only on
inference.endpoint_type. Ensure imports, function signatures, and tests are
updated accordingly and no dead references to use_completions_api remain.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
252-279: Consider usingLOG.exception()for preprocessing errors.At line 263, use
LOG.exception()instead ofLOG.error()when logging preprocessing errors to include the full traceback.Apply this diff:
except ValueError as e: - LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}") + LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.") continue
Otherwise, LGTM: Local evaluation path mirrors sandbox path well.
The local evaluation flow is consistent with the sandbox path, with the same preprocessing, release_version handling (prepending "release_" for Python), and postprocessing logic.
docs/evaluation/code.md (1)
326-327: Verify if sandbox flags are automatically set for OJBench.A past reviewer (Kipok) questioned whether these flags are necessary, suggesting they might be set automatically. Since LiveCodeBench has automatic sandbox routing logic, please confirm whether OJBench requires explicit sandbox flags or if they're inferred from the benchmark configuration.
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
183-202: Consider usingLOG.exception()for better error diagnostics.At line 201, when logging errors during package installation, use
LOG.exception()instead ofLOG.error()to automatically include the full traceback. This will help diagnose installation failures more quickly.Apply this diff:
except (subprocess.CalledProcessError, ImportError) as e: - LOG.error(f"Failed to install/import 'livecodebench'. Please install it manually. Error: {e}") + LOG.exception(f"Failed to install/import 'livecodebench'. Please install it manually.") raise
205-249: Consider usingLOG.exception()for preprocessing errors.Similar to the previous comment, at line 216, use
LOG.exception()instead ofLOG.error()when logging preprocessing errors to include the full traceback.Apply this diff:
except ValueError as e: - LOG.error(f"Skipping {jsonl_file} due to pre-processing error: {e}") + LOG.exception(f"Skipping {jsonl_file} due to pre-processing error.") continue
Otherwise, LGTM: Sandbox evaluation path is well-structured.
The async evaluation flow with retry logic, proper timeout calculation (
timeout * len(samples) + timeout_buffer), and consistent preprocessing/postprocessing is well-implemented.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/evaluation/code.md(3 hunks)nemo_skills/dataset/livecodebench-cpp/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/livecodebench-cpp/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(36-92)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
265-265: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
126-126: Avoid specifying long messages outside the exception class
(TRY003)
131-131: Avoid specifying long messages outside the exception class
(TRY003)
189-189: Consider moving this statement to an else block
(TRY300)
195-195: subprocess call: check for execution of untrusted input
(S603)
199-199: Consider moving this statement to an else block
(TRY300)
201-201: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
216-216: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
263-263: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
287-287: Avoid specifying long messages outside the exception class
(TRY003)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (5)
nemo_skills/evaluation/evaluator/livecodebench.py (3)
51-90: LGTM: Robust retry mechanism with proper backoff.The async context manager and retry logic are well-implemented. The exponential backoff and clear error logging will help diagnose connectivity issues.
117-162: LGTM: Clean consolidation of preprocessing and postprocessing logic.These helper functions eliminate code duplication between the sandbox and local evaluation paths. The validation logic (checking for empty samples and consistent release versions) is appropriate, and the in-place file modification approach works well with the evaluation flow.
282-302: LGTM: Well-designed routing logic with proper validation.The main entry point effectively:
- Validates interpreter constraints per language (Python allows
python/pypy3, C++ requirespython)- Forces the correct interpreter for C++ with a clear warning
- Checks sandbox availability before routing
- Enforces sandbox requirement for
pypy3(raises error if unavailable)- Gracefully falls back to local evaluation when sandbox is unavailable
The warning message at lines 289-293 addresses the past review comment about explaining why the Python interpreter is used for C++ evaluation.
docs/evaluation/code.md (2)
224-227: Verify if--with_sandboxflag is necessary.The documentation states that
--with_sandbox --keep_mounts_for_sandboxflags must be passed for Pypy3 evaluation. However, the code at lines 295-297 automatically checks sandbox availability and routes accordingly, raising an error if pypy3 is used without an available sandbox.The
--with_sandboxflag might be implicitly handled, though--keep_mounts_for_sandboxmay still be necessary for file access (as mentioned in the PR objectives). Please confirm whether users truly need to pass these flags explicitly.
262-267: LGTM: C++ benchmark documentation is clear and correct.The new section properly:
- Links to both the dataset definition and the original Hugging Face source
- Specifies the exact benchmark entrypoint (
--benchmarks=livecodebench-cpp)- Names both available splits (
v5_2408_2501andv6_2408_2505)- References the Python instructions for additional details
This addresses the past review comment about providing C++-specific instructions.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
93-115: Consider narrowing the exception handling scope.The function correctly implements sandbox availability checking, but catching all
Exceptiontypes on line 112 is overly broad. While this is acceptable for an availability check that should never fail the caller, you should consider whether certain exception types (e.g.,KeyboardInterrupt,SystemExit) should propagate.As per the Ruff hint (BLE001), you could narrow the exception handling:
LOG.info("Sandbox connection successful. Sandbox is available.") return True except httpx.NetworkError as e: LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}") return False - except Exception as e: + except (RuntimeError, ValueError, TypeError) as e: LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}") return FalseAlternatively, if you want to keep the broad catch for robustness, the current implementation is acceptable since it logs the error details.
288-293: Consider adding an inline comment explaining the interpreter constraint.While the warning message clearly explains that C++ evaluation requires the Python interpreter to run the harness, an inline comment would improve code readability for future maintainers.
if eval_config.language == "cpp" and eval_config.interpreter != "python": + # The C++ evaluation harness is written in Python, so we must use the Python interpreter + # to execute it, regardless of the target language being C++ LOG.warning( f"For C++ evaluation, the harness must be run with 'python'. " f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'." ) eval_config.interpreter = "python"Based on learnings (past review comment from Kipok suggesting to add a comment about why Python interpreter is used for C++ tests).
docs/evaluation/code.md (1)
334-335: Remove redundant sandbox flags from code example
Indocs/evaluation/code.md(lines 334–335), the--with_sandboxand--keep_mounts_for_sandboxflags can be omitted—they’re already enabled byREQUIRES_SANDBOX = TrueandKEEP_MOUNTS_FOR_SANDBOX = Trueinnemo_skills/dataset/ojbench/__init__.py.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/evaluation/code.md(3 hunks)nemo_skills/dataset/livecodebench-cpp/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/livecodebench-cpp/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
273-273: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (9)
docs/evaluation/code.md (2)
232-232: Documentation correctly reflects sandbox requirement for Pypy3.The updated instructions clearly state that sandbox flags are required for Pypy3 evaluation, which aligns with the PR objectives and the resolution of the custom_output_file issue.
270-275: LGTM! Previous review concerns have been addressed.The livecodebench-cpp documentation now correctly:
- Fixes the typo ("repare" → "prepare")
- Explicitly instructs users to target the C++ benchmark entrypoint (
--benchmarks=livecodebench-cpp)- Specifies the correct split names (
v5_2408_2501orv6_2408_2505)The implementation matches the suggestions from the previous review.
nemo_skills/evaluation/evaluator/livecodebench.py (7)
27-28: LGTM! Import required for network error handling.The
httpximport is necessary for catchinghttpx.NetworkErrorin the new retry logic.
47-48: LGTM! New configuration fields support retry logic.The addition of
timeout_bufferandnum_retriesfields appropriately extends the configuration to support the new retry and timeout management features.
62-91: LGTM! Retry logic implements exponential backoff correctly.The implementation properly:
- Catches
httpx.NetworkErrorspecifically for transient network issues- Uses exponential backoff (2^attempt) between retries
- Logs warnings with attempt counts and wait times
- Re-raises the last exception after all retries are exhausted
165-180: LGTM! Function signature updated to support retry configuration.The changes correctly:
- Accept
eval_configto access interpreter and num_retries- Use the new
_execute_in_sandbox_with_retrieshelper- Maintain backward compatibility with retry logic
219-220: LGTM! Consistent release_version handling across sandbox and local paths.Both evaluation paths now apply the same logic: prepending "release_" to the
release_versionwhenlanguage == "python". This resolves the previous review concern about mismatched handling between sandbox and local flows. The conditional prefix suggests that C++ releases use a different naming convention that doesn't require the "release_" prefix.Also applies to: 266-267
236-242: LGTM! Timeout calculation includes buffer for improved reliability.The timeout now accounts for per-sample timeout plus a configurable buffer (
timeout_buffer), which helps prevent premature timeouts during batch evaluation of large sample sets.
295-297: LGTM! Sandbox availability check ensures pypy3 requirements are met.The code correctly:
- Checks sandbox availability before attempting evaluation
- Raises a clear error when pypy3 is configured but sandbox is unavailable
- Routes to the appropriate evaluation path based on availability
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding a comment explaining the C++ interpreter override.The code forces the interpreter to
"python"for C++ evaluations, overriding any user configuration. While this is correct (the LiveCodeBench evaluation harness is Python-based, even for C++ submissions), a brief comment would help future maintainers understand why this override is necessary.Apply this diff:
if eval_config.language == "cpp" and eval_config.interpreter != "python": LOG.warning( - f"For C++ evaluation, the harness must be run with 'python'. " + # The LiveCodeBench evaluation harness for C++ is Python-based, regardless of the code being evaluated + f"For C++ evaluation, the harness must be run with 'python'. " f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'." ) eval_config.interpreter = "python"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/livecodebench.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/evaluation/evaluator/livecodebench.py (7)
27-28: LGTM!The
httpximport is required for the new network retry logic introduced in_execute_in_sandbox_with_retries.
42-48: LGTM!The new configuration fields (
timeout_buffer,num_retries) are well-suited for controlling the retry behavior and timeout calculations introduced in this PR.
62-91: LGTM!The retry mechanism with exponential backoff appropriately handles transient network failures when communicating with the sandbox service. The function correctly forwards arguments to
sandbox.execute_code.
93-115: LGTM!The sandbox availability check correctly probes connectivity by executing a simple shell command. The broad exception handling on line 112 is appropriate for an availability check, as it should gracefully handle any unexpected errors and report the sandbox as unavailable. The static analysis warnings (TRY300, BLE001) can be safely ignored in this context.
165-180: LGTM!The function correctly integrates the retry logic and properly uses
eval_configto determine the appropriate interpreter and number of retries.
295-302: LGTM!The sandbox availability check and path selection logic correctly handles the requirement that
pypy3must run in a sandbox environment, while allowing fallback to local evaluation for standard Python when the sandbox is unavailable.
219-226: Verify samplerelease_versionformatConfirm that the values returned by
_preprocess_and_validate_filefor Python samples do not already include the"release_"prefix; otherwise the code at lines 219–220 will produce"release_release_…"mismatches.
|
@Kipok this PR is working. I have made the following key changes. 1. We check if Sandbox is available for LCB evaluation.2. Core logic when to use Sandbox3. Execute code in Sandbox with a number of retriesCurrently, the default Feel free to take a look at the entire script - nemo_skills/evaluation/evaluator/livecodebench.py. |
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (2)
112-114: Narrow the exception handling to avoid masking unexpected errors.Catching a generic
Exceptionafter the specifichttpx.NetworkErrorcan suppress unexpected bugs (e.g.,AttributeError,TypeError). Consider either:
- Catching specific exceptions expected during sandbox initialization (e.g.,
ValueError,ConnectionError)- Or, logging the full traceback and re-raising if the exception is not one you expect
Example approach:
except httpx.NetworkError as e: LOG.warning(f"Sandbox is unavailable due to a network error: {type(e).__name__} - {e}") return False - except Exception as e: - LOG.warning(f"An unexpected error occurred while checking sandbox availability: {e}") + except (ValueError, ConnectionError) as e: + LOG.warning(f"Sandbox configuration or connection error: {type(e).__name__} - {e}") return False + except Exception as e: + LOG.error(f"Unexpected error during sandbox availability check: {e}", exc_info=True) + raise
288-294: Add a comment explaining why C++ evaluation requires the Python interpreter.The code forces
interpreter = "python"for C++ evaluation, but the rationale isn't immediately clear from the code alone. Adding a brief comment would help future maintainers understand that the evaluation harness itself is written in Python, even though it evaluates C++ code.Based on past review comments
Apply this diff:
if eval_config.language == "cpp" and eval_config.interpreter != "python": + # The LiveCodeBench evaluation harness is implemented in Python, + # so C++ code evaluation must use the Python interpreter to run the harness. LOG.warning( f"For C++ evaluation, the harness must be run with 'python'. " f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'." ) eval_config.interpreter = "python"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/livecodebench.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: Cursor Bugbot
🔇 Additional comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
219-220: No changes needed to release_version prefixing. The evaluator’s logic aligns with the dataset prepare scripts: Python samples are loaded viaparse_data(release_{version})and C++ via raw splits (v5_2408_2501/v6_2408_2505), so conditional prefixing is correct.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding an inline comment.The logic correctly forces the Python interpreter for C++ evaluation, but an inline comment explaining the rationale would improve maintainability.
For example:
if eval_config.language == "cpp" and eval_config.interpreter != "python": + # C++ evaluation harness in livecodebench must run with Python interpreter + # regardless of the configured interpreter for test execution LOG.warning( f"For C++ evaluation, the harness must be run with 'python'. " f"Ignoring configured interpreter '{eval_config.interpreter}' and using 'python'." ) eval_config.interpreter = "python"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/livecodebench.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: Cursor Bugbot
🔇 Additional comments (8)
nemo_skills/evaluation/evaluator/livecodebench.py (8)
27-28: LGTM!The
httpximport is necessary for handlingNetworkErrorexceptions in the new retry logic.
42-42: LGTM!The updated comment clearly indicates the valid language options.
47-48: LGTM!The new configuration fields
timeout_bufferandnum_retriesprovide necessary control over sandbox operations.
93-115: LGTM!The sandbox availability check correctly probes connectivity using a lightweight test command with retry logic.
165-180: LGTM!The updated signature and use of
eval_configimprove consistency and enable proper retry behavior for package installation.
205-249: LGTM!The sandbox evaluation flow correctly:
- Uses the updated installation helper with retry support
- Applies consistent
release_versionprefixing for Python (matching the local path)- Incorporates timeout buffer and retry logic for resilient execution
266-267: LGTM!The conditional
release_versionprefixing now matches the sandbox path, resolving the previous mismatch issue.
295-302: LGTM!The orchestration logic correctly:
- Checks sandbox availability before proceeding
- Enforces sandbox requirement for PyPy3
- Routes to the appropriate evaluation path based on availability
| async def _execute_in_sandbox_with_retries( | ||
| sandbox: Sandbox, | ||
| num_retries: int, | ||
| *args, | ||
| **kwargs, | ||
| ) -> Tuple[Dict[str, Any], Dict[str, Any]]: | ||
| """ | ||
| Executes a command in the sandbox, retrying on network errors. | ||
|
|
||
| Uses exponential backoff for delays between retries. | ||
| """ | ||
| last_exception = None | ||
| for attempt in range(num_retries): | ||
| try: | ||
| return await sandbox.execute_code(*args, **kwargs) | ||
| except httpx.NetworkError as e: | ||
| last_exception = e | ||
| if attempt == num_retries - 1: | ||
| break | ||
|
|
||
| wait_time = 2**attempt | ||
| LOG.warning( | ||
| f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. " | ||
| f"Retrying in {wait_time} second(s)..." | ||
| ) | ||
| await asyncio.sleep(wait_time) | ||
|
|
||
| LOG.error(f"All {num_retries} sandbox connection attempts failed.") | ||
| raise last_exception | ||
|
|
There was a problem hiding this comment.
Fix incorrect return type annotation.
The return type annotation specifies Tuple[Dict[str, Any], Dict[str, Any]], but sandbox.execute_code returns Tuple[Dict, str] where the second element is a session ID string, not a dictionary.
Apply this diff to correct the type annotation:
async def _execute_in_sandbox_with_retries(
sandbox: Sandbox,
num_retries: int,
*args,
**kwargs,
-) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+) -> Tuple[Dict[str, Any], str]:
"""
Executes a command in the sandbox, retrying on network errors.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _execute_in_sandbox_with_retries( | |
| sandbox: Sandbox, | |
| num_retries: int, | |
| *args, | |
| **kwargs, | |
| ) -> Tuple[Dict[str, Any], Dict[str, Any]]: | |
| """ | |
| Executes a command in the sandbox, retrying on network errors. | |
| Uses exponential backoff for delays between retries. | |
| """ | |
| last_exception = None | |
| for attempt in range(num_retries): | |
| try: | |
| return await sandbox.execute_code(*args, **kwargs) | |
| except httpx.NetworkError as e: | |
| last_exception = e | |
| if attempt == num_retries - 1: | |
| break | |
| wait_time = 2**attempt | |
| LOG.warning( | |
| f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. " | |
| f"Retrying in {wait_time} second(s)..." | |
| ) | |
| await asyncio.sleep(wait_time) | |
| LOG.error(f"All {num_retries} sandbox connection attempts failed.") | |
| raise last_exception | |
| async def _execute_in_sandbox_with_retries( | |
| sandbox: Sandbox, | |
| num_retries: int, | |
| *args, | |
| **kwargs, | |
| ) -> Tuple[Dict[str, Any], str]: | |
| """ | |
| Executes a command in the sandbox, retrying on network errors. | |
| Uses exponential backoff for delays between retries. | |
| """ | |
| last_exception = None | |
| for attempt in range(num_retries): | |
| try: | |
| return await sandbox.execute_code(*args, **kwargs) | |
| except httpx.NetworkError as e: | |
| last_exception = e | |
| if attempt == num_retries - 1: | |
| break | |
| wait_time = 2**attempt | |
| LOG.warning( | |
| f"Sandbox connection error (attempt {attempt + 1}/{num_retries}): {e}. " | |
| f"Retrying in {wait_time} second(s)..." | |
| ) | |
| await asyncio.sleep(wait_time) | |
| LOG.error(f"All {num_retries} sandbox connection attempts failed.") | |
| raise last_exception |
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/livecodebench.py around lines 62 to 91 the
function _execute_in_sandbox_with_retries has an incorrect return type
annotation Tuple[Dict[str, Any], Dict[str, Any]]; update it to reflect the
actual sandbox.execute_code return type (Tuple[Dict[str, Any], str]) by changing
the annotation to return Tuple[Dict[str, Any], str] and adjust any imports/types
if necessary so the second element is typed as a string rather than a dict.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
62-90: Fix the return type annotation.The function annotation specifies
Tuple[Dict[str, Any], Dict[str, Any]], butsandbox.execute_codereturnsTuple[Dict[str, Any], str]where the second element is a session ID string (seenemo_skills/code_execution/sandbox.py:151). This was flagged in a prior review but remains unaddressed.Apply this diff:
async def execute_in_sandbox_with_retries( sandbox: Sandbox, num_retries: int, *args, **kwargs, -) -> Tuple[Dict[str, Any], Dict[str, Any]]: +) -> Tuple[Dict[str, Any], str]:
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
288-293: Consider adding an inline comment for clarity.The warning log at lines 289-292 explains why the interpreter is forced to
pythonfor C++ evaluation. An inline comment before line 288 would help future maintainers understand this requirement without checking logs.Apply this diff:
+ # C++ evaluation requires the LiveCodeBench harness to run with Python, + # regardless of the configured interpreter setting if eval_config.language == "cpp" and eval_config.interpreter != "python":nemo_skills/evaluation/evaluator/ojbench.py (1)
26-30: Consider extracting shared utilities to a common module.The sandbox utilities (
execute_in_sandbox_with_retries,is_sandbox_available,sandbox_context) are imported fromlivecodebench.py. While this works, it creates a dependency where ojbench imports from livecodebench, which couples unrelated evaluators. Consider extracting these utilities to a shared module likenemo_skills/evaluation/evaluator/sandbox_utils.pyfor better modularity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/evaluation/code.md(2 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(9 hunks)nemo_skills/evaluation/evaluator/ojbench.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/evaluation/evaluator/livecodebench.py (1)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (1)
Sandbox(45-334)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(36-92)nemo_skills/evaluation/evaluator/livecodebench.py (3)
execute_in_sandbox_with_retries(62-90)is_sandbox_available(93-114)sandbox_context(52-59)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
273-273: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/livecodebench.py
108-108: Consider moving this statement to an else block
(TRY300)
112-112: Do not catch blind exception: Exception
(BLE001)
297-297: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/evaluation/evaluator/ojbench.py
146-146: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (12)
docs/evaluation/code.md (2)
232-232: LGTM!The Pypy3 evaluation instructions clearly state the sandbox requirement and provide the correct flags. The guidance aligns with the evaluator implementation that enforces sandbox availability for pypy3.
270-275: LGTM!The livecodebench-cpp section is complete and accurate. It correctly references the cpp benchmark entrypoint and provides clear data preparation and evaluation instructions. The typo has been fixed, and the guidance mirrors the Python livecodebench workflow appropriately.
nemo_skills/evaluation/evaluator/livecodebench.py (6)
47-48: LGTM!The new configuration fields
timeout_bufferandnum_retriesprovide sensible defaults (60 seconds and 3 retries) for managing sandbox execution reliability. These parameters are appropriately used throughout the evaluator to handle network instability.
93-114: LGTM!The sandbox availability check correctly uses
generated_code="true"and handles both network errors and unexpected exceptions gracefully. The broad exception handler at line 112 is appropriate for an availability probe, where any failure should be treated as "unavailable" rather than propagated.
165-180: LGTM!The package installation function correctly selects the appropriate pip command and git URL based on the interpreter, uses retry logic for network resilience, and provides clear logging. The error handling is appropriate.
205-249: LGTM!The sandboxed evaluation flow correctly installs packages, preprocesses files, applies the
release_prefix for Python language, and uses retry logic with appropriate timeouts. The timeout calculation (timeout * len(samples) + timeout_buffer) provides adequate buffer for evaluation completion.
252-279: LGTM!The local evaluation flow now correctly handles
release_versionconsistently with the sandboxed path. Both flows add therelease_prefix whenlanguage == "python", ensuring the evaluate function receives the correct release tag format.
282-302: LGTM!The orchestration logic correctly validates interpreter settings, enforces sandbox requirements for pypy3, checks sandbox availability, and routes evaluation to the appropriate implementation. The C++ interpreter forcing is well-explained in the log warning.
nemo_skills/evaluation/evaluator/ojbench.py (4)
40-41: LGTM!The timeout and retry configuration fields match the pattern established in livecodebench.py, providing consistent reliability handling across evaluators.
44-67: LGTM!The package installation function correctly uses retry logic for both cloning and installation, provides clear error logging, and follows the same pattern as the livecodebench evaluator for consistency.
70-137: LGTM!The async evaluation flow correctly manages sandbox lifecycle, installs packages with retry logic, and executes evaluation with appropriate timeout buffering. The error handling and result processing logic is sound.
139-146: LGTM!The orchestration correctly checks sandbox availability and enforces the sandbox requirement for OJBench evaluation. Unlike LiveCodeBench, OJBench has no local fallback, which is appropriate given that the evaluation harness requires sandbox isolation for accurate results.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>
In this PR, we are adding evaluation support for LiveCodeBench-CPP through nemo-skills. Primary changes are:
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Note
Adds LiveCodeBench-CPP dataset/eval support, introduces sandbox availability checks with retry/backoff across evaluators, and updates docs and warnings.
livecodebench-cpp: New dataset with defaults (__init__.py) and data prep (prepare.py) producingv5_2408_2501.jsonlandv6_2408_2505.jsonl.evaluation/evaluator/livecodebench.py):timeout_bufferandnum_retries.release_versionhandling (prefix only for Python); forcepythoninterpreter for C++; gate Pypy3 on sandbox availability.evaluation/evaluator/ojbench.py):timeout_buffer/num_retries; require sandbox availability before running.pipeline/utils/eval.py: Reduce false-positive sandbox mount warnings by tighteningkeep_mounts_for_sandboxcondition.--with_sandbox --keep_mounts_for_sandbox; addlivecodebench-cppsection with data prep/eval notes.Written by Cursor Bugbot for commit e4b3cb8. This will update automatically on new commits. Configure here.