BFCLv4 support#908
Conversation
WalkthroughIntroduces BFCL v4: new dataset metadata and splits, scoring extended with memory/web-search/format_sensitivity and agentic/hallucination components, data prepare/utils for v4 ingestion and OpenAI-style tools, evaluator/inference memory & web-search integrations, new WebSearchAPI, and related infra updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Prep as prepare.py
participant Git as GitHub
participant FS as Local FS
participant Utils as utils.convert_to_tool
Note over Prep,Git: BFCL v4 download & per-category prepare
Prep->>Git: shallow clone(repo_url, depth=1)
Git-->>Prep: repo_root
Prep->>Prep: discover BFCL_v4 JSON files (filter memory/web_search/format_sensitivity)
loop per category file
Prep->>Prep: load_dataset_entry(target_folder, category)
Prep->>Utils: func_doc_language_specific_pre_processing + convert_to_tool
Utils-->>Prep: tool descriptors
Prep->>FS: write split dir, `__init__.py`, and `test.jsonl`
end
sequenceDiagram
autonumber
participant Eval as Evaluator
participant Score as bfcl_v4.bfcl_score
participant Metrics as Metrics Store
Note over Eval,Score: Scoring includes agentic (memory/web_search) & hallucination components
Eval->>Score: compute_score(metrics)
activate Score
Score->>Score: validate GLOBAL_MAX_K
Score->>Score: get_accuracy_dict(metrics, category, optional=True)
Score->>Score: calculate_agentic_accuracy(metrics)
Score->>Metrics: read MEMORY & WEB_SEARCH category metrics
Metrics-->>Score: per-category accuracies
Score->>Score: aggregate -> overall, agentic, hallucination sub-scores
Score-->>Eval: bfcl_v4 result (overall + sub-scores)
deactivate Score
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(4 hunks)nemo_skills/dataset/bfcl_v4/constants.py(2 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(7 hunks)nemo_skills/dataset/bfcl_v4/utils.py(2 hunks)nemo_skills/evaluation/evaluator/bfcl.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
nemo_skills/dataset/bfcl_v4/utils.py (7)
convert_to_tool(172-184)func_doc_language_specific_pre_processing(83-134)is_multi_turn(39-43)load_file(62-71)clean_up_memory_prereq_entries(187-217)populate_initial_settings_for_memory_test_cases(220-239)populate_initial_settings_for_web_search_test_cases(242-258)
🪛 Ruff (0.13.3)
nemo_skills/dataset/bfcl_v4/prepare.py
108-108: Undefined name test_cases_to_generate
(F821)
145-145: Starting a process with a partial executable path
(S607)
169-169: Using .strip() with multi-character strings is misleading
(B005)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (15)
nemo_skills/evaluation/evaluator/bfcl.py (1)
44-44: LGTM - Consistent v3 to v4 migration.The file naming and category extraction updates correctly align with BFCL v4 conventions across all three locations.
Also applies to: 51-51, 79-79
nemo_skills/dataset/bfcl_v4/__init__.py (2)
35-40: LGTM - New module exports added.The new splits for web search, memory, and format sensitivity are properly integrated into the SPLITS list.
45-47: LGTM - Version references updated.Both SCORE_MODULE and BENCHMARKS correctly reference bfcl_v4.
nemo_skills/dataset/bfcl_v4/utils.py (4)
27-43: LGTM - Helper predicates added.The new category detection functions (is_memory, is_memory_prereq, is_web_search, is_multi_turn) provide clear, reusable predicates for identifying test categories.
46-59: LGTM - Test category extraction added.The extract_test_category_from_id function properly handles both normal and format_sensitivity test entry IDs, with clear logic for removing the "_prereq" suffix when requested.
187-217: LGTM - Memory prereq cleanup logic added.The clean_up_memory_prereq_entries function correctly:
- Removes orphaned memory-prerequisite test cases
- Filters out already-generated entries from dependency lists
220-258: LGTM - Initial settings population added.Both populate_initial_settings_for_memory_test_cases and populate_initial_settings_for_web_search_test_cases provide clear configuration injection for their respective test categories.
nemo_skills/dataset/bfcl_v4/prepare.py (2)
95-95: LGTM - Signature simplified.Removing the
model_typeparameter simplifies the interface. Ensure downstream callers are updated accordingly.
169-169: LGTM - Correct prefix removal.Using
.lstrip("BFCL_v4_")correctly removes the prefix. The static analysis warning about multi-character strings is a false positive in this case, as the intent is to remove the exact prefix string.nemo_skills/dataset/bfcl_v4/bfcl_score.py (4)
51-62: LGTM - New agentic categories added.The MEMORY, WEB_SEARCH, and FORMAT_SENSITIVITY constants properly define the new test categories for BFCL v4.
94-98: LGTM - Optional category lookup added.The optional parameter allows graceful handling of missing categories like format_sensitivity, which may not always be present.
186-203: LGTM - Agentic accuracy calculation added.The calculate_agentic_accuracy function correctly computes combined accuracy for memory and web search categories, with optional format_sensitivity inclusion.
209-229: LGTM - Agentic metrics integrated.The compute_score function correctly integrates agentic accuracy metrics (memory and web search) into the overall accuracy calculation and result structure.
nemo_skills/dataset/bfcl_v4/constants.py (2)
18-68: LGTM - Comprehensive category definitions.The category constants are well-organized with clear derivations:
- MEMORY_CATEGORY derived from ALL_AVAILABLE_MEMORY_BACKENDS
- SINGLE_TURN_CATEGORY composed from NON_LIVE_CATEGORY and LIVE_CATEGORY
- AGENTIC_CATEGORY combines MEMORY_CATEGORY and WEB_SEARCH_CATEGORY
- Clear separation between scoring and non-scoring categories
70-97: LGTM - Test collection mapping updated.The TEST_COLLECTION_MAPPING uses the new category constants consistently and adds new top-level groups ("memory", "web_search", "agentic") for v4.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
231-232: Fix typo in dictionary key.The key "format_sensistivity" contains a typo and should be "format_sensitivity" to match the constant and source key.
Apply this diff to fix the typo:
if "format_sensitivity" in agentic_accuracy: - res["format_sensistivity"] = agentic_accuracy["format_sensitivity"] + res["format_sensitivity"] = agentic_accuracy["format_sensitivity"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/bfcl_v4/__init__.py(2 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/bfcl_v4/init.py
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v4/bfcl_score.py
[warning] 1-1: ruff-format reformatted this file during hooks. Please review formatting changes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)
51-63: LGTM! New agentic constants are well-defined.The new constants for memory, web search, and format sensitivity categories align with the BFCL v4 update.
186-203: LGTM! Agentic accuracy calculation is sound.The function correctly aggregates memory and web search accuracies, and conditionally includes format sensitivity. However, this depends on fixing the critical bug in
get_accuracy_dict(lines 94-98) for the optional parameter to work correctly.
45b6755 to
5f5b3f8
Compare
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/eval/bfcl.py (1)
169-177: Normalize client parser output for no-tool-call case to list-of-strings.Avoid returning a string (later
extendwill split it). Keepgenerationas list.- except Exception: - generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" - tool_call_ids = [] + except (KeyError, TypeError, ValueError): + text = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" + generation = [text] if text else [] + tool_call_ids = []
♻️ Duplicate comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
95-101: Fix optional access to avoid KeyError when category missing.Unconditionally indexing
metrics[...]defeats theoptionalguard.-def get_accuracy_dict(metrics, category, optional=False): +def get_accuracy_dict(metrics, category, optional=False): # reporting aggregation for pass@1[avg-of-k] (for highest k) if available - if optional and f"bfcl_v4.{category}" not in metrics: - category_dict = {} - category_dict = metrics[f"bfcl_v4.{category}"] + key = f"bfcl_v4.{category}" + if optional and key not in metrics: + return {"accuracy": 0, "num_entries": 0} + category_dict = metrics[key]nemo_skills/dataset/bfcl_v4/prepare.py (1)
155-157: Version pinning concern remains unresolved.As noted in a previous review, shallow-cloning HEAD without version pinning risks future breakage if the upstream repository structure changes. This concern from the prior review has not been addressed in the current implementation.
Consider one of these approaches:
- Clone with an explicit tag:
["git", "clone", "--depth=1", "--branch", "v4", repo_url, temp_dir]- Add a post-clone checkout:
subprocess.run(["git", "-C", temp_dir, "checkout", KNOWN_COMMIT], check=True)Store the commit hash or tag as a constant at the module level for easy updates.
🧹 Nitpick comments (8)
nemo_skills/inference/eval/bfcl.py (5)
403-404: Prevent accidental char-wise extend when generation is string.Even with parser fixes, defensively normalize to list.
- current_turn_response.extend(model_response["generation"]) + gen = model_response["generation"] + current_turn_response.extend(gen if isinstance(gen, list) else [gen])
256-263: Use pathlib and robust key extraction for MemoryAPI path fix.Avoid string replace and single-element list indexing; use pathlib and next(iter(...)).
- for datapoint in data: - if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"): - datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "") + for datapoint in data: + if "initial_config" in datapoint: + api_key = next(iter(datapoint["initial_config"]), None) + if api_key and api_key.startswith("MemoryAPI"): + datapoint["initial_config"][api_key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)Add this import at the top of the file:
+from pathlib import Path
270-273: Avoidasyncio.runin environments with a running event loop.This will raise at runtime (e.g., Jupyter/async runners). Consider refactoring to async
load_dataand invoking from the existing loop, or provide a helper that schedules coroutines when a loop is running.Would you like a small refactor to make prereq processing async-safe?
430-436: Use iterator for first (and only) instance.Satisfy RUF015 and avoid list materialization.
Based on static analysis hints
- memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))
451-455: Same iterator fix when flushing memory.Based on static analysis hints
- memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))nemo_skills/dataset/bfcl_v4/utils.py (1)
96-103: Guard nested type mapping to avoid KeyError on unseen types.Use
mapping.get(..., "string")for robustness; recurse safely.- properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]] + item_type = properties[key]["items"].get("type") + properties[key]["items"]["type"] = mapping.get(item_type, "string") if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]: - properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]] + nested_item_type = properties[key]["items"]["items"].get("type") + properties[key]["items"]["items"]["type"] = mapping.get(nested_item_type, "string") elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]: properties[key]["items"]["properties"] = _cast_to_openai_type( properties[key]["items"]["properties"], mapping )nemo_skills/inference/eval/bfcl_web_search.py (1)
265-267: Narrow exception handling in fetch_url_content.Catching broad
Exceptionhides actionable errors; preferrequests.exceptions.RequestException.- except Exception as e: + except requests.exceptions.RequestException as e: return {"error": f"An error occurred while fetching {url}: {str(e)}"}nemo_skills/dataset/bfcl_v4/prepare.py (1)
201-208: Remove unused function parameter.The
argsparameter is not used in the function body. If the--model_typeargument parsed on line 213 was intended for future use, consider removing both the argument parsing and the parameter until it's actually needed.Apply this diff:
-def main(args): +def main(): LOG.warning( "Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek." ) download_and_process_bfcl_data( REPO_URL, DATA_FOLDER_PATH, output_dir=os.path.join(os.path.dirname(__file__)), )And remove the unused argument parsing:
if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--model_type", type=str, default=None, required=False) - args = parser.parse_args() - - main(args) + main()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.gitignore(1 hunks)dockerfiles/Dockerfile.nemo-skills(1 hunks)nemo_skills/dataset/bfcl_v3/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v3/bfcl_score.py(3 hunks)nemo_skills/dataset/bfcl_v3/constants.py(4 hunks)nemo_skills/dataset/bfcl_v3/prepare.py(4 hunks)nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(1 hunks)nemo_skills/dataset/bfcl_v4/constants.py(1 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(1 hunks)nemo_skills/dataset/bfcl_v4/utils.py(1 hunks)nemo_skills/evaluation/evaluator/bfcl.py(3 hunks)nemo_skills/inference/eval/bfcl.py(6 hunks)nemo_skills/inference/eval/bfcl_utils.py(3 hunks)nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
download_and_process_bfcl_data(140-198)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
setup_prompt(365-378)load_data(418-435)process_single_datapoint(537-562)nemo_skills/inference/eval/bfcl_utils.py (3)
convert_to_function_call(80-91)is_empty_execute_response(184-189)execute_multi_turn_func_call(94-181)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
get_logger_name(130-134)nested_dataclass(49-82)
nemo_skills/inference/eval/bfcl_utils.py (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)
_load_scenario(42-45)
nemo_skills/dataset/bfcl_v4/prepare.py (4)
nemo_skills/dataset/bfcl_v4/utils.py (2)
func_doc_language_specific_pre_processing(17-68)convert_to_tool(106-118)nemo_skills/utils.py (1)
get_logger_name(130-134)nemo_skills/dataset/bfcl_v3/utils.py (2)
is_multi_turn(51-55)load_file(39-48)nemo_skills/dataset/bfcl_v3/prepare.py (1)
download_and_process_bfcl_data(117-178)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
calculate_combined_accuracy(52-74)get_accuracy_dict(77-112)calculate_non_live_single_turn_accuracy(115-137)calculate_live_single_turn_accuracy(140-156)calculate_multi_turn_accuracy(159-165)compute_score(168-190)
🪛 GitHub Actions: Copyright check
nemo_skills/inference/eval/bfcl_web_search.py
[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.
nemo_skills/dataset/bfcl_v4/utils.py
[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.
🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v3/prepare.py
[error] 1-1: Code adjustments via pre-commit (git clone invocation formatting) in prepare.py.
nemo_skills/inference/eval/bfcl.py
[error] 1-1: Code adjustments via pre-commit (imports cleanup) in bfcl.py.
nemo_skills/dataset/bfcl_v4/constants.py
[error] 1-1: Code style cleanup: fixed spacing around VERSION_PREFIX constant.
nemo_skills/inference/eval/bfcl_web_search.py
[error] 1-1: Code adjustments via pre-commit (import cleanup) in bfcl_web_search.py.
nemo_skills/dataset/bfcl_v4/utils.py
[error] 1-1: Ruff check noted missing newline at end of file; newline added during formatting.
nemo_skills/evaluation/evaluator/bfcl.py
[error] 1-1: Code style adjustments via pre-commit (imports/formatting) in bfcl.py.
nemo_skills/dataset/bfcl_v3/bfcl_score.py
[error] 1-1: Ruff check reformatted code in bfcl_score.py. 9 issues fixed across the worktree.
[error] 1-1: Code adjustments via pre-commit (dictionary literal formatting) in bfcl_score.py.
nemo_skills/inference/eval/bfcl_utils.py
[error] 1-1: Code adjustments via pre-commit (string literal formatting) in bfcl_utils.py.
nemo_skills/dataset/bfcl_v4/prepare.py
[error] 1-1: Code adjustments via pre-commit (imports/order and formatting) applied in bfcl_v4/prepare.py.
[error] 1-1: Code adjustments via pre-commit (path construction and calls) in prepare.py.
nemo_skills/dataset/bfcl_v4/bfcl_score.py
[error] 1-1: Ruff check reformatted code in bfcl_score.py. Several formatting changes applied.
[error] 1-1: Code adjustments via pre-commit (dictionary formatting) in bfcl_score.py.
🪛 Ruff (0.14.0)
nemo_skills/dataset/bfcl_v3/prepare.py
133-133: Starting a process with a partial executable path
(S607)
160-160: Using .strip() with multi-character strings is misleading
(B005)
nemo_skills/inference/eval/bfcl.py
219-219: Do not catch blind exception: Exception
(BLE001)
261-261: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
262-262: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
412-412: Do not catch blind exception: Exception
(BLE001)
430-430: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
453-453: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
nemo_skills/inference/eval/bfcl_web_search.py
38-38: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
40-40: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
42-42: Unused method argument: long_context
(ARG002)
140-140: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
150-150: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
163-163: Use explicit conversion flag
Replace with conversion flag
(RUF010)
169-169: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
210-210: Avoid specifying long messages outside the exception class
(TRY003)
263-263: Abstract raise to an inner function
(TRY301)
263-263: Avoid specifying long messages outside the exception class
(TRY003)
265-265: Do not catch blind exception: Exception
(BLE001)
266-266: Use explicit conversion flag
Replace with conversion flag
(RUF010)
270-270: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
nemo_skills/dataset/bfcl_v4/prepare.py
155-155: subprocess call: check for execution of untrusted input
(S603)
156-156: Starting a process with a partial executable path
(S607)
164-166: Avoid specifying long messages outside the exception class
(TRY003)
197-197: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
198-198: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
201-201: Unused function argument: args
(ARG001)
nemo_skills/dataset/bfcl_v4/bfcl_score.py
126-130: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (20)
.gitignore (1)
42-42: LGTM!The ignore pattern correctly excludes BFCL v4 dataset directories, consistent with the new BFCL v4 support added in this PR.
nemo_skills/dataset/bfcl_v3/__init__.py (1)
18-35: LGTM!The split renaming (e.g., "simple" → "simple_python") and expansion with new live/multi-turn entries align with the BFCL v4 migration pattern and improve clarity.
nemo_skills/dataset/bfcl_v4/constants.py (1)
22-23: LGTM!The Path constants are well-defined and provide centralized location references for BFCL v4 data handling.
nemo_skills/dataset/bfcl_v3/constants.py (1)
19-92: LGTM!The systematic renaming of split identifiers (e.g., "simple" → "simple_python") across all test collection mappings is consistent with the BFCL v3
__init__.pyupdates.nemo_skills/dataset/bfcl_v3/prepare.py (3)
117-117: LGTM!Updating the default
file_prefixto "BFCL_v4" aligns with the BFCL v4 migration pattern throughout the PR.
149-150: LGTM!The filtering logic appropriately excludes "format_sensitivity", "memory", and "web_search" patterns from the BFCL v4 data processing, which aligns with the specialized handling of these categories introduced elsewhere in the PR.
133-133: Default branch “main” contains BFCL v4; removing-b v1.3is safe.nemo_skills/dataset/bfcl_v3/bfcl_score.py (3)
17-19: LGTM!The updated SIMPLE_AST list entries align with the split renaming across BFCL v3 modules.
136-136: LGTM!Renaming the key to "non_live_irrelevance" improves clarity and consistency with other output keys in the function.
183-189: LGTM!Nesting the results under a "bfcl_v3" top-level key clearly differentiates BFCL v3 scores from BFCL v4 scores in the overall output structure.
nemo_skills/evaluation/evaluator/bfcl.py (4)
31-31: Verify the new default model configuration.The default model was updated from "o3-mini-2025-01-31-FC" to "o4-mini-2025-04-16-FC". Ensure this model exists and is compatible with the BFCL evaluation pipeline.
46-46: LGTM!Adding support for both "bfcl_v3." and "bfcl_v4." prefixes ensures backward compatibility while enabling BFCL v4 evaluation.
82-88: LGTM!The updated file naming to use "BFCL_v4" and the addition of memory prerequisite filtering align with the BFCL v4 evaluation flow. The filtering logic correctly excludes memory prereq samples that are only needed during inference but not for evaluation scoring.
24-24: Ensure bfcl_eval.utils exports the imported utilities
Confirm thatget_directory_structure_by_categoryandis_memory_prereqare present in the installedbfcl_evalpackage (and update the dependency version or import paths if not).nemo_skills/inference/eval/bfcl_utils.py (4)
37-37: LGTM!The Path import supports the new Path conversion logic added later in the file.
49-66: LGTM!The introduction of
BACKEND_PATH_PREFIXand the updatedCLASS_FILE_PATH_MAPPINGwith f-strings provides a cleaner, more maintainable approach to constructing module paths. The addition of WebSearchAPI and MemoryAPI variants aligns with the BFCL v4 support being added.
117-118: LGTM!The instance naming changes improve robustness:
- Preserving the original
class_namecase maintains consistency- The regex substitution
re.sub(r'[-./]', '_', instance_name)ensures the instance name is a valid Python identifier by replacing potentially problematic characters
125-129: LGTM!The deep copy of
class_initial_configprevents unintended mutations of the original configuration dictionary. Convertingmodel_result_dirto a Path object before passing to_load_scenarioensures type consistency.nemo_skills/dataset/bfcl_v4/__init__.py (1)
15-48: BFCL v4 dataset metadata looks good.SPLITS, group flags, and SCORE_MODULE align with scoring code.
nemo_skills/dataset/bfcl_v4/prepare.py (1)
1-1: Address pre-commit formatting failures.The pipeline indicates that pre-commit hooks will apply automatic adjustments for import ordering, formatting, and path construction. Ensure you run the pre-commit hooks locally or allow the automated fixes to be applied before merging.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
nemo_skills/evaluation/evaluator/bfcl.py (2)
63-67: Avoid shell=True; pass args and env explicitly.Prevents shell injection via model/test_category and makes OPENAI_API_KEY scoping explicit.
Apply:
- cmd = f"OPENAI_API_KEY=dummy bfcl evaluate --model {eval_config.model} --test-category {test_category}" - - LOG.info(f"Running BFCL evaluation: {cmd}") - subprocess.run(cmd, shell=True, check=True, timeout=eval_config.timeout) + LOG.info("Running BFCL evaluation") + args = ["bfcl", "evaluate", "--model", eval_config.model, "--test-category", test_category] + env = {**os.environ, "OPENAI_API_KEY": "dummy"} + subprocess.run(args, check=True, timeout=eval_config.timeout, env=env)Also add:
import osnear the imports at the top.
122-133: Merging with zip() drops lines and can misalign IDs; also marks unevaluated items as correct.Because memory prereqs are filtered out upstream, bfcl_fmted_file has fewer rows than generation_file. zip() truncates and mislabels is_correct. Merge by id; set is_correct only for evaluated ids.
- for gen_line, bfcl_line in zip(gen_f, bfcl_f): - gen_instance = json.loads(gen_line) - # Add the bfcl result to the generation instance - gen_instance.update(json.loads(bfcl_line)) - - if gen_instance["id"] in wrong_instance_ids: - gen_instance["is_correct"] = False - else: - gen_instance["is_correct"] = True - - fout.write(json.dumps(gen_instance) + "\n") + # Index BFCL-evaluated entries by id + bfcl_by_id = {} + for bfcl_line in bfcl_f: + obj = json.loads(bfcl_line) + bfcl_by_id[obj.get("id")] = obj + + for gen_line in gen_f: + gen_instance = json.loads(gen_line) + _id = gen_instance.get("id") + + if _id in bfcl_by_id: + # Merge back evaluated fields + gen_instance.update(bfcl_by_id[_id]) + gen_instance["is_correct"] = _id not in wrong_instance_ids + else: + # Not evaluated (e.g., memory prereq); preserve record and omit correctness + gen_instance.pop("is_correct", None) + + fout.write(json.dumps(gen_instance) + "\n")nemo_skills/inference/eval/bfcl.py (1)
202-219: Normalize client parser outputs; never return bare strings; narrow exceptions.generation can be a bare string, causing .extend() to split into characters. Also catches broad Exception.
- try: - generation = [ - {func_call["name"]: json.dumps(func_call["arguments"])} for func_call in model_response["tool_calls"] - ] - tool_call_ids = [idx for idx in range(len(generation))] - except Exception: - generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" - tool_call_ids = [] + try: + tool_calls = model_response.get("tool_calls") or [] + if tool_calls: + generation = [ + {tc["name"]: json.dumps(tc["arguments"])} for tc in tool_calls if isinstance(tc, dict) + ] + tool_call_ids = [str(i) for i in range(len(generation))] + else: + content = parsed_response.get("content", "") + generation = [content] if isinstance(content, str) else [""] + tool_call_ids = [] + except (KeyError, TypeError, ValueError): + content = parsed_response.get("content", "") + generation = [content] if isinstance(content, str) else [""] + tool_call_ids = []
♻️ Duplicate comments (3)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
95-100: Fix optional handling to avoid KeyError and return a neutral metric when missing.Current code sets an empty dict then unconditionally indexes metrics[…], causing KeyError and later lookup failures.
-def get_accuracy_dict(metrics, category, optional=False): +def get_accuracy_dict(metrics, category, optional=False): # reporting aggregation for pass@1[avg-of-k] (for highest k) if available - if optional and f"bfcl_v4.{category}" not in metrics: - category_dict = {} - category_dict = metrics[f"bfcl_v4.{category}"] + key = f"bfcl_v4.{category}" + if key not in metrics: + if optional: + return {"accuracy": 0, "num_entries": 0} + raise KeyError(f"Missing metrics for category: {key}") + category_dict = metrics[key]nemo_skills/inference/eval/bfcl.py (2)
442-451: Fix remove_thinking gating; key off tool-call absence, not equality.- if self.cfg.remove_thinking: - # If no tool calling was used, apply reasoning cleanup to both the message and generation + if self.cfg.remove_thinking: trimmed_response_text = self._remove_thinking_from_message_content( self.message_parser.get_response_text(model_response["message"]) ) - # If no tool calling was used, apply reasoning cleanup to both the message and generation - if model_response["message"].content == model_response["generation"]: - model_response["generation"] = [trimmed_response_text] + # If no tool calling was used, apply reasoning cleanup to both the message and generation + if not model_response.get("tool_calls"): + model_response["generation"] = [trimmed_response_text] self.message_parser.set_response_text(model_response["message"], trimmed_response_text)This avoids cross-parser type comparisons and ensures generation is a list. Based on learnings
248-259: Normalize server parser outputs; avoid string-vs-list shape bugs and narrow except.- try: - tool_calls = output_dict["message"].tool_calls - generation = [{func_call.function.name: func_call.function.arguments} for func_call in tool_calls] - tool_call_ids = [func_call.id for func_call in tool_calls] - except Exception: - tool_calls = [] - generation = output_dict["message"].content - tool_call_ids = [] - - # Use model output if not a tool call - output_dict["generation"] = generation if generation else [output_dict["message"].content] + try: + tool_calls = output_dict["message"].tool_calls or [] + if tool_calls: + generation = [{fc.function.name: fc.function.arguments} for fc in tool_calls] + tool_call_ids = [fc.id for fc in tool_calls] + else: + generation = [output_dict["message"].content] + tool_call_ids = [] + except (AttributeError, KeyError, TypeError): + tool_calls = [] + generation = [getattr(output_dict["message"], "content", "")] + tool_call_ids = [] + + # Always lists + output_dict["generation"] = generationRuff BLE001 suggests avoiding broad Exception; the diff applies that. Based on learnings
🧹 Nitpick comments (10)
nemo_skills/evaluation/evaluator/bfcl.py (2)
51-55: Hardcoded BFCL repo path; make it configurable.Bakes in /opt/gorilla/... which breaks non-standard deployments. Move base path into BFCLEvaluatorConfig (or cfg) with a sane default.
Would introducing eval_config.bfcl_repo_root (defaulting to /opt/gorilla/berkeley-function-call-leaderboard) work for your runners?
45-47: Minor: update comment to v4.The comment still references bfcl_v3 structure; this code supports both prefixes.
nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)
121-130: TRY003: shorten or centralize long exception message.Either compress the text or define a small custom exception with the message in the class docstring.
58-64: Unused constant FORMAT_SENSITIVITY.If not reported in v4 output, remove; otherwise, integrate it into agentic or separate reporting.
Is format_sensitivity still a tracked metric in your pipeline?
nemo_skills/inference/eval/bfcl.py (6)
52-57: Avoid hard dependency on bfcl_eval; make imports lazy/TYPE_CHECKING.Top-level imports will break module import for users without bfcl_eval (earlier code intentionally delayed such imports). Move to local/lazy imports and keep only type hints under TYPE_CHECKING.
-from bfcl_eval.utils import is_memory_prereq, is_memory -from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt -from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import ( - MemoryAPI, -) +from typing import TYPE_CHECKING +if TYPE_CHECKING: + from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import MemoryAPIAdditionally, import where used:
- Inside load_data():
+ from bfcl_eval.utils import is_memory_prereq # local import to keep bfcl_eval optional
- Inside _generate_single_data_point_multi_turn() before the memory block:
+ from bfcl_eval.utils import is_memory + from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt
288-307: Make path handling robust and simplify key access; optional: event-loop guard.
- Use Path().parent instead of string replace.
- Use next(iter(...)) instead of list(...)[0] (RUF015).
- Confirm load_data will never run inside an active event loop before calling asyncio.run.
- # First, fix the target paths to point to the actual target paths for memory stores - for datapoint in data: - if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"): - datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "") + # First, fix the target paths to point to the actual target paths for memory stores + from pathlib import Path + for datapoint in data: + if "initial_config" in datapoint: + key = next(iter(datapoint["initial_config"])) + if key.startswith("MemoryAPI"): + datapoint["initial_config"][key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)Please confirm this method is not invoked under a running event loop (e.g., Jupyter) since it calls asyncio.run.
386-408: Harden memory-instance setup; avoid assert in prod and use RUF015 pattern.
- Replace assert with explicit check and error.
- Use next(iter(...)) for instance extraction.
- Optionally define memory_instance in outer scope for later flush.
- if is_memory(test_category): + memory_instance = None + if is_memory(test_category): # Execute no function call, but just to get a reference to all the instances to get the initial state for logging purpose _, involved_instances = execute_multi_turn_func_call( [], initial_config, involved_classes, test_entry_id=test_entry_id, long_context=("long_context" in test_category or "composite" in test_category), ) - - assert ( - len(involved_instances) == 1 - ), "Memory category should only involve one class." - - memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + if len(involved_instances) != 1: + raise ValueError("Memory category should only involve one class.") + memory_instance = next(iter(involved_instances.values())) data_point["question"] = add_memory_instruction_system_prompt( data_point["question"], test_category, data_point["scenario"], memory_instance, )
457-459: Defensive: ensure generation is a list before extend.Protect against accidental bare strings.
- current_turn_response.extend(model_response["generation"]) + gens = model_response["generation"] + if not isinstance(gens, list): + gens = [gens] + current_turn_response.extend(gens)
464-469: Narrow except when decoding tool calls; avoid BLE001; keep log level low.Catch only expected parse errors from convert_to_function_call/json.
- except Exception: - LOG.info("No tools to execute in this turn. Proceed to next turn.") + except (json.JSONDecodeError, KeyError, TypeError, ValueError): + LOG.debug("No tools to execute in this turn. Proceed to next turn.") break
483-487: Ensure tool_call_id is a string for downstream compatibility.Tokenizer/chat templates typically expect string IDs.
- tool_message = { + tool_message = { "role": "tool", "content": execution_result, - "tool_call_id": tool_call_id, + "tool_call_id": str(tool_call_id), }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dockerfiles/Dockerfile.nemo-skills(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(1 hunks)nemo_skills/evaluation/evaluator/bfcl.py(3 hunks)nemo_skills/inference/eval/bfcl.py(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dockerfiles/Dockerfile.nemo-skills
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
setup_prompt(368-381)load_data(427-444)process_single_datapoint(546-571)nemo_skills/inference/eval/bfcl_utils.py (3)
execute_multi_turn_func_call(94-181)convert_to_function_call(80-91)is_empty_execute_response(184-189)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
get_logger_name(131-135)nested_dataclass(50-83)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
calculate_combined_accuracy(52-74)get_accuracy_dict(77-112)calculate_non_live_single_turn_accuracy(115-137)calculate_live_single_turn_accuracy(140-156)calculate_multi_turn_accuracy(159-165)compute_score(168-190)
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl.py
252-252: Do not catch blind exception: Exception
(BLE001)
294-294: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
295-295: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
400-400: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
466-466: Do not catch blind exception: Exception
(BLE001)
511-511: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
nemo_skills/dataset/bfcl_v4/bfcl_score.py
126-130: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
206-213: Good: overall uses numeric accuracies.
This fixes the prior bug of adding dicts to floats.Please confirm the weights (0.4/0.3/0.1/0.1/0.1) match the latest BFCL v4 spec for your run.
nemo_skills/inference/eval/bfcl.py (1)
20-20: OK to add asyncio import.No issues.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
168-169: Fix Path type passed to process_file to avoid TypeError.process_multi_turn_test_case uses Path arithmetic ("/") with repo_root_dir; passing a str will fail. Pass a Path.
- process_file(temp_dir, input_file, output_file, model_type=model_type) + process_file(Path(temp_dir), input_file, output_file, model_type=model_type)
♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/utils.py (1)
15-21: Avoid duplication with bfcl_v3.utils where possible.If v4 logic mirrors v3, consider reusing/importing shared helpers to reduce drift.
🧹 Nitpick comments (6)
nemo_skills/dataset/bfcl_v3/prepare.py (2)
160-160: Use removesuffix() or Path.stem for extension handling (safer than replace).replace(".json","") can misfire if ".json" appears elsewhere. Prefer removesuffix (Py3.9+) or Path.stem.
- split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", "")) + name = filename.removeprefix("BFCL_v4_").removesuffix(".json") + split_dirname = os.path.join(output_dir, name)Alternatively:
- split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", "")) + name = Path(filename).stem + name = name.removeprefix("BFCL_v4_") + split_dirname = os.path.join(output_dir, name)
176-179: Log full traceback on git failures.Use logging.exception for context.
- except subprocess.CalledProcessError as e: - LOG.error(f"Git command failed: {e}") - LOG.error("Make sure git is installed and the repository URL is correct") + except subprocess.CalledProcessError: + LOG.exception("Git command failed") + LOG.error("Make sure git is installed and the repository URL is correct")nemo_skills/dataset/bfcl_v4/utils.py (1)
107-118: Guard against unknown nested item types in _cast_to_openai_type.Direct dict indexing may KeyError for unseen types; fall back to "string".
- elif "items" in properties[key]: - properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]] + elif "items" in properties[key]: + item_type = properties[key]["items"].get("type", "string") + properties[key]["items"]["type"] = mapping.get(item_type, "string") if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]: - properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]] + inner_type = properties[key]["items"]["items"].get("type", "string") + properties[key]["items"]["items"]["type"] = mapping.get(inner_type, "string") elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]: properties[key]["items"]["properties"] = _cast_to_openai_type( properties[key]["items"]["properties"], mapping )nemo_skills/inference/eval/bfcl_web_search.py (2)
60-61: Use dict.get for config with a safe default.Avoid KeyError when show_snippet is omitted.
- self.show_snippet = initial_config["show_snippet"] + self.show_snippet = initial_config.get("show_snippet", True)
280-281: Narrow the exception in fetch_url_content and keep trace.Catch requests exceptions, not bare Exception; include traceback via logging if available.
- except Exception as e: - return {"error": f"An error occurred while fetching {url}: {str(e)}"} + except requests.RequestException as e: + return {"error": f"requests error while fetching {url}: {e!s}"}nemo_skills/dataset/bfcl_v4/prepare.py (1)
195-197: Remove unnecessary f-string; keep traceback via exception().Minor cleanup and better diagnostics.
- except subprocess.CalledProcessError as e: - LOG.exception(f"Git command failed") + except subprocess.CalledProcessError: + LOG.exception("Git command failed") LOG.error("Make sure git is installed and the repository URL is correct")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
nemo_skills/dataset/bfcl_v3/prepare.py(4 hunks)nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/constants.py(1 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(1 hunks)nemo_skills/dataset/bfcl_v4/utils.py(1 hunks)nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/bfcl_v4/constants.py
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/dataset/bfcl_v4/prepare.py (3)
nemo_skills/dataset/bfcl_v4/utils.py (2)
func_doc_language_specific_pre_processing(32-83)convert_to_tool(121-133)nemo_skills/utils.py (1)
get_logger_name(131-135)nemo_skills/dataset/bfcl_v3/utils.py (2)
is_multi_turn(51-55)load_file(39-48)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
download_and_process_bfcl_data(140-197)
🪛 Ruff (0.14.1)
nemo_skills/dataset/bfcl_v4/prepare.py
154-154: subprocess call: check for execution of untrusted input
(S603)
155-155: Starting a process with a partial executable path
(S607)
163-165: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
196-196: f-string without any placeholders
Remove extraneous f prefix
(F541)
197-197: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
nemo_skills/dataset/bfcl_v3/prepare.py
133-133: Starting a process with a partial executable path
(S607)
nemo_skills/inference/eval/bfcl_web_search.py
53-53: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
55-55: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
57-57: Unused method argument: long_context
(ARG002)
155-155: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
165-165: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
178-178: Use explicit conversion flag
Replace with conversion flag
(RUF010)
184-184: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
225-225: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
280-280: Do not catch blind exception: Exception
(BLE001)
281-281: Use explicit conversion flag
Replace with conversion flag
(RUF010)
285-285: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
117-117: Confirm intent: modifying bfcl_v3 prepare to process v4 files.Defaulting file_prefix to "BFCL_v4" inside bfcl_v3/prepare.py blurs version boundaries. If unintentional, revert to v3 defaults and move v4 logic to bfcl_v4. If intentional, please confirm downstream callers expect v4 outputs here.
nemo_skills/dataset/bfcl_v4/__init__.py (1)
15-48: BFCLv4 dataset manifest looks good; ensure split parity with evaluator.Constants and BENCHMARKS map are fine. Please verify SPLITS matches bfcl_eval.constants.category_mapping.ALL_SCORING_CATEGORIES to avoid drift with prepare/evaluator.
nemo_skills/dataset/bfcl_v4/prepare.py (1)
154-156: Review is based on incorrect assumption; bfcl_v3 does not implement the suggested pattern.The review claims the fix follows "same rationale as bfcl_v3," but inspection of
nemo_skills/dataset/bfcl_v3/prepare.py(line 133) shows it contains onlygit clone --depth=1without any subsequentgit checkout. The suggested pattern does not exist as a reference implementation. Additionally, the proposed diff usesPLACEHOLDER_SHArather than an actual commit identifier, and there is no documented pinning strategy or specific commit/tag identified in the codebase for either dataset version.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)
152-209: CRITICAL: Past review comments on infinite retry loop and generator handling were NOT addressed.The issues flagged in the previous review are still present:
- Infinite loop (line 152):
while True:with no max retry limit can hang indefinitely under persistent rate-limiting.- Generator handling (line 155):
DDGS().text()returns a generator, not a list.- Truthiness check fails (line 185):
if not search_results:will not work correctly on a generator (generators are always truthy until exhausted).- Slicing fails (line 192):
search_results[:max_results]will raiseTypeErrorbecause generators don't support slicing.These are runtime blockers that will cause failures in production.
The previous review provided a comprehensive fix. Please apply it:
- backoff = 10 # initial back-off in seconds - - # Infinite retry loop with exponential backoff - while True: + backoff = 10 # initial back-off in seconds + max_retries = 6 + attempts = 0 + + while attempts < max_retries: try: wait_time = backoff + random.uniform(0, backoff) - search_results = DDGS(timeout=60).text( + search_results = list(DDGS(timeout=60).text( query=keywords, region=region, max_results=max_results, backend="duckduckgo" - ) + )) except DDGSException as e: if "No results found" in str(e): + attempts += 1 wait_time = backoff + random.uniform(0, backoff) error_block = ( "*" * 100 - + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…" + + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests ({attempts}/{max_retries}). This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds..." + "*" * 100 ) print(error_block) time.sleep(wait_time) backoff = min(backoff * 2, 120) # cap the back-off continue else: error_block = ( "*" * 100 + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried." + "*" * 100 ) print(error_block) return {"error": str(e)} - break # Success – no rate-limit error detected + break # Success - no rate-limit error detected + + if attempts >= max_retries: + return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits."} if not search_results: return { "error": "Failed to retrieve the search results from server. Please try again later." } # Convert the search results to the desired format results = [] for result in search_results[:max_results]: if self.show_snippet: results.append( { "title": result["title"], "href": result["href"], "body": result["body"], } ) else: results.append( { "title": result["title"], "href": result["href"], } ) return results
🧹 Nitpick comments (2)
requirements/main.txt (1)
40-40: Verify CPU-only FAISS variant is intentional.The addition of
faiss-cpusuggests CPU-only deployment. If GPU acceleration is available or expected in production,faiss-gpuor the basefaisspackage may be more appropriate. Confirm this choice aligns with deployment targets.Please clarify: Is the CPU-only variant intentional, or should this be
faiss(which auto-selects based on environment)?nemo_skills/inference/eval/bfcl_web_search.py (1)
228-249: Consider more robust browser emulation for production use.The custom headers help avoid 403 Forbidden errors, but some sites may still detect this as a bot. For more reliable fetching, consider:
- Option 1: Use
requests-htmlwhich includes a browser-like session with better default headers.- Option 2: For sites with heavy JavaScript, consider
playwrightorseleniumfor full browser automation.This is noted by the TODO comment on line 228. The current approach is acceptable for initial implementation, but may need enhancement based on real-world evaluation results.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)requirements/main.txt(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl_web_search.py
52-52: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
54-54: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
56-56: Unused method argument: long_context
(ARG002)
154-154: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
164-164: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
177-177: Use explicit conversion flag
Replace with conversion flag
(RUF010)
183-183: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
277-277: Abstract raise to an inner function
(TRY301)
277-277: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Use explicit conversion flag
Replace with conversion flag
(RUF010)
284-284: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (5)
requirements/main.txt (2)
16-16: Good practice: explanatory comments for new dependencies.The inline comments ("Needed for BFCLv4") are helpful for maintainability and context. This aligns well with other similar comments in the file (e.g., "Requires patching…" on line 31).
Also applies to: 38-40
16-16: Remove unusedfaiss-cpudependency; version pinning concern is invalid.The original review comment's version pinning concern misrepresents project practice—the codebase intentionally leaves most dependencies unpinned, pinning only those with known issues (e.g.,
litellm==1.77.5). The new packages follow this existing pattern correctly.However, verification reveals a critical issue:
faiss-cpuis added to requirements but never imported or used anywhere in the codebase. Onlyddgsandsentence_transformersare actively used for BFCLv4. Either removefaiss-cpuor confirm if supporting code is missing.Likely an incorrect or invalid review comment.
nemo_skills/inference/eval/bfcl_web_search.py (3)
46-54: Random instances are appropriate for non-cryptographic error simulation.The S311 static analysis warnings about
random.Randomare false positives here. These instances with fixed seeds are explicitly used to simulate deterministic errors for testing purposes, not for any security-sensitive operations.
279-280: Broad exception handling is appropriate for this utility function.Catching
Exceptionhere is reasonable for a general-purpose URL fetching utility that needs to gracefully handle network errors, timeouts, parsing failures, etc., and return a consistent error dict format. The BLE001 static analysis warning can be safely ignored in this context.
282-298: LGTM: Error simulation helper is well-implemented.The helper method correctly generates realistic error messages by parsing the URL and formatting templates with appropriate context. While currently unused (as noted in comments), this provides valuable testing infrastructure for future use.
gwarmstrong
left a comment
There was a problem hiding this comment.
@Glorf I have a couple questions about the current implementation.
| "parallel_multiple", | ||
| "java", | ||
| "javascript", | ||
| "simple_java", |
There was a problem hiding this comment.
@Glorf what is the reason for changing here and breaking backward compatibility?
There was a problem hiding this comment.
The authors renamed the evals in the repo: ShishirPatil/gorilla@58f57e9
I originally tried to put bfclv3 aside from bfclv4, but as they both implement bfcl_eval package, I decided to keep bfclv3 as the subset of bfclv4 and use bfclv4 evaluation/logic (as it's technically the same, just need to drop agentic evals)
| timeout: int = 300 | ||
|
|
||
|
|
||
| def eval_bfcl(cfg): |
There was a problem hiding this comment.
Is there a way to evaluate an individual sample at a time? If so, it would definitely be good to move onto the new Evaluator class with an eval_single implementation: https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/evaluator/base.py#L34-L91
This would let samples be evaluated during the generation loop--potentially saving time if the evaluations are CPU intensive.
Otherwise, it would still be good to move onto that, but only implement eval_full.
There was a problem hiding this comment.
I don't think we can - at least not at this control level - we offload all the eval logic to the bfcl_eval - the original repo
Signed-off-by: Michal Bien <mbien@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Refactor
Chores