BFCLv4 support by Glorf · Pull Request #908 · NVIDIA-NeMo/Skills

Glorf · 2025-10-08T08:56:38Z

Summary by CodeRabbit

New Features
- BFCL v4 support: dataset downloader/prep pipeline, expanded splits (multi-turn, live, language-specific), memory and web-search handling.
- New evaluation outputs: agentic (memory + web-search) and hallucination scores.
- Built-in web-search backend with query and URL-fetch capabilities.
Refactor
- Migration of dataset, evaluation, and inference flows to BFCL v4 naming and score structure; multi-turn and memory flows enhanced.
Chores
- Updated constants, default evaluator model, ignored dataset paths, and added runtime dependencies.

coderabbitai · 2025-10-08T08:56:49Z

Walkthrough

Introduces BFCL v4: new dataset metadata and splits, scoring extended with memory/web-search/format_sensitivity and agentic/hallucination components, data prepare/utils for v4 ingestion and OpenAI-style tools, evaluator/inference memory & web-search integrations, new WebSearchAPI, and related infra updates.

Changes

Cohort / File(s)	Summary of Changes
Dataset v4 metadata `nemo_skills/dataset/bfcl_v4/__init__.py`	New module defining `DATASET_GROUP`, `SPLITS` (many BFCL v4 splits), `IS_BENCHMARK_GROUP`, `SCORE_MODULE`, and `BENCHMARKS` mapping.
Scoring / bfcl_v4 logic `nemo_skills/dataset/bfcl_v4/bfcl_score.py`	New constants `MEMORY`, `WEB_SEARCH`, `FORMAT_SENSITIVITY`; `GLOBAL_MAX_K` validation; `get_accuracy_dict(..., optional=False)` extended; added `calculate_agentic_accuracy`, `calculate_hallucination_measurement`, and `compute_score(metrics: dict)` to return consolidated `bfcl_v4` output including agentic and hallucination components.
v4 constants `nemo_skills/dataset/bfcl_v4/constants.py`	New path/version constants: `VERSION_PREFIX`, `MULTI_TURN_FUNC_DOC_PATH`, `DATA_FOLDER_PATH` (Path objects).
Preparation pipeline (v4) `nemo_skills/dataset/bfcl_v4/prepare.py`	New prepare module: shallow-clone BFCL repo, `download_and_process_bfcl_data`, `load_dataset_entry`, `process_multi_turn_test_case`; per-category processing, language hints, multi-turn handling, conversion to tool-format, and writing per-split `test.jsonl` and `__init__.py`.
Function/tool utilities (v4) `nemo_skills/dataset/bfcl_v4/utils.py`	New utilities: `_get_language_specific_hint`, `func_doc_language_specific_pre_processing`, `_cast_to_openai_type`, and `convert_to_tool` to produce OpenAI-style function/tool descriptors and normalize types.
Evaluator updates (bfcl) `nemo_skills/evaluation/evaluator/bfcl.py`	Default model updated; test-category extraction now strips `bfcl_v3.` and `bfcl_v4.`; output/score filenames/paths updated to BFCL_v4 with category directory structure; memory-prereq samples filtered during BFCL format conversion.
v3 → naming & score adjustments `nemo_skills/dataset/bfcl_v3/__init__.py`, `.../constants.py`, `.../prepare.py`, `.../bfcl_score.py`	Renamed splits (`simple`→`simple_python`, `java`→`simple_java`, `javascript`→`simple_javascript`), added live/multi-turn splits, updated `TEST_COLLECTION_MAPPING`, adjusted `prepare.py` defaults to `BFCL_v4` and filters, and restructured `bfcl_score.py` return shape.
Inference / execution changes `nemo_skills/inference/eval/bfcl.py`, `.../bfcl_utils.py`	Added memory handling (`is_memory_prereq`/`is_memory`, MemoryAPI integration), `BFCLGenerationTask.load_data` to order/prioritize memory prereqs, deep-copy config loading, `BACKEND_PATH_PREFIX`, expanded `CLASS_FILE_PATH_MAPPING` (WebSearchAPI, MemoryAPI variants), and memory flush behavior.
Web search backend `nemo_skills/inference/eval/bfcl_web_search.py`	New `WebSearchAPI` class with `_load_scenario`, `search_engine_query` (DDGS with retries/backoff), `fetch_url_content` (raw/markdown/truncate), and error-message simulation (`ERROR_TEMPLATES`).
Requirements & infra `requirements/main.txt`, `.gitignore`, `dockerfiles/Dockerfile.nemo-skills`	Added dependencies `ddgs`, `sentence_transformers`, `faiss-cpu`; `.gitignore` adds `nemo_skills/dataset/bfcl_v4/*/`; Dockerfile pinned Gorilla repo commit updated.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Prep as prepare.py
  participant Git as GitHub
  participant FS as Local FS
  participant Utils as utils.convert_to_tool

  Note over Prep,Git: BFCL v4 download & per-category prepare
  Prep->>Git: shallow clone(repo_url, depth=1)
  Git-->>Prep: repo_root
  Prep->>Prep: discover BFCL_v4 JSON files (filter memory/web_search/format_sensitivity)
  loop per category file
    Prep->>Prep: load_dataset_entry(target_folder, category)
    Prep->>Utils: func_doc_language_specific_pre_processing + convert_to_tool
    Utils-->>Prep: tool descriptors
    Prep->>FS: write split dir, `__init__.py`, and `test.jsonl`
  end

sequenceDiagram
  autonumber
  participant Eval as Evaluator
  participant Score as bfcl_v4.bfcl_score
  participant Metrics as Metrics Store

  Note over Eval,Score: Scoring includes agentic (memory/web_search) & hallucination components

  Eval->>Score: compute_score(metrics)
  activate Score
  Score->>Score: validate GLOBAL_MAX_K
  Score->>Score: get_accuracy_dict(metrics, category, optional=True)
  Score->>Score: calculate_agentic_accuracy(metrics)
  Score->>Metrics: read MEMORY & WEB_SEARCH category metrics
  Metrics-->>Score: per-category accuracies
  Score->>Score: aggregate -> overall, agentic, hallucination sub-scores
  Score-->>Eval: bfcl_v4 result (overall + sub-scores)
  deactivate Score

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through files with eager paws,
Cloning repos and mapping laws.
Memory, web-search, tools in tune,
v4 splits sprout beneath the moon.
A rabbit cheers: prepare and run! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "BFCLv4 support" directly and clearly summarizes the primary objective of this pull request. The changes consistently introduce comprehensive support for Berkeley Function Call Leaderboard v4, including new dataset modules, scoring functions, preparation scripts, web search and memory APIs, and infrastructure updates. The title is concise, specific, and accurately conveys the main intent without unnecessary details, making it clear to teammates scanning the history that this PR adds BFCL version 4 support.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mbien/bfcl_v4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94a7619 and 02d6300.

📒 Files selected for processing (6)

nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (4 hunks)
nemo_skills/dataset/bfcl_v4/constants.py (2 hunks)
nemo_skills/dataset/bfcl_v4/prepare.py (7 hunks)
nemo_skills/dataset/bfcl_v4/utils.py (2 hunks)
nemo_skills/evaluation/evaluator/bfcl.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/dataset/bfcl_v4/prepare.py (1)

nemo_skills/dataset/bfcl_v4/utils.py (7)

convert_to_tool (172-184)

func_doc_language_specific_pre_processing (83-134)

is_multi_turn (39-43)

load_file (62-71)

clean_up_memory_prereq_entries (187-217)

populate_initial_settings_for_memory_test_cases (220-239)

populate_initial_settings_for_web_search_test_cases (242-258)

🪛 Ruff (0.13.3)

nemo_skills/dataset/bfcl_v4/prepare.py

108-108: Undefined name test_cases_to_generate

(F821)

145-145: Starting a process with a partial executable path

(S607)

169-169: Using .strip() with multi-character strings is misleading

(B005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (15)

nemo_skills/evaluation/evaluator/bfcl.py (1)

44-44: LGTM - Consistent v3 to v4 migration.

The file naming and category extraction updates correctly align with BFCL v4 conventions across all three locations.

Also applies to: 51-51, 79-79

nemo_skills/dataset/bfcl_v4/__init__.py (2)

35-40: LGTM - New module exports added.

The new splits for web search, memory, and format sensitivity are properly integrated into the SPLITS list.

45-47: LGTM - Version references updated.

Both SCORE_MODULE and BENCHMARKS correctly reference bfcl_v4.

nemo_skills/dataset/bfcl_v4/utils.py (4)

27-43: LGTM - Helper predicates added.

The new category detection functions (is_memory, is_memory_prereq, is_web_search, is_multi_turn) provide clear, reusable predicates for identifying test categories.

46-59: LGTM - Test category extraction added.

The extract_test_category_from_id function properly handles both normal and format_sensitivity test entry IDs, with clear logic for removing the "_prereq" suffix when requested.

187-217: LGTM - Memory prereq cleanup logic added.

The clean_up_memory_prereq_entries function correctly:

Removes orphaned memory-prerequisite test cases

Filters out already-generated entries from dependency lists

220-258: LGTM - Initial settings population added.

Both populate_initial_settings_for_memory_test_cases and populate_initial_settings_for_web_search_test_cases provide clear configuration injection for their respective test categories.

nemo_skills/dataset/bfcl_v4/prepare.py (2)

95-95: LGTM - Signature simplified.

Removing the model_type parameter simplifies the interface. Ensure downstream callers are updated accordingly.

169-169: LGTM - Correct prefix removal.

Using .lstrip("BFCL_v4_") correctly removes the prefix. The static analysis warning about multi-character strings is a false positive in this case, as the intent is to remove the exact prefix string.

nemo_skills/dataset/bfcl_v4/bfcl_score.py (4)

51-62: LGTM - New agentic categories added.

The MEMORY, WEB_SEARCH, and FORMAT_SENSITIVITY constants properly define the new test categories for BFCL v4.

94-98: LGTM - Optional category lookup added.

The optional parameter allows graceful handling of missing categories like format_sensitivity, which may not always be present.

186-203: LGTM - Agentic accuracy calculation added.

The calculate_agentic_accuracy function correctly computes combined accuracy for memory and web search categories, with optional format_sensitivity inclusion.

209-229: LGTM - Agentic metrics integrated.

The compute_score function correctly integrates agentic accuracy metrics (memory and web search) into the overall accuracy calculation and result structure.

nemo_skills/dataset/bfcl_v4/constants.py (2)

18-68: LGTM - Comprehensive category definitions.

The category constants are well-organized with clear derivations:

MEMORY_CATEGORY derived from ALL_AVAILABLE_MEMORY_BACKENDS

SINGLE_TURN_CATEGORY composed from NON_LIVE_CATEGORY and LIVE_CATEGORY

AGENTIC_CATEGORY combines MEMORY_CATEGORY and WEB_SEARCH_CATEGORY

Clear separation between scoring and non-scoring categories

70-97: LGTM - Test collection mapping updated.

The TEST_COLLECTION_MAPPING uses the new category constants consistently and adds new top-level groups ("memory", "web_search", "agentic") for v4.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
231-232: Fix typo in dictionary key.

The key "format_sensistivity" contains a typo and should be "format_sensitivity" to match the constant and source key.

Apply this diff to fix the typo:
     if "format_sensitivity" in agentic_accuracy:
-        res["format_sensistivity"] = agentic_accuracy["format_sensitivity"]
+        res["format_sensitivity"] = agentic_accuracy["format_sensitivity"]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02d6300 and 9d99897.

📒 Files selected for processing (2)

nemo_skills/dataset/bfcl_v4/__init__.py (2 hunks)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/bfcl_v4/init.py

🧰 Additional context used

🪛 GitHub Actions: Lint and Format

nemo_skills/dataset/bfcl_v4/bfcl_score.py

[warning] 1-1: ruff-format reformatted this file during hooks. Please review formatting changes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)

51-63: LGTM! New agentic constants are well-defined.

The new constants for memory, web search, and format sensitivity categories align with the BFCL v4 update.

186-203: LGTM! Agentic accuracy calculation is sound.

The function correctly aggregates memory and web search accuracies, and conditionally includes format sensitivity. However, this depends on fixing the critical bug in get_accuracy_dict (lines 94-98) for the optional parameter to work correctly.

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/eval/bfcl.py (1)

169-177: Normalize client parser output for no-tool-call case to list-of-strings.

Avoid returning a string (later extend will split it). Keep generation as list.

-        except Exception:
-            generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
-            tool_call_ids = []
+        except (KeyError, TypeError, ValueError):
+            text = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
+            generation = [text] if text else []
+            tool_call_ids = []

♻️ Duplicate comments (2)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
95-101: Fix optional access to avoid KeyError when category missing.

Unconditionally indexing metrics[...] defeats the optional guard.
-def get_accuracy_dict(metrics, category, optional=False):
+def get_accuracy_dict(metrics, category, optional=False):
     # reporting aggregation for pass@1[avg-of-k] (for highest k) if available
-    if optional and f"bfcl_v4.{category}" not in metrics:
-        category_dict = {}
-    category_dict = metrics[f"bfcl_v4.{category}"]
+    key = f"bfcl_v4.{category}"
+    if optional and key not in metrics:
+        return {"accuracy": 0, "num_entries": 0}
+    category_dict = metrics[key]
nemo_skills/dataset/bfcl_v4/prepare.py (1)

155-157: Version pinning concern remains unresolved.

As noted in a previous review, shallow-cloning HEAD without version pinning risks future breakage if the upstream repository structure changes. This concern from the prior review has not been addressed in the current implementation.

Consider one of these approaches:

Clone with an explicit tag: ["git", "clone", "--depth=1", "--branch", "v4", repo_url, temp_dir]

Add a post-clone checkout: subprocess.run(["git", "-C", temp_dir, "checkout", KNOWN_COMMIT], check=True)

Store the commit hash or tag as a constant at the module level for easy updates.

🧹 Nitpick comments (8)

nemo_skills/inference/eval/bfcl.py (5)

403-404: Prevent accidental char-wise extend when generation is string.

Even with parser fixes, defensively normalize to list.
-                current_turn_response.extend(model_response["generation"])
+                gen = model_response["generation"]
+                current_turn_response.extend(gen if isinstance(gen, list) else [gen])
256-263: Use pathlib and robust key extraction for MemoryAPI path fix.

Avoid string replace and single-element list indexing; use pathlib and next(iter(...)).
-        for datapoint in data:
-            if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"):
-                datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "")
+        for datapoint in data:
+            if "initial_config" in datapoint:
+                api_key = next(iter(datapoint["initial_config"]), None)
+                if api_key and api_key.startswith("MemoryAPI"):
+                    datapoint["initial_config"][api_key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)
Add this import at the top of the file:
+from pathlib import Path
270-273: Avoid asyncio.run in environments with a running event loop.

This will raise at runtime (e.g., Jupyter/async runners). Consider refactoring to async load_data and invoking from the existing loop, or provide a helper that schedules coroutines when a loop is running.

Would you like a small refactor to make prereq processing async-safe?

430-436: Use iterator for first (and only) instance.

Satisfy RUF015 and avoid list materialization.

Based on static analysis hints
-                    memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+                    memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))
451-455: Same iterator fix when flushing memory.

Based on static analysis hints
-                    memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+                    memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))

nemo_skills/dataset/bfcl_v4/utils.py (1)

96-103: Guard nested type mapping to avoid KeyError on unseen types.

Use mapping.get(..., "string") for robustness; recurse safely.

-                properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]]
+                item_type = properties[key]["items"].get("type")
+                properties[key]["items"]["type"] = mapping.get(item_type, "string")
                 if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]:
-                    properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]]
+                    nested_item_type = properties[key]["items"]["items"].get("type")
+                    properties[key]["items"]["items"]["type"] = mapping.get(nested_item_type, "string")
                 elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]:
                     properties[key]["items"]["properties"] = _cast_to_openai_type(
                         properties[key]["items"]["properties"], mapping
                     )

nemo_skills/inference/eval/bfcl_web_search.py (1)

265-267: Narrow exception handling in fetch_url_content.

Catching broad Exception hides actionable errors; prefer requests.exceptions.RequestException.
-        except Exception as e:
+        except requests.exceptions.RequestException as e:
             return {"error": f"An error occurred while fetching {url}: {str(e)}"}

nemo_skills/dataset/bfcl_v4/prepare.py (1)

201-208: Remove unused function parameter.

The args parameter is not used in the function body. If the --model_type argument parsed on line 213 was intended for future use, consider removing both the argument parsing and the parameter until it's actually needed.

Apply this diff:
-def main(args):
+def main():
     LOG.warning(
         "Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek."
     )
 
     download_and_process_bfcl_data(
         REPO_URL, DATA_FOLDER_PATH, output_dir=os.path.join(os.path.dirname(__file__)),
     )
And remove the unused argument parsing:
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_type", type=str, default=None, required=False)
-    args = parser.parse_args()
-
-    main(args)
+    main()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d99897 and 5f5b3f8.

📒 Files selected for processing (15)

.gitignore (1 hunks)
dockerfiles/Dockerfile.nemo-skills (1 hunks)
nemo_skills/dataset/bfcl_v3/__init__.py (1 hunks)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (3 hunks)
nemo_skills/dataset/bfcl_v3/constants.py (4 hunks)
nemo_skills/dataset/bfcl_v3/prepare.py (4 hunks)
nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1 hunks)
nemo_skills/dataset/bfcl_v4/constants.py (1 hunks)
nemo_skills/dataset/bfcl_v4/prepare.py (1 hunks)
nemo_skills/dataset/bfcl_v4/utils.py (1 hunks)
nemo_skills/evaluation/evaluator/bfcl.py (3 hunks)
nemo_skills/inference/eval/bfcl.py (6 hunks)
nemo_skills/inference/eval/bfcl_utils.py (3 hunks)
nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

nemo_skills/dataset/bfcl_v3/prepare.py (1)

nemo_skills/dataset/bfcl_v4/prepare.py (1)

download_and_process_bfcl_data (140-198)

nemo_skills/inference/eval/bfcl.py (2)

nemo_skills/inference/generate.py (3)

setup_prompt (365-378)

load_data (418-435)

process_single_datapoint (537-562)

nemo_skills/inference/eval/bfcl_utils.py (3)

convert_to_function_call (80-91)

is_empty_execute_response (184-189)

execute_multi_turn_func_call (94-181)

nemo_skills/evaluation/evaluator/bfcl.py (1)

nemo_skills/utils.py (2)

get_logger_name (130-134)

nested_dataclass (49-82)

nemo_skills/inference/eval/bfcl_utils.py (1)

nemo_skills/inference/eval/bfcl_web_search.py (1)

_load_scenario (42-45)

nemo_skills/dataset/bfcl_v4/prepare.py (4)

nemo_skills/dataset/bfcl_v4/utils.py (2)

func_doc_language_specific_pre_processing (17-68)

convert_to_tool (106-118)

nemo_skills/utils.py (1)

get_logger_name (130-134)

nemo_skills/dataset/bfcl_v3/utils.py (2)

is_multi_turn (51-55)

load_file (39-48)

nemo_skills/dataset/bfcl_v3/prepare.py (1)

download_and_process_bfcl_data (117-178)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)

calculate_combined_accuracy (52-74)

get_accuracy_dict (77-112)

calculate_non_live_single_turn_accuracy (115-137)

calculate_live_single_turn_accuracy (140-156)

calculate_multi_turn_accuracy (159-165)

compute_score (168-190)

🪛 GitHub Actions: Copyright check

nemo_skills/inference/eval/bfcl_web_search.py

[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.

nemo_skills/dataset/bfcl_v4/utils.py

[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.

🪛 GitHub Actions: Lint and Format

nemo_skills/dataset/bfcl_v3/prepare.py

[error] 1-1: Code adjustments via pre-commit (git clone invocation formatting) in prepare.py.

nemo_skills/inference/eval/bfcl.py

[error] 1-1: Code adjustments via pre-commit (imports cleanup) in bfcl.py.

nemo_skills/dataset/bfcl_v4/constants.py

[error] 1-1: Code style cleanup: fixed spacing around VERSION_PREFIX constant.

nemo_skills/inference/eval/bfcl_web_search.py

[error] 1-1: Code adjustments via pre-commit (import cleanup) in bfcl_web_search.py.

nemo_skills/dataset/bfcl_v4/utils.py

[error] 1-1: Ruff check noted missing newline at end of file; newline added during formatting.

nemo_skills/evaluation/evaluator/bfcl.py

[error] 1-1: Code style adjustments via pre-commit (imports/formatting) in bfcl.py.

nemo_skills/dataset/bfcl_v3/bfcl_score.py

[error] 1-1: Ruff check reformatted code in bfcl_score.py. 9 issues fixed across the worktree.

[error] 1-1: Code adjustments via pre-commit (dictionary literal formatting) in bfcl_score.py.

nemo_skills/inference/eval/bfcl_utils.py

[error] 1-1: Code adjustments via pre-commit (string literal formatting) in bfcl_utils.py.

nemo_skills/dataset/bfcl_v4/prepare.py

[error] 1-1: Code adjustments via pre-commit (imports/order and formatting) applied in bfcl_v4/prepare.py.

[error] 1-1: Code adjustments via pre-commit (path construction and calls) in prepare.py.

nemo_skills/dataset/bfcl_v4/bfcl_score.py

[error] 1-1: Ruff check reformatted code in bfcl_score.py. Several formatting changes applied.

[error] 1-1: Code adjustments via pre-commit (dictionary formatting) in bfcl_score.py.

🪛 Ruff (0.14.0)

nemo_skills/dataset/bfcl_v3/prepare.py

133-133: Starting a process with a partial executable path

(S607)

160-160: Using .strip() with multi-character strings is misleading

(B005)

nemo_skills/inference/eval/bfcl.py

219-219: Do not catch blind exception: Exception

(BLE001)

261-261: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)

262-262: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)

412-412: Do not catch blind exception: Exception

(BLE001)

430-430: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

453-453: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

nemo_skills/inference/eval/bfcl_web_search.py

38-38: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

40-40: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

42-42: Unused method argument: long_context

(ARG002)

140-140: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

150-150: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

163-163: Use explicit conversion flag

Replace with conversion flag

(RUF010)

169-169: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

210-210: Avoid specifying long messages outside the exception class

(TRY003)

263-263: Abstract raise to an inner function

(TRY301)

263-263: Avoid specifying long messages outside the exception class

(TRY003)

265-265: Do not catch blind exception: Exception

(BLE001)

266-266: Use explicit conversion flag

Replace with conversion flag

(RUF010)

270-270: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

nemo_skills/dataset/bfcl_v4/prepare.py

155-155: subprocess call: check for execution of untrusted input

(S603)

156-156: Starting a process with a partial executable path

(S607)

164-166: Avoid specifying long messages outside the exception class

(TRY003)

197-197: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

201-201: Unused function argument: args

(ARG001)

nemo_skills/dataset/bfcl_v4/bfcl_score.py

126-130: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (20)

.gitignore (1)

42-42: LGTM!

The ignore pattern correctly excludes BFCL v4 dataset directories, consistent with the new BFCL v4 support added in this PR.

nemo_skills/dataset/bfcl_v3/__init__.py (1)

18-35: LGTM!

The split renaming (e.g., "simple" → "simple_python") and expansion with new live/multi-turn entries align with the BFCL v4 migration pattern and improve clarity.

nemo_skills/dataset/bfcl_v4/constants.py (1)

22-23: LGTM!

The Path constants are well-defined and provide centralized location references for BFCL v4 data handling.

nemo_skills/dataset/bfcl_v3/constants.py (1)

19-92: LGTM!

The systematic renaming of split identifiers (e.g., "simple" → "simple_python") across all test collection mappings is consistent with the BFCL v3 __init__.py updates.

nemo_skills/dataset/bfcl_v3/prepare.py (3)

117-117: LGTM!

Updating the default file_prefix to "BFCL_v4" aligns with the BFCL v4 migration pattern throughout the PR.

149-150: LGTM!

The filtering logic appropriately excludes "format_sensitivity", "memory", and "web_search" patterns from the BFCL v4 data processing, which aligns with the specialized handling of these categories introduced elsewhere in the PR.

133-133: Default branch “main” contains BFCL v4; removing -b v1.3 is safe.

nemo_skills/dataset/bfcl_v3/bfcl_score.py (3)

17-19: LGTM!

The updated SIMPLE_AST list entries align with the split renaming across BFCL v3 modules.

136-136: LGTM!

Renaming the key to "non_live_irrelevance" improves clarity and consistency with other output keys in the function.

183-189: LGTM!

Nesting the results under a "bfcl_v3" top-level key clearly differentiates BFCL v3 scores from BFCL v4 scores in the overall output structure.

nemo_skills/evaluation/evaluator/bfcl.py (4)

31-31: Verify the new default model configuration.

The default model was updated from "o3-mini-2025-01-31-FC" to "o4-mini-2025-04-16-FC". Ensure this model exists and is compatible with the BFCL evaluation pipeline.

46-46: LGTM!

Adding support for both "bfcl_v3." and "bfcl_v4." prefixes ensures backward compatibility while enabling BFCL v4 evaluation.

82-88: LGTM!

The updated file naming to use "BFCL_v4" and the addition of memory prerequisite filtering align with the BFCL v4 evaluation flow. The filtering logic correctly excludes memory prereq samples that are only needed during inference but not for evaluation scoring.

24-24: Ensure bfcl_eval.utils exports the imported utilities
Confirm that get_directory_structure_by_category and is_memory_prereq are present in the installed bfcl_eval package (and update the dependency version or import paths if not).

nemo_skills/inference/eval/bfcl_utils.py (4)

37-37: LGTM!

The Path import supports the new Path conversion logic added later in the file.

49-66: LGTM!

The introduction of BACKEND_PATH_PREFIX and the updated CLASS_FILE_PATH_MAPPING with f-strings provides a cleaner, more maintainable approach to constructing module paths. The addition of WebSearchAPI and MemoryAPI variants aligns with the BFCL v4 support being added.

117-118: LGTM!

The instance naming changes improve robustness:

Preserving the original class_name case maintains consistency

The regex substitution re.sub(r'[-./]', '_', instance_name) ensures the instance name is a valid Python identifier by replacing potentially problematic characters

125-129: LGTM!

The deep copy of class_initial_config prevents unintended mutations of the original configuration dictionary. Converting model_result_dir to a Path object before passing to _load_scenario ensures type consistency.

nemo_skills/dataset/bfcl_v4/__init__.py (1)

15-48: BFCL v4 dataset metadata looks good.

SPLITS, group flags, and SCORE_MODULE align with scoring code.

nemo_skills/dataset/bfcl_v4/prepare.py (1)

1-1: Address pre-commit formatting failures.

The pipeline indicates that pre-commit hooks will apply automatic adjustments for import ordering, formatting, and path construction. Ensure you run the pre-commit hooks locally or allow the automated fixes to be applied before merging.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

nemo_skills/evaluation/evaluator/bfcl.py (2)

63-67: Avoid shell=True; pass args and env explicitly.

Prevents shell injection via model/test_category and makes OPENAI_API_KEY scoping explicit.

Apply:

-            cmd = f"OPENAI_API_KEY=dummy bfcl evaluate --model {eval_config.model} --test-category {test_category}"
-
-            LOG.info(f"Running BFCL evaluation: {cmd}")
-            subprocess.run(cmd, shell=True, check=True, timeout=eval_config.timeout)
+            LOG.info("Running BFCL evaluation")
+            args = ["bfcl", "evaluate", "--model", eval_config.model, "--test-category", test_category]
+            env = {**os.environ, "OPENAI_API_KEY": "dummy"}
+            subprocess.run(args, check=True, timeout=eval_config.timeout, env=env)

Also add:

import os

near the imports at the top.

122-133: Merging with zip() drops lines and can misalign IDs; also marks unevaluated items as correct.

Because memory prereqs are filtered out upstream, bfcl_fmted_file has fewer rows than generation_file. zip() truncates and mislabels is_correct. Merge by id; set is_correct only for evaluated ids.

-        for gen_line, bfcl_line in zip(gen_f, bfcl_f):
-            gen_instance = json.loads(gen_line)
-            # Add the bfcl result to the generation instance
-            gen_instance.update(json.loads(bfcl_line))
-
-            if gen_instance["id"] in wrong_instance_ids:
-                gen_instance["is_correct"] = False
-            else:
-                gen_instance["is_correct"] = True
-
-            fout.write(json.dumps(gen_instance) + "\n")
+        # Index BFCL-evaluated entries by id
+        bfcl_by_id = {}
+        for bfcl_line in bfcl_f:
+            obj = json.loads(bfcl_line)
+            bfcl_by_id[obj.get("id")] = obj
+
+        for gen_line in gen_f:
+            gen_instance = json.loads(gen_line)
+            _id = gen_instance.get("id")
+
+            if _id in bfcl_by_id:
+                # Merge back evaluated fields
+                gen_instance.update(bfcl_by_id[_id])
+                gen_instance["is_correct"] = _id not in wrong_instance_ids
+            else:
+                # Not evaluated (e.g., memory prereq); preserve record and omit correctness
+                gen_instance.pop("is_correct", None)
+
+            fout.write(json.dumps(gen_instance) + "\n")

nemo_skills/inference/eval/bfcl.py (1)

202-219: Normalize client parser outputs; never return bare strings; narrow exceptions.

generation can be a bare string, causing .extend() to split into characters. Also catches broad Exception.

-        try:
-            generation = [
-                {func_call["name"]: json.dumps(func_call["arguments"])} for func_call in model_response["tool_calls"]
-            ]
-            tool_call_ids = [idx for idx in range(len(generation))]
-        except Exception:
-            generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
-            tool_call_ids = []
+        try:
+            tool_calls = model_response.get("tool_calls") or []
+            if tool_calls:
+                generation = [
+                    {tc["name"]: json.dumps(tc["arguments"])} for tc in tool_calls if isinstance(tc, dict)
+                ]
+                tool_call_ids = [str(i) for i in range(len(generation))]
+            else:
+                content = parsed_response.get("content", "")
+                generation = [content] if isinstance(content, str) else [""]
+                tool_call_ids = []
+        except (KeyError, TypeError, ValueError):
+            content = parsed_response.get("content", "")
+            generation = [content] if isinstance(content, str) else [""]
+            tool_call_ids = []

♻️ Duplicate comments (3)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

95-100: Fix optional handling to avoid KeyError and return a neutral metric when missing.

Current code sets an empty dict then unconditionally indexes metrics[…], causing KeyError and later lookup failures.

-def get_accuracy_dict(metrics, category, optional=False):
+def get_accuracy_dict(metrics, category, optional=False):
     # reporting aggregation for pass@1[avg-of-k] (for highest k) if available
-    if optional and f"bfcl_v4.{category}" not in metrics:
-        category_dict = {}
-    category_dict = metrics[f"bfcl_v4.{category}"]
+    key = f"bfcl_v4.{category}"
+    if key not in metrics:
+        if optional:
+            return {"accuracy": 0, "num_entries": 0}
+        raise KeyError(f"Missing metrics for category: {key}")
+    category_dict = metrics[key]

nemo_skills/inference/eval/bfcl.py (2)

442-451: Fix remove_thinking gating; key off tool-call absence, not equality.

-                if self.cfg.remove_thinking:
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                if self.cfg.remove_thinking:
                     trimmed_response_text = self._remove_thinking_from_message_content(
                         self.message_parser.get_response_text(model_response["message"])
                     )
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
-                    if model_response["message"].content == model_response["generation"]:
-                        model_response["generation"] = [trimmed_response_text]
+                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                    if not model_response.get("tool_calls"):
+                        model_response["generation"] = [trimmed_response_text]
 
                     self.message_parser.set_response_text(model_response["message"], trimmed_response_text)

This avoids cross-parser type comparisons and ensures generation is a list. Based on learnings

248-259: Normalize server parser outputs; avoid string-vs-list shape bugs and narrow except.

-        try:
-            tool_calls = output_dict["message"].tool_calls
-            generation = [{func_call.function.name: func_call.function.arguments} for func_call in tool_calls]
-            tool_call_ids = [func_call.id for func_call in tool_calls]
-        except Exception:
-            tool_calls = []
-            generation = output_dict["message"].content
-            tool_call_ids = []
-
-        # Use model output if not a tool call
-        output_dict["generation"] = generation if generation else [output_dict["message"].content]
+        try:
+            tool_calls = output_dict["message"].tool_calls or []
+            if tool_calls:
+                generation = [{fc.function.name: fc.function.arguments} for fc in tool_calls]
+                tool_call_ids = [fc.id for fc in tool_calls]
+            else:
+                generation = [output_dict["message"].content]
+                tool_call_ids = []
+        except (AttributeError, KeyError, TypeError):
+            tool_calls = []
+            generation = [getattr(output_dict["message"], "content", "")]
+            tool_call_ids = []
+
+        # Always lists
+        output_dict["generation"] = generation

Ruff BLE001 suggests avoiding broad Exception; the diff applies that. Based on learnings

🧹 Nitpick comments (10)

nemo_skills/evaluation/evaluator/bfcl.py (2)

51-55: Hardcoded BFCL repo path; make it configurable.

Bakes in /opt/gorilla/... which breaks non-standard deployments. Move base path into BFCLEvaluatorConfig (or cfg) with a sane default.

Would introducing eval_config.bfcl_repo_root (defaulting to /opt/gorilla/berkeley-function-call-leaderboard) work for your runners?

45-47: Minor: update comment to v4.

The comment still references bfcl_v3 structure; this code supports both prefixes.

nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)

121-130: TRY003: shorten or centralize long exception message.

Either compress the text or define a small custom exception with the message in the class docstring.

58-64: Unused constant FORMAT_SENSITIVITY.

If not reported in v4 output, remove; otherwise, integrate it into agentic or separate reporting.

Is format_sensitivity still a tracked metric in your pipeline?
nemo_skills/inference/eval/bfcl.py (6)
52-57: Avoid hard dependency on bfcl_eval; make imports lazy/TYPE_CHECKING.

Top-level imports will break module import for users without bfcl_eval (earlier code intentionally delayed such imports). Move to local/lazy imports and keep only type hints under TYPE_CHECKING.
-from bfcl_eval.utils import is_memory_prereq, is_memory
-from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt
-from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import (
-    MemoryAPI,
-)
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import MemoryAPI
Additionally, import where used:

Inside load_data():
+        from bfcl_eval.utils import is_memory_prereq  # local import to keep bfcl_eval optional
Inside _generate_single_data_point_multi_turn() before the memory block:
+        from bfcl_eval.utils import is_memory
+        from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt
288-307: Make path handling robust and simplify key access; optional: event-loop guard.

Use Path().parent instead of string replace.

Use next(iter(...)) instead of list(...)[0] (RUF015).

Confirm load_data will never run inside an active event loop before calling asyncio.run.
-        # First, fix the target paths to point to the actual target paths for memory stores
-        for datapoint in data:
-            if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"):
-                datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "")
+        # First, fix the target paths to point to the actual target paths for memory stores
+        from pathlib import Path
+        for datapoint in data:
+            if "initial_config" in datapoint:
+                key = next(iter(datapoint["initial_config"]))
+                if key.startswith("MemoryAPI"):
+                    datapoint["initial_config"][key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)
Please confirm this method is not invoked under a running event loop (e.g., Jupyter) since it calls asyncio.run.

386-408: Harden memory-instance setup; avoid assert in prod and use RUF015 pattern.

Replace assert with explicit check and error.

Use next(iter(...)) for instance extraction.

Optionally define memory_instance in outer scope for later flush.
-        if is_memory(test_category):
+        memory_instance = None
+        if is_memory(test_category):
             # Execute no function call, but just to get a reference to all the instances to get the initial state for logging purpose
             _, involved_instances = execute_multi_turn_func_call(
                 [],
                 initial_config,
                 involved_classes,
                 test_entry_id=test_entry_id,
                 long_context=("long_context" in test_category or "composite" in test_category),
             )
-
-            assert (
-                len(involved_instances) == 1
-            ), "Memory category should only involve one class."
-
-            memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+            if len(involved_instances) != 1:
+                raise ValueError("Memory category should only involve one class.")
+            memory_instance = next(iter(involved_instances.values()))
             data_point["question"] = add_memory_instruction_system_prompt(
                 data_point["question"],
                 test_category,
                 data_point["scenario"],
                 memory_instance,
             )
457-459: Defensive: ensure generation is a list before extend.

Protect against accidental bare strings.
-                current_turn_response.extend(model_response["generation"])
+                gens = model_response["generation"]
+                if not isinstance(gens, list):
+                    gens = [gens]
+                current_turn_response.extend(gens)
464-469: Narrow except when decoding tool calls; avoid BLE001; keep log level low.

Catch only expected parse errors from convert_to_function_call/json.
-                except Exception:
-                    LOG.info("No tools to execute in this turn. Proceed to next turn.")
+                except (json.JSONDecodeError, KeyError, TypeError, ValueError):
+                    LOG.debug("No tools to execute in this turn. Proceed to next turn.")
                     break
483-487: Ensure tool_call_id is a string for downstream compatibility.

Tokenizer/chat templates typically expect string IDs.
-                    tool_message = {
+                    tool_message = {
                         "role": "tool",
                         "content": execution_result,
-                        "tool_call_id": tool_call_id,
+                        "tool_call_id": str(tool_call_id),
                     }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f5b3f8 and 21f16f7.

📒 Files selected for processing (4)

dockerfiles/Dockerfile.nemo-skills (1 hunks)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1 hunks)
nemo_skills/evaluation/evaluator/bfcl.py (3 hunks)
nemo_skills/inference/eval/bfcl.py (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

dockerfiles/Dockerfile.nemo-skills

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_skills/inference/eval/bfcl.py (2)

nemo_skills/inference/generate.py (3)

setup_prompt (368-381)

load_data (427-444)

process_single_datapoint (546-571)

nemo_skills/inference/eval/bfcl_utils.py (3)

execute_multi_turn_func_call (94-181)

convert_to_function_call (80-91)

is_empty_execute_response (184-189)

nemo_skills/evaluation/evaluator/bfcl.py (1)

nemo_skills/utils.py (2)

get_logger_name (131-135)

nested_dataclass (50-83)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)

calculate_combined_accuracy (52-74)

get_accuracy_dict (77-112)

calculate_non_live_single_turn_accuracy (115-137)

calculate_live_single_turn_accuracy (140-156)

calculate_multi_turn_accuracy (159-165)

compute_score (168-190)

🪛 Ruff (0.14.1)

nemo_skills/inference/eval/bfcl.py

252-252: Do not catch blind exception: Exception

(BLE001)

294-294: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)

295-295: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)

400-400: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

466-466: Do not catch blind exception: Exception

(BLE001)

511-511: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

nemo_skills/dataset/bfcl_v4/bfcl_score.py

126-130: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

206-213: Good: overall uses numeric accuracies.
This fixes the prior bug of adding dicts to floats.

Please confirm the weights (0.4/0.3/0.1/0.1/0.1) match the latest BFCL v4 spec for your run.

nemo_skills/inference/eval/bfcl.py (1)

20-20: OK to add asyncio import.

No issues.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/dataset/bfcl_v3/prepare.py (1)
168-169: Fix Path type passed to process_file to avoid TypeError.

process_multi_turn_test_case uses Path arithmetic ("/") with repo_root_dir; passing a str will fail. Pass a Path.
-                process_file(temp_dir, input_file, output_file, model_type=model_type)
+                process_file(Path(temp_dir), input_file, output_file, model_type=model_type)

♻️ Duplicate comments (1)

nemo_skills/dataset/bfcl_v4/utils.py (1)

15-21: Avoid duplication with bfcl_v3.utils where possible.

If v4 logic mirrors v3, consider reusing/importing shared helpers to reduce drift.

🧹 Nitpick comments (6)

nemo_skills/dataset/bfcl_v3/prepare.py (2)

160-160: Use removesuffix() or Path.stem for extension handling (safer than replace).

replace(".json","") can misfire if ".json" appears elsewhere. Prefer removesuffix (Py3.9+) or Path.stem.

-                split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", ""))
+                name = filename.removeprefix("BFCL_v4_").removesuffix(".json")
+                split_dirname = os.path.join(output_dir, name)

Alternatively:

-                split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", ""))
+                name = Path(filename).stem
+                name = name.removeprefix("BFCL_v4_")
+                split_dirname = os.path.join(output_dir, name)

176-179: Log full traceback on git failures.

Use logging.exception for context.

-        except subprocess.CalledProcessError as e:
-            LOG.error(f"Git command failed: {e}")
-            LOG.error("Make sure git is installed and the repository URL is correct")
+        except subprocess.CalledProcessError:
+            LOG.exception("Git command failed")
+            LOG.error("Make sure git is installed and the repository URL is correct")

nemo_skills/dataset/bfcl_v4/utils.py (1)

107-118: Guard against unknown nested item types in _cast_to_openai_type.

Direct dict indexing may KeyError for unseen types; fall back to "string".

-            elif "items" in properties[key]:
-                properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]]
+            elif "items" in properties[key]:
+                item_type = properties[key]["items"].get("type", "string")
+                properties[key]["items"]["type"] = mapping.get(item_type, "string")
                 if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]:
-                    properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]]
+                    inner_type = properties[key]["items"]["items"].get("type", "string")
+                    properties[key]["items"]["items"]["type"] = mapping.get(inner_type, "string")
                 elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]:
                     properties[key]["items"]["properties"] = _cast_to_openai_type(
                         properties[key]["items"]["properties"], mapping
                     )

nemo_skills/inference/eval/bfcl_web_search.py (2)

60-61: Use dict.get for config with a safe default.

Avoid KeyError when show_snippet is omitted.
-        self.show_snippet = initial_config["show_snippet"]
+        self.show_snippet = initial_config.get("show_snippet", True)
280-281: Narrow the exception in fetch_url_content and keep trace.

Catch requests exceptions, not bare Exception; include traceback via logging if available.
-        except Exception as e:
-            return {"error": f"An error occurred while fetching {url}: {str(e)}"}
+        except requests.RequestException as e:
+            return {"error": f"requests error while fetching {url}: {e!s}"}

nemo_skills/dataset/bfcl_v4/prepare.py (1)

195-197: Remove unnecessary f-string; keep traceback via exception().

Minor cleanup and better diagnostics.

-        except subprocess.CalledProcessError as e:
-            LOG.exception(f"Git command failed")
+        except subprocess.CalledProcessError:
+            LOG.exception("Git command failed")
             LOG.error("Make sure git is installed and the repository URL is correct")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21f16f7 and 01c8c34.

📒 Files selected for processing (6)

nemo_skills/dataset/bfcl_v3/prepare.py (4 hunks)
nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
nemo_skills/dataset/bfcl_v4/constants.py (1 hunks)
nemo_skills/dataset/bfcl_v4/prepare.py (1 hunks)
nemo_skills/dataset/bfcl_v4/utils.py (1 hunks)
nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/bfcl_v4/constants.py

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/dataset/bfcl_v4/prepare.py (3)

nemo_skills/dataset/bfcl_v4/utils.py (2)

func_doc_language_specific_pre_processing (32-83)

convert_to_tool (121-133)

nemo_skills/utils.py (1)

get_logger_name (131-135)

nemo_skills/dataset/bfcl_v3/utils.py (2)

is_multi_turn (51-55)

load_file (39-48)

nemo_skills/dataset/bfcl_v3/prepare.py (1)

nemo_skills/dataset/bfcl_v4/prepare.py (1)

download_and_process_bfcl_data (140-197)

🪛 Ruff (0.14.1)

nemo_skills/dataset/bfcl_v4/prepare.py

154-154: subprocess call: check for execution of untrusted input

(S603)

155-155: Starting a process with a partial executable path

(S607)

163-165: Avoid specifying long messages outside the exception class

(TRY003)

195-195: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

196-196: f-string without any placeholders

Remove extraneous f prefix

(F541)

197-197: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

nemo_skills/dataset/bfcl_v3/prepare.py

133-133: Starting a process with a partial executable path

(S607)

nemo_skills/inference/eval/bfcl_web_search.py

53-53: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

55-55: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

57-57: Unused method argument: long_context

(ARG002)

155-155: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

165-165: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

178-178: Use explicit conversion flag

Replace with conversion flag

(RUF010)

184-184: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

225-225: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Abstract raise to an inner function

(TRY301)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Do not catch blind exception: Exception

(BLE001)

281-281: Use explicit conversion flag

Replace with conversion flag

(RUF010)

285-285: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (3)

nemo_skills/dataset/bfcl_v3/prepare.py (1)

117-117: Confirm intent: modifying bfcl_v3 prepare to process v4 files.

Defaulting file_prefix to "BFCL_v4" inside bfcl_v3/prepare.py blurs version boundaries. If unintentional, revert to v3 defaults and move v4 logic to bfcl_v4. If intentional, please confirm downstream callers expect v4 outputs here.

nemo_skills/dataset/bfcl_v4/__init__.py (1)

15-48: BFCLv4 dataset manifest looks good; ensure split parity with evaluator.

Constants and BENCHMARKS map are fine. Please verify SPLITS matches bfcl_eval.constants.category_mapping.ALL_SCORING_CATEGORIES to avoid drift with prepare/evaluator.

nemo_skills/dataset/bfcl_v4/prepare.py (1)

154-156: Review is based on incorrect assumption; bfcl_v3 does not implement the suggested pattern.

The review claims the fix follows "same rationale as bfcl_v3," but inspection of nemo_skills/dataset/bfcl_v3/prepare.py (line 133) shows it contains only git clone --depth=1 without any subsequent git checkout. The suggested pattern does not exist as a reference implementation. Additionally, the proposed diff uses PLACEHOLDER_SHA rather than an actual commit identifier, and there is no documented pinning strategy or specific commit/tag identified in the codebase for either dataset version.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/inference/eval/bfcl_web_search.py (1)

152-209: CRITICAL: Past review comments on infinite retry loop and generator handling were NOT addressed.

The issues flagged in the previous review are still present:

Infinite loop (line 152): while True: with no max retry limit can hang indefinitely under persistent rate-limiting.
Generator handling (line 155): DDGS().text() returns a generator, not a list.
Truthiness check fails (line 185): if not search_results: will not work correctly on a generator (generators are always truthy until exhausted).
Slicing fails (line 192): search_results[:max_results] will raise TypeError because generators don't support slicing.

These are runtime blockers that will cause failures in production.

The previous review provided a comprehensive fix. Please apply it:

-        backoff = 10  # initial back-off in seconds
-
-        # Infinite retry loop with exponential backoff
-        while True:
+        backoff = 10  # initial back-off in seconds
+        max_retries = 6
+        attempts = 0
+        
+        while attempts < max_retries:
             try:
                 wait_time = backoff + random.uniform(0, backoff)
-                search_results = DDGS(timeout=60).text(
+                search_results = list(DDGS(timeout=60).text(
                     query=keywords,
                     region=region,
                     max_results=max_results,
                     backend="duckduckgo"
-                )
+                ))
 
             except DDGSException as e:
                 if "No results found" in str(e):
+                    attempts += 1
                     wait_time = backoff + random.uniform(0, backoff)
                     error_block = (
                         "*" * 100
-                        + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…"
+                        + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests ({attempts}/{max_retries}). This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds..."
                         + "*" * 100
                     )
                     print(error_block)
                     time.sleep(wait_time)
                     backoff = min(backoff * 2, 120)  # cap the back-off
                     continue
                 else:
                     error_block = (
                         "*" * 100
                         + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried."
                         + "*" * 100
                     )
                     print(error_block)
                 return {"error": str(e)}
 
-            break  # Success – no rate-limit error detected
+            break  # Success - no rate-limit error detected
+        
+        if attempts >= max_retries:
+            return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits."}
 
         if not search_results:
             return {
                 "error": "Failed to retrieve the search results from server. Please try again later."
             }
 
         # Convert the search results to the desired format
         results = []
         for result in search_results[:max_results]:
             if self.show_snippet:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                         "body": result["body"],
                     }
                 )
             else:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                     }
                 )
 
         return results

🧹 Nitpick comments (2)

requirements/main.txt (1)

40-40: Verify CPU-only FAISS variant is intentional.

The addition of faiss-cpu suggests CPU-only deployment. If GPU acceleration is available or expected in production, faiss-gpu or the base faiss package may be more appropriate. Confirm this choice aligns with deployment targets.

Please clarify: Is the CPU-only variant intentional, or should this be faiss (which auto-selects based on environment)?

nemo_skills/inference/eval/bfcl_web_search.py (1)

228-249: Consider more robust browser emulation for production use.

The custom headers help avoid 403 Forbidden errors, but some sites may still detect this as a bot. For more reliable fetching, consider:

Option 1: Use requests-html which includes a browser-like session with better default headers.

Option 2: For sites with heavy JavaScript, consider playwright or selenium for full browser automation.

This is noted by the TODO comment on line 228. The current approach is acceptable for initial implementation, but may need enhancement based on real-world evaluation results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01c8c34 and 70805d2.

📒 Files selected for processing (2)

nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)
requirements/main.txt (2 hunks)

🧰 Additional context used

🪛 Ruff (0.14.1)

nemo_skills/inference/eval/bfcl_web_search.py

52-52: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

54-54: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

56-56: Unused method argument: long_context

(ARG002)

154-154: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

164-164: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

177-177: Use explicit conversion flag

Replace with conversion flag

(RUF010)

183-183: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Abstract raise to an inner function

(TRY301)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

279-279: Do not catch blind exception: Exception

(BLE001)

280-280: Use explicit conversion flag

Replace with conversion flag

(RUF010)

284-284: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (5)

requirements/main.txt (2)

16-16: Good practice: explanatory comments for new dependencies.

The inline comments ("Needed for BFCLv4") are helpful for maintainability and context. This aligns well with other similar comments in the file (e.g., "Requires patching…" on line 31).

Also applies to: 38-40

16-16: Remove unused faiss-cpu dependency; version pinning concern is invalid.

The original review comment's version pinning concern misrepresents project practice—the codebase intentionally leaves most dependencies unpinned, pinning only those with known issues (e.g., litellm==1.77.5). The new packages follow this existing pattern correctly.

However, verification reveals a critical issue: faiss-cpu is added to requirements but never imported or used anywhere in the codebase. Only ddgs and sentence_transformers are actively used for BFCLv4. Either remove faiss-cpu or confirm if supporting code is missing.

Likely an incorrect or invalid review comment.

nemo_skills/inference/eval/bfcl_web_search.py (3)

46-54: Random instances are appropriate for non-cryptographic error simulation.

The S311 static analysis warnings about random.Random are false positives here. These instances with fixed seeds are explicitly used to simulate deterministic errors for testing purposes, not for any security-sensitive operations.

279-280: Broad exception handling is appropriate for this utility function.

Catching Exception here is reasonable for a general-purpose URL fetching utility that needs to gracefully handle network errors, timeouts, parsing failures, etc., and return a consistent error dict format. The BLE001 static analysis warning can be safely ignored in this context.

282-298: LGTM: Error simulation helper is well-implemented.

The helper method correctly generates realistic error messages by parsing the URL and formatting templates with appropriate context. While currently unused (as noted in comments), this provides valuable testing infrastructure for future use.

gwarmstrong

@Glorf I have a couple questions about the current implementation.

gwarmstrong · 2025-11-24T19:00:25Z

        "parallel_multiple",
-        "java",
-        "javascript",
+        "simple_java",


@Glorf what is the reason for changing here and breaking backward compatibility?

The authors renamed the evals in the repo: ShishirPatil/gorilla@58f57e9
I originally tried to put bfclv3 aside from bfclv4, but as they both implement bfcl_eval package, I decided to keep bfclv3 as the subset of bfclv4 and use bfclv4 evaluation/logic (as it's technically the same, just need to drop agentic evals)

gwarmstrong · 2025-11-24T19:35:55Z

    timeout: int = 300


 def eval_bfcl(cfg):


Is there a way to evaluate an individual sample at a time? If so, it would definitely be good to move onto the new Evaluator class with an eval_single implementation: https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/evaluator/base.py#L34-L91

This would let samples be evaluated during the generation loop--potentially saving time if the evaluations are CPU intensive.

Otherwise, it would still be good to move onto that, but only implement eval_full.

I don't think we can - at least not at this control level - we offload all the eval logic to the bfcl_eval - the original repo

Signed-off-by: Michal Bien <mbien@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong

LGTM

Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Glorf requested a review from shtoshni October 8, 2025 08:56

coderabbitai Bot reviewed Oct 8, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v4/bfcl_score.py Outdated

Comment thread nemo_skills/dataset/bfcl_v4/prepare.py Outdated

Comment thread nemo_skills/dataset/bfcl_v4/prepare.py Outdated

coderabbitai Bot reviewed Oct 8, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v4/bfcl_score.py Outdated

shtoshni marked this pull request as draft October 9, 2025 12:05

Glorf force-pushed the mbien/bfcl_v4 branch 2 times, most recently from 45b6755 to 5f5b3f8 Compare October 15, 2025 12:07

Glorf changed the title ~~[DRAFT] BFCLv4 support~~ BFCLv4 support Oct 15, 2025

Glorf marked this pull request as ready for review October 15, 2025 12:11

coderabbitai Bot reviewed Oct 15, 2025

View reviewed changes

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread dockerfiles/Dockerfile.nemo-skills Outdated

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v3/__init__.py

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v3/constants.py

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v3/bfcl_score.py

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v3/prepare.py Outdated

shtoshni reviewed Oct 15, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v4/utils.py Outdated

coderabbitai Bot reviewed Oct 21, 2025

View reviewed changes

Comment thread nemo_skills/dataset/bfcl_v4/bfcl_score.py

Comment thread nemo_skills/evaluation/evaluator/bfcl.py

Comment thread nemo_skills/inference/eval/bfcl.py

coderabbitai Bot reviewed Oct 21, 2025

View reviewed changes

Comment thread requirements/main.txt Outdated

gwarmstrong self-requested a review November 24, 2025 18:53

gwarmstrong reviewed Nov 24, 2025

View reviewed changes

Comment thread requirements/main.txt Outdated

BFCLv4 squashed

f251858

Signed-off-by: Michal Bien <mbien@nvidia.com>

Glorf force-pushed the mbien/bfcl_v4 branch from 646db93 to f251858 Compare November 26, 2025 09:41

MAINT fix spacing

a1169ed

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong reviewed Nov 26, 2025

View reviewed changes

Comment thread dockerfiles/Dockerfile.nemo-skills Outdated

Apply suggestion from @gwarmstrong

316670c

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong approved these changes Nov 26, 2025

View reviewed changes

gwarmstrong enabled auto-merge (squash) November 26, 2025 17:15

gwarmstrong merged commit 43e0f2b into main Nov 26, 2025
5 checks passed

gwarmstrong deleted the mbien/bfcl_v4 branch November 26, 2025 17:25

coderabbitai Bot mentioned this pull request Feb 13, 2026

Adding an option to store benchmarks in external repo #1240

Merged

Conversation

Glorf commented Oct 8, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gwarmstrong Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Glorf Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gwarmstrong Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Glorf Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Glorf commented Oct 8, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 8, 2025 •

edited

Loading