Skip to content

chore: add tool call eval example#225

Merged
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:add-tool-eval-example
Apr 25, 2026
Merged

chore: add tool call eval example#225
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:add-tool-eval-example

Conversation

@asamal4
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 commented Apr 24, 2026

Description

chore: add tool call eval example

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Documentation
    • Added a new "Tool Call Evaluation" example with detailed documentation explaining how to execute tool evaluations using system and evaluation data YAML files.
    • Includes reference configuration and dataset demonstrating tool-call matching modes, regex patterns for dynamic values, result validation, and alternative expected sequences.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@asamal4 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 36 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 36 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4171de2-8364-4ab2-9015-43d4a57c743c

📥 Commits

Reviewing files that changed from the base of the PR and between 7faf0f3 and 28775ce.

📒 Files selected for processing (3)
  • examples/02_metrics/tool_evaluation/README.md
  • examples/02_metrics/tool_evaluation/eval_data.yaml
  • examples/02_metrics/tool_evaluation/system.yaml

Walkthrough

Introduces a new "Tool Call Evaluation" example consisting of documentation, configuration, and evaluation dataset files. Defines the custom:tool_eval metric for validating tool invocations with support for regex pattern matching, multiple matching modes, alternative expected sequences, and optional result validation.

Changes

Cohort / File(s) Summary
Tool Evaluation Example
examples/02_metrics/tool_evaluation/README.md, examples/02_metrics/tool_evaluation/eval_data.yaml, examples/02_metrics/tool_evaluation/system.yaml
Adds a new evaluation example with documentation describing custom:tool_eval metric (binary tool-call matcher supporting ordered/full\_match modes, regex patterns, alternative expected sequences, and result validation), YAML evaluation dataset with test cases covering correct invocation, regex-based argument matching, tool mismatch detection, output validation, and alternative paths, and configuration file setting core/API/cache settings, metric metadata, storage output format, visualization parameters, and logging levels.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'chore: add tool call eval example' clearly and concisely describes the main change: adding a new tool call evaluation example with three supporting files (README, eval_data.yaml, and system.yaml).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@asamal4
Copy link
Copy Markdown
Collaborator Author

asamal4 commented Apr 25, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/02_metrics/tool_evaluation/system.yaml (1)

8-13: Consider threshold: 0.5 for the binary custom:tool_eval metric.

Tool eval emits exactly 0.0 or 1.0. A 0.5 threshold cleanly partitions fail vs. pass and is the convention used elsewhere in the project for this binary metric. threshold: 1 works only if the comparison is >=; if it ever changes to strict >, a passing 1.0 would be misclassified.

♻️ Proposed change
     custom:tool_eval:
-      threshold: 1
+      threshold: 0.5
       description: Binary validation of tool calls (exact/regex matching)

Based on learnings: "For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/02_metrics/tool_evaluation/system.yaml` around lines 8 - 13, Change
the binary metric configuration for custom:tool_eval by setting its threshold
from 1 to 0.5 so that tool_eval scores of 0.0 are treated as fail and 1.0 as
pass; update the threshold field under the custom:tool_eval block (replace
threshold: 1 with threshold: 0.5) to match the project convention for binary
metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/02_metrics/tool_evaluation/eval_data.yaml`:
- Around line 82-90: The example YAML currently nests the two tool approaches as
a single alternative sequence, causing evaluate_tool_calls to treat them as one
2-step expectation; fix by making each approach its own top-level alternative in
expected_tool_calls so the parser yields two alternatives (e.g., outer list
contains two entries, each a list of step-lists), ensuring evaluate_tool_calls
(src/lightspeed_evaluation/core/metrics/custom/tool_eval.py) can match either
[{kubectl_get}] OR [{oc_get}] when full_match/ordered are true.

In `@examples/02_metrics/tool_evaluation/README.md`:
- Around line 30-35: The expected_tool_calls YAML alternatives are missing one
nesting level; the matcher/validator expects each alternative as a
sequence-of-steps (list[list[dict]]), but the current examples provide
list[dict]; update the expected_tool_calls values so each alternative is wrapped
in an additional list (e.g., change entries under expected_tool_calls to
[[{...}]] instead of [{...}]) so the structure matches the list[list[dict]]
schema.

---

Nitpick comments:
In `@examples/02_metrics/tool_evaluation/system.yaml`:
- Around line 8-13: Change the binary metric configuration for custom:tool_eval
by setting its threshold from 1 to 0.5 so that tool_eval scores of 0.0 are
treated as fail and 1.0 as pass; update the threshold field under the
custom:tool_eval block (replace threshold: 1 with threshold: 0.5) to match the
project convention for binary metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f22a458-7153-4a10-af5d-9e6d0127bf6c

📥 Commits

Reviewing files that changed from the base of the PR and between 36f1940 and 7faf0f3.

📒 Files selected for processing (3)
  • examples/02_metrics/tool_evaluation/README.md
  • examples/02_metrics/tool_evaluation/eval_data.yaml
  • examples/02_metrics/tool_evaluation/system.yaml

Comment thread examples/02_metrics/tool_evaluation/eval_data.yaml Outdated
Comment thread examples/02_metrics/tool_evaluation/README.md
@asamal4 asamal4 force-pushed the add-tool-eval-example branch from 7faf0f3 to 28775ce Compare April 25, 2026 01:18
@asamal4 asamal4 merged commit 9449c6c into lightspeed-core:main Apr 25, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant