chore: add tool call eval example by asamal4 · Pull Request #225 · lightspeed-core/lightspeed-evaluation

asamal4 · 2026-04-24T13:40:01Z

Description

chore: add tool call eval example

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Documentation
- Added a new "Tool Call Evaluation" example with detailed documentation explaining how to execute tool evaluations using system and evaluation data YAML files.
- Includes reference configuration and dataset demonstrating tool-call matching modes, regex patterns for dynamic values, result validation, and alternative expected sequences.

coderabbitai · 2026-04-24T13:40:07Z

Warning

Rate limit exceeded

@asamal4 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 36 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 36 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4171de2-8364-4ab2-9015-43d4a57c743c

📥 Commits

Reviewing files that changed from the base of the PR and between 7faf0f3 and 28775ce.

📒 Files selected for processing (3)

examples/02_metrics/tool_evaluation/README.md
examples/02_metrics/tool_evaluation/eval_data.yaml
examples/02_metrics/tool_evaluation/system.yaml

Walkthrough

Introduces a new "Tool Call Evaluation" example consisting of documentation, configuration, and evaluation dataset files. Defines the custom:tool_eval metric for validating tool invocations with support for regex pattern matching, multiple matching modes, alternative expected sequences, and optional result validation.

Changes

Cohort / File(s)	Summary
Tool Evaluation Example `examples/02_metrics/tool_evaluation/README.md`, `examples/02_metrics/tool_evaluation/eval_data.yaml`, `examples/02_metrics/tool_evaluation/system.yaml`	Adds a new evaluation example with documentation describing `custom:tool_eval` metric (binary tool-call matcher supporting ordered/full\_match modes, regex patterns, alternative expected sequences, and result validation), YAML evaluation dataset with test cases covering correct invocation, regex-based argument matching, tool mismatch detection, output validation, and alternative paths, and configuration file setting core/API/cache settings, metric metadata, storage output format, visualization parameters, and logging levels.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'chore: add tool call eval example' clearly and concisely describes the main change: adding a new tool call evaluation example with three supporting files (README, eval_data.yaml, and system.yaml).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

asamal4 · 2026-04-25T00:55:30Z

@coderabbitai review

coderabbitai · 2026-04-25T00:55:36Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

examples/02_metrics/tool_evaluation/system.yaml (1)
8-13: Consider threshold: 0.5 for the binary custom:tool_eval metric.

Tool eval emits exactly 0.0 or 1.0. A 0.5 threshold cleanly partitions fail vs. pass and is the convention used elsewhere in the project for this binary metric. threshold: 1 works only if the comparison is >=; if it ever changes to strict >, a passing 1.0 would be misclassified.
♻️ Proposed change
     custom:tool_eval:
-      threshold: 1
+      threshold: 0.5
       description: Binary validation of tool calls (exact/regex matching)
Based on learnings: "For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/02_metrics/tool_evaluation/system.yaml` around lines 8 - 13, Change
the binary metric configuration for custom:tool_eval by setting its threshold
from 1 to 0.5 so that tool_eval scores of 0.0 are treated as fail and 1.0 as
pass; update the threshold field under the custom:tool_eval block (replace
threshold: 1 with threshold: 0.5) to match the project convention for binary
metrics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/02_metrics/tool_evaluation/eval_data.yaml`:
- Around line 82-90: The example YAML currently nests the two tool approaches as
a single alternative sequence, causing evaluate_tool_calls to treat them as one
2-step expectation; fix by making each approach its own top-level alternative in
expected_tool_calls so the parser yields two alternatives (e.g., outer list
contains two entries, each a list of step-lists), ensuring evaluate_tool_calls
(src/lightspeed_evaluation/core/metrics/custom/tool_eval.py) can match either
[{kubectl_get}] OR [{oc_get}] when full_match/ordered are true.

In `@examples/02_metrics/tool_evaluation/README.md`:
- Around line 30-35: The expected_tool_calls YAML alternatives are missing one
nesting level; the matcher/validator expects each alternative as a
sequence-of-steps (list[list[dict]]), but the current examples provide
list[dict]; update the expected_tool_calls values so each alternative is wrapped
in an additional list (e.g., change entries under expected_tool_calls to
[[{...}]] instead of [{...}]) so the structure matches the list[list[dict]]
schema.

---

Nitpick comments:
In `@examples/02_metrics/tool_evaluation/system.yaml`:
- Around line 8-13: Change the binary metric configuration for custom:tool_eval
by setting its threshold from 1 to 0.5 so that tool_eval scores of 0.0 are
treated as fail and 1.0 as pass; update the threshold field under the
custom:tool_eval block (replace threshold: 1 with threshold: 0.5) to match the
project convention for binary metrics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f22a458-7153-4a10-af5d-9e6d0127bf6c

📥 Commits

Reviewing files that changed from the base of the PR and between 36f1940 and 7faf0f3.

📒 Files selected for processing (3)

examples/02_metrics/tool_evaluation/README.md
examples/02_metrics/tool_evaluation/eval_data.yaml
examples/02_metrics/tool_evaluation/system.yaml

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread examples/02_metrics/tool_evaluation/eval_data.yaml Outdated

Comment thread examples/02_metrics/tool_evaluation/README.md

chore: add tool call eval example

28775ce

asamal4 force-pushed the add-tool-eval-example branch from 7faf0f3 to 28775ce Compare April 25, 2026 01:18

asamal4 merged commit 9449c6c into lightspeed-core:main Apr 25, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add tool call eval example#225

chore: add tool call eval example#225
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:add-tool-eval-example

asamal4 commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

asamal4 commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asamal4 commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Uh oh!

asamal4 commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asamal4 commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading