feat(metrics): add deterministic ToolPermissionMetric#2826
Open
gh-raju wants to merge 2 commits into
Open
Conversation
Checks that an agent only called tools it was authorized to, against an allowed_tools allowlist and/or denied_tools denylist (a denial always wins). Score is the fraction of authorized tool calls (1.0 when no tools were called). Unlike ToolCorrectnessMetric it evaluates authorization rather than task-correctness, and is fully deterministic (no LLM / API key), so it works as a CI gate. Adds deepeval/metrics/tool_permission/, exports ToolPermissionMetric from deepeval.metrics, and tests/test_metrics/test_tool_permission_metric.py (9 tests, no API key). black + ruff clean.
|
@gh-raju is attempting to deploy a commit to the Confident AI Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
ToolPermissionMetric, a deterministic metric that checks whether an agent only called tools it was authorized to, given a permission policy:allowed_tools— an allowlist (least privilege); a called tool not in the list is unauthorized.denied_tools— an explicit denylist; a called tool in the list is unauthorized (a denial always wins over an allow).Score = fraction of authorized tool calls (
1.0when no tools were called). Requires no LLM / API key.Why
ToolCorrectnessMetriccompares called vs expected tools and explicitly does not check authorization. Least-privilege / tool-permission enforcement is a growing production requirement for agents — an out-of-scope tool call (delete_account,wire_transfer, …) is a safety failure regardless of task success. This makes that a deterministic, CI-gateable check.Relates to #2825.
Changes
deepeval/metrics/tool_permission/{__init__.py,tool_permission.py}— the metric (mirrorsToolCorrectnessMetricstructure and theBaseMetriccontract).deepeval/metrics/__init__.py— exportToolPermissionMetric.tests/test_metrics/test_tool_permission_metric.py— 9 tests: allowlist, denylist, denial-wins, partial credit + threshold, strict mode, no-tools, policy-required, and sync/async parity. Deterministic, so they run without any API key.Verification
pytest tests/test_metrics/test_tool_permission_metric.py→ 9 passed.black --line-length 80andruff check→ clean.Notes