Skip to content

feat(metrics): add deterministic ToolPermissionMetric#2826

Open
gh-raju wants to merge 2 commits into
confident-ai:mainfrom
gh-raju:feat/tool-permission-metric
Open

feat(metrics): add deterministic ToolPermissionMetric#2826
gh-raju wants to merge 2 commits into
confident-ai:mainfrom
gh-raju:feat/tool-permission-metric

Conversation

@gh-raju

@gh-raju gh-raju commented Jul 1, 2026

Copy link
Copy Markdown

What

Adds ToolPermissionMetric, a deterministic metric that checks whether an agent only called tools it was authorized to, given a permission policy:

  • allowed_tools — an allowlist (least privilege); a called tool not in the list is unauthorized.
  • denied_tools — an explicit denylist; a called tool in the list is unauthorized (a denial always wins over an allow).

Score = fraction of authorized tool calls (1.0 when no tools were called). Requires no LLM / API key.

Why

ToolCorrectnessMetric compares called vs expected tools and explicitly does not check authorization. Least-privilege / tool-permission enforcement is a growing production requirement for agents — an out-of-scope tool call (delete_account, wire_transfer, …) is a safety failure regardless of task success. This makes that a deterministic, CI-gateable check.

Relates to #2825.

Changes

  • deepeval/metrics/tool_permission/{__init__.py,tool_permission.py} — the metric (mirrors ToolCorrectnessMetric structure and the BaseMetric contract).
  • deepeval/metrics/__init__.py — export ToolPermissionMetric.
  • tests/test_metrics/test_tool_permission_metric.py — 9 tests: allowlist, denylist, denial-wins, partial credit + threshold, strict mode, no-tools, policy-required, and sync/async parity. Deterministic, so they run without any API key.

Verification

  • pytest tests/test_metrics/test_tool_permission_metric.py9 passed.
  • black --line-length 80 and ruff check → clean.

Notes

  • No new dependencies. No breaking changes.
  • Deterministic (no model), so it is safe to run on every PR in CI.

Checks that an agent only called tools it was authorized to, against an allowed_tools allowlist and/or denied_tools denylist (a denial always wins). Score is the fraction of authorized tool calls (1.0 when no tools were called). Unlike ToolCorrectnessMetric it evaluates authorization rather than task-correctness, and is fully deterministic (no LLM / API key), so it works as a CI gate.

Adds deepeval/metrics/tool_permission/, exports ToolPermissionMetric from deepeval.metrics, and tests/test_metrics/test_tool_permission_metric.py (9 tests, no API key). black + ruff clean.
@vercel

vercel Bot commented Jul 1, 2026

Copy link
Copy Markdown

@gh-raju is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant