fix: align ArgumentCorrectness prompt with tool calls by DivyamTalwar · Pull Request #2820 · confident-ai/deepeval

DivyamTalwar · 2026-06-29T18:22:20Z

Summary

ArgumentCorrectnessMetric evaluates tool call arguments, but its
generate_verdicts prompt still instructed the model to return verdicts for
statements. This is a separate prompt-wording fix; it does not address the
historical undefined-variable render failure reported in #2817. The PR updates
that instruction to refer to tool calls and adds a regression test for the
compiled template text. It also hardens the template compiler/workflow so this
CI guard does not read symlinked template sources or persist checkout
credentials; these are bundled because the new regression coverage runs through
the same template sync guardrail.

Changes

Update the ArgumentCorrectness verdict-count instruction from statements to
tool calls.
Regenerate the Python and TypeScript metric template bundles with
python3 scripts/compile_metric_templates.py.
Add a template regression test that builds the template bundle and checks the
ArgumentCorrectness instruction wording without installing the full project or
rendering PR-controlled Jinja template text in CI.
Reject symlinked metric template source files before the compiler reads them
and symlinked generated output paths before the compiler writes them, with
regression coverage using isolated temporary repo layouts.
Add explicit read-only workflow permissions and disable persisted checkout
credentials in the dedicated metric-template workflow.

Why this matters

The metric is supposed to judge tool argument correctness per tool call. Keeping
the prompt aligned with that contract avoids confusing judge-model instructions
and makes future template regressions easier to catch without API keys or model
calls.

Tests

Commands run:

python3 -m pytest tests/test_templates/test_metric_templates.py -q — passed
(9 tests)
python3 scripts/compile_metric_templates.py — passed and updated both
committed template bundles
python3 -m py_compile scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py
— passed
git diff --check — passed

Not run:

python3 -m black --check scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py
— not run successfully because black is not installed in this local Python
environment.

Risk

Risk level: implementation low; score/baseline comparability moderate

Implementation risk: low. This is a small-surface prompt correction plus
regression coverage. It does not change metric scoring code, public APIs, or
template variables. It does make the template compiler reject symlinked template
sources/outputs; the current repository templates compile successfully after
that guard.

Score and baseline-comparability risk: moderate. The prompt wording can change
ArgumentCorrectnessMetric scores for some runs because the judge model now
sees a clearer per-tool-call verdict-count instruction. Teams that gate CI or
benchmark comparisons on this metric may want to compare scores before and after
the change on a representative eval set; pass/fail flips or material score drift
would be a good trigger for a release-note callout or broader validation before
release. Rollback is straightforward: revert the prompt text, compiled bundles,
compiler guard, workflow hardening, and tests.

Issue

Related to #2817. This does not attempt to resolve the original
undefined-variable render failure described there. It fixes a separate confirmed
wording bug discovered in the same ArgumentCorrectness prompt area.

Notes for maintainers

This intentionally does not rename the existing stringified_tools_called
template variable because the current code and compiled bundle already agree on
that variable. The PR only corrects the model-facing instruction.

CI note after opening:

Metric Templates / test passed on this PR.
Core Tests, Metrics Tests, Confident Tests, and integration jobs passed.
Python Black is currently red because 47 existing files would be reformatted;
the CI log reports the two PR-owned Python files as already well formatted.
TypeScript Lint is currently red for existing Prettier drift in 66 TypeScript
files.
TypeScript Tests is currently red because the workflow runs secret-backed
tests with empty OPENAI_API_KEY / CONFIDENT_API_KEY.
Vercel is red due fork deployment authorization.

ArgumentCorrectness evaluates tool call arguments, but its verdict prompt still described verdict count in terms of statements. That gives the model conflicting instructions for a tool-call metric. Add a compiled-template regression test without installing the full project or rendering PR-controlled Jinja text in CI, and regenerate the Python and TypeScript template bundles from the source template. Give the dedicated template workflow explicit read-only repository permissions and disable persisted checkout credentials. Reject symlinked template sources and outputs so the template sync check cannot read or write runner-local files through repository symlinks. Constraint: Keep the change to prompt wording, template regression coverage, and template guardrail hardening only. Rejected: Rename stringified_tools_called | The current code and bundle already agree on that variable, and renaming it would broaden the PR beyond the wording bug. Confidence: high Scope-risk: moderate Directive: Keep source `.txt` metric templates and both compiled bundles in sync after prompt edits. Tested: python3 -m pytest tests/test_templates/test_metric_templates.py -q; python3 scripts/compile_metric_templates.py; python3 -m py_compile scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py; git diff --check Not-tested: python3 -m black --check scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py | local Python environment does not have black installed.

vercel · 2026-06-29T18:22:24Z

@DivyamTalwar is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

shubhamofbce mentioned this pull request Jul 1, 2026

ArgumentCorrectnessMetric always fails to render: 'stringified_tools_called' is undefined #2817

Open

yawningphantom mentioned this pull request Jul 2, 2026

feat(evaluate): honor DisplayConfig.display_option to show only failing/passing cases #2829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: align ArgumentCorrectness prompt with tool calls#2820

fix: align ArgumentCorrectness prompt with tool calls#2820
DivyamTalwar wants to merge 1 commit into
confident-ai:mainfrom
DivyamTalwar:fix/argument-correctness-template-tools-called

DivyamTalwar commented Jun 29, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DivyamTalwar commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why this matters

Tests

Risk

Issue

Notes for maintainers

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DivyamTalwar commented Jun 29, 2026 •

edited

Loading