Skip to content

fix: align ArgumentCorrectness prompt with tool calls#2820

Open
DivyamTalwar wants to merge 1 commit into
confident-ai:mainfrom
DivyamTalwar:fix/argument-correctness-template-tools-called
Open

fix: align ArgumentCorrectness prompt with tool calls#2820
DivyamTalwar wants to merge 1 commit into
confident-ai:mainfrom
DivyamTalwar:fix/argument-correctness-template-tools-called

Conversation

@DivyamTalwar

@DivyamTalwar DivyamTalwar commented Jun 29, 2026

Copy link
Copy Markdown

Summary

ArgumentCorrectnessMetric evaluates tool call arguments, but its
generate_verdicts prompt still instructed the model to return verdicts for
statements. This is a separate prompt-wording fix; it does not address the
historical undefined-variable render failure reported in #2817. The PR updates
that instruction to refer to tool calls and adds a regression test for the
compiled template text. It also hardens the template compiler/workflow so this
CI guard does not read symlinked template sources or persist checkout
credentials; these are bundled because the new regression coverage runs through
the same template sync guardrail.

Changes

  • Update the ArgumentCorrectness verdict-count instruction from statements to
    tool calls.
  • Regenerate the Python and TypeScript metric template bundles with
    python3 scripts/compile_metric_templates.py.
  • Add a template regression test that builds the template bundle and checks the
    ArgumentCorrectness instruction wording without installing the full project or
    rendering PR-controlled Jinja template text in CI.
  • Reject symlinked metric template source files before the compiler reads them
    and symlinked generated output paths before the compiler writes them, with
    regression coverage using isolated temporary repo layouts.
  • Add explicit read-only workflow permissions and disable persisted checkout
    credentials in the dedicated metric-template workflow.

Why this matters

The metric is supposed to judge tool argument correctness per tool call. Keeping
the prompt aligned with that contract avoids confusing judge-model instructions
and makes future template regressions easier to catch without API keys or model
calls.

Tests

Commands run:

  • python3 -m pytest tests/test_templates/test_metric_templates.py -q — passed
    (9 tests)
  • python3 scripts/compile_metric_templates.py — passed and updated both
    committed template bundles
  • python3 -m py_compile scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py
    — passed
  • git diff --check — passed

Not run:

  • python3 -m black --check scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py
    — not run successfully because black is not installed in this local Python
    environment.

Risk

Risk level: implementation low; score/baseline comparability moderate

Implementation risk: low. This is a small-surface prompt correction plus
regression coverage. It does not change metric scoring code, public APIs, or
template variables. It does make the template compiler reject symlinked template
sources/outputs; the current repository templates compile successfully after
that guard.

Score and baseline-comparability risk: moderate. The prompt wording can change
ArgumentCorrectnessMetric scores for some runs because the judge model now
sees a clearer per-tool-call verdict-count instruction. Teams that gate CI or
benchmark comparisons on this metric may want to compare scores before and after
the change on a representative eval set; pass/fail flips or material score drift
would be a good trigger for a release-note callout or broader validation before
release. Rollback is straightforward: revert the prompt text, compiled bundles,
compiler guard, workflow hardening, and tests.

Issue

Related to #2817. This does not attempt to resolve the original
undefined-variable render failure described there. It fixes a separate confirmed
wording bug discovered in the same ArgumentCorrectness prompt area.

Notes for maintainers

This intentionally does not rename the existing stringified_tools_called
template variable because the current code and compiled bundle already agree on
that variable. The PR only corrects the model-facing instruction.

CI note after opening:

  • Metric Templates / test passed on this PR.
  • Core Tests, Metrics Tests, Confident Tests, and integration jobs passed.
  • Python Black is currently red because 47 existing files would be reformatted;
    the CI log reports the two PR-owned Python files as already well formatted.
  • TypeScript Lint is currently red for existing Prettier drift in 66 TypeScript
    files.
  • TypeScript Tests is currently red because the workflow runs secret-backed
    tests with empty OPENAI_API_KEY / CONFIDENT_API_KEY.
  • Vercel is red due fork deployment authorization.

ArgumentCorrectness evaluates tool call arguments, but its verdict prompt still
described verdict count in terms of statements. That gives the model conflicting
instructions for a tool-call metric.

Add a compiled-template regression test without installing the full project or
rendering PR-controlled Jinja text in CI, and regenerate the Python and
TypeScript template bundles from the source template. Give the dedicated
template workflow explicit read-only repository permissions and disable
persisted checkout credentials. Reject symlinked template sources and outputs so
the template sync check cannot read or write runner-local files through
repository symlinks.

Constraint: Keep the change to prompt wording, template regression coverage, and template guardrail hardening only.
Rejected: Rename stringified_tools_called | The current code and bundle already agree on that variable, and renaming it would broaden the PR beyond the wording bug.
Confidence: high
Scope-risk: moderate
Directive: Keep source `.txt` metric templates and both compiled bundles in sync after prompt edits.
Tested: python3 -m pytest tests/test_templates/test_metric_templates.py -q; python3 scripts/compile_metric_templates.py; python3 -m py_compile scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py; git diff --check
Not-tested: python3 -m black --check scripts/compile_metric_templates.py tests/test_templates/test_metric_templates.py | local Python environment does not have black installed.
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

@DivyamTalwar is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant