Skip to content

feat(evaluation): add generic deterministic metrics (WBP 2.3)#64

Merged
lzongren merged 1 commit into
mainfrom
feature/generic-metrics
Jun 16, 2026
Merged

feat(evaluation): add generic deterministic metrics (WBP 2.3)#64
lzongren merged 1 commit into
mainfrom
feature/generic-metrics

Conversation

@lzongren

Copy link
Copy Markdown
Contributor

What & why

Implements WBP task 2.3 (generic metrics) for the Transform Gym evaluation framework. Before this, the only built-in scorers were assertion_pass_rate and llm_judge (plus credits/usage). This adds three generic, deterministic, zero-Bedrock metrics that resolve by name through MetricRegistry, alongside the existing two.

Metric What it scores Per-test config (metadata)
tool_usage expected_tools were called and forbidden_tools were avoided expected_tools, forbidden_tools
error_handling transcript is free of a surfaced framework error (default marker ERROR:) — error absence, not recovery quality error_markers (optional override)
completeness run reached its designed end (finished before exhausting max_turns) and final output is non-empty completion_markers (optional override)

All three score 0–10 and pass at ≥ 7.0, matching assertion_pass_rate so they combine cleanly in one run.

Design decisions

  • Config via TestCase.metadata, not a schema change. The metrics read free-form metadata keys, so no TestCase/JSON schema churn.
  • Abstain when there's no signal. tool_usage and completeness return 10.0/pass when unconfigured (no expected_tools/forbidden_tools; no usable turn_count/max_turns or markers), mirroring assertion_pass_rate's abstain convention — so they're safe to enable globally and only "activate" when there's something to grade. They are not added to the default metrics list (["assertion_pass_rate"]), so this PR cannot regress existing runs.
  • completeness keys off the turn budget, not __DONE__. The ACP engine consumes its __DONE__ sentinel (execution/runner.py breaks before appending it to the transcript), so a marker-based check would score the inverse of completion. Instead, completeness compares turn_count < max_turns — a naturally-completed run stops below the ceiling; a truncated one hits it. This required surfacing turn_count onto ExecutionResult (populated from EvalResult.turn_count in ACPAgent, emitted in results.json). completion_markers remains an optional override that takes precedence when set.
  • error_handling default is narrow on purpose. Only ERROR: is matched by default — the one prefix the framework actually emits (runner.py/engine.py; the curated sample asserts transcript_not_contains: "ERROR:" by the same convention). A broader set (Exception:/FATAL/panic:/Traceback) false-positives on ordinary agent prose; teams that want stricter markers supply error_markers.

Notes for reviewers

  • Shared metadata coercion lives in metrics/_metadata.py (coerce_str_list drops non-string entries, so a stray null can't become a phantom tool/marker).
  • tool_usage reads the tool name key only, matching how assertion_pass_rate grades tool_called.
  • ExecutionResult gained turn_count: int | None = None (appended last, defaulted — all construction sites unaffected; JSON-serializable).
  • README documents the metric table, a metadata example, and a custom-metric snippet.

This change was self-reviewed adversarially before raising: an earlier revision had completeness depending on the (stripped) __DONE__ sentinel and error_handling using the broad marker set — both were caught and fixed, with the boundary math (turn_count < max_turns, no off-by-one) re-verified against runner.py.

Testing

  • Full eval suite passes (309 tests; +14 vs. base), ruff clean (E/F/I).
  • New/updated tests: test_tool_usage.py, test_error_handling.py, test_completeness.py (turn-budget semantics + marker override), test_metadata.py (coercion contract), plus registry built-in/resolve tests.

Comment thread evaluation/src/eval_runner/metrics/error_handling.py Fixed
@lzongren lzongren force-pushed the feature/generic-metrics branch from b411c7e to 67f3233 Compare June 16, 2026 17:17
Adds the three generic, deterministic, zero-arg metrics the Transform Gym
WBP (task 2.3) called for, alongside the existing assertion_pass_rate +
llm_judge. None require Bedrock; all resolve by name through MetricRegistry
and score 0-10, passing at >= 7.0 for cross-metric consistency.

- tool_usage: scores expected_tools called / forbidden_tools avoided
- error_handling: scores transcript free of a surfaced framework error
  (default marker: "ERROR:" — the one prefix the ACP engine actually emits;
  error *absence*, not recovery quality). Always active.
- completeness: scores whether the run reached its designed end (finished
  before exhausting max_turns) AND the final output is non-empty.

Config is read from the free-form TestCase.metadata (no schema change):
expected_tools / forbidden_tools / completion_markers / error_markers.
tool_usage and completeness abstain (10.0/pass) when they have no signal to
act on, so they are safe to enable globally and only activate when there is
something to grade.

completeness measures completion via the TURN BUDGET, not a transcript
marker: the ACP engine consumes its "__DONE__" sentinel (runner.py breaks
before appending it), so a naturally-completed run stops below max_turns
while a truncated one hits the ceiling. This required surfacing turn_count
onto ExecutionResult (populated from EvalResult.turn_count in ACPAgent, and
emitted in the engine's results.json). completion_markers remains an optional
override that takes precedence when set.

Shared metadata coercion lives in metrics/_metadata.py (coerce_str_list drops
non-string entries so a stray null can't become a phantom tool/marker).
tool_usage reads the tool 'name' key only, matching assertion_pass_rate's
tool_called grading.

Registered as built-ins in MetricRegistry; README documents the metric table,
metadata example, and a custom-metric snippet.

Tests: +metric/registry/helper tests; full eval suite 309 passed, ruff clean.
@lzongren lzongren force-pushed the feature/generic-metrics branch from 67f3233 to c1fd8dd Compare June 16, 2026 17:59
@lzongren lzongren requested a review from LikeHui92 June 16, 2026 18:19
@lzongren lzongren merged commit f482917 into main Jun 16, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants