feat(evaluation): add generic deterministic metrics (WBP 2.3) by lzongren · Pull Request #64 · awslabs/agent-builder-toolkit-aws-transform

lzongren · 2026-06-16T06:45:34Z

What & why

Implements WBP task 2.3 (generic metrics) for the Transform Gym evaluation framework. Before this, the only built-in scorers were assertion_pass_rate and llm_judge (plus credits/usage). This adds three generic, deterministic, zero-Bedrock metrics that resolve by name through MetricRegistry, alongside the existing two.

Metric	What it scores	Per-test config (`metadata`)
`tool_usage`	`expected_tools` were called and `forbidden_tools` were avoided	`expected_tools`, `forbidden_tools`
`error_handling`	transcript is free of a surfaced framework error (default marker `ERROR:`) — error absence, not recovery quality	`error_markers` (optional override)
`completeness`	run reached its designed end (finished before exhausting `max_turns`) and final output is non-empty	`completion_markers` (optional override)

All three score 0–10 and pass at ≥ 7.0, matching assertion_pass_rate so they combine cleanly in one run.

Design decisions

Config via TestCase.metadata, not a schema change. The metrics read free-form metadata keys, so no TestCase/JSON schema churn.
Abstain when there's no signal. tool_usage and completeness return 10.0/pass when unconfigured (no expected_tools/forbidden_tools; no usable turn_count/max_turns or markers), mirroring assertion_pass_rate's abstain convention — so they're safe to enable globally and only "activate" when there's something to grade. They are not added to the default metrics list (["assertion_pass_rate"]), so this PR cannot regress existing runs.
completeness keys off the turn budget, not __DONE__. The ACP engine consumes its __DONE__ sentinel (execution/runner.py breaks before appending it to the transcript), so a marker-based check would score the inverse of completion. Instead, completeness compares turn_count < max_turns — a naturally-completed run stops below the ceiling; a truncated one hits it. This required surfacing turn_count onto ExecutionResult (populated from EvalResult.turn_count in ACPAgent, emitted in results.json). completion_markers remains an optional override that takes precedence when set.
error_handling default is narrow on purpose. Only ERROR: is matched by default — the one prefix the framework actually emits (runner.py/engine.py; the curated sample asserts transcript_not_contains: "ERROR:" by the same convention). A broader set (Exception:/FATAL/panic:/Traceback) false-positives on ordinary agent prose; teams that want stricter markers supply error_markers.

Notes for reviewers

Shared metadata coercion lives in metrics/_metadata.py (coerce_str_list drops non-string entries, so a stray null can't become a phantom tool/marker).
tool_usage reads the tool name key only, matching how assertion_pass_rate grades tool_called.
ExecutionResult gained turn_count: int | None = None (appended last, defaulted — all construction sites unaffected; JSON-serializable).
README documents the metric table, a metadata example, and a custom-metric snippet.

This change was self-reviewed adversarially before raising: an earlier revision had completeness depending on the (stripped) __DONE__ sentinel and error_handling using the broad marker set — both were caught and fixed, with the boundary math (turn_count < max_turns, no off-by-one) re-verified against runner.py.

Testing

Full eval suite passes (309 tests; +14 vs. base), ruff clean (E/F/I).
New/updated tests: test_tool_usage.py, test_error_handling.py, test_completeness.py (turn-budget semantics + marker override), test_metadata.py (coercion contract), plus registry built-in/resolve tests.

Adds the three generic, deterministic, zero-arg metrics the Transform Gym WBP (task 2.3) called for, alongside the existing assertion_pass_rate + llm_judge. None require Bedrock; all resolve by name through MetricRegistry and score 0-10, passing at >= 7.0 for cross-metric consistency. - tool_usage: scores expected_tools called / forbidden_tools avoided - error_handling: scores transcript free of a surfaced framework error (default marker: "ERROR:" — the one prefix the ACP engine actually emits; error *absence*, not recovery quality). Always active. - completeness: scores whether the run reached its designed end (finished before exhausting max_turns) AND the final output is non-empty. Config is read from the free-form TestCase.metadata (no schema change): expected_tools / forbidden_tools / completion_markers / error_markers. tool_usage and completeness abstain (10.0/pass) when they have no signal to act on, so they are safe to enable globally and only activate when there is something to grade. completeness measures completion via the TURN BUDGET, not a transcript marker: the ACP engine consumes its "__DONE__" sentinel (runner.py breaks before appending it), so a naturally-completed run stops below max_turns while a truncated one hits the ceiling. This required surfacing turn_count onto ExecutionResult (populated from EvalResult.turn_count in ACPAgent, and emitted in the engine's results.json). completion_markers remains an optional override that takes precedence when set. Shared metadata coercion lives in metrics/_metadata.py (coerce_str_list drops non-string entries so a stray null can't become a phantom tool/marker). tool_usage reads the tool 'name' key only, matching assertion_pass_rate's tool_called grading. Registered as built-ins in MetricRegistry; README documents the metric table, metadata example, and a custom-metric snippet. Tests: +metric/registry/helper tests; full eval suite 309 passed, ruff clean.

github-code-quality Bot found potential problems Jun 16, 2026

View reviewed changes

Comment thread evaluation/src/eval_runner/metrics/error_handling.py Fixed

lzongren force-pushed the feature/generic-metrics branch from b411c7e to 67f3233 Compare June 16, 2026 17:17

lzongren force-pushed the feature/generic-metrics branch from 67f3233 to c1fd8dd Compare June 16, 2026 17:59

lzongren requested a review from LikeHui92 June 16, 2026 18:19

LikeHui92 approved these changes Jun 16, 2026

View reviewed changes

lzongren merged commit f482917 into main Jun 16, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): add generic deterministic metrics (WBP 2.3)#64

feat(evaluation): add generic deterministic metrics (WBP 2.3)#64
lzongren merged 1 commit into
mainfrom
feature/generic-metrics

lzongren commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lzongren commented Jun 16, 2026

What & why

Design decisions

Notes for reviewers

Testing

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants