feat(evaluation): add generic deterministic metrics (WBP 2.3)#64
Merged
Conversation
b411c7e to
67f3233
Compare
Adds the three generic, deterministic, zero-arg metrics the Transform Gym WBP (task 2.3) called for, alongside the existing assertion_pass_rate + llm_judge. None require Bedrock; all resolve by name through MetricRegistry and score 0-10, passing at >= 7.0 for cross-metric consistency. - tool_usage: scores expected_tools called / forbidden_tools avoided - error_handling: scores transcript free of a surfaced framework error (default marker: "ERROR:" — the one prefix the ACP engine actually emits; error *absence*, not recovery quality). Always active. - completeness: scores whether the run reached its designed end (finished before exhausting max_turns) AND the final output is non-empty. Config is read from the free-form TestCase.metadata (no schema change): expected_tools / forbidden_tools / completion_markers / error_markers. tool_usage and completeness abstain (10.0/pass) when they have no signal to act on, so they are safe to enable globally and only activate when there is something to grade. completeness measures completion via the TURN BUDGET, not a transcript marker: the ACP engine consumes its "__DONE__" sentinel (runner.py breaks before appending it), so a naturally-completed run stops below max_turns while a truncated one hits the ceiling. This required surfacing turn_count onto ExecutionResult (populated from EvalResult.turn_count in ACPAgent, and emitted in the engine's results.json). completion_markers remains an optional override that takes precedence when set. Shared metadata coercion lives in metrics/_metadata.py (coerce_str_list drops non-string entries so a stray null can't become a phantom tool/marker). tool_usage reads the tool 'name' key only, matching assertion_pass_rate's tool_called grading. Registered as built-ins in MetricRegistry; README documents the metric table, metadata example, and a custom-metric snippet. Tests: +metric/registry/helper tests; full eval suite 309 passed, ruff clean.
67f3233 to
c1fd8dd
Compare
LikeHui92
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Implements WBP task 2.3 (generic metrics) for the Transform Gym evaluation framework. Before this, the only built-in scorers were
assertion_pass_rateandllm_judge(plus credits/usage). This adds three generic, deterministic, zero-Bedrock metrics that resolve by name throughMetricRegistry, alongside the existing two.metadata)tool_usageexpected_toolswere called andforbidden_toolswere avoidedexpected_tools,forbidden_toolserror_handlingERROR:) — error absence, not recovery qualityerror_markers(optional override)completenessmax_turns) and final output is non-emptycompletion_markers(optional override)All three score 0–10 and pass at ≥ 7.0, matching
assertion_pass_rateso they combine cleanly in one run.Design decisions
TestCase.metadata, not a schema change. The metrics read free-formmetadatakeys, so noTestCase/JSON schema churn.tool_usageandcompletenessreturn 10.0/pass when unconfigured (noexpected_tools/forbidden_tools; no usableturn_count/max_turnsor markers), mirroringassertion_pass_rate's abstain convention — so they're safe to enable globally and only "activate" when there's something to grade. They are not added to the default metrics list (["assertion_pass_rate"]), so this PR cannot regress existing runs.completenesskeys off the turn budget, not__DONE__. The ACP engine consumes its__DONE__sentinel (execution/runner.pybreaks before appending it to the transcript), so a marker-based check would score the inverse of completion. Instead, completeness comparesturn_count < max_turns— a naturally-completed run stops below the ceiling; a truncated one hits it. This required surfacingturn_countontoExecutionResult(populated fromEvalResult.turn_countinACPAgent, emitted inresults.json).completion_markersremains an optional override that takes precedence when set.error_handlingdefault is narrow on purpose. OnlyERROR:is matched by default — the one prefix the framework actually emits (runner.py/engine.py; the curated sample assertstranscript_not_contains: "ERROR:"by the same convention). A broader set (Exception:/FATAL/panic:/Traceback) false-positives on ordinary agent prose; teams that want stricter markers supplyerror_markers.Notes for reviewers
metrics/_metadata.py(coerce_str_listdrops non-string entries, so a straynullcan't become a phantom tool/marker).tool_usagereads the toolnamekey only, matching howassertion_pass_rategradestool_called.ExecutionResultgainedturn_count: int | None = None(appended last, defaulted — all construction sites unaffected; JSON-serializable).This change was self-reviewed adversarially before raising: an earlier revision had
completenessdepending on the (stripped)__DONE__sentinel anderror_handlingusing the broad marker set — both were caught and fixed, with the boundary math (turn_count < max_turns, no off-by-one) re-verified againstrunner.py.Testing
test_tool_usage.py,test_error_handling.py,test_completeness.py(turn-budget semantics + marker override),test_metadata.py(coercion contract), plus registry built-in/resolve tests.