All notable changes to harness-evals will be documented in this file.
Format follows Keep a Changelog. Versioning follows Semantic Versioning.
scoring_duration_msonScore— optional field tracking how long metric scoring took (in milliseconds).JUnitSinknow emitstimeattributes on<testcase>and<testsuite>elements, so CI test tabs (Harness, Jenkins, etc.) show actual durations instead of 0s.
ToolArgumentMatchMetric— deterministic comparison of tool-call arguments against authored expectations. Companion toToolCorrectnessMetric(names) and the LLM-judgedArgumentCorrectnessMetric. Supportspair=exact|subset,arg_match=exact|subset,ignore_keys, andwildcard_value. Registered in the catalog as"tool_argument_match".Golden.expected_tool_calls: list[ToolCall] | NoneandEvalCase.expected_tool_calls: list[ToolCall] | None— optional, defaults toNone. Lets dataset authors carry expected tool-call arguments alongsideexpected_tools.Golden.from_dict,EvalCase.from_dict, andEvalCase.from_goldenhandle (de)serialization and propagation.- ADR-010: Why
ToolArgumentMatchMetricis a separate metric (not an enrichment ofToolCorrectness).
- README and PLAN.md updated with the new metric, the canonical
ToolCorrectness+ToolArgumentMatchpairing snippet, and the data-model note.
- Fully backward-compatible: no existing field changes its type or default; existing JSONL datasets continue to load unchanged.
Goldendataclass — authored evaluation data (input, expected, context)EvalCasedataclass — replacesTestCase, adds typed operational fields (latency_ms,token_count,cost_usd,retry_count,confidence)EvalCase.from_golden()— factory to create EvalCase from Golden + agent outputEvalCase.from_dict()/Golden.from_dict()— with backward-compat aliases (actual_output->output,expected_output->expected,token_usage->token_count)Score.passed— auto-computed fromvalue >= thresholdin__post_init__Score.created_at— UTC timestamp set at creationScore.to_dict(),Golden.to_dict(),EvalCase.to_dict()— serialization methodsevaluate_cases()— sync batch evaluation of pre-captured eval casesevaluate_dataset()— async evaluation: runs agent on goldens, then scoresBaseMetric.a_measure()— async variant, defaults to calling syncmeasure()ReliabilityMetric.a_measure_runs()— async variant for multi-run metrics- ADR-007: Why Golden and EvalCase are separate types
- BREAKING:
TestCaseremoved, replaced byGolden+EvalCase - BREAKING:
actual_outputrenamed tooutput - BREAKING:
expected_outputrenamed toexpected - BREAKING:
Score.successrenamed toScore.passed(auto-computed, not in constructor) - Operational metrics read typed fields (
eval_case.latency_ms) instead ofmetadatadict ResourceConsistencyMetricdefaultresource_keychanged from"token_usage"to"token_count"- ADR-006 updated to reflect sync
measure()+ asynca_measure()pattern
TestCasedataclass (useGolden+EvalCaseinstead)Score.successconstructor parameter (use auto-computedScore.passed)
Replace TestCase usage:
# Before
tc = TestCase(input="q", actual_output="a", expected_output="e",
metadata={"latency_ms": 100})
score = metric.measure(tc)
if score.success: ...
# After
ec = EvalCase(input="q", output="a", expected="e", latency_ms=100)
score = metric.measure(ec)
if score.passed: ...Or use from_dict() with old field names (backward compatible):
ec = EvalCase.from_dict({"input": "q", "actual_output": "a", "expected_output": "e"})- Core types:
TestCase,Score,BaseMetric,ReliabilityMetric,BaseSink - Runner functions:
evaluate()(non-raising) andassert_test()(raises on failure) - Deterministic metrics:
ExactMatchMetric,ContainsMetric,RegexMetric,NumericDiffMetric - Structural metrics:
JsonDiffMetric(DeepDiff-backed),SchemaValidationMetric(jsonschema-backed) - Operational metrics:
LatencyMetric,TokenCostMetric,CostEfficiencyMetric,RetryCountMetric - Reliability metrics:
OutcomeConsistencyMetric,ResourceConsistencyMetric - Output sinks:
StdoutSink,JsonSink - Project infrastructure: pyproject.toml, .gitignore, .pre-commit-config.yaml
- CI: GitHub Actions workflow for Python 3.10/3.11/3.12
- Documentation: README.md, AGENTS.md, PLAN.md (full 6-phase vision)
- 42 passing tests covering all metrics and core functions
- Example:
examples/basic_eval.py
- Phase 5: Synthesizer, perturbation generators
- Phase 6: Harness AI Evals integration