Skip to content

[RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput]#23

Merged
jjbuck merged 1 commit intostrands-agents:mainfrom
jjbuck:feature/evaluator_interface
Nov 2, 2025
Merged

[RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput]#23
jjbuck merged 1 commit intostrands-agents:mainfrom
jjbuck:feature/evaluator_interface

Conversation

@jjbuck
Copy link
Copy Markdown
Collaborator

@jjbuck jjbuck commented Oct 29, 2025

Description

BREAKING CHANGE: Evaluator.evaluate() and evaluate_async() now returns list[EvaluationOutput] instead of single EvaluationOutput to support multi-metric evaluation scenarios.

  • Add aggregator property to Evaluator base class with default mean aggregation
  • Update all evaluator implementations to return lists
  • InteractionsEvaluator now returns all intermediate evaluations instead of only the last
  • Add detailed_results field to EvaluationReport for drill-down into individual metrics
  • Dataset aggregates multiple outputs per case using evaluator's aggregator function
  • Update display to show detailed metrics tree when cases are expanded. An example of this is shown below (note: this is generated from the example "multi_metric_evaluator" file. This is a basic toy example to help illustrate the changes; we can consider removing this from this PR or as a future standalone PR to minimize cruft)
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ index ┃ name           ┃ score ┃ test_pass ┃ reason                                                                                              ┃ input ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ ▼ 0   │ short-response │ 1.00  │ ✅        │ Length: 67 chars (✓ min 20) | Keywords: 2/2 found ['response', 'information'] | Sentiment: Positive │ Hi    │
├───────┼────────────────┼───────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ ▶ 1   │ long-response  │ 1.00  │ ✅        │ ...                                                                                                 │ ...   │
└───────┴────────────────┴───────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
📋 Detailed Metrics for Case 0
├── Metric 1: Score=1.00 ✅
│   └── Length: 67 chars (✓ min 20)
├── Metric 2: Score=1.00 ✅
│   └── Keywords: 2/2 found ['response', 'information']
└── Metric 3: Score=1.00 ✅
    └── Sentiment: Positive

Motivation

The current evaluator interface assumes a 1:1 relationship between test cases and evaluation metrics. However, many real-world evaluation scenarios produce multiple metrics per test case. For example, evaluating tool parameter accuracy across a multi-turn conversation should produce one metric per turn, not a single aggregate score. Similarly, the InteractionsEvaluator was already evaluating each interaction individually but discarding all intermediate results except the last one.

This change makes the evaluator interface more expressive by returning a list of metrics. While this is a breaking change to the return type, the evaluation logic itself remains unchanged—existing evaluators simply wrap their single output in a list. The Dataset layer handles aggregation transparently, so the EvaluationReport structure (one score per case) stays intact.

Each evaluator now has a configurable aggregator function that determines how multiple metrics combine into a case-level score. The default aggregator computes the mean of scores, requires all metrics to pass for the case to pass, and concatenates reasons with " | " separators. Evaluators can override this with custom aggregation logic (e.g., min, max, weighted average) to match their specific semantics. Detailed individual metrics are preserved in EvaluationReport.detailed_results for drill-down analysis.

The attached figure illustrates the modifications to the relevant data models.

jLRDRXit4BxlKmoKoqgLvAJDQU4QDcodDT0uWLCWA591e7P74csQN91SAswKQn-WZzWdwP3SVwyD5q4k0ZMS6NxppJSZ7HlBj2rkHMHkkCoPPhUG2cRCYRMQhmgB5wcI7_XV22-ZoD_0AJDuU27pmlu-XNU5TOSZ-181_03ScHC8jzw2dtDCBMPBvJsJAg9xQJMxffFIzma8Rkod4tc_wBfmQL6pr78bJUbqqPWMqTSAt_aT4mLpnZDcsV

Related Issues

N/A

Documentation PR

N/A

Type of Change

Breaking change

Testing

Ran pytest after updating all affected unit tests.

  • [x ] I ran hatch run prepare

Checklist

  • [ x] I have read the CONTRIBUTING document
  • [x ] I have added any necessary tests that prove my fix is effective or my feature works
  • [ x] I have updated the documentation accordingly
  • [x ] I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • [x ] My changes generate no new warnings
  • [x ] Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jjbuck jjbuck requested a review from poshinchen October 29, 2025 22:24
@jjbuck jjbuck changed the title feat: change Evaluator.evaluate() to return list[EvaluationOutput] [RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput] Oct 29, 2025
BREAKING CHANGE: Evaluator.evaluate() and evaluate_async() now return
list[EvaluationOutput] instead of single EvaluationOutput to support
multi-metric evaluation scenarios.

- Add aggregator property to Evaluator base class with default mean aggregation
- Update all evaluator implementations to return lists
- InteractionsEvaluator now returns all intermediate evaluations instead of only the last
- Add detailed_results field to EvaluationReport for drill-down into individual metrics
- Update display to show detailed metrics tree when cases are expanded
- Dataset aggregates multiple outputs per case using evaluator's aggregator function
@jjbuck jjbuck force-pushed the feature/evaluator_interface branch from 00fd063 to 8db4c70 Compare October 29, 2025 22:31
@jjbuck jjbuck marked this pull request as ready for review November 2, 2025 00:50
@jjbuck jjbuck merged commit 15eb14e into strands-agents:main Nov 2, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants