[RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput] by jjbuck · Pull Request #23 · strands-agents/evals

jjbuck · 2025-10-29T22:24:27Z

Description

BREAKING CHANGE: Evaluator.evaluate() and evaluate_async() now returns list[EvaluationOutput] instead of single EvaluationOutput to support multi-metric evaluation scenarios.

Add aggregator property to Evaluator base class with default mean aggregation
Update all evaluator implementations to return lists
InteractionsEvaluator now returns all intermediate evaluations instead of only the last
Add detailed_results field to EvaluationReport for drill-down into individual metrics
Dataset aggregates multiple outputs per case using evaluator's aggregator function
Update display to show detailed metrics tree when cases are expanded. An example of this is shown below (note: this is generated from the example "multi_metric_evaluator" file. This is a basic toy example to help illustrate the changes; we can consider removing this from this PR or as a future standalone PR to minimize cruft)

┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ index ┃ name           ┃ score ┃ test_pass ┃ reason                                                                                              ┃ input ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ ▼ 0   │ short-response │ 1.00  │ ✅        │ Length: 67 chars (✓ min 20) | Keywords: 2/2 found ['response', 'information'] | Sentiment: Positive │ Hi    │
├───────┼────────────────┼───────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ ▶ 1   │ long-response  │ 1.00  │ ✅        │ ...                                                                                                 │ ...   │
└───────┴────────────────┴───────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
📋 Detailed Metrics for Case 0
├── Metric 1: Score=1.00 ✅
│   └── Length: 67 chars (✓ min 20)
├── Metric 2: Score=1.00 ✅
│   └── Keywords: 2/2 found ['response', 'information']
└── Metric 3: Score=1.00 ✅
    └── Sentiment: Positive

Motivation

The current evaluator interface assumes a 1:1 relationship between test cases and evaluation metrics. However, many real-world evaluation scenarios produce multiple metrics per test case. For example, evaluating tool parameter accuracy across a multi-turn conversation should produce one metric per turn, not a single aggregate score. Similarly, the InteractionsEvaluator was already evaluating each interaction individually but discarding all intermediate results except the last one.

This change makes the evaluator interface more expressive by returning a list of metrics. While this is a breaking change to the return type, the evaluation logic itself remains unchanged—existing evaluators simply wrap their single output in a list. The Dataset layer handles aggregation transparently, so the EvaluationReport structure (one score per case) stays intact.

Each evaluator now has a configurable aggregator function that determines how multiple metrics combine into a case-level score. The default aggregator computes the mean of scores, requires all metrics to pass for the case to pass, and concatenates reasons with " | " separators. Evaluators can override this with custom aggregation logic (e.g., min, max, weighted average) to match their specific semantics. Detailed individual metrics are preserved in EvaluationReport.detailed_results for drill-down analysis.

The attached figure illustrates the modifications to the relevant data models.

jLRDRXit4BxlKmoKoqgLvAJDQU4QDcodDT0uWLCWA591e7P74csQN91SAswKQn-WZzWdwP3SVwyD5q4k0ZMS6NxppJSZ7HlBj2rkHMHkkCoPPhUG2cRCYRMQhmgB5wcI7_XV22-ZoD_0AJDuU27pmlu-XNU5TOSZ-181_03ScHC8jzw2dtDCBMPBvJsJAg9xQJMxffFIzma8Rkod4tc_wBfmQL6pr78bJUbqqPWMqTSAt_aT4mLpnZDcsV

Related Issues

N/A

Documentation PR

N/A

Type of Change

Breaking change

Testing

Ran pytest after updating all affected unit tests.

[x ] I ran hatch run prepare

Checklist

[ x] I have read the CONTRIBUTING document
[x ] I have added any necessary tests that prove my fix is effective or my feature works
[ x] I have updated the documentation accordingly
[x ] I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
[x ] My changes generate no new warnings
[x ] Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

BREAKING CHANGE: Evaluator.evaluate() and evaluate_async() now return list[EvaluationOutput] instead of single EvaluationOutput to support multi-metric evaluation scenarios. - Add aggregator property to Evaluator base class with default mean aggregation - Update all evaluator implementations to return lists - InteractionsEvaluator now returns all intermediate evaluations instead of only the last - Add detailed_results field to EvaluationReport for drill-down into individual metrics - Update display to show detailed metrics tree when cases are expanded - Dataset aggregates multiple outputs per case using evaluator's aggregator function

jjbuck requested a review from poshinchen October 29, 2025 22:24

jjbuck temporarily deployed to auto-approve October 29, 2025 22:24 — with GitHub Actions Inactive

jjbuck changed the title ~~feat: change Evaluator.evaluate() to return list[EvaluationOutput]~~ [RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput] Oct 29, 2025

jjbuck force-pushed the feature/evaluator_interface branch from 00fd063 to 8db4c70 Compare October 29, 2025 22:31

jjbuck temporarily deployed to auto-approve October 29, 2025 22:31 — with GitHub Actions Inactive

poshinchen approved these changes Nov 1, 2025

View reviewed changes

jjbuck marked this pull request as ready for review November 2, 2025 00:50

jjbuck merged commit 15eb14e into strands-agents:main Nov 2, 2025
22 checks passed

jjbuck mentioned this pull request Nov 2, 2025

fix: change Evaluator.evaluate() to return list[EvaluationOutput] #26

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput]#23

[RFC] feat: change Evaluator.evaluate() to return list[EvaluationOutput]#23
jjbuck merged 1 commit intostrands-agents:mainfrom
jjbuck:feature/evaluator_interface

jjbuck commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jjbuck commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jjbuck commented Oct 29, 2025 •

edited

Loading