Releases · strands-agents/evals

19 Mar 14:32

Unshure

v0.1.11

09bbe70

v0.1.11 Latest

Latest

What's Changed

feat(report): allow flattened report by @poshinchen in #157
feat: add environment state evaluation support by @afarntrog in #156
feat: added Langchain mappers by @poshinchen in #153
feat: add environment state support to OutputEvaluator by @afarntrog in #160
fix: hatch run test-lint by @afarntrog in #161

Full Changelog: v0.1.10...v0.1.11

Contributors

afarntrog and poshinchen

Assets 2

0 Join discussion

11 Mar 19:03

mehtarac

v0.1.10

e5a4c61

v0.1.10

What's Changed

feat: add deterministic evaluators for output and trajectory checks by @afarntrog in #154

Full Changelog: v0.1.9...v0.1.10

Contributors

afarntrog

Assets 2

04 Mar 21:14

pgrayy

v0.1.9

ca9e30f

v0.1.9

What's Changed

feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them. by @afarntrog in #147
feat: add ToolSimulator for tool response simulation by @ybdarrenwang in #111

Full Changelog: v0.1.8...v0.1.9

Contributors

ybdarrenwang and afarntrog

Assets 2

25 Feb 20:08

lizradway

v0.1.8

d335cb0

v0.1.8

What's Changed

fix: handle parallel tool calls during tool extraction by @clareliguori in #137
feat: trace provider interface by @afarntrog in #140
feat: add LangfuseProvider for remote trace evaluation by @afarntrog in #144
ci: bump amannn/action-semantic-pull-request from 5 to 6 by @dependabot[bot] in #138

Full Changelog: v0.1.7...v0.1.8

Contributors

clareliguori, dependabot, and afarntrog

Assets 2

19 Feb 17:46

chaynabors

v0.1.7

d307f40

v0.1.7

What's Changed

fix: retrieve multiple text contentBlock in messageConent by @poshinchen in #133
feat(workflows): add conventional commit workflow in PR by @mkmeral in #134
fix: add tool info to concisenss, harmfulness, helpfulness and response relevance evaluators by @ybdarrenwang in #132
fix: update output variable name in workflow by @Unshure in #139
fix: update finalize condition for workflow execution by @Unshure in #142

New Contributors

@mkmeral made their first contribution in #134
@ybdarrenwang made their first contribution in #132
@Unshure made their first contribution in #139

Full Changelog: v0.1.6...v0.1.7

Contributors

ybdarrenwang, mkmeral, and 2 other contributors

Assets 2

11 Feb 20:20

afarntrog

v0.1.6

fc30eb8

v0.1.6

What's Changed

Refactor: centralized InputT and OutputT by @poshinchen in #124
Added CoherenceEvaluator by @poshinchen in #125

Full Changelog: v0.1.5...v0.1.6

Contributors

poshinchen

Assets 2

05 Feb 21:18

mkmeral

v0.1.5

9be0604

v0.1.5

Major Features

Response Relevance Evaluator - PR#112

The new ResponseRelevanceEvaluator measures how well an agent's response addresses the user's question. It uses a 5-level LLM-as-judge scoring system — Not At All (0.0), Not Generally (0.25), Neutral/Mixed (0.5), Generally Yes (0.75), and Completely Yes (1.0) — with a pass threshold at ≥0.5. Like other trace-level evaluators, it requires an actual_trajectory session and supports both sync and async evaluation.

from strands_evals.evaluators import ResponseRelevanceEvaluator

evaluator = ResponseRelevanceEvaluator()
results = evaluator.evaluate(evaluation_data)

# results[0].score    -> 1.0  (for COMPLETELY_YES)
# results[0].test_pass -> True (score >= 0.5)
# results[0].reason   -> "The response directly answers the question."
# results[0].label    -> ResponseRelevanceScore.COMPLETELY_YES

Conciseness Evaluator - PR#115

The new ConcisenessEvaluator assesses whether an agent's response is appropriately concise. It uses a 3-level scoring system: Perfectly Concise (1.0) for responses that deliver exactly what was asked, Partially Concise (0.5) for minor extra wording, and Not Concise (0.0) for verbose or repetitive content. The pass threshold is ≥0.5. Both evaluators accept an optional custom model and system_prompt for the LLM judge.

from strands_evals.evaluators import ConcisenessEvaluator

evaluator = ConcisenessEvaluator()
results = evaluator.evaluate(evaluation_data)

# results[0].score    -> 0.0  (for NOT_CONCISE)
# results[0].test_pass -> False (score < 0.5)
# results[0].label    -> ConcisenessScore.NOT_CONCISE

Automatic Retry with Exponential Backoff for Throttled Evaluations - PR#107

Experiment.run_evaluations() and run_evaluations_async() now automatically retry on throttling errors using tenacity. Both task execution and evaluator execution are wrapped with exponential backoff (up to 6 attempts, 4s → 240s). Throttling is detected for ModelThrottledException, EventLoopException, and botocore ClientError with codes like ThrottlingException and TooManyRequestsException. Non-throttling errors are raised immediately without retrying. Additionally, one evaluator failing no longer prevents other evaluators from running on the same case.

from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator, ConcisenessEvaluator

experiment = Experiment(
    cases=[Case(name="test", input="What is 2+2?", expected_output="4")],
    evaluators=[
        OutputEvaluator(rubric="Is the answer correct?"),
        ConcisenessEvaluator(),
    ],
)

# Throttling errors are retried automatically with exponential backoff.
# If one evaluator fails, the other still runs.
reports = experiment.run_evaluations(my_agent)

Bug Fixes

Replace Deprecated Structured Output Methods - PR#67
Updated OutputEvaluator to call the agent with the structured_output_model parameter instead of the removed structured_output() and structured_output_async() methods. Without this fix, OutputEvaluator would fail on recent Strands SDK versions.

Full Changelog: v0.1.4...v0.1.5

Assets 2

29 Jan 01:53

pgrayy

v0.1.4

5287ac1

v0.1.4

What's Changed

fix: include tool executions in _extract_trace_level by @razkenari in #77

New Contributors

@razkenari made their first contribution in #77

Full Changelog: v0.1.3...v0.1.4

Contributors

razkenari

Assets 2

21 Jan 21:05

JackYPCOnline

v0.1.3

9c643b6

v0.1.3

What's Changed

fix: Multiple Tool Usage Not Detected in tools_use_extractor.py by @bipro1992 in #80

New Contributors

@bipro1992 made their first contribution in #80

Full Changelog: v0.1.2...v0.1.3

Contributors

bipro1992

Assets 2

13 Jan 23:18

zastrowm

v0.1.2

47ca78d

v0.1.2

What's Changed

fix: Isolate evaluator errors in run_evaluations by @afarntrog in #84
fix(extractors): Add null check for toolResult in message extraction by @afarntrog in #85

Full Changelog: v0.1.1...v0.1.2

Contributors

afarntrog

Assets 2

Releases: strands-agents/evals

v0.1.11

What's Changed

Contributors

Uh oh!

v0.1.10

What's Changed

Contributors

Uh oh!

v0.1.9

What's Changed

Contributors

Uh oh!

v0.1.8

What's Changed

Contributors

Uh oh!

v0.1.7

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.6

What's Changed

Contributors

Uh oh!

v0.1.5

Major Features

Response Relevance Evaluator - PR#112

Conciseness Evaluator - PR#115

Automatic Retry with Exponential Backoff for Throttled Evaluations - PR#107

Bug Fixes

Uh oh!

v0.1.4

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

Contributors

Uh oh!