Releases: strands-agents/evals
v0.1.11
What's Changed
- feat(report): allow flattened report by @poshinchen in #157
- feat: add environment state evaluation support by @afarntrog in #156
- feat: added Langchain mappers by @poshinchen in #153
- feat: add environment state support to OutputEvaluator by @afarntrog in #160
- fix: hatch run test-lint by @afarntrog in #161
Full Changelog: v0.1.10...v0.1.11
v0.1.10
What's Changed
- feat: add deterministic evaluators for output and trajectory checks by @afarntrog in #154
Full Changelog: v0.1.9...v0.1.10
v0.1.9
What's Changed
- feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them. by @afarntrog in #147
- feat: add ToolSimulator for tool response simulation by @ybdarrenwang in #111
Full Changelog: v0.1.8...v0.1.9
v0.1.8
What's Changed
- fix: handle parallel tool calls during tool extraction by @clareliguori in #137
- feat: trace provider interface by @afarntrog in #140
- feat: add LangfuseProvider for remote trace evaluation by @afarntrog in #144
- ci: bump amannn/action-semantic-pull-request from 5 to 6 by @dependabot[bot] in #138
Full Changelog: v0.1.7...v0.1.8
v0.1.7
What's Changed
- fix: retrieve multiple text contentBlock in messageConent by @poshinchen in #133
- feat(workflows): add conventional commit workflow in PR by @mkmeral in #134
- fix: add tool info to concisenss, harmfulness, helpfulness and response relevance evaluators by @ybdarrenwang in #132
- fix: update output variable name in workflow by @Unshure in #139
- fix: update finalize condition for workflow execution by @Unshure in #142
New Contributors
- @mkmeral made their first contribution in #134
- @ybdarrenwang made their first contribution in #132
- @Unshure made their first contribution in #139
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- Refactor: centralized InputT and OutputT by @poshinchen in #124
- Added CoherenceEvaluator by @poshinchen in #125
Full Changelog: v0.1.5...v0.1.6
v0.1.5
Major Features
Response Relevance Evaluator - PR#112
The new ResponseRelevanceEvaluator measures how well an agent's response addresses the user's question. It uses a 5-level LLM-as-judge scoring system — Not At All (0.0), Not Generally (0.25), Neutral/Mixed (0.5), Generally Yes (0.75), and Completely Yes (1.0) — with a pass threshold at ≥0.5. Like other trace-level evaluators, it requires an actual_trajectory session and supports both sync and async evaluation.
from strands_evals.evaluators import ResponseRelevanceEvaluator
evaluator = ResponseRelevanceEvaluator()
results = evaluator.evaluate(evaluation_data)
# results[0].score -> 1.0 (for COMPLETELY_YES)
# results[0].test_pass -> True (score >= 0.5)
# results[0].reason -> "The response directly answers the question."
# results[0].label -> ResponseRelevanceScore.COMPLETELY_YESConciseness Evaluator - PR#115
The new ConcisenessEvaluator assesses whether an agent's response is appropriately concise. It uses a 3-level scoring system: Perfectly Concise (1.0) for responses that deliver exactly what was asked, Partially Concise (0.5) for minor extra wording, and Not Concise (0.0) for verbose or repetitive content. The pass threshold is ≥0.5. Both evaluators accept an optional custom model and system_prompt for the LLM judge.
from strands_evals.evaluators import ConcisenessEvaluator
evaluator = ConcisenessEvaluator()
results = evaluator.evaluate(evaluation_data)
# results[0].score -> 0.0 (for NOT_CONCISE)
# results[0].test_pass -> False (score < 0.5)
# results[0].label -> ConcisenessScore.NOT_CONCISEAutomatic Retry with Exponential Backoff for Throttled Evaluations - PR#107
Experiment.run_evaluations() and run_evaluations_async() now automatically retry on throttling errors using tenacity. Both task execution and evaluator execution are wrapped with exponential backoff (up to 6 attempts, 4s → 240s). Throttling is detected for ModelThrottledException, EventLoopException, and botocore ClientError with codes like ThrottlingException and TooManyRequestsException. Non-throttling errors are raised immediately without retrying. Additionally, one evaluator failing no longer prevents other evaluators from running on the same case.
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator, ConcisenessEvaluator
experiment = Experiment(
cases=[Case(name="test", input="What is 2+2?", expected_output="4")],
evaluators=[
OutputEvaluator(rubric="Is the answer correct?"),
ConcisenessEvaluator(),
],
)
# Throttling errors are retried automatically with exponential backoff.
# If one evaluator fails, the other still runs.
reports = experiment.run_evaluations(my_agent)Bug Fixes
- Replace Deprecated Structured Output Methods - PR#67
UpdatedOutputEvaluatorto call the agent with thestructured_output_modelparameter instead of the removedstructured_output()andstructured_output_async()methods. Without this fix,OutputEvaluatorwould fail on recent Strands SDK versions.
Full Changelog: v0.1.4...v0.1.5
v0.1.4
What's Changed
- fix: include tool executions in _extract_trace_level by @razkenari in #77
New Contributors
- @razkenari made their first contribution in #77
Full Changelog: v0.1.3...v0.1.4
v0.1.3
What's Changed
- fix: Multiple Tool Usage Not Detected in tools_use_extractor.py by @bipro1992 in #80
New Contributors
- @bipro1992 made their first contribution in #80
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
- fix: Isolate evaluator errors in run_evaluations by @afarntrog in #84
- fix(extractors): Add null check for toolResult in message extraction by @afarntrog in #85
Full Changelog: v0.1.1...v0.1.2