feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators by afarntrog · Pull Request #170 · strands-agents/evals

afarntrog · 2026-03-20T13:46:26Z

Description

Refactor the Experiment class to separate task execution from evaluation into independent, composable steps, and fix async result ordering.

Add run_tasks() and run_tasks_async() methods that execute tasks against cases and return EvaluationData without running evaluators
Add run_evaluators() and run_evaluators_async() methods that evaluate pre-computed EvaluationData without executing tasks, enabling reuse of task results across multiple evaluation runs
Rewrite run_evaluations() and run_evaluations_async() as convenience wrappers that compose the two steps
Preserve result ordering in async workers by using index-based placement into a pre-allocated list instead of appending to a shared list
Switch from asyncio.iscoroutinefunction() to inspect.iscoroutinefunction() for more reliable coroutine detection
Extract shared logic into helper methods (_build_reports, _init_evaluator_data, _log_evaluation_to_cloudwatch, _set_task_span_attributes, _extract_task_error, _build_failed_evaluation_data)
Failed task executions now produce EvaluationData with actual_output=None and the error in metadata["task_error"] instead of raising
Always include session_id in evaluation metadata

Related Issues

#94

Documentation PR

Type of Change

New feature

Testing

Added 15 new tests in tests/strands_evals/test_experiment.py covering:

run_tasks: basic output, dict output with trajectory/interactions, task failure handling
run_evaluators: pre-computed data evaluation, evaluator error isolation, multiple evaluators
Roundtrip: run_tasks → run_evaluators pipeline, serialization/deserialization roundtrip
Async variants: run_tasks_async with sync and async tasks, failure handling, run_evaluators_async with pre-computed data and evaluator failures

Updated existing tests to reflect span naming changes (run_task/eval_case instead of execute_case) and the inclusion of session_id in metadata.

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…CloudWatch observability - Extract helper methods (_build_reports, _init_evaluator_data, _log_evaluation_to_cloudwatch) - Split monolithic run into separate run_tasks() and run_evaluators() methods - Add run_evaluations() as convenience method combining both steps - Add CloudWatch logging support for evaluation results when observability is enabled - Support parallel execution of tasks and evaluators via threading

…oroutine detection - Replace asyncio.iscoroutinefunction with inspect.iscoroutinefunction for more reliable async function detection - Use index-based result assignment instead of list.append in async workers to preserve input case ordering - Pass (index, case) tuples through queues and pre-allocate result lists - Switch logger calls from f-strings to lazy % formatting with structured key=value messages - Update type hints to reflect nullable result slots (list[... | None])

afarntrog · 2026-03-20T13:54:54Z

src/strands_evals/experiment.py

-                                    "session.id": case.session_id,
-                                }
+        for index, case in enumerate(self._cases):
+            queue.put_nowait((index, case))


Using the index instead of just appending is needed to maintain order. Previously, order was not guaranteed and you can have the following outcome:

3 cases and 2 workers: Time Worker-0 Worker-1 ──── ──────── ──────── t0 get_nowait() → case_0 t1 await task(case_0)... get_nowait() → case_1 t2 (task_0 still running) await task(case_1)... t3 (task_0 still running) task_1 completes! t4 (task_0 still running) results.append(result_1) ← index 0 in list! t5 task_0 completes! get_nowait() → case_2 t6 results.append(result_0) await task(case_2)... ↑ index 1 in list!

- Fix bug passing instead of to , ensuring correct span context - Simplify to since is already a subclass of - Remove unused variable assignment - Remove stale param from docstring

afarntrog added 2 commits March 18, 2026 15:47

afarntrog temporarily deployed to auto-approve March 20, 2026 13:46 — with GitHub Actions Inactive

afarntrog requested a review from poshinchen March 20, 2026 13:52

afarntrog commented Mar 20, 2026

View reviewed changes

fix: correct eval_span reference and simplify exception handling

ba890e6

- Fix bug passing instead of to , ensuring correct span context - Simplify to since is already a subclass of - Remove unused variable assignment - Remove stale param from docstring

afarntrog temporarily deployed to auto-approve March 20, 2026 16:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170

feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170
afarntrog wants to merge 3 commits intostrands-agents:mainfrom
afarntrog:replay_94

afarntrog commented Mar 20, 2026

Uh oh!

afarntrog Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

afarntrog commented Mar 20, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

afarntrog Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant