feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170
Open
afarntrog wants to merge 3 commits intostrands-agents:mainfrom
Open
feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170afarntrog wants to merge 3 commits intostrands-agents:mainfrom
afarntrog wants to merge 3 commits intostrands-agents:mainfrom
Conversation
…CloudWatch observability - Extract helper methods (_build_reports, _init_evaluator_data, _log_evaluation_to_cloudwatch) - Split monolithic run into separate run_tasks() and run_evaluators() methods - Add run_evaluations() as convenience method combining both steps - Add CloudWatch logging support for evaluation results when observability is enabled - Support parallel execution of tasks and evaluators via threading
…oroutine detection - Replace asyncio.iscoroutinefunction with inspect.iscoroutinefunction for more reliable async function detection - Use index-based result assignment instead of list.append in async workers to preserve input case ordering - Pass (index, case) tuples through queues and pre-allocate result lists - Switch logger calls from f-strings to lazy % formatting with structured key=value messages - Update type hints to reflect nullable result slots (list[... | None])
afarntrog
commented
Mar 20, 2026
| "session.id": case.session_id, | ||
| } | ||
| for index, case in enumerate(self._cases): | ||
| queue.put_nowait((index, case)) |
Contributor
Author
There was a problem hiding this comment.
Using the index instead of just appending is needed to maintain order. Previously, order was not guaranteed and you can have the following outcome:
3 cases and 2 workers:
Time Worker-0 Worker-1
──── ──────── ────────
t0 get_nowait() → case_0
t1 await task(case_0)... get_nowait() → case_1
t2 (task_0 still running) await task(case_1)...
t3 (task_0 still running) task_1 completes!
t4 (task_0 still running) results.append(result_1) ← index 0 in list!
t5 task_0 completes! get_nowait() → case_2
t6 results.append(result_0) await task(case_2)...
↑ index 1 in list!
- Fix bug passing instead of to , ensuring correct span context - Simplify to since is already a subclass of - Remove unused variable assignment - Remove stale param from docstring
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Refactor the
Experimentclass to separate task execution from evaluation into independent, composable steps, and fix async result ordering.run_tasks()andrun_tasks_async()methods that execute tasks against cases and returnEvaluationDatawithout running evaluatorsrun_evaluators()andrun_evaluators_async()methods that evaluate pre-computedEvaluationDatawithout executing tasks, enabling reuse of task results across multiple evaluation runsrun_evaluations()andrun_evaluations_async()as convenience wrappers that compose the two stepsasyncio.iscoroutinefunction()toinspect.iscoroutinefunction()for more reliable coroutine detection_build_reports,_init_evaluator_data,_log_evaluation_to_cloudwatch,_set_task_span_attributes,_extract_task_error,_build_failed_evaluation_data)EvaluationDatawithactual_output=Noneand the error inmetadata["task_error"]instead of raisingsession_idin evaluation metadataRelated Issues
#94
Documentation PR
Type of Change
New feature
Testing
Added 15 new tests in
tests/strands_evals/test_experiment.pycovering:run_tasks: basic output, dict output with trajectory/interactions, task failure handlingrun_evaluators: pre-computed data evaluation, evaluator error isolation, multiple evaluatorsrun_tasks→run_evaluatorspipeline, serialization/deserialization roundtriprun_tasks_asyncwith sync and async tasks, failure handling,run_evaluators_asyncwith pre-computed data and evaluator failuresUpdated existing tests to reflect span naming changes (
run_task/eval_caseinstead ofexecute_case) and the inclusion ofsession_idin metadata.hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.