Skip to content

feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170

Open
afarntrog wants to merge 3 commits intostrands-agents:mainfrom
afarntrog:replay_94
Open

feat: separate task execution from evaluation in Experiment with run_tasks/run_evaluators#170
afarntrog wants to merge 3 commits intostrands-agents:mainfrom
afarntrog:replay_94

Conversation

@afarntrog
Copy link
Contributor

Description

Refactor the Experiment class to separate task execution from evaluation into independent, composable steps, and fix async result ordering.

  • Add run_tasks() and run_tasks_async() methods that execute tasks against cases and return EvaluationData without running evaluators
  • Add run_evaluators() and run_evaluators_async() methods that evaluate pre-computed EvaluationData without executing tasks, enabling reuse of task results across multiple evaluation runs
  • Rewrite run_evaluations() and run_evaluations_async() as convenience wrappers that compose the two steps
  • Preserve result ordering in async workers by using index-based placement into a pre-allocated list instead of appending to a shared list
  • Switch from asyncio.iscoroutinefunction() to inspect.iscoroutinefunction() for more reliable coroutine detection
  • Extract shared logic into helper methods (_build_reports, _init_evaluator_data, _log_evaluation_to_cloudwatch, _set_task_span_attributes, _extract_task_error, _build_failed_evaluation_data)
  • Failed task executions now produce EvaluationData with actual_output=None and the error in metadata["task_error"] instead of raising
  • Always include session_id in evaluation metadata

Related Issues

#94

Documentation PR

Type of Change

New feature

Testing

Added 15 new tests in tests/strands_evals/test_experiment.py covering:

  • run_tasks: basic output, dict output with trajectory/interactions, task failure handling
  • run_evaluators: pre-computed data evaluation, evaluator error isolation, multiple evaluators
  • Roundtrip: run_tasksrun_evaluators pipeline, serialization/deserialization roundtrip
  • Async variants: run_tasks_async with sync and async tasks, failure handling, run_evaluators_async with pre-computed data and evaluator failures

Updated existing tests to reflect span naming changes (run_task/eval_case instead of execute_case) and the inclusion of session_id in metadata.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…CloudWatch observability

- Extract helper methods (_build_reports, _init_evaluator_data, _log_evaluation_to_cloudwatch)
- Split monolithic run into separate run_tasks() and run_evaluators() methods
- Add run_evaluations() as convenience method combining both steps
- Add CloudWatch logging support for evaluation results when observability is enabled
- Support parallel execution of tasks and evaluators via threading
…oroutine detection

- Replace asyncio.iscoroutinefunction with inspect.iscoroutinefunction
  for more reliable async function detection
- Use index-based result assignment instead of list.append in async
  workers to preserve input case ordering
- Pass (index, case) tuples through queues and pre-allocate result lists
- Switch logger calls from f-strings to lazy % formatting with
  structured key=value messages
- Update type hints to reflect nullable result slots (list[... | None])
"session.id": case.session_id,
}
for index, case in enumerate(self._cases):
queue.put_nowait((index, case))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the index instead of just appending is needed to maintain order. Previously, order was not guaranteed and you can have the following outcome:

3 cases and 2 workers:                                        
  
  Time    Worker-0                    Worker-1                                                                                             
  ────    ────────                    ────────                    
  t0      get_nowait() → case_0                                                                                                            
  t1      await task(case_0)...       get_nowait() → case_1                                                                                
  t2      (task_0 still running)      await task(case_1)...                                                                                
  t3      (task_0 still running)      task_1 completes!                                                                                    
  t4      (task_0 still running)      results.append(result_1)  ← index 0 in list!                                                         
  t5      task_0 completes!           get_nowait() → case_2                                                                                
  t6      results.append(result_0)    await task(case_2)...                                                                                
          ↑ index 1 in list!                                                                                                               

- Fix bug passing  instead of  to
  , ensuring correct span context
- Simplify  to
  since  is already a subclass of
- Remove unused  variable assignment
- Remove stale  param from docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant