feat: add TaskResultStore for caching and replaying task execution results by afarntrog · Pull Request #176 · strands-agents/evals

afarntrog · 2026-03-23T17:19:44Z

Description

Add TaskResultStore protocol for caching task execution results across experiment runs. This enables skipping expensive task re-execution (e.g., LLM calls) when iterating on evaluators, by loading cached EvaluationData from a pluggable storage backend. In addition, refactor the multi hundred line _worker method.

Introduce TaskResultStore protocol with load/save methods that implementations can back with any storage (files, S3, databases, etc.)
Accept an optional task_result_store parameter in run_evaluations and run_evaluations_async
When a store is provided, load cached results before executing the task and save results after execution
Validate that all cases have unique, non-None names when a store is provided
Refactor _worker by extracting _execute_task and _run_evaluator into standalone methods for readability
Export TaskResultStore from the top-level strands_evals package

Related Issues

#94

Documentation PR

Type of Change

New feature

Testing

Added 5 unit tests in tests/strands_evals/test_experiment.py under a new TestTaskResultStore class covering:

Task execution and result saving on first run with an empty store
Cached result loading on subsequent runs (task not called)
Validation that cases must have names when a store is provided
Validation that case names must be unique when a store is provided
Default behavior without a store remains unchanged

Uses a DictTaskResultStore in-memory implementation for test isolation.

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce TaskResultStore abstraction to enable caching and reuse of task results across experiment runs. Refactor Experiment class to extract task execution logic into a dedicated method with retry, tracing, and optional result store integration. Add case name validation to ensure uniqueness when using a result store.

src/strands_evals/task_result_store.py

src/strands_evals/evaluation_data_store.py

src/strands_evals/task_result_store.py

poshinchen

Is the next step implementing the local store (save and load json)?

afarntrog · 2026-03-24T16:05:48Z

Is the next step implementing the local store (save and load json)?

Yes

Rename the TaskResultStore class and module to EvaluationDataStore to better reflect its purpose of storing evaluation data rather than just task results. Updates all references including module file, class name, parameter names, docstrings, error messages, and tests.

afarntrog added 2 commits March 23, 2026 12:46

afarntrog temporarily deployed to auto-approve March 23, 2026 17:19 — with GitHub Actions Inactive

afarntrog changed the title ~~Task result store~~ feat: Add TaskResultStore for caching and replaying task execution results Mar 23, 2026

afarntrog changed the title ~~feat: Add TaskResultStore for caching and replaying task execution results~~ feat: add TaskResultStore for caching and replaying task execution results Mar 23, 2026

poshinchen reviewed Mar 24, 2026

View reviewed changes

src/strands_evals/task_result_store.py Outdated Show resolved Hide resolved

poshinchen reviewed Mar 24, 2026

View reviewed changes

src/strands_evals/evaluation_data_store.py Show resolved Hide resolved

poshinchen reviewed Mar 24, 2026

View reviewed changes

src/strands_evals/task_result_store.py Outdated Show resolved Hide resolved

poshinchen reviewed Mar 24, 2026

View reviewed changes

afarntrog temporarily deployed to auto-approve March 24, 2026 16:22 — with GitHub Actions Inactive

poshinchen previously approved these changes Mar 24, 2026

View reviewed changes

afarntrog dismissed poshinchen’s stale review via 4b60e2d March 24, 2026 18:36

afarntrog deployed to auto-approve March 24, 2026 18:37 — with GitHub Actions Active

afarntrog enabled auto-merge (squash) March 24, 2026 20:52

poshinchen approved these changes Mar 24, 2026

View reviewed changes

afarntrog merged commit dae3c55 into strands-agents:main Mar 24, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TaskResultStore for caching and replaying task execution results#176

feat: add TaskResultStore for caching and replaying task execution results#176
afarntrog merged 4 commits intostrands-agents:mainfrom
afarntrog:task_result_store

afarntrog commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poshinchen left a comment

Uh oh!

afarntrog commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

afarntrog commented Mar 23, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

afarntrog commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants