feat: add environment state evaluation support by afarntrog · Pull Request #156 · strands-agents/evals

afarntrog · 2026-03-11T12:49:47Z

Description

Add EnvironmentState type and evaluators for assessing agent side effects on external environments (e.g., file systems, databases, APIs).

Add EnvironmentState model with name/state fields
Add expected_environment_state field to Case
Add actual/expected environment state fields to EvaluationData
Add StateEquals deterministic evaluator for exact state matching
Add EnvironmentStateEvaluator (LLM-based) for semantic state evaluation
Include comprehensive tests for all new functionality

Related Issues

#110

Documentation PR

Type of Change

New feature

Testing

Tested locally and also added comprehensive unit and integ tests.

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Add EnvironmentState type and evaluators for assessing agent side effects on external environments (e.g., file systems, databases, APIs). - Add EnvironmentState model with name/state fields - Add expected_environment_state field to Case - Add actual/expected environment state fields to EvaluationData - Add StateEquals deterministic evaluator for exact state matching - Add EnvironmentStateEvaluator (LLM-based) for semantic state evaluation - Include comprehensive tests for all new functionality

src/strands_evals/evaluators/environment_state_evaluator.py

poshinchen

Do you think the environmentState should be dict instead of list? the lookup will be O(1) and is cleaner I believe.

afarntrog · 2026-03-12T17:33:28Z

Do you think the environmentState should be dict instead of list? the lookup will be O(1) and is cleaner I believe.

So I understand where you're coming from. However, I'm not sure how to make it cleaner. If we remove the typing altogether then we lose out on the typing goodness. I'm also not too concerned about O(1) vs constant time because there is an academic exercise and practical cases. What's the max realistic amount of env states people will be evaluating? 10? 100? 1000? It's all completely insignificant (especially in this context where we are making calls to an LLM). So i'm only open to it if we can also make the experience simpler and cleaner for end users.

For example, the below will be a dict but will also require the users to enter the same key twice

# Before (list)
environment_state=[
    EnvironmentState(name="test_results", state={"exit_code": 0}),
    EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
]

# After(dict[str, EnvironmentState])
environment_state={
    "test_results": EnvironmentState(name="test_results", state={"exit_code": 0}),
    "file_system": EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
}

Remove EnvironmentStateEvaluator and its associated prompt template. This includes removing it from the public API exports, the prompt templates module, and the default evaluators registry in Experiment.

afarntrog requested a review from poshinchen March 11, 2026 12:49

afarntrog temporarily deployed to auto-approve March 11, 2026 12:49 — with GitHub Actions Inactive

docstring

a375539

afarntrog temporarily deployed to auto-approve March 11, 2026 16:50 — with GitHub Actions Inactive

poshinchen reviewed Mar 12, 2026

View reviewed changes

src/strands_evals/evaluators/environment_state_evaluator.py Outdated Show resolved Hide resolved

poshinchen requested changes Mar 12, 2026

View reviewed changes

feat: remove EnvironmentStateEvaluator from codebase

c032f33

Remove EnvironmentStateEvaluator and its associated prompt template. This includes removing it from the public API exports, the prompt templates module, and the default evaluators registry in Experiment.

afarntrog temporarily deployed to auto-approve March 12, 2026 19:11 — with GitHub Actions Inactive

poshinchen approved these changes Mar 12, 2026

View reviewed changes

afarntrog merged commit 402e1e6 into strands-agents:main Mar 13, 2026
13 checks passed

agent-of-mkmeral mentioned this pull request Mar 17, 2026

Weekly Strands Digest agent-of-mkmeral/strands-coder#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add environment state evaluation support#156

feat: add environment state evaluation support#156
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:environment_state

afarntrog commented Mar 11, 2026

Uh oh!

Uh oh!

poshinchen left a comment

Uh oh!

afarntrog commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

afarntrog commented Mar 11, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

afarntrog commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants