Skip to content

feat: add environment state evaluation support#156

Merged
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:environment_state
Mar 13, 2026
Merged

feat: add environment state evaluation support#156
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:environment_state

Conversation

@afarntrog
Copy link
Copy Markdown
Contributor

Description

Add EnvironmentState type and evaluators for assessing agent side effects on external environments (e.g., file systems, databases, APIs).

  • Add EnvironmentState model with name/state fields
  • Add expected_environment_state field to Case
  • Add actual/expected environment state fields to EvaluationData
  • Add StateEquals deterministic evaluator for exact state matching
  • Add EnvironmentStateEvaluator (LLM-based) for semantic state evaluation
  • Include comprehensive tests for all new functionality

Related Issues

#110

Documentation PR

Type of Change

New feature

Testing

Tested locally and also added comprehensive unit and integ tests.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Add EnvironmentState type and evaluators for assessing agent side effects
on external environments (e.g., file systems, databases, APIs).

- Add EnvironmentState model with name/state fields
- Add expected_environment_state field to Case
- Add actual/expected environment state fields to EvaluationData
- Add StateEquals deterministic evaluator for exact state matching
- Add EnvironmentStateEvaluator (LLM-based) for semantic state evaluation
- Include comprehensive tests for all new functionality
Copy link
Copy Markdown
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the environmentState should be dict instead of list? the lookup will be O(1) and is cleaner I believe.

@afarntrog
Copy link
Copy Markdown
Contributor Author

Do you think the environmentState should be dict instead of list? the lookup will be O(1) and is cleaner I believe.

So I understand where you're coming from. However, I'm not sure how to make it cleaner. If we remove the typing altogether then we lose out on the typing goodness. I'm also not too concerned about O(1) vs constant time because there is an academic exercise and practical cases. What's the max realistic amount of env states people will be evaluating? 10? 100? 1000? It's all completely insignificant (especially in this context where we are making calls to an LLM). So i'm only open to it if we can also make the experience simpler and cleaner for end users.

For example, the below will be a dict but will also require the users to enter the same key twice

# Before (list)
environment_state=[
    EnvironmentState(name="test_results", state={"exit_code": 0}),
    EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
]

# After(dict[str, EnvironmentState])
environment_state={
    "test_results": EnvironmentState(name="test_results", state={"exit_code": 0}),
    "file_system": EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
}

Remove EnvironmentStateEvaluator and its associated prompt template.
This includes removing it from the public API exports, the prompt
templates module, and the default evaluators registry in Experiment.
@afarntrog afarntrog merged commit 402e1e6 into strands-agents:main Mar 13, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants