feat: add environment state evaluation support#156
feat: add environment state evaluation support#156afarntrog merged 3 commits intostrands-agents:mainfrom
Conversation
Add EnvironmentState type and evaluators for assessing agent side effects on external environments (e.g., file systems, databases, APIs). - Add EnvironmentState model with name/state fields - Add expected_environment_state field to Case - Add actual/expected environment state fields to EvaluationData - Add StateEquals deterministic evaluator for exact state matching - Add EnvironmentStateEvaluator (LLM-based) for semantic state evaluation - Include comprehensive tests for all new functionality
poshinchen
left a comment
There was a problem hiding this comment.
Do you think the environmentState should be dict instead of list? the lookup will be O(1) and is cleaner I believe.
So I understand where you're coming from. However, I'm not sure how to make it cleaner. If we remove the typing altogether then we lose out on the typing goodness. I'm also not too concerned about O(1) vs constant time because there is an academic exercise and practical cases. What's the max realistic amount of env states people will be evaluating? 10? 100? 1000? It's all completely insignificant (especially in this context where we are making calls to an LLM). So i'm only open to it if we can also make the experience simpler and cleaner for end users. For example, the below will be a dict but will also require the users to enter the same key twice # Before (list)
environment_state=[
EnvironmentState(name="test_results", state={"exit_code": 0}),
EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
]
# After(dict[str, EnvironmentState])
environment_state={
"test_results": EnvironmentState(name="test_results", state={"exit_code": 0}),
"file_system": EnvironmentState(name="file_system", state={"created": ["out.txt"]}),
} |
Remove EnvironmentStateEvaluator and its associated prompt template. This includes removing it from the public API exports, the prompt templates module, and the default evaluators registry in Experiment.
Description
Add EnvironmentState type and evaluators for assessing agent side effects on external environments (e.g., file systems, databases, APIs).
Related Issues
#110
Documentation PR
Type of Change
New feature
Testing
Tested locally and also added comprehensive unit and integ tests.
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.