-
Notifications
You must be signed in to change notification settings - Fork 22
Test triage agent #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nikola Forró <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces testing for the triage agent using the DeepEval framework. It adds new test files, utility functions, and a model wrapper for DeepEval. The existing agent classes are also updated to capture raw responses for evaluation purposes.
Overall, the changes are a great step towards robust testing. However, I've identified a few critical issues in the new test implementation (test_triage_agent.py) that need to be addressed for correctness and robustness. Specifically, there's a potential for a crash if the agent doesn't produce a response, and the logic for collecting tool calls is flawed. I've also pointed out a type violation in the DeepEvalLLM wrapper. My review includes specific suggestions to resolve these issues. Once these are fixed, the PR should be in good shape.
| response = agent.last_raw_response | ||
| test_case.tools_called = [] | ||
| test_case.actual_output = response.answer.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential AttributeError here. If the agent.run_with_schema call fails or doesn't set last_raw_response for any reason, agent.last_raw_response will be None. Accessing response.answer.text would then raise an exception. It's crucial to add a check to ensure response is not None before proceeding to make the test harness more robust.
response = agent.last_raw_response
if response is None:
pytest.fail("Agent run did not yield a response.")
test_case.tools_called = []
test_case.actual_output = response.answer.text| for index, step in enumerate(response.state.steps): | ||
| if not step.tool: | ||
| continue | ||
| prev_step = response.state.steps[index - 1] if index > 0 else None | ||
| test_case.tools_called = [ | ||
| ToolCall( | ||
| name=step.tool.name, | ||
| description=step.tool.description, | ||
| input_parameters=step.input, | ||
| output=step.output.get_text_content(), | ||
| reasoning=( | ||
| to_json(prev_step.input, indent=2, sort_keys=False) | ||
| if prev_step and isinstance(prev_step.tool, ThinkTool) | ||
| else None | ||
| ), | ||
| ) | ||
| for step in response.state.steps | ||
| if step.tool and not isinstance(step.tool, FinalAnswerTool) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for populating test_case.tools_called is incorrect and inefficient. The list comprehension is inside a for loop, which causes test_case.tools_called to be completely recalculated and overwritten in each iteration. This means it will only contain data based on the final state of the loop, and it performs redundant computations.
The logic should be to initialize an empty list before the loop, and then append a ToolCall object for each relevant step within the loop.
if response and response.state and response.state.steps:
tools_called = []
for index, step in enumerate(response.state.steps):
if not step.tool or isinstance(step.tool, FinalAnswerTool):
continue
prev_step = response.state.steps[index - 1] if index > 0 else None
reasoning = (
to_json(prev_step.input, indent=2, sort_keys=False)
if prev_step and isinstance(prev_step.tool, ThinkTool)
else None
)
tools_called.append(
ToolCall(
name=step.tool.name,
description=step.tool.description,
input_parameters=step.input,
output=step.output.get_text_content(),
reasoning=reasoning,
)
)
test_case.tools_called = tools_called| ) | ||
| ) | ||
| text = response.get_text_content() | ||
| return schema.model_validate_json(text) if schema else text # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The a_generate method in the base class DeepEvalBaseLLM is typed to return a str. This implementation, however, returns a Pydantic model object when a schema is provided, which violates the Liskov substitution principle. The # type: ignore comment is hiding this type mismatch.
To conform to the base class's method signature and ensure type safety, you should serialize the Pydantic model into a JSON string before returning it.
| return schema.model_validate_json(text) if schema else text # type: ignore | |
| return schema.model_validate_json(text).model_dump_json() if schema else text |
| if failed: | ||
| pytest.fail(f"{failed}/{total} tests failed. See the summary table above for more details.", pytrace=False) | ||
| else: | ||
| assert 1 == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of assert 1 == 1 is unconventional for indicating a successful test completion. It works, but it's not very expressive. A more standard and clearer way to indicate that a test has passed successfully (when no exceptions were raised) is to simply have no assertions at the end, or use pass. Since pytest.fail is used for the failure case, the function will implicitly pass if it completes without hitting that line.
| assert 1 == 1 | |
| pass |
|
wow, this is very promising! |
|
This is great! My only "concern", as an outsider, is further committing to beeai components. I mean, from the overall design/architecture PoV, going forward. |
I believe we can choose to do this with LangSmith SDK, it would probably take more time as BeeAI already has some wrappers and utility code for DeepEval, but I don't think it would be a huge deal. |
First attempt to use DeepEval for testing an agent.
model.pyand_utils.pyare taken from https://github.com/i-am-bee/beeai-framework/tree/main/python/eval (they are not part of the installed wheel for some reason).This is the important part of the code (the test itself):
ai-workflows/beeai/agents/tests/test_triage_agent.py
Lines 50 to 97 in 1f877ef
To execute:
make run-beeai-bashPYTHONPATH=$(pwd)/agents deepeval test run agents/tests/test_triage_agent.pyExample output: