Test triage agent #79

nforro · 2025-08-07T17:49:14Z

First attempt to use DeepEval for testing an agent.

model.py and _utils.py are taken from https://github.com/i-am-bee/beeai-framework/tree/main/python/eval (they are not part of the installed wheel for some reason).

This is the important part of the code (the test itself):

ai-workflows/beeai/agents/tests/test_triage_agent.py

Lines 50 to 97 in 1f877ef

    
           @pytest.mark.asyncio 
        
           async def test_triage(): 
        
               setup_observability(os.getenv("COLLECTOR_ENDPOINT")) 
        
               dataset = await create_dataset( 
        
                   name="Triage", 
        
                   agent_factory=lambda: TriageAgent(), 
        
                   agent_run=run_agent, 
        
                   goldens=[ 
        
                       Golden( 
        
                           input=InputSchema(issue="RHEL-12345").model_dump_json(), 
        
                           expected_output=OutputSchema( 
        
                               resolution=Resolution.NO_ACTION, 
        
                               data=NoActionData(reasoning="The issue is not a fixable bug", jira_issue="RHEL-12345"), 
        
                           ).model_dump_json(), 
        
                           expected_tools=[ 
        
                               # ToolCall( 
        
                               #    name="get_jira_details", 
        
                               #    reasoning="TODO", 
        
                               #    input={"issue_key": "RHEL-12345"}, 
        
                               #    output="TODO", 
        
                               # ), 
        
                           ], 
        
                       ) 
        
                   ], 
        
               ) 
        
               correctness_metric = GEval( 
        
                   name="Correctness", 
        
                   criteria="\n - ".join( 
        
                       [ 
        
                           "Reasoning must be factually equal to the expected one", 
        
                           "`jira_issue` in the output must match `issue` in the input", 
        
                       ] 
        
                   ), 
        
                   evaluation_params=[ 
        
                       LLMTestCaseParams.INPUT, 
        
                       LLMTestCaseParams.ACTUAL_OUTPUT, 
        
                       LLMTestCaseParams.EXPECTED_OUTPUT, 
        
                       LLMTestCaseParams.TOOLS_CALLED, 
        
                       LLMTestCaseParams.EXPECTED_TOOLS, 
        
                   ], 
        
                   verbose_mode=True, 
        
                   model=DeepEvalLLM.from_name(os.getenv("CHAT_MODEL")), 
        
                   threshold=0.65, 
        
               ) 
        
               metrics: list[BaseMetric] = [correctness_metric] 
        
               evaluate_dataset(dataset, metrics)

To execute:

first run the container: make run-beeai-bash
then execute the test: PYTHONPATH=$(pwd)/agents deepeval test run agents/tests/test_triage_agent.py

Example output:

✓ Evaluation completed 🎉! (time taken: 17.75s | token cost: None USD)
» Test Results (1 total tests):
   » Pass Rate: 100.0% | Passed: 1 | Failed: 0

 ================================================================================ 

» What to share evals with your team, or a place for your test cases to live? ❤️ 🏡
  » Run 'deepeval view' to analyze and save testing results on Confident AI.


   Test Results Summary    
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Total ┃ Passed ┃ Failed ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│     1 │      1 │      0 │
└───────┴────────┴────────┘
╭───────────────────────────────────────────────── Triage - {"issue":"RHEL-12345"} ──────────────────────────────────────────────────╮
│ Input            {"issue":"RHEL-12345"}                                                                                            │
│ Expected Output  {"resolution":"no-action","data":{"reasoning":"The issue is not a fixable bug","jira_issue":"RHEL-12345"}}        │
│ Actual Output    {"resolution":"no-action","data":{"reasoning":"The issue is a feature request to add support for a new device to  │
│                  the HCK-CI infrastructure. This is not a bug or CVE, and therefore no code fix is                                 │
│                  required.","jira_issue":"RHEL-12345"}}                                                                            │
│                                                              Metrics                                                               │
│ ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ │
│ ┃ Metric              ┃ Success ┃ Score ┃ Threshold ┃ Reason                                                             ┃ Error ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │
│ │ Correctness [GEval] │ True    │ 1.0   │ 0.65      │ The model correctly identified the `issue` from the input and      │       │ │
│ │                     │         │       │           │ placed it in the `jira_issue` field of the output. The reasoning   │       │ │
│ │                     │         │       │           │ provided is logical and well-supported by the analysis shown in    │       │ │
│ │                     │         │       │           │ the tool calls, which correctly determined that the issue          │       │ │
│ │                     │         │       │           │ "RHEL-12345" is a feature request for CI infrastructure and not a  │       │ │
│ │                     │         │       │           │ bug requiring a code fix.                                          │       │ │
│ └─────────────────────┴─────────┴───────┴───────────┴────────────────────────────────────────────────────────────────────┴───────┘ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
.Running teardown with pytest sessionfinish...

======================================================== slowest 10 durations ========================================================
72.43s call     agents/tests/test_triage_agent.py::test_triage

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
1 passed, 2 warnings in 72.58s (0:01:12)

Signed-off-by: Nikola Forró <[email protected]>

gemini-code-assist

Code Review

This pull request introduces testing for the triage agent using the DeepEval framework. It adds new test files, utility functions, and a model wrapper for DeepEval. The existing agent classes are also updated to capture raw responses for evaluation purposes.

Overall, the changes are a great step towards robust testing. However, I've identified a few critical issues in the new test implementation (test_triage_agent.py) that need to be addressed for correctness and robustness. Specifically, there's a potential for a crash if the agent doesn't produce a response, and the logic for collecting tool calls is flawed. I've also pointed out a type violation in the DeepEvalLLM wrapper. My review includes specific suggestions to resolve these issues. Once these are fixed, the PR should be in good shape.

gemini-code-assist · 2025-08-07T17:50:47Z

beeai/agents/tests/test_triage_agent.py

+    response = agent.last_raw_response
+    test_case.tools_called = []
+    test_case.actual_output = response.answer.text


There's a potential AttributeError here. If the agent.run_with_schema call fails or doesn't set last_raw_response for any reason, agent.last_raw_response will be None. Accessing response.answer.text would then raise an exception. It's crucial to add a check to ensure response is not None before proceeding to make the test harness more robust.

response = agent.last_raw_response if response is None: pytest.fail("Agent run did not yield a response.") test_case.tools_called = [] test_case.actual_output = response.answer.text

gemini-code-assist · 2025-08-07T17:50:47Z

beeai/agents/tests/test_triage_agent.py

+    for index, step in enumerate(response.state.steps):
+        if not step.tool:
+            continue
+        prev_step = response.state.steps[index - 1] if index > 0 else None
+        test_case.tools_called = [
+            ToolCall(
+                name=step.tool.name,
+                description=step.tool.description,
+                input_parameters=step.input,
+                output=step.output.get_text_content(),
+                reasoning=(
+                    to_json(prev_step.input, indent=2, sort_keys=False)
+                    if prev_step and isinstance(prev_step.tool, ThinkTool)
+                    else None
+                ),
+            )
+            for step in response.state.steps
+            if step.tool and not isinstance(step.tool, FinalAnswerTool)
+        ]


The current implementation for populating test_case.tools_called is incorrect and inefficient. The list comprehension is inside a for loop, which causes test_case.tools_called to be completely recalculated and overwritten in each iteration. This means it will only contain data based on the final state of the loop, and it performs redundant computations.

The logic should be to initialize an empty list before the loop, and then append a ToolCall object for each relevant step within the loop.

if response and response.state and response.state.steps: tools_called = [] for index, step in enumerate(response.state.steps): if not step.tool or isinstance(step.tool, FinalAnswerTool): continue prev_step = response.state.steps[index - 1] if index > 0 else None reasoning = ( to_json(prev_step.input, indent=2, sort_keys=False) if prev_step and isinstance(prev_step.tool, ThinkTool) else None ) tools_called.append( ToolCall( name=step.tool.name, description=step.tool.description, input_parameters=step.input, output=step.output.get_text_content(), reasoning=reasoning, ) ) test_case.tools_called = tools_called

gemini-code-assist · 2025-08-07T17:50:47Z

beeai/agents/tests/model.py

+            )
+        )
+        text = response.get_text_content()
+        return schema.model_validate_json(text) if schema else text  # type: ignore


The a_generate method in the base class DeepEvalBaseLLM is typed to return a str. This implementation, however, returns a Pydantic model object when a schema is provided, which violates the Liskov substitution principle. The # type: ignore comment is hiding this type mismatch.

To conform to the base class's method signature and ensure type safety, you should serialize the Pydantic model into a JSON string before returning it.

Suggested change

return schema.model_validate_json(text) if schema else text # type: ignore

return schema.model_validate_json(text).model_dump_json() if schema else text

gemini-code-assist · 2025-08-07T17:50:47Z

beeai/agents/tests/_utils.py

+    if failed:
+        pytest.fail(f"{failed}/{total} tests failed. See the summary table above for more details.", pytrace=False)
+    else:
+        assert 1 == 1


The use of assert 1 == 1 is unconventional for indicating a successful test completion. It works, but it's not very expressive. A more standard and clearer way to indicate that a test has passed successfully (when no exceptions were raised) is to simply have no assertions at the end, or use pass. Since pytest.fail is used for the failure case, the function will implicitly pass if it completes without hitting that line.

Suggested change

assert 1 == 1

pass

TomasTomecek · 2025-08-08T08:11:45Z

wow, this is very promising!
I guess we could create a few jiras for several use cases and run those with deepeval

martinhoyer · 2025-08-08T12:39:24Z

This is great!

My only "concern", as an outsider, is further committing to beeai components. I mean, from the overall design/architecture PoV, going forward.
Langchain eval docs for reference: https://docs.smith.langchain.com/evaluation/concepts

nforro · 2025-08-13T08:36:54Z

My only "concern", as an outsider, is further committing to beeai components. I mean, from the overall design/architecture PoV, going forward.
Langchain eval docs for reference: https://docs.smith.langchain.com/evaluation/concepts

I believe we can choose to do this with LangSmith SDK, it would probably take more time as BeeAI already has some wrappers and utility code for DeepEval, but I don't think it would be a huge deal.

Test triage agent

1f877ef

Signed-off-by: Nikola Forró <[email protected]>

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test triage agent #79

Test triage agent #79

Uh oh!

nforro commented Aug 7, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

TomasTomecek commented Aug 8, 2025

Uh oh!

martinhoyer commented Aug 8, 2025

Uh oh!

nforro commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	@pytest.mark.asyncio
	async def test_triage():
	setup_observability(os.getenv("COLLECTOR_ENDPOINT"))

	dataset = await create_dataset(
	name="Triage",
	agent_factory=lambda: TriageAgent(),
	agent_run=run_agent,
	goldens=[
	Golden(
	input=InputSchema(issue="RHEL-12345").model_dump_json(),
	expected_output=OutputSchema(
	resolution=Resolution.NO_ACTION,
	data=NoActionData(reasoning="The issue is not a fixable bug", jira_issue="RHEL-12345"),
	).model_dump_json(),
	expected_tools=[
	# ToolCall(
	# name="get_jira_details",
	# reasoning="TODO",
	# input={"issue_key": "RHEL-12345"},
	# output="TODO",
	# ),
	],
	)
	],
	)

	correctness_metric = GEval(
	name="Correctness",
	criteria="\n - ".join(
	[
	"Reasoning must be factually equal to the expected one",
	"`jira_issue` in the output must match `issue` in the input",
	]
	),
	evaluation_params=[
	LLMTestCaseParams.INPUT,
	LLMTestCaseParams.ACTUAL_OUTPUT,
	LLMTestCaseParams.EXPECTED_OUTPUT,
	LLMTestCaseParams.TOOLS_CALLED,
	LLMTestCaseParams.EXPECTED_TOOLS,
	],
	verbose_mode=True,
	model=DeepEvalLLM.from_name(os.getenv("CHAT_MODEL")),
	threshold=0.65,
	)
	metrics: list[BaseMetric] = [correctness_metric]
	evaluate_dataset(dataset, metrics)

	return schema.model_validate_json(text) if schema else text # type: ignore
	return schema.model_validate_json(text).model_dump_json() if schema else text

Test triage agent #79

Are you sure you want to change the base?

Test triage agent #79

Uh oh!

Conversation

nforro commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

TomasTomecek commented Aug 8, 2025

Uh oh!

martinhoyer commented Aug 8, 2025

Uh oh!

nforro commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nforro commented Aug 7, 2025 •

edited

Loading