Skip to content

Conversation

@nforro
Copy link
Member

@nforro nforro commented Aug 7, 2025

First attempt to use DeepEval for testing an agent.

model.py and _utils.py are taken from https://github.com/i-am-bee/beeai-framework/tree/main/python/eval (they are not part of the installed wheel for some reason).

This is the important part of the code (the test itself):

@pytest.mark.asyncio
async def test_triage():
setup_observability(os.getenv("COLLECTOR_ENDPOINT"))
dataset = await create_dataset(
name="Triage",
agent_factory=lambda: TriageAgent(),
agent_run=run_agent,
goldens=[
Golden(
input=InputSchema(issue="RHEL-12345").model_dump_json(),
expected_output=OutputSchema(
resolution=Resolution.NO_ACTION,
data=NoActionData(reasoning="The issue is not a fixable bug", jira_issue="RHEL-12345"),
).model_dump_json(),
expected_tools=[
# ToolCall(
# name="get_jira_details",
# reasoning="TODO",
# input={"issue_key": "RHEL-12345"},
# output="TODO",
# ),
],
)
],
)
correctness_metric = GEval(
name="Correctness",
criteria="\n - ".join(
[
"Reasoning must be factually equal to the expected one",
"`jira_issue` in the output must match `issue` in the input",
]
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
LLMTestCaseParams.TOOLS_CALLED,
LLMTestCaseParams.EXPECTED_TOOLS,
],
verbose_mode=True,
model=DeepEvalLLM.from_name(os.getenv("CHAT_MODEL")),
threshold=0.65,
)
metrics: list[BaseMetric] = [correctness_metric]
evaluate_dataset(dataset, metrics)

To execute:

  • first run the container: make run-beeai-bash
  • then execute the test: PYTHONPATH=$(pwd)/agents deepeval test run agents/tests/test_triage_agent.py

Example output:

✓ Evaluation completed 🎉! (time taken: 17.75s | token cost: None USD)
» Test Results (1 total tests):
   » Pass Rate: 100.0% | Passed: 1 | Failed: 0

 ================================================================================ 

» What to share evals with your team, or a place for your test cases to live? ❤️ 🏡
  » Run 'deepeval view' to analyze and save testing results on Confident AI.


   Test Results Summary    
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Total ┃ Passed ┃ Failed ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│     1 │      1 │      0 │
└───────┴────────┴────────┘
╭───────────────────────────────────────────────── Triage - {"issue":"RHEL-12345"} ──────────────────────────────────────────────────╮
│ Input            {"issue":"RHEL-12345"}                                                                                            │
│ Expected Output  {"resolution":"no-action","data":{"reasoning":"The issue is not a fixable bug","jira_issue":"RHEL-12345"}}        │
│ Actual Output    {"resolution":"no-action","data":{"reasoning":"The issue is a feature request to add support for a new device to  │
│                  the HCK-CI infrastructure. This is not a bug or CVE, and therefore no code fix is                                 │
│                  required.","jira_issue":"RHEL-12345"}}                                                                            │
│                                                              Metrics                                                               │
│ ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ │
│ ┃ Metric              ┃ Success ┃ Score ┃ Threshold ┃ Reason                                                             ┃ Error ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │
│ │ Correctness [GEval] │ True    │ 1.0   │ 0.65      │ The model correctly identified the `issue` from the input and      │       │ │
│ │                     │         │       │           │ placed it in the `jira_issue` field of the output. The reasoning   │       │ │
│ │                     │         │       │           │ provided is logical and well-supported by the analysis shown in    │       │ │
│ │                     │         │       │           │ the tool calls, which correctly determined that the issue          │       │ │
│ │                     │         │       │           │ "RHEL-12345" is a feature request for CI infrastructure and not a  │       │ │
│ │                     │         │       │           │ bug requiring a code fix.                                          │       │ │
│ └─────────────────────┴─────────┴───────┴───────────┴────────────────────────────────────────────────────────────────────┴───────┘ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
.Running teardown with pytest sessionfinish...

======================================================== slowest 10 durations ========================================================
72.43s call     agents/tests/test_triage_agent.py::test_triage

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
1 passed, 2 warnings in 72.58s (0:01:12)

Signed-off-by: Nikola Forró <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces testing for the triage agent using the DeepEval framework. It adds new test files, utility functions, and a model wrapper for DeepEval. The existing agent classes are also updated to capture raw responses for evaluation purposes.

Overall, the changes are a great step towards robust testing. However, I've identified a few critical issues in the new test implementation (test_triage_agent.py) that need to be addressed for correctness and robustness. Specifically, there's a potential for a crash if the agent doesn't produce a response, and the logic for collecting tool calls is flawed. I've also pointed out a type violation in the DeepEvalLLM wrapper. My review includes specific suggestions to resolve these issues. Once these are fixed, the PR should be in good shape.

Comment on lines +26 to +28
response = agent.last_raw_response
test_case.tools_called = []
test_case.actual_output = response.answer.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential AttributeError here. If the agent.run_with_schema call fails or doesn't set last_raw_response for any reason, agent.last_raw_response will be None. Accessing response.answer.text would then raise an exception. It's crucial to add a check to ensure response is not None before proceeding to make the test harness more robust.

    response = agent.last_raw_response
    if response is None:
        pytest.fail("Agent run did not yield a response.")
    test_case.tools_called = []
    test_case.actual_output = response.answer.text

Comment on lines +29 to +47
for index, step in enumerate(response.state.steps):
if not step.tool:
continue
prev_step = response.state.steps[index - 1] if index > 0 else None
test_case.tools_called = [
ToolCall(
name=step.tool.name,
description=step.tool.description,
input_parameters=step.input,
output=step.output.get_text_content(),
reasoning=(
to_json(prev_step.input, indent=2, sort_keys=False)
if prev_step and isinstance(prev_step.tool, ThinkTool)
else None
),
)
for step in response.state.steps
if step.tool and not isinstance(step.tool, FinalAnswerTool)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for populating test_case.tools_called is incorrect and inefficient. The list comprehension is inside a for loop, which causes test_case.tools_called to be completely recalculated and overwritten in each iteration. This means it will only contain data based on the final state of the loop, and it performs redundant computations.

The logic should be to initialize an empty list before the loop, and then append a ToolCall object for each relevant step within the loop.

    if response and response.state and response.state.steps:
        tools_called = []
        for index, step in enumerate(response.state.steps):
            if not step.tool or isinstance(step.tool, FinalAnswerTool):
                continue

            prev_step = response.state.steps[index - 1] if index > 0 else None
            reasoning = (
                to_json(prev_step.input, indent=2, sort_keys=False)
                if prev_step and isinstance(prev_step.tool, ThinkTool)
                else None
            )
            tools_called.append(
                ToolCall(
                    name=step.tool.name,
                    description=step.tool.description,
                    input_parameters=step.input,
                    output=step.output.get_text_content(),
                    reasoning=reasoning,
                )
            )
        test_case.tools_called = tools_called

)
)
text = response.get_text_content()
return schema.model_validate_json(text) if schema else text # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The a_generate method in the base class DeepEvalBaseLLM is typed to return a str. This implementation, however, returns a Pydantic model object when a schema is provided, which violates the Liskov substitution principle. The # type: ignore comment is hiding this type mismatch.

To conform to the base class's method signature and ensure type safety, you should serialize the Pydantic model into a JSON string before returning it.

Suggested change
return schema.model_validate_json(text) if schema else text # type: ignore
return schema.model_validate_json(text).model_dump_json() if schema else text

if failed:
pytest.fail(f"{failed}/{total} tests failed. See the summary table above for more details.", pytrace=False)
else:
assert 1 == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of assert 1 == 1 is unconventional for indicating a successful test completion. It works, but it's not very expressive. A more standard and clearer way to indicate that a test has passed successfully (when no exceptions were raised) is to simply have no assertions at the end, or use pass. Since pytest.fail is used for the failure case, the function will implicitly pass if it completes without hitting that line.

Suggested change
assert 1 == 1
pass

@TomasTomecek
Copy link
Member

wow, this is very promising!
I guess we could create a few jiras for several use cases and run those with deepeval

@martinhoyer
Copy link

This is great!

My only "concern", as an outsider, is further committing to beeai components. I mean, from the overall design/architecture PoV, going forward.
Langchain eval docs for reference: https://docs.smith.langchain.com/evaluation/concepts

@nforro
Copy link
Member Author

nforro commented Aug 13, 2025

My only "concern", as an outsider, is further committing to beeai components. I mean, from the overall design/architecture PoV, going forward.
Langchain eval docs for reference: https://docs.smith.langchain.com/evaluation/concepts

I believe we can choose to do this with LangSmith SDK, it would probably take more time as BeeAI already has some wrappers and utility code for DeepEval, but I don't think it would be a huge deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants