-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
🔴 Required Information
Describe the Bug:
When using conversation_scenario for dynamic user simulation, the AgentEvaluator.evaluate_eval_set() method fails with a Pydantic validation error. The _EvalMetricResultWithInvocation class requires expected_invocation: Invocation, but when using user simulation there are no expected invocations (they're generated dynamically), so None is passed, causing validation to fail.
Steps to Reproduce:
- Create an evalset file with
conversation_scenario(user simulation):
{
"eval_set_id": "test",
"eval_cases": [{
"eval_id": "CS-1",
"conversation_scenario": {
"starting_prompt": "Hello",
"conversation_plan": "Ask the agent to do something"
},
"session_input": { "app_name": "my_agent", "user_id": "test" }
}]
}- Create a
test_config.jsonwithhallucinations_v1metric (which supports user simulation per docs):
{
"criteria": { "hallucinations_v1": { "threshold": 0.8 } },
"user_simulator_config": { "model": "gemini-2.5-flash", "max_allowed_invocations": 10 }
}- Run evaluation using
AgentEvaluator.evaluate_eval_set() - Error occurs at line 639 in agent_evaluator.py
Expected Behavior:
Evaluation should complete successfully when using conversation_scenario with metrics that support user simulation (e.g., hallucinations_v1, safety_v1).
Observed Behavior:
Pydantic validation error:
pydantic_core._pydantic_core.ValidationError: 1 validation error for _EvalMetricResultWithInvocation
expected_invocation
Input should be a valid dictionary or instance of Invocation [type=model_type, input_value=None, input_type=NoneType]
Environment Details:
- ADK Library Version: 1.20.0
- Desktop OS: Linux
- Python Version: 3.13.8
Model Information:
- Are you using LiteLLM: No
- Which model is being used: gemini-2.5-flash (for user simulator), gemini-2.5-flash (for agent)
🟡 Optional Information
Regression:
N/A
Logs:
File ".../google/adk/evaluation/agent_evaluator.py", line 639, in _get_eval_metric_results_with_invocation
_EvalMetricResultWithInvocation(
actual_invocation=actual_invocation,
expected_invocation=expected_invocation, # ← This is None
eval_metric_result=eval_metric_result,
)
Root Cause Analysis:
The issue is in agent_evaluator.py:
- Line 83-91 defines
_EvalMetricResultWithInvocation:
class _EvalMetricResultWithInvocation(BaseModel):
actual_invocation: Invocation
expected_invocation: Invocation # ← NOT Optional!
eval_metric_result: EvalMetricResult- Line 632-640 in
_get_eval_metric_results_with_invocation:
actual_invocation = eval_metrics_per_invocation.actual_invocation
expected_invocation = eval_metrics_per_invocation.expected_invocation # ← None for user simulation
eval_metric_results[metric_name].append(
_EvalMetricResultWithInvocation(
actual_invocation=actual_invocation,
expected_invocation=expected_invocation, # ← Passes None, fails validation
eval_metric_result=eval_metric_result,
)
)Suggested Fix:
Make expected_invocation optional in _EvalMetricResultWithInvocation:
class _EvalMetricResultWithInvocation(BaseModel):
actual_invocation: Invocation
expected_invocation: Optional[Invocation] = None # ← Make optional
eval_metric_result: EvalMetricResultAnd update _print_details to handle None gracefully (lines 420-430).
How often has this issue occurred?:
- Always (100%) when using
conversation_scenario