Skip to content

_EvalMetricResultWithInvocation fails with conversation_scenario (user simulation) - expected_invocation is None #4283

@simone-viozzi

Description

@simone-viozzi

🔴 Required Information

Describe the Bug:
When using conversation_scenario for dynamic user simulation, the AgentEvaluator.evaluate_eval_set() method fails with a Pydantic validation error. The _EvalMetricResultWithInvocation class requires expected_invocation: Invocation, but when using user simulation there are no expected invocations (they're generated dynamically), so None is passed, causing validation to fail.

Steps to Reproduce:

  1. Create an evalset file with conversation_scenario (user simulation):
{
  "eval_set_id": "test",
  "eval_cases": [{
    "eval_id": "CS-1",
    "conversation_scenario": {
      "starting_prompt": "Hello",
      "conversation_plan": "Ask the agent to do something"
    },
    "session_input": { "app_name": "my_agent", "user_id": "test" }
  }]
}
  1. Create a test_config.json with hallucinations_v1 metric (which supports user simulation per docs):
{
  "criteria": { "hallucinations_v1": { "threshold": 0.8 } },
  "user_simulator_config": { "model": "gemini-2.5-flash", "max_allowed_invocations": 10 }
}
  1. Run evaluation using AgentEvaluator.evaluate_eval_set()
  2. Error occurs at line 639 in agent_evaluator.py

Expected Behavior:
Evaluation should complete successfully when using conversation_scenario with metrics that support user simulation (e.g., hallucinations_v1, safety_v1).

Observed Behavior:
Pydantic validation error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for _EvalMetricResultWithInvocation
expected_invocation
  Input should be a valid dictionary or instance of Invocation [type=model_type, input_value=None, input_type=NoneType]

Environment Details:

  • ADK Library Version: 1.20.0
  • Desktop OS: Linux
  • Python Version: 3.13.8

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: gemini-2.5-flash (for user simulator), gemini-2.5-flash (for agent)

🟡 Optional Information

Regression:
N/A

Logs:

File ".../google/adk/evaluation/agent_evaluator.py", line 639, in _get_eval_metric_results_with_invocation
    _EvalMetricResultWithInvocation(
        actual_invocation=actual_invocation,
        expected_invocation=expected_invocation,  # ← This is None
        eval_metric_result=eval_metric_result,
    )

Root Cause Analysis:
The issue is in agent_evaluator.py:

  1. Line 83-91 defines _EvalMetricResultWithInvocation:
class _EvalMetricResultWithInvocation(BaseModel):
  actual_invocation: Invocation
  expected_invocation: Invocation  # ← NOT Optional!
  eval_metric_result: EvalMetricResult
  1. Line 632-640 in _get_eval_metric_results_with_invocation:
actual_invocation = eval_metrics_per_invocation.actual_invocation
expected_invocation = eval_metrics_per_invocation.expected_invocation  # ← None for user simulation

eval_metric_results[metric_name].append(
    _EvalMetricResultWithInvocation(
        actual_invocation=actual_invocation,
        expected_invocation=expected_invocation,  # ← Passes None, fails validation
        eval_metric_result=eval_metric_result,
    )
)

Suggested Fix:
Make expected_invocation optional in _EvalMetricResultWithInvocation:

class _EvalMetricResultWithInvocation(BaseModel):
  actual_invocation: Invocation
  expected_invocation: Optional[Invocation] = None  # ← Make optional
  eval_metric_result: EvalMetricResult

And update _print_details to handle None gracefully (lines 420-430).

How often has this issue occurred?:

  • Always (100%) when using conversation_scenario

Metadata

Metadata

Assignees

No one assigned

    Labels

    eval[Component] This issue is related to evaluation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions