|
| 1 | +# Eval Set Format |
| 2 | + |
| 3 | +An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling. |
| 4 | + |
| 5 | +Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites. |
| 6 | + |
| 7 | +## Structure Overview |
| 8 | + |
| 9 | +``` |
| 10 | +EvalSet |
| 11 | +├── eval_set_id (required, string) |
| 12 | +├── name (optional, string) |
| 13 | +├── description (optional, string) |
| 14 | +├── eval_cases (required, list of EvalCase) |
| 15 | +│ └── EvalCase |
| 16 | +│ ├── eval_id (required, string) |
| 17 | +│ ├── conversation (list of Invocation) |
| 18 | +│ │ └── Invocation |
| 19 | +│ │ ├── invocation_id (string) |
| 20 | +│ │ ├── user_content (Content: role + parts) |
| 21 | +│ │ ├── final_response (Content, optional) |
| 22 | +│ │ └── intermediate_data (optional) |
| 23 | +│ │ ├── tool_uses (list of FunctionCall) |
| 24 | +│ │ └── tool_responses (list of FunctionResponse) |
| 25 | +│ ├── rubrics (optional, list of Rubric) |
| 26 | +│ └── session_input (optional) |
| 27 | +└── creation_timestamp (optional, float) |
| 28 | +``` |
| 29 | + |
| 30 | +## Minimal Example |
| 31 | + |
| 32 | +A single eval case with one user turn and an expected response: |
| 33 | + |
| 34 | +```json |
| 35 | +{ |
| 36 | + "eval_set_id": "my-agent-eval", |
| 37 | + "eval_cases": [ |
| 38 | + { |
| 39 | + "eval_id": "greeting", |
| 40 | + "conversation": [ |
| 41 | + { |
| 42 | + "invocation_id": "inv-1", |
| 43 | + "user_content": { |
| 44 | + "role": "user", |
| 45 | + "parts": [{"text": "Hi! Can you help me?"}] |
| 46 | + }, |
| 47 | + "final_response": { |
| 48 | + "role": "model", |
| 49 | + "parts": [{"text": "Hello! I can help you roll dice and check prime numbers."}] |
| 50 | + } |
| 51 | + } |
| 52 | + ] |
| 53 | + } |
| 54 | + ] |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +## Example with Tool Calls |
| 59 | + |
| 60 | +When your agent uses tools, capture the expected tool trajectory in `intermediate_data`: |
| 61 | + |
| 62 | +```json |
| 63 | +{ |
| 64 | + "eval_set_id": "helm_eval_set", |
| 65 | + "name": "Helm Agent Eval Set", |
| 66 | + "description": "Golden eval cases for the Helm agent.", |
| 67 | + "eval_cases": [ |
| 68 | + { |
| 69 | + "eval_id": "helm_list_releases", |
| 70 | + "conversation": [ |
| 71 | + { |
| 72 | + "invocation_id": "inv-1", |
| 73 | + "user_content": { |
| 74 | + "role": "user", |
| 75 | + "parts": [{"text": "list all Helm releases"}] |
| 76 | + }, |
| 77 | + "final_response": { |
| 78 | + "role": "model", |
| 79 | + "parts": [{"text": "There are two Helm releases installed in the cluster..."}] |
| 80 | + }, |
| 81 | + "intermediate_data": { |
| 82 | + "tool_uses": [ |
| 83 | + { |
| 84 | + "name": "helm_list_releases", |
| 85 | + "args": {}, |
| 86 | + "id": "call_1" |
| 87 | + } |
| 88 | + ], |
| 89 | + "tool_responses": [ |
| 90 | + { |
| 91 | + "name": "helm_list_releases", |
| 92 | + "response": { |
| 93 | + "content": [{"type": "text", "text": "NAME NAMESPACE STATUS ..."}], |
| 94 | + "isError": false |
| 95 | + }, |
| 96 | + "id": "call_1" |
| 97 | + } |
| 98 | + ] |
| 99 | + } |
| 100 | + } |
| 101 | + ] |
| 102 | + } |
| 103 | + ] |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +## Multi-turn Conversations |
| 108 | + |
| 109 | +An eval case can have multiple invocations to represent a conversation. Each invocation is one user turn plus the agent's expected response: |
| 110 | + |
| 111 | +```json |
| 112 | +{ |
| 113 | + "eval_set_id": "multi_turn_eval", |
| 114 | + "eval_cases": [ |
| 115 | + { |
| 116 | + "eval_id": "roll_and_check", |
| 117 | + "conversation": [ |
| 118 | + { |
| 119 | + "invocation_id": "inv-1", |
| 120 | + "user_content": {"role": "user", "parts": [{"text": "Roll a 20-sided die"}]}, |
| 121 | + "final_response": {"role": "model", "parts": [{"text": "I rolled a 17!"}]}, |
| 122 | + "intermediate_data": { |
| 123 | + "tool_uses": [{"name": "roll_die", "args": {"sides": 20}, "id": "c1"}], |
| 124 | + "tool_responses": [{"name": "roll_die", "response": {"result": 17}, "id": "c1"}] |
| 125 | + } |
| 126 | + }, |
| 127 | + { |
| 128 | + "invocation_id": "inv-2", |
| 129 | + "user_content": {"role": "user", "parts": [{"text": "Is that number prime?"}]}, |
| 130 | + "final_response": {"role": "model", "parts": [{"text": "Yes, 17 is a prime number."}]}, |
| 131 | + "intermediate_data": { |
| 132 | + "tool_uses": [{"name": "check_prime", "args": {"nums": [17]}, "id": "c2"}], |
| 133 | + "tool_responses": [{"name": "check_prime", "response": {"17": true}, "id": "c2"}] |
| 134 | + } |
| 135 | + } |
| 136 | + ] |
| 137 | + } |
| 138 | + ] |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +## Field Reference |
| 143 | + |
| 144 | +### EvalSet (top level) |
| 145 | + |
| 146 | +| Field | Type | Required | Description | |
| 147 | +|---|---|---|---| |
| 148 | +| `eval_set_id` | string | yes | Unique identifier for this eval set | |
| 149 | +| `name` | string | no | Human readable name | |
| 150 | +| `description` | string | no | What this eval set covers | |
| 151 | +| `eval_cases` | list[EvalCase] | yes | The evaluation cases | |
| 152 | +| `creation_timestamp` | float | no | Unix timestamp, defaults to 0.0 | |
| 153 | + |
| 154 | +### EvalCase |
| 155 | + |
| 156 | +| Field | Type | Required | Description | |
| 157 | +|---|---|---|---| |
| 158 | +| `eval_id` | string | yes | Unique identifier for this case | |
| 159 | +| `conversation` | list[Invocation] | yes* | Static conversation turns | |
| 160 | +| `conversation_scenario` | ConversationScenario | no* | For simulated agent evaluation (ADK feature, not used by agentevals) | |
| 161 | +| `session_input` | SessionInput | no | Initial session state for the agent | |
| 162 | +| `rubrics` | list[Rubric] | no | Scoring rubrics for all invocations in this case | |
| 163 | +| `final_session_state` | dict | no | Expected session state after the conversation | |
| 164 | +| `creation_timestamp` | float | no | Unix timestamp | |
| 165 | + |
| 166 | +*Exactly one of `conversation` or `conversation_scenario` must be provided. For agentevals, use `conversation`. |
| 167 | + |
| 168 | +### Invocation |
| 169 | + |
| 170 | +| Field | Type | Required | Description | |
| 171 | +|---|---|---|---| |
| 172 | +| `invocation_id` | string | no | Unique turn identifier (defaults to empty string) | |
| 173 | +| `user_content` | Content | yes | What the user said | |
| 174 | +| `final_response` | Content | no | Expected agent response | |
| 175 | +| `intermediate_data` | IntermediateData | no | Expected tool calls and responses | |
| 176 | +| `rubrics` | list[Rubric] | no | Scoring rubrics for this specific invocation | |
| 177 | +| `creation_timestamp` | float | no | Unix timestamp | |
| 178 | + |
| 179 | +### Content |
| 180 | + |
| 181 | +Uses the Google GenAI `Content` format: |
| 182 | + |
| 183 | +```json |
| 184 | +{ |
| 185 | + "role": "user" or "model", |
| 186 | + "parts": [ |
| 187 | + {"text": "plain text content"}, |
| 188 | + {"function_call": {"name": "tool_name", "args": {...}}}, |
| 189 | + {"function_response": {"name": "tool_name", "response": {...}}} |
| 190 | + ] |
| 191 | +} |
| 192 | +``` |
| 193 | + |
| 194 | +The `parts` array can contain text, function calls, or function responses. Most commonly you will use text parts in `user_content` and `final_response`. |
| 195 | + |
| 196 | +### IntermediateData |
| 197 | + |
| 198 | +| Field | Type | Default | Description | |
| 199 | +|---|---|---|---| |
| 200 | +| `tool_uses` | list[FunctionCall] | `[]` | Tool calls the agent made, in chronological order | |
| 201 | +| `tool_responses` | list[FunctionResponse] | `[]` | Tool responses received, in chronological order | |
| 202 | +| `intermediate_responses` | list[tuple] | `[]` | Sub-agent responses (multi-agent systems) | |
| 203 | + |
| 204 | +Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them. |
| 205 | + |
| 206 | +## Which Metrics Use Eval Sets |
| 207 | + |
| 208 | +Not all metrics require an eval set. Use `agentevals list-metrics` to see which do: |
| 209 | + |
| 210 | +| Metric | Needs Eval Set | What It Reads | |
| 211 | +|---|---|---| |
| 212 | +| `tool_trajectory_avg_score` | yes | `intermediate_data.tool_uses` | |
| 213 | +| `response_match_score` | yes | `final_response` (ROUGE-1 text similarity) | |
| 214 | +| `final_response_match_v2` | yes | `final_response` (LLM judge comparison) | |
| 215 | +| `response_evaluation_score` | yes | `final_response` (Vertex AI semantic eval) | |
| 216 | +| `hallucinations_v1` | no | N/A | |
| 217 | +| `safety_v1` | no | N/A | |
| 218 | + |
| 219 | +## Usage |
| 220 | + |
| 221 | +### CLI |
| 222 | + |
| 223 | +```bash |
| 224 | +agentevals run trace.json --eval-set eval_set.json -m tool_trajectory_avg_score |
| 225 | +``` |
| 226 | + |
| 227 | +### Web UI |
| 228 | + |
| 229 | +Upload an eval set file in the evaluation panel, or let the UI generate one from a golden session. |
| 230 | + |
| 231 | +### API |
| 232 | + |
| 233 | +```bash |
| 234 | +curl -X POST http://localhost:8001/api/validate/eval-set \ |
| 235 | + -F "eval_set_file=@eval_set.json" |
| 236 | +``` |
| 237 | + |
| 238 | +The validation endpoint checks JSON syntax, required fields, and structural correctness before you run an evaluation. |
| 239 | + |
| 240 | +## ADK Compatibility |
| 241 | + |
| 242 | +The eval set format is defined by [Google ADK's evaluation module](https://github.com/google/adk-python/tree/main/src/google/adk/evaluation). agentevals loads eval sets using `EvalSet.model_validate()` from ADK directly, so any valid ADK eval set works with agentevals and vice versa. |
| 243 | + |
| 244 | +Fields specific to ADK's live evaluation flow (`conversation_scenario`, `session_input`, `final_session_state`) are accepted but not used by agentevals, which evaluates pre-recorded traces rather than running agents live. |
0 commit comments