Skip to content

Commit ff6cd3a

Browse files
Merge pull request #54 from agentevals-dev/docs/byo-evalset
Update docs with custom evaluators and hand crafting evalsets
2 parents 7699489 + 2df56d1 commit ff6cd3a

2 files changed

Lines changed: 280 additions & 0 deletions

File tree

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,34 @@ uv run agentevals run samples/helm.json \
104104
--output json
105105
```
106106

107+
## Custom Evaluators
108+
109+
Beyond the built-in metrics, you can write your own evaluators in Python, JavaScript, or any language. An evaluator is any program that reads JSON from stdin and writes a score to stdout.
110+
111+
```bash
112+
agentevals evaluator init my_evaluator
113+
```
114+
115+
This scaffolds a directory with boilerplate and a manifest. Implement your scoring logic, then reference it in an eval config:
116+
117+
```yaml
118+
# eval_config.yaml
119+
evaluators:
120+
- name: tool_trajectory_avg_score
121+
type: builtin
122+
123+
- name: my_evaluator
124+
type: code
125+
path: ./evaluators/my_evaluator.py
126+
threshold: 0.7
127+
```
128+
129+
```bash
130+
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
131+
```
132+
133+
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
134+
107135
## Web UI
108136

109137
**Installed bundle** (port 8001):
@@ -161,6 +189,14 @@ Two slash-command workflows in `.claude/skills/`, available automatically in thi
161189
| `/eval` | Score traces or compare sessions against a golden reference |
162190
| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
163191

192+
## Docs
193+
194+
| Guide | Description |
195+
|-------|-------------|
196+
| [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
197+
| [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
198+
| [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
199+
164200
## Development
165201

166202
```bash

docs/eval-set-format.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Eval Set Format
2+
3+
An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
4+
5+
Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.
6+
7+
## Structure Overview
8+
9+
```
10+
EvalSet
11+
├── eval_set_id (required, string)
12+
├── name (optional, string)
13+
├── description (optional, string)
14+
├── eval_cases (required, list of EvalCase)
15+
│ └── EvalCase
16+
│ ├── eval_id (required, string)
17+
│ ├── conversation (list of Invocation)
18+
│ │ └── Invocation
19+
│ │ ├── invocation_id (string)
20+
│ │ ├── user_content (Content: role + parts)
21+
│ │ ├── final_response (Content, optional)
22+
│ │ └── intermediate_data (optional)
23+
│ │ ├── tool_uses (list of FunctionCall)
24+
│ │ └── tool_responses (list of FunctionResponse)
25+
│ ├── rubrics (optional, list of Rubric)
26+
│ └── session_input (optional)
27+
└── creation_timestamp (optional, float)
28+
```
29+
30+
## Minimal Example
31+
32+
A single eval case with one user turn and an expected response:
33+
34+
```json
35+
{
36+
"eval_set_id": "my-agent-eval",
37+
"eval_cases": [
38+
{
39+
"eval_id": "greeting",
40+
"conversation": [
41+
{
42+
"invocation_id": "inv-1",
43+
"user_content": {
44+
"role": "user",
45+
"parts": [{"text": "Hi! Can you help me?"}]
46+
},
47+
"final_response": {
48+
"role": "model",
49+
"parts": [{"text": "Hello! I can help you roll dice and check prime numbers."}]
50+
}
51+
}
52+
]
53+
}
54+
]
55+
}
56+
```
57+
58+
## Example with Tool Calls
59+
60+
When your agent uses tools, capture the expected tool trajectory in `intermediate_data`:
61+
62+
```json
63+
{
64+
"eval_set_id": "helm_eval_set",
65+
"name": "Helm Agent Eval Set",
66+
"description": "Golden eval cases for the Helm agent.",
67+
"eval_cases": [
68+
{
69+
"eval_id": "helm_list_releases",
70+
"conversation": [
71+
{
72+
"invocation_id": "inv-1",
73+
"user_content": {
74+
"role": "user",
75+
"parts": [{"text": "list all Helm releases"}]
76+
},
77+
"final_response": {
78+
"role": "model",
79+
"parts": [{"text": "There are two Helm releases installed in the cluster..."}]
80+
},
81+
"intermediate_data": {
82+
"tool_uses": [
83+
{
84+
"name": "helm_list_releases",
85+
"args": {},
86+
"id": "call_1"
87+
}
88+
],
89+
"tool_responses": [
90+
{
91+
"name": "helm_list_releases",
92+
"response": {
93+
"content": [{"type": "text", "text": "NAME NAMESPACE STATUS ..."}],
94+
"isError": false
95+
},
96+
"id": "call_1"
97+
}
98+
]
99+
}
100+
}
101+
]
102+
}
103+
]
104+
}
105+
```
106+
107+
## Multi-turn Conversations
108+
109+
An eval case can have multiple invocations to represent a conversation. Each invocation is one user turn plus the agent's expected response:
110+
111+
```json
112+
{
113+
"eval_set_id": "multi_turn_eval",
114+
"eval_cases": [
115+
{
116+
"eval_id": "roll_and_check",
117+
"conversation": [
118+
{
119+
"invocation_id": "inv-1",
120+
"user_content": {"role": "user", "parts": [{"text": "Roll a 20-sided die"}]},
121+
"final_response": {"role": "model", "parts": [{"text": "I rolled a 17!"}]},
122+
"intermediate_data": {
123+
"tool_uses": [{"name": "roll_die", "args": {"sides": 20}, "id": "c1"}],
124+
"tool_responses": [{"name": "roll_die", "response": {"result": 17}, "id": "c1"}]
125+
}
126+
},
127+
{
128+
"invocation_id": "inv-2",
129+
"user_content": {"role": "user", "parts": [{"text": "Is that number prime?"}]},
130+
"final_response": {"role": "model", "parts": [{"text": "Yes, 17 is a prime number."}]},
131+
"intermediate_data": {
132+
"tool_uses": [{"name": "check_prime", "args": {"nums": [17]}, "id": "c2"}],
133+
"tool_responses": [{"name": "check_prime", "response": {"17": true}, "id": "c2"}]
134+
}
135+
}
136+
]
137+
}
138+
]
139+
}
140+
```
141+
142+
## Field Reference
143+
144+
### EvalSet (top level)
145+
146+
| Field | Type | Required | Description |
147+
|---|---|---|---|
148+
| `eval_set_id` | string | yes | Unique identifier for this eval set |
149+
| `name` | string | no | Human readable name |
150+
| `description` | string | no | What this eval set covers |
151+
| `eval_cases` | list[EvalCase] | yes | The evaluation cases |
152+
| `creation_timestamp` | float | no | Unix timestamp, defaults to 0.0 |
153+
154+
### EvalCase
155+
156+
| Field | Type | Required | Description |
157+
|---|---|---|---|
158+
| `eval_id` | string | yes | Unique identifier for this case |
159+
| `conversation` | list[Invocation] | yes* | Static conversation turns |
160+
| `conversation_scenario` | ConversationScenario | no* | For simulated agent evaluation (ADK feature, not used by agentevals) |
161+
| `session_input` | SessionInput | no | Initial session state for the agent |
162+
| `rubrics` | list[Rubric] | no | Scoring rubrics for all invocations in this case |
163+
| `final_session_state` | dict | no | Expected session state after the conversation |
164+
| `creation_timestamp` | float | no | Unix timestamp |
165+
166+
*Exactly one of `conversation` or `conversation_scenario` must be provided. For agentevals, use `conversation`.
167+
168+
### Invocation
169+
170+
| Field | Type | Required | Description |
171+
|---|---|---|---|
172+
| `invocation_id` | string | no | Unique turn identifier (defaults to empty string) |
173+
| `user_content` | Content | yes | What the user said |
174+
| `final_response` | Content | no | Expected agent response |
175+
| `intermediate_data` | IntermediateData | no | Expected tool calls and responses |
176+
| `rubrics` | list[Rubric] | no | Scoring rubrics for this specific invocation |
177+
| `creation_timestamp` | float | no | Unix timestamp |
178+
179+
### Content
180+
181+
Uses the Google GenAI `Content` format:
182+
183+
```json
184+
{
185+
"role": "user" or "model",
186+
"parts": [
187+
{"text": "plain text content"},
188+
{"function_call": {"name": "tool_name", "args": {...}}},
189+
{"function_response": {"name": "tool_name", "response": {...}}}
190+
]
191+
}
192+
```
193+
194+
The `parts` array can contain text, function calls, or function responses. Most commonly you will use text parts in `user_content` and `final_response`.
195+
196+
### IntermediateData
197+
198+
| Field | Type | Default | Description |
199+
|---|---|---|---|
200+
| `tool_uses` | list[FunctionCall] | `[]` | Tool calls the agent made, in chronological order |
201+
| `tool_responses` | list[FunctionResponse] | `[]` | Tool responses received, in chronological order |
202+
| `intermediate_responses` | list[tuple] | `[]` | Sub-agent responses (multi-agent systems) |
203+
204+
Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.
205+
206+
## Which Metrics Use Eval Sets
207+
208+
Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
209+
210+
| Metric | Needs Eval Set | What It Reads |
211+
|---|---|---|
212+
| `tool_trajectory_avg_score` | yes | `intermediate_data.tool_uses` |
213+
| `response_match_score` | yes | `final_response` (ROUGE-1 text similarity) |
214+
| `final_response_match_v2` | yes | `final_response` (LLM judge comparison) |
215+
| `response_evaluation_score` | yes | `final_response` (Vertex AI semantic eval) |
216+
| `hallucinations_v1` | no | N/A |
217+
| `safety_v1` | no | N/A |
218+
219+
## Usage
220+
221+
### CLI
222+
223+
```bash
224+
agentevals run trace.json --eval-set eval_set.json -m tool_trajectory_avg_score
225+
```
226+
227+
### Web UI
228+
229+
Upload an eval set file in the evaluation panel, or let the UI generate one from a golden session.
230+
231+
### API
232+
233+
```bash
234+
curl -X POST http://localhost:8001/api/validate/eval-set \
235+
-F "eval_set_file=@eval_set.json"
236+
```
237+
238+
The validation endpoint checks JSON syntax, required fields, and structural correctness before you run an evaluation.
239+
240+
## ADK Compatibility
241+
242+
The eval set format is defined by [Google ADK's evaluation module](https://github.com/google/adk-python/tree/main/src/google/adk/evaluation). agentevals loads eval sets using `EvalSet.model_validate()` from ADK directly, so any valid ADK eval set works with agentevals and vice versa.
243+
244+
Fields specific to ADK's live evaluation flow (`conversation_scenario`, `session_input`, `final_session_state`) are accepted but not used by agentevals, which evaluates pre-recorded traces rather than running agents live.

0 commit comments

Comments
 (0)