Skip to content

Commit 05e3aac

Browse files
committed
rename the key in config, add versioning
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
1 parent cd6620b commit 05e3aac

13 files changed

Lines changed: 198 additions & 72 deletions

File tree

docs/custom-evaluators.md

Lines changed: 49 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,11 @@ The `@evaluator` decorator marks your function as an evaluator. Call `.run()` to
6363

6464
```yaml
6565
# eval_config.yaml
66-
metrics:
67-
- tool_trajectory_avg_score # built-in metric
66+
evaluators:
67+
- name: tool_trajectory_avg_score # built-in metric
68+
type: builtin
6869

69-
- name: response_quality # your custom evaluator
70+
- name: response_quality # your custom evaluator
7071
type: code
7172
path: ./evaluators/response_quality.py
7273
threshold: 0.7
@@ -84,7 +85,7 @@ agentevals run traces/my_trace.json \
8485

8586
## Eval Config Reference
8687

87-
Each custom evaluator entry in the `metrics` list uses the following fields:
88+
Each evaluator entry in the `evaluators` list uses the following fields:
8889

8990
| Field | Required | Default | Description |
9091
|---|---|---|---|
@@ -103,6 +104,7 @@ Every evaluator — regardless of language — communicates via the same JSON pr
103104

104105
```json
105106
{
107+
"protocol_version": "1.0",
106108
"metric_name": "response_quality",
107109
"threshold": 0.7,
108110
"config": { "min_length": 20 },
@@ -111,12 +113,14 @@ Every evaluator — regardless of language — communicates via the same JSON pr
111113
"invocation_id": "inv-001",
112114
"user_content": "What is 2+2?",
113115
"final_response": "The answer is 4.",
114-
"tool_calls": [
115-
{ "name": "calculator", "args": { "expression": "2+2" } }
116-
],
117-
"tool_responses": [
118-
{ "name": "calculator", "output": "4" }
119-
]
116+
"intermediate_steps": {
117+
"tool_calls": [
118+
{ "name": "calculator", "args": { "expression": "2+2" } }
119+
],
120+
"tool_responses": [
121+
{ "name": "calculator", "output": "4" }
122+
]
123+
}
120124
}
121125
],
122126
"expected_invocations": null
@@ -125,6 +129,7 @@ Every evaluator — regardless of language — communicates via the same JSON pr
125129

126130
| Field | Type | Description |
127131
|---|---|---|
132+
| `protocol_version` | string | Wire-format version (`"MAJOR.MINOR"`). Current: `"1.0"` |
128133
| `metric_name` | string | Name of this evaluator |
129134
| `threshold` | float | Pass/fail threshold |
130135
| `config` | object | User-provided config from the YAML |
@@ -138,6 +143,12 @@ Each invocation contains:
138143
| `invocation_id` | string | Unique turn identifier |
139144
| `user_content` | string | What the user said |
140145
| `final_response` | string or null | The agent's final response |
146+
| `intermediate_steps` | object | Steps between user input and final response |
147+
148+
The `intermediate_steps` object contains:
149+
150+
| Field | Type | Description |
151+
|---|---|---|
141152
| `tool_calls` | array | Tools the agent called |
142153
| `tool_responses` | array | Responses the agent received from tools |
143154

@@ -159,6 +170,24 @@ Each invocation contains:
159170
| `per_invocation_scores` | no | Per-turn scores (same order as input invocations) |
160171
| `details` | no | Arbitrary metadata for debugging |
161172

173+
### Protocol Versioning
174+
175+
The `protocol_version` field uses `"MAJOR.MINOR"` format (currently `"1.0"`). This allows the CLI and SDK to evolve independently while maintaining compatibility:
176+
177+
- **Additive only** -- new fields may be added to `EvalInput` or `EvalResult`; existing fields are never removed or renamed within the same major version.
178+
- **Defaults required** -- every new field must have a default value. Older deserializers silently ignore unknown fields (Pydantic's default behavior), so an evaluator built against an older SDK will still work with a newer CLI.
179+
- **MINOR bumps** -- additive changes (new optional fields). No action required by evaluator authors.
180+
- **MAJOR bumps** -- breaking changes (removed fields, type changes). The SDK's `@evaluator` decorator will log a warning if it sees a major version it does not recognize.
181+
182+
The CLI and SDK are **independent packages**. Install them at whatever versions you need:
183+
184+
```bash
185+
pip install agentevals # CLI -- may speak protocol 1.1
186+
pip install agentevals-evaluator-sdk # SDK -- may speak protocol 1.0
187+
```
188+
189+
As long as the major version matches, they are compatible.
190+
162191
## Writing Evaluators in Other Languages
163192

164193
You don't need the Python SDK. Any program that reads JSON from stdin and writes JSON to stdout works.
@@ -171,7 +200,7 @@ const input = JSON.parse(require("fs").readFileSync("/dev/stdin", "utf8"));
171200

172201
let score = 1.0;
173202
for (const inv of input.invocations) {
174-
if (inv.tool_calls.length === 0) {
203+
if (inv.intermediate_steps.tool_calls.length === 0) {
175204
score -= 0.5;
176205
}
177206
}
@@ -183,9 +212,10 @@ console.log(JSON.stringify({
183212
```
184213

185214
```yaml
186-
- name: tool_check
187-
type: code
188-
path: ./evaluators/tool_check.js
215+
evaluators:
216+
- name: tool_check
217+
type: code
218+
path: ./evaluators/tool_check.js
189219
```
190220
191221
### Any language
@@ -221,8 +251,9 @@ This shows evaluators from all registered sources: ADK built-in metrics and the
221251
You can reference evaluators from the community repository directly in your eval config. They are downloaded and cached automatically on first use.
222252

223253
```yaml
224-
metrics:
225-
- tool_trajectory_avg_score
254+
evaluators:
255+
- name: tool_trajectory_avg_score
256+
type: builtin
226257
227258
- name: response_quality
228259
type: remote
@@ -372,7 +403,7 @@ register_executor("docker", lambda path, timeout: DockerBackend(path, timeout))
372403
Users then set `executor: docker` in their config:
373404

374405
```yaml
375-
metrics:
406+
evaluators:
376407
- name: untrusted_evaluator
377408
type: code
378409
path: ./evaluators/untrusted.py
@@ -400,7 +431,7 @@ register_source(MyRegistrySource())
400431
Users can then reference evaluators from the new source:
401432

402433
```yaml
403-
metrics:
434+
evaluators:
404435
- name: my_evaluator
405436
type: remote
406437
source: my-registry

examples/custom_evaluators/eval_config.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,10 @@
55
# --config examples/custom_evaluators/eval_config.yaml \
66
# --eval-set samples/eval_set_helm.json
77

8-
metrics:
9-
# Built-in metric (unchanged)
10-
- tool_trajectory_avg_score
8+
evaluators:
9+
# Built-in metric
10+
- name: tool_trajectory_avg_score
11+
type: builtin
1112

1213
# Custom code evaluators (local scripts)
1314
- name: tool_call_checker
@@ -24,7 +25,7 @@ metrics:
2425
config:
2526
min_response_length: 20
2627

27-
# reference an evaluator from Github
28+
# Reference an evaluator from Github
2829
- name: peters_evaluator
2930
type: remote
3031
source: github

examples/custom_evaluators/response_quality.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
66
Usage in eval_config.yaml:
77
8-
metrics:
8+
evaluators:
99
- name: response_quality
1010
type: code
1111
path: ./examples/custom_evaluators/response_quality.py

examples/custom_evaluators/tool_call_checker.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
33
Usage in eval_config.yaml:
44
5-
metrics:
5+
evaluators:
66
- name: tool_call_checker
77
type: code
88
path: ./examples/custom_evaluators/tool_call_checker.py
@@ -20,7 +20,7 @@ def tool_call_checker(input: EvalInput) -> EvalResult:
2020
scores: list[float] = []
2121

2222
for inv in input.invocations:
23-
if len(inv.tool_calls) >= min_calls:
23+
if len(inv.intermediate_steps.tool_calls) >= min_calls:
2424
scores.append(1.0)
2525
else:
2626
scores.append(0.0)

packages/evaluator-sdk-py/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,8 @@ The `@evaluator` decorator marks your function as a runnable evaluator. Call `.r
3737

3838
- **`EvalInput`** -- input payload with `metric_name`, `threshold`, `config`, `invocations`, and optional `expected_invocations`
3939
- **`EvalResult`** -- output payload with `score` (0.0-1.0), optional `status`, `per_invocation_scores`, and `details` (dict)
40-
- **`InvocationData`** -- a single agent turn with `user_content`, `final_response`, `tool_calls`, and `tool_responses`
40+
- **`InvocationData`** -- a single agent turn with `user_content`, `final_response`, and `intermediate_steps`
41+
- **`IntermediateStepData`** -- the steps between user input and final response: `tool_calls` and `tool_responses`
4142

4243
## Documentation
4344

packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ def my_evaluator(input: EvalInput) -> EvalResult:
2222
from .types import (
2323
EvalInput,
2424
EvalResult,
25+
IntermediateStepData,
2526
InvocationData,
2627
ToolCallData,
2728
ToolResponseData,
@@ -31,6 +32,7 @@ def my_evaluator(input: EvalInput) -> EvalResult:
3132
"evaluator",
3233
"EvalInput",
3334
"EvalResult",
35+
"IntermediateStepData",
3436
"InvocationData",
3537
"ToolCallData",
3638
"ToolResponseData",

packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/types.py

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,23 +26,35 @@ class ToolResponseData(BaseModel):
2626
output: str = ""
2727

2828

29+
class IntermediateStepData(BaseModel):
30+
"""The intermediate steps an agent took between receiving user input and
31+
producing a final response — tool calls, tool responses, and (in the
32+
future) reasoning traces, memory lookups, sub-agent calls, etc.
33+
34+
Mirrors the semantic role of ADK's ``IntermediateData`` without depending
35+
on ADK types.
36+
"""
37+
38+
tool_calls: list[ToolCallData] = Field(default_factory=list)
39+
tool_responses: list[ToolResponseData] = Field(default_factory=list)
40+
41+
2942
class InvocationData(BaseModel):
3043
"""Simplified representation of a single agent invocation (turn).
3144
32-
This is a language-agnostic view of ADK's ``Invocation``, flattened into
33-
plain strings and dicts so that script/container authors don't need ADK.
45+
This is a language-agnostic view of ADK's ``Invocation`` so that
46+
script/container authors don't need ADK.
3447
"""
3548

3649
invocation_id: str = ""
3750
user_content: str = ""
3851
final_response: Optional[str] = None
39-
tool_calls: list[ToolCallData] = Field(default_factory=list)
40-
tool_responses: list[ToolResponseData] = Field(default_factory=list)
41-
52+
intermediate_steps: IntermediateStepData = Field(default_factory=IntermediateStepData)
4253

4354
class EvalInput(BaseModel):
4455
"""Input payload sent to a custom evaluator script/container on stdin."""
4556

57+
protocol_version: str = "1.0"
4658
metric_name: str
4759
threshold: float = 0.5
4860
config: dict[str, Any] = Field(default_factory=dict)

pyproject.toml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ dependencies = [
1919
"opentelemetry-proto>=1.36.0",
2020
"pyyaml>=6.0",
2121
"httpx>=0.27.0",
22-
"agentevals-evaluator-sdk>=0.1.0",
2322
]
2423

2524
[project.optional-dependencies]
@@ -47,9 +46,6 @@ no-build-isolation-package = ["rouge_score"]
4746
[tool.uv.workspace]
4847
members = ["packages/evaluator-sdk-py"]
4948

50-
[tool.uv.sources]
51-
agentevals-evaluator-sdk = { workspace = true }
52-
5349
[tool.pytest.ini_options]
5450
testpaths = ["tests"]
5551
pythonpath = ["src"]

src/agentevals/_protocol.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
"""CLI-internal protocol types for the custom evaluator JSON wire format.
2+
3+
These mirror the types in ``agentevals_evaluator_sdk.types`` but are owned by
4+
the CLI so that the CLI and SDK packages can be versioned independently. The
5+
JSON schema produced/consumed by these models is the contract — not the Python
6+
types themselves.
7+
8+
Protocol versioning rules:
9+
- ``protocol_version`` uses ``"MAJOR.MINOR"`` format.
10+
- MINOR bumps are additive-only (new fields with defaults). Old deserializers
11+
silently ignore unknown fields.
12+
- MAJOR bumps signal breaking changes (removed/renamed fields, type changes).
13+
"""
14+
15+
from __future__ import annotations
16+
17+
from typing import Any, Optional
18+
19+
from pydantic import BaseModel, Field
20+
21+
PROTOCOL_VERSION = "1.0"
22+
23+
24+
class ToolCallData(BaseModel):
25+
"""A single tool call made by the agent."""
26+
27+
name: str
28+
args: dict[str, Any] = Field(default_factory=dict)
29+
30+
31+
class ToolResponseData(BaseModel):
32+
"""A single tool response received by the agent."""
33+
34+
name: str
35+
output: str = ""
36+
37+
38+
class IntermediateStepData(BaseModel):
39+
"""Intermediate steps between user input and final response."""
40+
41+
tool_calls: list[ToolCallData] = Field(default_factory=list)
42+
tool_responses: list[ToolResponseData] = Field(default_factory=list)
43+
44+
45+
class InvocationData(BaseModel):
46+
"""Simplified, language-agnostic representation of a single agent turn."""
47+
48+
invocation_id: str = ""
49+
user_content: str = ""
50+
final_response: Optional[str] = None
51+
intermediate_steps: IntermediateStepData = Field(default_factory=IntermediateStepData)
52+
53+
54+
class EvalInput(BaseModel):
55+
"""Input payload sent to a custom evaluator on stdin."""
56+
57+
protocol_version: str = PROTOCOL_VERSION
58+
metric_name: str
59+
threshold: float = 0.5
60+
config: dict[str, Any] = Field(default_factory=dict)
61+
invocations: list[InvocationData] = Field(default_factory=list)
62+
expected_invocations: Optional[list[InvocationData]] = None
63+
64+
65+
class EvalResult(BaseModel):
66+
"""Output payload expected from a custom evaluator on stdout."""
67+
68+
score: float = Field(ge=0.0, le=1.0)
69+
status: Optional[str] = Field(
70+
default=None,
71+
description='One of "PASSED", "FAILED", "NOT_EVALUATED". Derived from score vs threshold if omitted.',
72+
)
73+
per_invocation_scores: list[Optional[float]] = Field(default_factory=list)
74+
details: Optional[dict[str, Any]] = None

0 commit comments

Comments
 (0)