rename the key in config, add versioning

peterj · peterj · commit 05e3aac82ef1 · 2026-03-19T13:56:58.000-07:00
Signed-off-by: Peter Jausovec &lt;peter.jausovec@solo.io&gt;
diff --git a/docs/custom-evaluators.md b/docs/custom-evaluators.md
@@ -63,10 +63,11 @@ The `@evaluator` decorator marks your function as an evaluator. Call `.run()` to
 
 ```yaml
 # eval_config.yaml
-metrics:
-  - tool_trajectory_avg_score   # built-in metric
+evaluators:
+  - name: tool_trajectory_avg_score   # built-in metric
+    type: builtin
 
-  - name: response_quality      # your custom evaluator
+  - name: response_quality            # your custom evaluator
     type: code
     path: ./evaluators/response_quality.py
     threshold: 0.7
@@ -84,7 +85,7 @@ agentevals run traces/my_trace.json \
 
 ## Eval Config Reference
 
-Each custom evaluator entry in the `metrics` list uses the following fields:
+Each evaluator entry in the `evaluators` list uses the following fields:
 
 | Field | Required | Default | Description |
 |---|---|---|---|
@@ -103,6 +104,7 @@ Every evaluator — regardless of language — communicates via the same JSON pr
 
 ```json
 {
+  "protocol_version": "1.0",
   "metric_name": "response_quality",
   "threshold": 0.7,
   "config": { "min_length": 20 },
@@ -111,12 +113,14 @@ Every evaluator — regardless of language — communicates via the same JSON pr
       "invocation_id": "inv-001",
       "user_content": "What is 2+2?",
       "final_response": "The answer is 4.",
-      "tool_calls": [
-        { "name": "calculator", "args": { "expression": "2+2" } }
-      ],
-      "tool_responses": [
-        { "name": "calculator", "output": "4" }
-      ]
+      "intermediate_steps": {
+        "tool_calls": [
+          { "name": "calculator", "args": { "expression": "2+2" } }
+        ],
+        "tool_responses": [
+          { "name": "calculator", "output": "4" }
+        ]
+      }
     }
   ],
   "expected_invocations": null
@@ -125,6 +129,7 @@ Every evaluator — regardless of language — communicates via the same JSON pr
 
 | Field | Type | Description |
 |---|---|---|
+| `protocol_version` | string | Wire-format version (`"MAJOR.MINOR"`). Current: `"1.0"` |
 | `metric_name` | string | Name of this evaluator |
 | `threshold` | float | Pass/fail threshold |
 | `config` | object | User-provided config from the YAML |
@@ -138,6 +143,12 @@ Each invocation contains:
 | `invocation_id` | string | Unique turn identifier |
 | `user_content` | string | What the user said |
 | `final_response` | string or null | The agent's final response |
+| `intermediate_steps` | object | Steps between user input and final response |
+
+The `intermediate_steps` object contains:
+
+| Field | Type | Description |
+|---|---|---|
 | `tool_calls` | array | Tools the agent called |
 | `tool_responses` | array | Responses the agent received from tools |
 
@@ -159,6 +170,24 @@ Each invocation contains:
 | `per_invocation_scores` | no | Per-turn scores (same order as input invocations) |
 | `details` | no | Arbitrary metadata for debugging |
 
+### Protocol Versioning
+
+The `protocol_version` field uses `"MAJOR.MINOR"` format (currently `"1.0"`). This allows the CLI and SDK to evolve independently while maintaining compatibility:
+
+- **Additive only** -- new fields may be added to `EvalInput` or `EvalResult`; existing fields are never removed or renamed within the same major version.
+- **Defaults required** -- every new field must have a default value. Older deserializers silently ignore unknown fields (Pydantic's default behavior), so an evaluator built against an older SDK will still work with a newer CLI.
+- **MINOR bumps** -- additive changes (new optional fields). No action required by evaluator authors.
+- **MAJOR bumps** -- breaking changes (removed fields, type changes). The SDK's `@evaluator` decorator will log a warning if it sees a major version it does not recognize.
+
+The CLI and SDK are **independent packages**. Install them at whatever versions you need:
+
+```bash
+pip install agentevals            # CLI -- may speak protocol 1.1
+pip install agentevals-evaluator-sdk   # SDK -- may speak protocol 1.0
+```
+
+As long as the major version matches, they are compatible.
+
 ## Writing Evaluators in Other Languages
 
 You don't need the Python SDK. Any program that reads JSON from stdin and writes JSON to stdout works.
@@ -171,7 +200,7 @@ const input = JSON.parse(require("fs").readFileSync("/dev/stdin", "utf8"));
 
 let score = 1.0;
 for (const inv of input.invocations) {
-  if (inv.tool_calls.length === 0) {
+  if (inv.intermediate_steps.tool_calls.length === 0) {
     score -= 0.5;
   }
 }
@@ -183,9 +212,10 @@ console.log(JSON.stringify({
 ```
 
 ```yaml
-- name: tool_check
-  type: code
-  path: ./evaluators/tool_check.js
+evaluators:
+  - name: tool_check
+    type: code
+    path: ./evaluators/tool_check.js
 ```
 
 ### Any language
@@ -221,8 +251,9 @@ This shows evaluators from all registered sources: ADK built-in metrics and the
 You can reference evaluators from the community repository directly in your eval config. They are downloaded and cached automatically on first use.
 
 ```yaml
-metrics:
-  - tool_trajectory_avg_score
+evaluators:
+  - name: tool_trajectory_avg_score
+    type: builtin
 
   - name: response_quality
     type: remote
@@ -372,7 +403,7 @@ register_executor("docker", lambda path, timeout: DockerBackend(path, timeout))
 Users then set `executor: docker` in their config:
 
 ```yaml
-metrics:
+evaluators:
   - name: untrusted_evaluator
     type: code
     path: ./evaluators/untrusted.py
@@ -400,7 +431,7 @@ register_source(MyRegistrySource())
 Users can then reference evaluators from the new source:
 
 ```yaml
-metrics:
+evaluators:
   - name: my_evaluator
     type: remote
     source: my-registry
diff --git a/examples/custom_evaluators/eval_config.yaml b/examples/custom_evaluators/eval_config.yaml
@@ -5,9 +5,10 @@
 #     --config examples/custom_evaluators/eval_config.yaml \
 #     --eval-set samples/eval_set_helm.json
 
-metrics:
-  # Built-in metric (unchanged)
-  - tool_trajectory_avg_score
+evaluators:
+  # Built-in metric
+  - name: tool_trajectory_avg_score
+    type: builtin
 
   # Custom code evaluators (local scripts)
   - name: tool_call_checker
@@ -24,7 +25,7 @@ metrics:
     config:
       min_response_length: 20
 
-  # reference an evaluator from Github
+  # Reference an evaluator from Github
   - name: peters_evaluator
     type: remote
     source: github
diff --git a/examples/custom_evaluators/response_quality.py b/examples/custom_evaluators/response_quality.py
@@ -5,7 +5,7 @@
 
 Usage in eval_config.yaml:
 
-    metrics:
+    evaluators:
       - name: response_quality
         type: code
         path: ./examples/custom_evaluators/response_quality.py
diff --git a/examples/custom_evaluators/tool_call_checker.py b/examples/custom_evaluators/tool_call_checker.py
@@ -2,7 +2,7 @@
 
 Usage in eval_config.yaml:
 
-    metrics:
+    evaluators:
       - name: tool_call_checker
         type: code
         path: ./examples/custom_evaluators/tool_call_checker.py
@@ -20,7 +20,7 @@ def tool_call_checker(input: EvalInput) -> EvalResult:
     scores: list[float] = []
 
     for inv in input.invocations:
-        if len(inv.tool_calls) >= min_calls:
+        if len(inv.intermediate_steps.tool_calls) >= min_calls:
             scores.append(1.0)
         else:
             scores.append(0.0)
diff --git a/packages/evaluator-sdk-py/README.md b/packages/evaluator-sdk-py/README.md
@@ -37,7 +37,8 @@ The `@evaluator` decorator marks your function as a runnable evaluator. Call `.r
 
 - **`EvalInput`** -- input payload with `metric_name`, `threshold`, `config`, `invocations`, and optional `expected_invocations`
 - **`EvalResult`** -- output payload with `score` (0.0-1.0), optional `status`, `per_invocation_scores`, and `details` (dict)
-- **`InvocationData`** -- a single agent turn with `user_content`, `final_response`, `tool_calls`, and `tool_responses`
+- **`InvocationData`** -- a single agent turn with `user_content`, `final_response`, and `intermediate_steps`
+- **`IntermediateStepData`** -- the steps between user input and final response: `tool_calls` and `tool_responses`
 
 ## Documentation
 
diff --git a/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/__init__.py b/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/__init__.py
@@ -22,6 +22,7 @@ def my_evaluator(input: EvalInput) -> EvalResult:
 from .types import (
     EvalInput,
     EvalResult,
+    IntermediateStepData,
     InvocationData,
     ToolCallData,
     ToolResponseData,
@@ -31,6 +32,7 @@ def my_evaluator(input: EvalInput) -> EvalResult:
     "evaluator",
     "EvalInput",
     "EvalResult",
+    "IntermediateStepData",
     "InvocationData",
     "ToolCallData",
     "ToolResponseData",
diff --git a/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/types.py b/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/types.py
@@ -26,23 +26,35 @@ class ToolResponseData(BaseModel):
     output: str = ""
 
 
+class IntermediateStepData(BaseModel):
+    """The intermediate steps an agent took between receiving user input and
+    producing a final response — tool calls, tool responses, and (in the
+    future) reasoning traces, memory lookups, sub-agent calls, etc.
+
+    Mirrors the semantic role of ADK's ``IntermediateData`` without depending
+    on ADK types.
+    """
+
+    tool_calls: list[ToolCallData] = Field(default_factory=list)
+    tool_responses: list[ToolResponseData] = Field(default_factory=list)
+
+
 class InvocationData(BaseModel):
     """Simplified representation of a single agent invocation (turn).
 
-    This is a language-agnostic view of ADK's ``Invocation``, flattened into
-    plain strings and dicts so that script/container authors don't need ADK.
+    This is a language-agnostic view of ADK's ``Invocation`` so that
+    script/container authors don't need ADK.
     """
 
     invocation_id: str = ""
     user_content: str = ""
     final_response: Optional[str] = None
-    tool_calls: list[ToolCallData] = Field(default_factory=list)
-    tool_responses: list[ToolResponseData] = Field(default_factory=list)
-
+    intermediate_steps: IntermediateStepData = Field(default_factory=IntermediateStepData)
 
 class EvalInput(BaseModel):
     """Input payload sent to a custom evaluator script/container on stdin."""
 
+    protocol_version: str = "1.0"
     metric_name: str
     threshold: float = 0.5
     config: dict[str, Any] = Field(default_factory=dict)
diff --git a/pyproject.toml b/pyproject.toml
@@ -19,7 +19,6 @@ dependencies = [
     "opentelemetry-proto>=1.36.0",
     "pyyaml>=6.0",
     "httpx>=0.27.0",
-    "agentevals-evaluator-sdk>=0.1.0",
 ]
 
 [project.optional-dependencies]
@@ -47,9 +46,6 @@ no-build-isolation-package = ["rouge_score"]
 [tool.uv.workspace]
 members = ["packages/evaluator-sdk-py"]
 
-[tool.uv.sources]
-agentevals-evaluator-sdk = { workspace = true }
-
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 pythonpath = ["src"]
diff --git a/src/agentevals/_protocol.py b/src/agentevals/_protocol.py
@@ -0,0 +1,74 @@
+"""CLI-internal protocol types for the custom evaluator JSON wire format.
+
+These mirror the types in ``agentevals_evaluator_sdk.types`` but are owned by
+the CLI so that the CLI and SDK packages can be versioned independently.  The
+JSON schema produced/consumed by these models is the contract — not the Python
+types themselves.
+
+Protocol versioning rules:
+- ``protocol_version`` uses ``"MAJOR.MINOR"`` format.
+- MINOR bumps are additive-only (new fields with defaults).  Old deserializers
+  silently ignore unknown fields.
+- MAJOR bumps signal breaking changes (removed/renamed fields, type changes).
+"""
+
+from __future__ import annotations
+
+from typing import Any, Optional
+
+from pydantic import BaseModel, Field
+
+PROTOCOL_VERSION = "1.0"
+
+
+class ToolCallData(BaseModel):
+    """A single tool call made by the agent."""
+
+    name: str
+    args: dict[str, Any] = Field(default_factory=dict)
+
+
+class ToolResponseData(BaseModel):
+    """A single tool response received by the agent."""
+
+    name: str
+    output: str = ""
+
+
+class IntermediateStepData(BaseModel):
+    """Intermediate steps between user input and final response."""
+
+    tool_calls: list[ToolCallData] = Field(default_factory=list)
+    tool_responses: list[ToolResponseData] = Field(default_factory=list)
+
+
+class InvocationData(BaseModel):
+    """Simplified, language-agnostic representation of a single agent turn."""
+
+    invocation_id: str = ""
+    user_content: str = ""
+    final_response: Optional[str] = None
+    intermediate_steps: IntermediateStepData = Field(default_factory=IntermediateStepData)
+
+
+class EvalInput(BaseModel):
+    """Input payload sent to a custom evaluator on stdin."""
+
+    protocol_version: str = PROTOCOL_VERSION
+    metric_name: str
+    threshold: float = 0.5
+    config: dict[str, Any] = Field(default_factory=dict)
+    invocations: list[InvocationData] = Field(default_factory=list)
+    expected_invocations: Optional[list[InvocationData]] = None
+
+
+class EvalResult(BaseModel):
+    """Output payload expected from a custom evaluator on stdout."""
+
+    score: float = Field(ge=0.0, le=1.0)
+    status: Optional[str] = Field(
+        default=None,
+        description='One of "PASSED", "FAILED", "NOT_EVALUATED". Derived from score vs threshold if omitted.',
+    )
+    per_invocation_scores: list[Optional[float]] = Field(default_factory=list)
+    details: Optional[dict[str, Any]] = None
diff --git a/src/agentevals/cli.py b/src/agentevals/cli.py
diff --git a/src/agentevals/custom_evaluators.py b/src/agentevals/custom_evaluators.py
diff --git a/src/agentevals/eval_config_loader.py b/src/agentevals/eval_config_loader.py
diff --git a/src/agentevals/evaluator/templates.py b/src/agentevals/evaluator/templates.py