Skip to content

Commit b93ab07

Browse files
Merge pull request #149 from agentevals-dev/peterj/consolidatemetricsandeval
BREAKING: consolidate 'metrics' and 'custom_evaluators' into `evaluators`
2 parents 094f9e8 + 9d9736c commit b93ab07

40 files changed

Lines changed: 719 additions & 420 deletions

DEVELOPMENT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Once running, submit a run with:
5050
```bash
5151
curl -X POST http://localhost:8001/api/runs \
5252
-H 'content-type: application/json' \
53-
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}'
53+
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'
5454
```
5555

5656
Then poll `GET /api/runs/{runId}` and `GET /api/runs/{runId}/results`. Without `storage.backend=postgres`, the `/api/runs` endpoints return 503 with a hint pointing at the env var.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
250250
agentevals serve # bundled UI on http://localhost:8001
251251
```
252252

253-
Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
253+
Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
254254

255255
Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
256256

docs/eval-set-format.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Eval Set Format
22

3-
An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
3+
An eval set is a JSON file containing golden reference data that evaluators compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
44

55
Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.
66

@@ -203,9 +203,9 @@ The `parts` array can contain text, function calls, or function responses. Most
203203

204204
Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.
205205

206-
## Which Metrics Use Eval Sets
206+
## Which Evaluators Use Eval Sets
207207

208-
Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
208+
Not all evaluators require an eval set. Use `agentevals evaluator list --source builtin` to see which built-in evaluators do:
209209

210210
| Metric | Needs Eval Set | What It Reads |
211211
|---|---|---|

examples/dice_agent/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ Update `main.py` to test the new functionality.
149149
**After agent completes:**
150150
- Status changes to "EVALUATED"
151151
- Evaluation results appear as colored badges
152-
- Each metric shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
152+
- Each evaluator result shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
153153

154154
**Multiple runs:**
155155
- Each run creates a new session with model name in ID

examples/kubernetes/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ This captures the GPT-5 session's tool trajectory and final responses as the gol
221221
2. Select both sessions (the `gpt-4.1-mini` session and the `gpt-5` session)
222222
3. Click **Evaluate**
223223
4. Select the `helm-agent-comparison` eval set
224-
5. Choose the metrics:
224+
5. Choose the evaluators:
225225
- **tool_trajectory_avg_score**: Did the agent call the correct tools in the correct order?
226226
- **response_match_score**: Did the agent produce responses consistent with the golden reference?
227227
6. Run the evaluation
@@ -241,7 +241,7 @@ Compare the two sessions in the results table:
241241

242242
<img width="1914" height="1154" alt="image" src="https://github.com/user-attachments/assets/5939a8d4-3775-4cf1-9cf2-d3b6b4afd582" />
243243

244-
You can also click an individual conversation and see a breakdown of each evaluators.
244+
You can also click an individual conversation and see a breakdown of each evaluator.
245245

246246
<img width="1916" height="1348" alt="image" src="https://github.com/user-attachments/assets/984b3d29-8018-4fcb-9036-bb7c6e97d9ff" />
247247

src/agentevals/api/routes.py

Lines changed: 28 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,7 @@
1818
from agentevals import __version__
1919

2020
from ..builtin_metrics import METRICS_NEEDING_EXPECTED, METRICS_NEEDING_GCP, METRICS_NEEDING_LLM
21-
from ..config import (
22-
BuiltinMetricDef,
23-
CodeEvaluatorDef,
24-
CustomEvaluatorDef,
25-
EvalParams,
26-
EvalRunConfig,
27-
OpenAIEvalDef,
28-
)
21+
from ..config import EvalParams, EvalRunConfig
2922
from ..converter import convert_traces
3023
from ..extraction import get_extractor
3124
from ..loader import load_traces
@@ -121,24 +114,6 @@ async def _maybe_persist_evaluate_run(
121114

122115
_MAX_JSON_BODY_BYTES = 50 * 1024 * 1024 # 50 MB (multipart endpoints allow 10 MB per file)
123116

124-
_TYPE_TO_MODEL = {
125-
"builtin": BuiltinMetricDef,
126-
"code": CodeEvaluatorDef,
127-
"openai_eval": OpenAIEvalDef,
128-
}
129-
130-
131-
def _parse_custom_evaluators(raw: list[dict]) -> list[CustomEvaluatorDef]:
132-
"""Parse a list of custom evaluator dicts from the API config JSON."""
133-
defs: list[CustomEvaluatorDef] = []
134-
for entry in raw:
135-
evaluator_type = entry.get("type", "builtin")
136-
model_cls = _TYPE_TO_MODEL.get(evaluator_type)
137-
if not model_cls:
138-
raise ValueError(f"Unknown custom evaluator type: {evaluator_type}")
139-
defs.append(model_cls.model_validate(entry))
140-
return defs
141-
142117

143118
@router.get("/health", response_model=StandardResponse[HealthData])
144119
async def health_check():
@@ -489,10 +464,10 @@ async def evaluate_traces(
489464
eval_set_file: UploadFile | None = File(None),
490465
):
491466
"""
492-
Evaluate agent traces using specified metrics.
467+
Evaluate agent traces using the provided evaluator configuration.
493468
494469
Args:
495-
trace_files: List of Jaeger JSON trace files
470+
trace_files: List of Jaeger or OTLP JSON trace files
496471
config: JSON string with evaluation configuration
497472
eval_set_file: Optional golden eval set file
498473
@@ -556,40 +531,23 @@ async def evaluate_traces(
556531
)
557532
f.write(content)
558533

559-
metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
560-
if not metrics or not isinstance(metrics, list):
561-
raise HTTPException(
562-
status_code=400,
563-
detail="Config must include 'metrics' as a non-empty array",
564-
)
565-
566-
threshold = config_dict.get("threshold")
567-
if threshold is not None and (threshold < 0 or threshold > 1):
568-
raise HTTPException(
569-
status_code=400,
570-
detail="Threshold must be between 0 and 1",
534+
try:
535+
eval_config = EvalRunConfig.model_validate(
536+
{
537+
**config_dict,
538+
"traceFiles": trace_paths,
539+
"evalSetFile": eval_set_path,
540+
"traceFormat": trace_format,
541+
}
571542
)
543+
except Exception as exc:
544+
raise HTTPException(status_code=400, detail=f"Invalid config: {exc}") from exc
572545

573-
custom_evaluators: list[CustomEvaluatorDef] = []
574-
raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
575-
if raw_custom:
576-
try:
577-
custom_evaluators = _parse_custom_evaluators(raw_custom)
578-
except Exception as exc:
579-
raise HTTPException(status_code=400, detail=f"Invalid customEvaluators: {exc}") from exc
580-
581-
eval_config = EvalRunConfig(
582-
trace_files=trace_paths,
583-
eval_set_file=eval_set_path,
584-
metrics=metrics,
585-
custom_evaluators=custom_evaluators,
586-
trace_format=trace_format,
587-
judge_model=config_dict.get("judgeModel"),
588-
threshold=threshold,
589-
trajectory_match_type=config_dict.get("trajectoryMatchType"),
546+
logger.info(
547+
"Evaluating %d trace file(s) with evaluators: %s",
548+
len(trace_paths),
549+
[e.name for e in eval_config.evaluators],
590550
)
591-
592-
logger.info(f"Evaluating {len(trace_paths)} trace file(s) with metrics: {metrics}")
593551
result = await run_evaluation(eval_config)
594552

595553
run_id = await _maybe_persist_evaluate_run(
@@ -675,36 +633,19 @@ async def event_generator():
675633
return
676634
f.write(content)
677635

678-
metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
679-
if not metrics or not isinstance(metrics, list):
680-
yield f"data: {SSEErrorEvent(error='Config must include metrics as a non-empty array').model_dump_json(by_alias=True)}\n\n"
681-
return
682-
683-
threshold = config_dict.get("threshold")
684-
if threshold is not None and (threshold < 0 or threshold > 1):
685-
yield f"data: {SSEErrorEvent(error='Threshold must be between 0 and 1').model_dump_json(by_alias=True)}\n\n"
636+
try:
637+
eval_config = EvalRunConfig.model_validate(
638+
{
639+
**config_dict,
640+
"traceFiles": trace_paths,
641+
"evalSetFile": eval_set_path,
642+
"traceFormat": trace_format,
643+
}
644+
)
645+
except Exception as exc:
646+
yield f"data: {SSEErrorEvent(error=f'Invalid config: {exc}').model_dump_json(by_alias=True)}\n\n"
686647
return
687648

688-
custom_evaluators: list[CustomEvaluatorDef] = []
689-
raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
690-
if raw_custom:
691-
try:
692-
custom_evaluators = _parse_custom_evaluators(raw_custom)
693-
except Exception as exc:
694-
yield f"data: {SSEErrorEvent(error=f'Invalid customEvaluators: {exc}').model_dump_json(by_alias=True)}\n\n"
695-
return
696-
697-
eval_config = EvalRunConfig(
698-
trace_files=trace_paths,
699-
eval_set_file=eval_set_path,
700-
metrics=metrics,
701-
custom_evaluators=custom_evaluators,
702-
trace_format=trace_format,
703-
judge_model=config_dict.get("judgeModel"),
704-
threshold=threshold,
705-
trajectory_match_type=config_dict.get("trajectoryMatchType"),
706-
)
707-
708649
for trace_file_path in trace_paths:
709650
try:
710651
traces = load_traces(trace_file_path, format=eval_config.trace_format)

src/agentevals/api/runs_routes.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ async def submit_run(payload: RunRequest, request: Request):
7676
service = _service(request)
7777
try:
7878
run = await service.submit(run_id=payload.run_id, spec=payload.spec)
79+
except ValueError as exc:
80+
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_CONTENT, detail=str(exc)) from exc
7981
except RunSubmitConflict as exc:
8082
raise HTTPException(
8183
status_code=status.HTTP_409_CONFLICT,

src/agentevals/api/streaming_routes.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
import asyncio
66
import json
77
import logging
8-
from typing import TYPE_CHECKING, Literal
8+
from typing import TYPE_CHECKING
99

1010
from fastapi import APIRouter, Depends, HTTPException
1111
from fastapi.responses import FileResponse
12-
from pydantic import BaseModel
12+
from pydantic import BaseModel, ConfigDict, Field
1313

14-
from ..config import EvalRunConfig
14+
from ..config import BuiltinMetricDef, EvalRunConfig, EvaluatorDef
1515
from ..converter import convert_traces
1616
from ..loader.otlp import OtlpJsonLoader
1717
from ..runner import run_evaluation
@@ -42,11 +42,11 @@ class CreateEvalSetRequest(BaseModel):
4242

4343

4444
class EvaluateSessionsRequest(BaseModel):
45+
model_config = ConfigDict(extra="forbid")
46+
4547
golden_session_id: str
4648
eval_set_id: str
47-
metrics: list[str] = ["tool_trajectory_avg_score"]
48-
judge_model: str = "gemini-2.5-flash"
49-
trajectory_match_type: Literal["EXACT", "IN_ORDER", "ANY_ORDER"] | None = None
49+
evaluators: list[EvaluatorDef] = Field(default_factory=lambda: [BuiltinMetricDef(name="tool_trajectory_avg_score")])
5050

5151

5252
class PrepareEvaluationRequest(BaseModel):
@@ -209,9 +209,7 @@ async def eval_one_session(session_id: str, session) -> SessionEvalResult:
209209
trace_files=[str(trace_file)],
210210
trace_format="otlp-json",
211211
eval_set_file=eval_set_file.name,
212-
metrics=request.metrics,
213-
judge_model=request.judge_model,
214-
trajectory_match_type=request.trajectory_match_type,
212+
evaluators=request.evaluators,
215213
)
216214

217215
eval_result = await run_evaluation(config)

0 commit comments

Comments
 (0)