agentevals-dev
diff --git a/‎DEVELOPMENT.md‎
Lines changed: 1 addition & 1 deletion b/‎DEVELOPMENT.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/eval-set-format.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/eval-set-format.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎examples/dice_agent/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/dice_agent/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/kubernetes/README.md‎
Lines changed: 2 additions & 2 deletions b/‎examples/kubernetes/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/agentevals/api/routes.py‎
Lines changed: 28 additions & 87 deletions b/‎src/agentevals/api/routes.py‎
Lines changed: 28 additions & 87 deletions
diff --git a/‎src/agentevals/api/runs_routes.py‎
Lines changed: 2 additions & 0 deletions b/‎src/agentevals/api/runs_routes.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/agentevals/api/streaming_routes.py‎
Lines changed: 7 additions & 9 deletions b/‎src/agentevals/api/streaming_routes.py‎
Lines changed: 7 additions & 9 deletions
@@ -50,7 +50,7 @@ Once running, submit a run with:
 ```bash
 curl -X POST http://localhost:8001/api/runs \
     -H 'content-type: application/json' \
-    -d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}'
+    -d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'
 ```
 
 Then poll `GET /api/runs/{runId}` and `GET /api/runs/{runId}/results`. Without `storage.backend=postgres`, the `/api/runs` endpoints return 503 with a hint pointing at the env var.
 
@@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
 agentevals serve            # bundled UI on http://localhost:8001
 ```
 
-Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
+Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
 
 Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
 
 
@@ -1,6 +1,6 @@
 # Eval Set Format
 
-An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
+An eval set is a JSON file containing golden reference data that evaluators compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
 
 Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.
 
@@ -203,9 +203,9 @@ The `parts` array can contain text, function calls, or function responses. Most
 
 Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.
 
-## Which Metrics Use Eval Sets
+## Which Evaluators Use Eval Sets
 
-Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
+Not all evaluators require an eval set. Use `agentevals evaluator list --source builtin` to see which built-in evaluators do:
 
 | Metric | Needs Eval Set | What It Reads |
 |---|---|---|
 
@@ -149,7 +149,7 @@ Update `main.py` to test the new functionality.
 **After agent completes:**
 - Status changes to "EVALUATED"
 - Evaluation results appear as colored badges
-- Each metric shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
+- Each evaluator result shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
 
 **Multiple runs:**
 - Each run creates a new session with model name in ID
 
@@ -221,7 +221,7 @@ This captures the GPT-5 session's tool trajectory and final responses as the gol
 2. Select both sessions (the `gpt-4.1-mini` session and the `gpt-5` session)
 3. Click **Evaluate**
 4. Select the `helm-agent-comparison` eval set
-5. Choose the metrics:
+5. Choose the evaluators:
    - **tool_trajectory_avg_score**: Did the agent call the correct tools in the correct order?
    - **response_match_score**: Did the agent produce responses consistent with the golden reference?
 6. Run the evaluation
@@ -241,7 +241,7 @@ Compare the two sessions in the results table:
 
 <img width="1914" height="1154" alt="image" src="https://github.com/user-attachments/assets/5939a8d4-3775-4cf1-9cf2-d3b6b4afd582" />
 
-You can also click an individual conversation and see a breakdown of each evaluators.
+You can also click an individual conversation and see a breakdown of each evaluator.
 
 <img width="1916" height="1348" alt="image" src="https://github.com/user-attachments/assets/984b3d29-8018-4fcb-9036-bb7c6e97d9ff" />
 
 
@@ -18,14 +18,7 @@
 from agentevals import __version__
 
 from ..builtin_metrics import METRICS_NEEDING_EXPECTED, METRICS_NEEDING_GCP, METRICS_NEEDING_LLM
-from ..config import (
-    BuiltinMetricDef,
-    CodeEvaluatorDef,
-    CustomEvaluatorDef,
-    EvalParams,
-    EvalRunConfig,
-    OpenAIEvalDef,
-)
+from ..config import EvalParams, EvalRunConfig
 from ..converter import convert_traces
 from ..extraction import get_extractor
 from ..loader import load_traces
@@ -121,24 +114,6 @@ async def _maybe_persist_evaluate_run(
 
 _MAX_JSON_BODY_BYTES = 50 * 1024 * 1024  # 50 MB (multipart endpoints allow 10 MB per file)
 
-_TYPE_TO_MODEL = {
-    "builtin": BuiltinMetricDef,
-    "code": CodeEvaluatorDef,
-    "openai_eval": OpenAIEvalDef,
-}
-
-
-def _parse_custom_evaluators(raw: list[dict]) -> list[CustomEvaluatorDef]:
-    """Parse a list of custom evaluator dicts from the API config JSON."""
-    defs: list[CustomEvaluatorDef] = []
-    for entry in raw:
-        evaluator_type = entry.get("type", "builtin")
-        model_cls = _TYPE_TO_MODEL.get(evaluator_type)
-        if not model_cls:
-            raise ValueError(f"Unknown custom evaluator type: {evaluator_type}")
-        defs.append(model_cls.model_validate(entry))
-    return defs
-
 
 @router.get("/health", response_model=StandardResponse[HealthData])
 async def health_check():
@@ -489,10 +464,10 @@ async def evaluate_traces(
     eval_set_file: UploadFile | None = File(None),
 ):
     """
-    Evaluate agent traces using specified metrics.
+    Evaluate agent traces using the provided evaluator configuration.
 
     Args:
-        trace_files: List of Jaeger JSON trace files
+        trace_files: List of Jaeger or OTLP JSON trace files
         config: JSON string with evaluation configuration
         eval_set_file: Optional golden eval set file
 
@@ -556,40 +531,23 @@ async def evaluate_traces(
                     )
                 f.write(content)
 
-        metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
-        if not metrics or not isinstance(metrics, list):
-            raise HTTPException(
-                status_code=400,
-                detail="Config must include 'metrics' as a non-empty array",
-            )
-
-        threshold = config_dict.get("threshold")
-        if threshold is not None and (threshold < 0 or threshold > 1):
-            raise HTTPException(
-                status_code=400,
-                detail="Threshold must be between 0 and 1",
+        try:
+            eval_config = EvalRunConfig.model_validate(
+                {
+                    **config_dict,
+                    "traceFiles": trace_paths,
+                    "evalSetFile": eval_set_path,
+                    "traceFormat": trace_format,
+                }
             )
+        except Exception as exc:
+            raise HTTPException(status_code=400, detail=f"Invalid config: {exc}") from exc
 
-        custom_evaluators: list[CustomEvaluatorDef] = []
-        raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
-        if raw_custom:
-            try:
-                custom_evaluators = _parse_custom_evaluators(raw_custom)
-            except Exception as exc:
-                raise HTTPException(status_code=400, detail=f"Invalid customEvaluators: {exc}") from exc
-
-        eval_config = EvalRunConfig(
-            trace_files=trace_paths,
-            eval_set_file=eval_set_path,
-            metrics=metrics,
-            custom_evaluators=custom_evaluators,
-            trace_format=trace_format,
-            judge_model=config_dict.get("judgeModel"),
-            threshold=threshold,
-            trajectory_match_type=config_dict.get("trajectoryMatchType"),
+        logger.info(
+            "Evaluating %d trace file(s) with evaluators: %s",
+            len(trace_paths),
+            [e.name for e in eval_config.evaluators],
         )
-
-        logger.info(f"Evaluating {len(trace_paths)} trace file(s) with metrics: {metrics}")
         result = await run_evaluation(eval_config)
 
         run_id = await _maybe_persist_evaluate_run(
@@ -675,36 +633,19 @@ async def event_generator():
                         return
                     f.write(content)
 
-            metrics = config_dict.get("metrics", ["tool_trajectory_avg_score"])
-            if not metrics or not isinstance(metrics, list):
-                yield f"data: {SSEErrorEvent(error='Config must include metrics as a non-empty array').model_dump_json(by_alias=True)}\n\n"
-                return
-
-            threshold = config_dict.get("threshold")
-            if threshold is not None and (threshold < 0 or threshold > 1):
-                yield f"data: {SSEErrorEvent(error='Threshold must be between 0 and 1').model_dump_json(by_alias=True)}\n\n"
+            try:
+                eval_config = EvalRunConfig.model_validate(
+                    {
+                        **config_dict,
+                        "traceFiles": trace_paths,
+                        "evalSetFile": eval_set_path,
+                        "traceFormat": trace_format,
+                    }
+                )
+            except Exception as exc:
+                yield f"data: {SSEErrorEvent(error=f'Invalid config: {exc}').model_dump_json(by_alias=True)}\n\n"
                 return
 
-            custom_evaluators: list[CustomEvaluatorDef] = []
-            raw_custom = config_dict.get("customEvaluators", config_dict.get("customMetrics", []))
-            if raw_custom:
-                try:
-                    custom_evaluators = _parse_custom_evaluators(raw_custom)
-                except Exception as exc:
-                    yield f"data: {SSEErrorEvent(error=f'Invalid customEvaluators: {exc}').model_dump_json(by_alias=True)}\n\n"
-                    return
-
-            eval_config = EvalRunConfig(
-                trace_files=trace_paths,
-                eval_set_file=eval_set_path,
-                metrics=metrics,
-                custom_evaluators=custom_evaluators,
-                trace_format=trace_format,
-                judge_model=config_dict.get("judgeModel"),
-                threshold=threshold,
-                trajectory_match_type=config_dict.get("trajectoryMatchType"),
-            )
-
             for trace_file_path in trace_paths:
                 try:
                     traces = load_traces(trace_file_path, format=eval_config.trace_format)
 
@@ -76,6 +76,8 @@ async def submit_run(payload: RunRequest, request: Request):
     service = _service(request)
     try:
         run = await service.submit(run_id=payload.run_id, spec=payload.spec)
+    except ValueError as exc:
+        raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_CONTENT, detail=str(exc)) from exc
     except RunSubmitConflict as exc:
         raise HTTPException(
             status_code=status.HTTP_409_CONFLICT,
 
@@ -5,13 +5,13 @@
 import asyncio
 import json
 import logging
-from typing import TYPE_CHECKING, Literal
+from typing import TYPE_CHECKING
 
 from fastapi import APIRouter, Depends, HTTPException
 from fastapi.responses import FileResponse
-from pydantic import BaseModel
+from pydantic import BaseModel, ConfigDict, Field
 
-from ..config import EvalRunConfig
+from ..config import BuiltinMetricDef, EvalRunConfig, EvaluatorDef
 from ..converter import convert_traces
 from ..loader.otlp import OtlpJsonLoader
 from ..runner import run_evaluation
@@ -42,11 +42,11 @@ class CreateEvalSetRequest(BaseModel):
 
 
 class EvaluateSessionsRequest(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
     golden_session_id: str
     eval_set_id: str
-    metrics: list[str] = ["tool_trajectory_avg_score"]
-    judge_model: str = "gemini-2.5-flash"
-    trajectory_match_type: Literal["EXACT", "IN_ORDER", "ANY_ORDER"] | None = None
+    evaluators: list[EvaluatorDef] = Field(default_factory=lambda: [BuiltinMetricDef(name="tool_trajectory_avg_score")])
 
 
 class PrepareEvaluationRequest(BaseModel):
@@ -209,9 +209,7 @@ async def eval_one_session(session_id: str, session) -> SessionEvalResult:
                         trace_files=[str(trace_file)],
                         trace_format="otlp-json",
                         eval_set_file=eval_set_file.name,
-                        metrics=request.metrics,
-                        judge_model=request.judge_model,
-                        trajectory_match_type=request.trajectory_match_type,
+                        evaluators=request.evaluators,
                     )
 
                     eval_result = await run_evaluation(config)