Skip to content

Commit 90e29de

Browse files
committed
merge: sync with upstream main, add score_model alongside label_model
2 parents f95e793 + 9c39e64 commit 90e29de

55 files changed

Lines changed: 1277 additions & 476 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

DEVELOPMENT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Once running, submit a run with:
5050
```bash
5151
curl -X POST http://localhost:8001/api/runs \
5252
-H 'content-type: application/json' \
53-
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}'
53+
-d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'
5454
```
5555

5656
Then poll `GET /api/runs/{runId}` and `GET /api/runs/{runId}/results`. Without `storage.backend=postgres`, the `/api/runs` endpoints return 503 with a hint pointing at the env var.

README.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
250250
agentevals serve # bundled UI on http://localhost:8001
251251
```
252252

253-
Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
253+
Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
254254

255255
Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
256256

@@ -397,6 +397,18 @@ Yes. A custom evaluator is any program that reads JSON from stdin and writes a s
397397

398398
Yes. The OTLP receiver on port 4318 accepts standard `http/protobuf` and `http/json` trace exports, so it slots into any OpenTelemetry pipeline as just another exporter destination. If your pipeline uses gRPC (port 4317), place an [OTel Collector](https://opentelemetry.io/docs/collector/) in front to bridge gRPC to HTTP. The [Kubernetes example](examples/kubernetes/README.md) shows this pattern.
399399

400+
**Can I use agentevals to evaluate Claude Code, Codex, or OpenCode?**
401+
402+
Not today. agentevals scores agent behavior from OpenTelemetry GenAI traces (spans for model calls, tool calls, agent invocations following the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)). The major coding agents do not currently emit telemetry in that shape:
403+
404+
- **Claude Code** ships OTel telemetry as logs, not GenAI spans. A prior proof of concept on a feature branch made it work by stitching hook events into synthetic traces. Reviving that path is on the backlog, not a near-term commitment.
405+
- **Codex** exposes OTel, but in a different shape we have not yet validated against the GenAI semconv.
406+
- **OpenCode** did not have OTel support merged the last time we checked.
407+
408+
Retrofitting agentevals to ingest each harness's bespoke telemetry is multiple thousands of lines of glue code per agent, for a use case where the dominant signal is "did the final output feel right," not "did the agent call the right tool with the right arguments in the right order." That kind of vibes evaluation is interesting work for harness and coding-agent vendors themselves, but it is not what agentevals is optimized for.
409+
410+
agentevals is built for the opposite end of the spectrum: smaller, purpose-built, properly instrumented agents (kagent, agentregistry, custom Strands / ADK / LangChain / OpenAI Agents SDK flows) running in cloud native environments, where success is measurable through tool trajectories, response matching, and deterministic pass/fail gates. If that is your use case, we are a good fit. If you are evaluating long-running coding sessions end to end, you probably want a tool built specifically for that shape.
411+
400412
**How does this compare to ADK's evaluations?**
401413

402414
Unlike ADK's eval method, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
@@ -420,3 +432,7 @@ Langfuse is a full observability platform (requires Postgres, ClickHouse, Redis,
420432
**How does this compare to Opik?**
421433

422434
Opik's primary evaluation path re-runs your application code against a dataset, incurring additional LLM costs per eval run. It also supports online evaluation rules that auto-score production traces. While Opik supports OpenTelemetry ingestion alongside its own SDK, its evaluation workflow still centers on re-execution against datasets. agentevals evaluates pre-recorded OTel traces from any framework without re-execution, and runs entirely locally with no cloud dependency.
435+
436+
## Acknowledgements
437+
438+
agentevals is built on top of [Google's Agent Development Kit](https://github.com/google/adk-python). ADK provides the evaluator protocol and the canonical eval data model (`Invocation`, `EvalSet`, `Evaluator`, prebuilt metrics) that this project extends. `google-adk` is licensed under [Apache 2.0](https://github.com/google/adk-python/blob/main/LICENSE), the same license as agentevals. Thanks to the ADK team and contributors.

charts/agentevals/templates/deployment.yaml

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ spec:
2929
securityContext:
3030
{{- toYaml .Values.podSecurityContext | nindent 8 }}
3131
serviceAccountName: {{ include "agentevals.serviceAccountName" . }}
32-
{{- if .Values.ephemeralVolume.enabled }}
32+
{{- if or .Values.ephemeralVolume.enabled .Values.extraVolumes }}
3333
volumes:
34+
{{- if .Values.ephemeralVolume.enabled }}
3435
- name: agentevals-tmp
3536
{{- if or .Values.ephemeralVolume.sizeLimit (eq .Values.ephemeralVolume.medium "Memory") }}
3637
emptyDir:
@@ -43,6 +44,10 @@ spec:
4344
{{- else }}
4445
emptyDir: {}
4546
{{- end }}
47+
{{- end }}
48+
{{- with .Values.extraVolumes }}
49+
{{- toYaml . | nindent 8 }}
50+
{{- end }}
4651
{{- end }}
4752
containers:
4853
- name: agentevals
@@ -70,6 +75,10 @@ spec:
7075
value: "postgres"
7176
- name: AGENTEVALS_DATABASE_SCHEMA
7277
value: {{ .Values.database.postgres.schema | quote }}
78+
- name: AGENTEVALS_AUTO_MIGRATE
79+
value: {{ .Values.database.postgres.autoMigrate | quote }}
80+
- name: AGENTEVALS_DB_CONNECT_TIMEOUT_S
81+
value: {{ .Values.database.postgres.connectTimeoutSeconds | quote }}
7382
{{- if .Values.database.postgres.urlFile }}
7483
- name: AGENTEVALS_DATABASE_URL_FILE
7584
value: {{ .Values.database.postgres.urlFile | quote }}
@@ -135,10 +144,15 @@ spec:
135144
port: http
136145
initialDelaySeconds: 15
137146
periodSeconds: 20
138-
{{- if .Values.ephemeralVolume.enabled }}
147+
{{- if or .Values.ephemeralVolume.enabled .Values.extraVolumeMounts }}
139148
volumeMounts:
149+
{{- if .Values.ephemeralVolume.enabled }}
140150
- name: agentevals-tmp
141151
mountPath: /tmp
152+
{{- end }}
153+
{{- with .Values.extraVolumeMounts }}
154+
{{- toYaml . | nindent 12 }}
155+
{{- end }}
142156
{{- end }}
143157
{{- with .Values.nodeSelector }}
144158
nodeSelector:

charts/agentevals/values.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,16 @@ env: []
159159
# -- Extra envFrom sources (ConfigMapRef, SecretRef)
160160
envFrom: []
161161

162+
# -- Extra volumes appended to the pod spec. Use this to mount additional
163+
# config files or secrets (e.g. result-sink credentials) into the pod.
164+
extraVolumes: []
165+
166+
# -- Extra volumeMounts appended to the main container. Pair with
167+
# extraVolumes by name. securityContext.readOnlyRootFilesystem is true by
168+
# default; that only makes the root filesystem read-only, mounted paths
169+
# themselves are unaffected, so a writable extraVolumes entry works fine.
170+
extraVolumeMounts: []
171+
162172
# ==============================================================================
163173
# STORAGE (preview feature)
164174
#
@@ -195,6 +205,18 @@ database:
195205
urlFile: ""
196206
# -- Postgres schema to use for agentevals tables.
197207
schema: agentevals
208+
# -- Apply pending database migrations during server startup before the
209+
# HTTP listener opens. The Postgres advisory lock serialises concurrent
210+
# replica starts so this is safe with replicaCount > 1. When set to
211+
# false the server refuses to start if the schema is behind or dirty;
212+
# run "agentevals migrate up" manually in that case.
213+
autoMigrate: true
214+
# -- Seconds the startup will spend retrying the initial Postgres
215+
# connection before the pod aborts. Default 600s matches the chart's
216+
# hard-coded startupProbe budget (failureThreshold 60 x periodSeconds
217+
# 10). Going above 600s requires overriding the probe in your own
218+
# downstream template.
219+
connectTimeoutSeconds: 600
198220
# -- Bundled Postgres instance for development and evaluation only.
199221
# Not suitable for production. Deployed when enabled is true and url /
200222
# urlFile are not set.

docs/custom-evaluators.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,26 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
337337
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
338338
| `rouge_l` | Longest common subsequence overlap (F-measure) |
339339

340+
### Label Model Grader
341+
342+
Scores responses without a golden set. The model reads each response and assigns a label from a fixed list. Passing labels are defined in the config.
343+
344+
```yaml
345+
evaluators:
346+
- name: quality_check
347+
type: openai_eval
348+
grader:
349+
type: label_model
350+
model: gpt-4o-mini
351+
input:
352+
- role: user
353+
content: "Rate this response: {{ item.actual_response }}"
354+
labels: [good, bad]
355+
passing_labels: [good]
356+
```
357+
358+
The `threshold` field is not used for `label_model`. A response passes if its assigned label is in `passing_labels`.
359+
340360
### How it works
341361

342362
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.

docs/eval-set-format.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Eval Set Format
22

3-
An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
3+
An eval set is a JSON file containing golden reference data that evaluators compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
44

55
Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.
66

@@ -203,9 +203,9 @@ The `parts` array can contain text, function calls, or function responses. Most
203203

204204
Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.
205205

206-
## Which Metrics Use Eval Sets
206+
## Which Evaluators Use Eval Sets
207207

208-
Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
208+
Not all evaluators require an eval set. Use `agentevals evaluator list --source builtin` to see which built-in evaluators do:
209209

210210
| Metric | Needs Eval Set | What It Reads |
211211
|---|---|---|

examples/custom_evaluators/eval_config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ evaluators:
3232
ref: evaluators/random_evaluator/random_evaluator.py
3333
threshold: 0.110
3434
executor: local
35+

examples/custom_evaluators/eval_config_openai_eval.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,16 @@
66
# --config examples/custom_evaluators/eval_config_openai_eval.yaml
77

88
evaluators:
9+
- name: quality_check
10+
type: openai_eval
11+
grader:
12+
type: label_model
13+
model: gpt-4o-mini
14+
input:
15+
- role: user
16+
content: "Rate this response: {{ item.actual_response }}"
17+
labels: [good, bad]
18+
passing_labels: [good]
919
- name: quality_score
1020
type: openai_eval
1121
threshold: 0.7

examples/dice_agent/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ Update `main.py` to test the new functionality.
149149
**After agent completes:**
150150
- Status changes to "EVALUATED"
151151
- Evaluation results appear as colored badges
152-
- Each metric shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
152+
- Each evaluator result shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
153153

154154
**Multiple runs:**
155155
- Each run creates a new session with model name in ID

examples/kubernetes/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ This captures the GPT-5 session's tool trajectory and final responses as the gol
221221
2. Select both sessions (the `gpt-4.1-mini` session and the `gpt-5` session)
222222
3. Click **Evaluate**
223223
4. Select the `helm-agent-comparison` eval set
224-
5. Choose the metrics:
224+
5. Choose the evaluators:
225225
- **tool_trajectory_avg_score**: Did the agent call the correct tools in the correct order?
226226
- **response_match_score**: Did the agent produce responses consistent with the golden reference?
227227
6. Run the evaluation
@@ -241,7 +241,7 @@ Compare the two sessions in the results table:
241241

242242
<img width="1914" height="1154" alt="image" src="https://github.com/user-attachments/assets/5939a8d4-3775-4cf1-9cf2-d3b6b4afd582" />
243243

244-
You can also click an individual conversation and see a breakdown of each evaluators.
244+
You can also click an individual conversation and see a breakdown of each evaluator.
245245

246246
<img width="1916" height="1348" alt="image" src="https://github.com/user-attachments/assets/984b3d29-8018-4fcb-9036-bb7c6e97d9ff" />
247247

0 commit comments

Comments
 (0)