Skip to content

Commit 3ded9b3

Browse files
Merge pull request #74 from agentevals-dev/docs/add-openai-graders
document OpenAI Graders
2 parents d7ef558 + 0ee898e commit 3ded9b3

2 files changed

Lines changed: 84 additions & 36 deletions

File tree

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ Optional extras:
9191

9292
```bash
9393
pip install "agentevals-cli[live]" # MCP server support
94+
pip install "agentevals-cli[openai]" # OpenAI Evals API graders
9495
```
9596

9697
**GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
@@ -215,7 +216,7 @@ evaluators:
215216
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
216217
```
217218

218-
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
219+
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
219220

220221
## Web UI
221222

docs/custom-evaluators.md

Lines changed: 82 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,9 @@ agentevals run traces/my_trace.json \
8585

8686
## Eval Config Reference
8787

88-
Each evaluator entry in the `evaluators` list uses the following fields:
88+
Each evaluator entry in the `evaluators` list uses the following fields. The `type` field determines which other fields are valid.
89+
90+
### `type: code` (local scripts)
8991

9092
| Field | Required | Default | Description |
9193
|---|---|---|---|
@@ -96,6 +98,16 @@ Each evaluator entry in the `evaluators` list uses the following fields:
9698
| `timeout` | no | `30` | Subprocess timeout in seconds |
9799
| `config` | no | `{}` | Arbitrary key-value pairs passed to the evaluator |
98100

101+
### `type: openai_eval` (OpenAI Evals API)
102+
103+
| Field | Required | Default | Description |
104+
|---|---|---|---|
105+
| `name` | yes | | Unique name for the evaluator (used in output) |
106+
| `type` | yes | | `openai_eval` for OpenAI Evals API graders |
107+
| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader |
108+
| `timeout` | no | `120` | Max seconds to wait for the OpenAI eval run |
109+
| `grader` | yes | | OpenAI grader config (see [OpenAI Evals Graders](#openai-evals-api-graders)) |
110+
99111
## Protocol
100112

101113
Every evaluator — regardless of language — communicates via the same JSON protocol over stdin/stdout.
@@ -275,6 +287,40 @@ evaluators:
275287

276288
Remote evaluators are cached in `~/.cache/agentevals/evaluators/`. To force a re-download, delete the cached file.
277289

290+
## OpenAI Evals API Graders
291+
292+
You can delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) instead of running scoring logic locally. This requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY` to be set.
293+
294+
### Text Similarity Grader
295+
296+
Compares the agent's response against a golden reference using text similarity metrics. Requires an eval set.
297+
298+
```yaml
299+
evaluators:
300+
- name: response_similarity
301+
type: openai_eval
302+
threshold: 0.8
303+
grader:
304+
type: text_similarity
305+
evaluation_metric: fuzzy_match
306+
```
307+
308+
The `grader.evaluation_metric` field selects the similarity algorithm:
309+
310+
| Metric | Description |
311+
|---|---|
312+
| `fuzzy_match` | Approximate string matching using edit distance |
313+
| `bleu` | N-gram overlap score, commonly used for translation quality |
314+
| `gleu` | Google's variant of BLEU with sentence-level scoring |
315+
| `meteor` | Alignment-based metric considering synonyms and paraphrases |
316+
| `cosine` | Cosine similarity on vectorized text |
317+
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318+
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319+
320+
### How it works
321+
322+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
323+
278324
### Configuring the GitHub source
279325

280326
By default, evaluators are fetched from the official community repository. Override with environment variables:
@@ -303,42 +349,43 @@ The community repo uses per-evaluator manifests. A CI workflow compiles all `eva
303349
Custom evaluators use a layered architecture designed for extensibility.
304350

305351
```
306-
┌─────────────────────────────────────────┐
307-
│ Eval Config (YAML) │
308-
│ type: code | remote
309-
└──────────────┬──────────────────────────┘
310-
311-
312-
┌─────────────────────────────────────────┐
313-
│ EvaluatorResolver
314-
│ Downloads remote → local cache │
315-
(passthrough for type: code)
316-
└──────────────┬──────────────────────────┘
317-
318-
319-
┌─────────────────────────────────────────┐
320-
│ CustomEvaluatorRunner
321-
│ ADK Evaluator adapter │
322-
Invocation ↔ EvalInput/EvalResult
323-
└──────────────┬──────────────────────────┘
324-
325-
326-
┌─────────────────────────────────────────┐
327-
│ EvaluatorBackend (ABC) — executor factory │
328-
│ async run(EvalInput) → EvalResult │
329-
├─────────────────────────────────────────┤
330-
│ "local" → SubprocessBackend
331-
│ "docker" → DockerBackend (future)
332-
└──────────────┬──────────────────────────┘
333-
334-
335-
┌─────────────────────────────────────────
336-
│ Runtime registry
337-
│ PythonRuntime (.py)
338-
│ NodeRuntime (.js, .ts)
339-
└─────────────────────────────────────────
352+
┌─────────────────────────────────────────────
353+
│ Eval Config (YAML)
354+
│ type: code | remote | openai_eval
355+
└──────────────┬─────────────┬────────────────┘
356+
357+
code/remote openai_eval
358+
│ │
359+
360+
┌──────────────────────┐ ┌──────────────────────┐
361+
EvaluatorResolver │ │ OpenAI Evals API
362+
│ remote → local │ │ create eval + run │
363+
│ (passthrough: code) │ │ poll → get results
364+
└──────────┬───────────┘ └──────────────────────┘
365+
366+
367+
┌──────────────────────────┐
368+
CustomEvaluatorRunner
369+
│ ADK Evaluator adapter │
370+
│ Invocation ↔ EvalInput
371+
└──────────┬───────────────┘
372+
373+
374+
┌──────────────────────────┐
375+
│ EvaluatorBackend (ABC) │
376+
│ "local" → Subprocess
377+
│ "docker" → (future) │
378+
└─────────────────────────┘
379+
380+
381+
┌──────────────────────────┐
382+
│ Runtime registry │
383+
│ PythonRuntime (.py) │
384+
│ NodeRuntime (.js, .ts) │
385+
└──────────────────────────┘
340386
```
341387
388+
- **`type: openai_eval`** takes a separate path: it calls the OpenAI Evals API directly (create eval, create run, poll, collect results) and returns a `MetricResult`. It does not go through the subprocess/backend stack.
342389
- **`EvaluatorSource`** is the registry abstraction. Implementations (`BuiltinEvaluatorSource`, `GitHubEvaluatorSource`) list and fetch evaluators from different registries.
343390
- **`EvaluatorResolver`** downloads remote evaluators and converts `RemoteEvaluatorDef` to `CodeEvaluatorDef` with a local cached path.
344391
- **`EvaluatorBackend`** is the execution abstraction. The `executor` field in config selects which factory to use (`"local"` → `SubprocessBackend`). New executors (e.g. `DockerBackend`) register via `register_executor()`.

0 commit comments

Comments
 (0)