You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,6 +91,7 @@ Optional extras:
91
91
92
92
```bash
93
93
pip install "agentevals-cli[live]"# MCP server support
94
+
pip install "agentevals-cli[openai]"# OpenAI Evals API graders
94
95
```
95
96
96
97
**GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
@@ -215,7 +216,7 @@ evaluators:
215
216
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
216
217
```
217
218
218
-
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
219
+
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
Every evaluator — regardless of language — communicates via the same JSON protocol over stdin/stdout.
@@ -275,6 +287,40 @@ evaluators:
275
287
276
288
Remote evaluators are cached in `~/.cache/agentevals/evaluators/`. To force a re-download, delete the cached file.
277
289
290
+
## OpenAI Evals API Graders
291
+
292
+
You can delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) instead of running scoring logic locally. This requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY` to be set.
293
+
294
+
### Text Similarity Grader
295
+
296
+
Compares the agent's response against a golden reference using text similarity metrics. Requires an eval set.
297
+
298
+
```yaml
299
+
evaluators:
300
+
- name: response_similarity
301
+
type: openai_eval
302
+
threshold: 0.8
303
+
grader:
304
+
type: text_similarity
305
+
evaluation_metric: fuzzy_match
306
+
```
307
+
308
+
The `grader.evaluation_metric` field selects the similarity algorithm:
309
+
310
+
| Metric | Description |
311
+
|---|---|
312
+
| `fuzzy_match` | Approximate string matching using edit distance |
313
+
| `bleu` | N-gram overlap score, commonly used for translation quality |
314
+
| `gleu` | Google's variant of BLEU with sentence-level scoring |
315
+
| `meteor` | Alignment-based metric considering synonyms and paraphrases |
316
+
| `cosine` | Cosine similarity on vectorized text |
317
+
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318
+
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319
+
320
+
### How it works
321
+
322
+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
323
+
278
324
### Configuring the GitHub source
279
325
280
326
By default, evaluators are fetched from the official community repository. Override with environment variables:
@@ -303,42 +349,43 @@ The community repo uses per-evaluator manifests. A CI workflow compiles all `eva
303
349
Custom evaluators use a layered architecture designed for extensibility.
304
350
305
351
```
306
-
┌─────────────────────────────────────────┐
307
-
│ Eval Config (YAML) │
308
-
│ type: code | remote │
309
-
└──────────────┬──────────────────────────┘
310
-
│
311
-
▼
312
-
┌─────────────────────────────────────────┐
313
-
│ EvaluatorResolver│
314
-
│ Downloads remote → local cache │
315
-
│ (passthrough for type: code) │
316
-
└──────────────┬──────────────────────────┘
317
-
│
318
-
▼
319
-
┌─────────────────────────────────────────┐
320
-
│ CustomEvaluatorRunner │
321
-
│ ADK Evaluator adapter │
322
-
│ Invocation ↔ EvalInput/EvalResult │
323
-
└──────────────┬──────────────────────────┘
324
-
│
325
-
▼
326
-
┌─────────────────────────────────────────┐
327
-
│ EvaluatorBackend (ABC) — executor factory │
328
-
│ async run(EvalInput) → EvalResult │
329
-
├─────────────────────────────────────────┤
330
-
│ "local" → SubprocessBackend │
331
-
│ "docker" → DockerBackend (future) │
332
-
└──────────────┬──────────────────────────┘
333
-
│
334
-
▼
335
-
┌─────────────────────────────────────────┐
336
-
│ Runtime registry │
337
-
│ PythonRuntime (.py) │
338
-
│ NodeRuntime (.js, .ts) │
339
-
└─────────────────────────────────────────┘
352
+
┌─────────────────────────────────────────────┐
353
+
│ Eval Config (YAML) │
354
+
│ type: code | remote | openai_eval │
355
+
└──────────────┬─────────────┬────────────────┘
356
+
│ │
357
+
code/remoteopenai_eval
358
+
│ │
359
+
▼ ▼
360
+
┌──────────────────────┐ ┌──────────────────────┐
361
+
│ EvaluatorResolver │ │ OpenAI Evals API │
362
+
│ remote → local │ │ create eval + run │
363
+
│ (passthrough: code) │ │ poll → get results │
364
+
└──────────┬───────────┘ └──────────────────────┘
365
+
│
366
+
▼
367
+
┌──────────────────────────┐
368
+
│ CustomEvaluatorRunner │
369
+
│ ADK Evaluator adapter │
370
+
│ Invocation ↔ EvalInput │
371
+
└──────────┬───────────────┘
372
+
│
373
+
▼
374
+
┌──────────────────────────┐
375
+
│ EvaluatorBackend (ABC) │
376
+
│ "local" → Subprocess │
377
+
│ "docker" → (future) │
378
+
└──────────┬───────────────┘
379
+
│
380
+
▼
381
+
┌──────────────────────────┐
382
+
│ Runtime registry │
383
+
│ PythonRuntime (.py) │
384
+
│ NodeRuntime (.js, .ts) │
385
+
└──────────────────────────┘
340
386
```
341
387
388
+
- **`type: openai_eval`** takes a separate path: it calls the OpenAI Evals API directly (create eval, create run, poll, collect results) and returns a `MetricResult`. It does not go through the subprocess/backend stack.
342
389
- **`EvaluatorSource`** is the registry abstraction. Implementations (`BuiltinEvaluatorSource`, `GitHubEvaluatorSource`) list and fetch evaluators from different registries.
343
390
- **`EvaluatorResolver`** downloads remote evaluators and converts `RemoteEvaluatorDef` to `CodeEvaluatorDef` with a local cached path.
344
391
- **`EvaluatorBackend`** is the execution abstraction. The `executor` field in config selects which factory to use (`"local"` → `SubprocessBackend`). New executors (e.g. `DockerBackend`) register via `register_executor()`.
0 commit comments