Skip to content

Commit ec33312

Browse files
wiliyamclaude
andcommitted
fix: address remaining review comments — update docs, examples, and PR description
- Add String Check Grader section to docs/custom-evaluators.md - Document that threshold is not applicable to string_check grader - Update "How it works" to describe grader-aware JSONL item building - Add openai_eval examples (text_similarity + string_check) to eval_config.yaml - Mention both grader types in README Custom Evaluators section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3add7a1 commit ec33312

3 files changed

Lines changed: 41 additions & 3 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ evaluators:
216216
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
217217
```
218218

219-
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
219+
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). Two grader types are supported: `text_similarity` for comparing responses against a golden reference, and `string_check` for exact or pattern-based matching against a fixed value. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
220220

221221
## Web UI
222222

docs/custom-evaluators.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ Each evaluator entry in the `evaluators` list uses the following fields. The `ty
104104
|---|---|---|---|
105105
| `name` | yes | | Unique name for the evaluator (used in output) |
106106
| `type` | yes | | `openai_eval` for OpenAI Evals API graders |
107-
| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader |
107+
| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader (not applicable for `string_check`) |
108108
| `timeout` | no | `120` | Max seconds to wait for the OpenAI eval run |
109109
| `grader` | yes | | OpenAI grader config (see [OpenAI Evals Graders](#openai-evals-api-graders)) |
110110

@@ -317,9 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
317317
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318318
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319319

320+
### String Check Grader
321+
322+
Checks the agent's response against a fixed reference string using comparison operations. Does **not** require an eval set — the reference value is specified directly in the grader config. The `threshold` field is not applicable to this grader (string_check always returns 0 or 1).
323+
324+
```yaml
325+
evaluators:
326+
- name: city_name_check
327+
type: openai_eval
328+
grader:
329+
type: string_check
330+
operation: eq
331+
reference: "Paris"
332+
```
333+
334+
The `grader.operation` field selects the comparison:
335+
336+
| Operation | Description |
337+
|---|---|
338+
| `eq` | Exact equality |
339+
| `ne` | Not equal |
340+
| `like` | Pattern match (case-sensitive) |
341+
| `ilike` | Pattern match (case-insensitive) |
342+
320343
### How it works
321344

322-
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
345+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits invocations as JSONL items, polls for results, and cleans up. For `text_similarity` graders, each item contains both the actual and expected responses; for `string_check` graders, each item contains only the actual response (the reference is supplied statically in the grader config). Items are placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
323346

324347
### Configuring the GitHub source
325348

examples/custom_evaluators/eval_config.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,18 @@ evaluators:
3232
ref: evaluators/random_evaluator/random_evaluator.py
3333
threshold: 0.110
3434
executor: local
35+
36+
# OpenAI Evals API graders (requires OPENAI_API_KEY)
37+
- name: response_similarity
38+
type: openai_eval
39+
threshold: 0.8
40+
grader:
41+
type: text_similarity
42+
evaluation_metric: fuzzy_match
43+
44+
- name: city_name_check
45+
type: openai_eval
46+
grader:
47+
type: string_check
48+
operation: eq
49+
reference: "{{ item.expected_response }}"

0 commit comments

Comments
 (0)