You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Add String Check Grader section to docs/custom-evaluators.md
- Document that threshold is not applicable to string_check grader
- Update "How it works" to describe grader-aware JSONL item building
- Add openai_eval examples (text_similarity + string_check) to eval_config.yaml
- Mention both grader types in README Custom Evaluators section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -216,7 +216,7 @@ evaluators:
216
216
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
217
217
```
218
218
219
-
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
219
+
Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). Two grader types are supported: `text_similarity` for comparing responses against a golden reference, and `string_check` for exact or pattern-based matching against a fixed value. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
@@ -317,9 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
317
317
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318
318
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319
319
320
+
### String Check Grader
321
+
322
+
Checks the agent's response against a fixed reference string using comparison operations. Does **not** require an eval set — the reference value is specified directly in the grader config. The `threshold` field is not applicable to this grader (string_check always returns 0 or 1).
323
+
324
+
```yaml
325
+
evaluators:
326
+
- name: city_name_check
327
+
type: openai_eval
328
+
grader:
329
+
type: string_check
330
+
operation: eq
331
+
reference: "Paris"
332
+
```
333
+
334
+
The `grader.operation` field selects the comparison:
335
+
336
+
| Operation | Description |
337
+
|---|---|
338
+
| `eq` | Exact equality |
339
+
| `ne` | Not equal |
340
+
| `like` | Pattern match (case-sensitive) |
341
+
| `ilike` | Pattern match (case-insensitive) |
342
+
320
343
### How it works
321
344
322
-
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
345
+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits invocations as JSONL items, polls for results, and cleans up. For `text_similarity` graders, each item contains both the actual and expected responses; for `string_check` graders, each item contains only the actual response (the reference is supplied statically in the grader config). Items are placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
0 commit comments