You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add StringCheckGrader support for OpenAI Evals backend
Adds string_check grader alongside the existing text_similarity grader.
string_check evaluates agent responses against a fixed reference string
using comparison operations (eq, ne, like, ilike). Unlike text_similarity,
it does not require a golden eval set — the reference is specified
directly in the grader config.
Changes:
- config.py: _VALID_STRING_CHECK_OPERATIONS, _SUPPORTED_GRADER_TYPES,
grader-aware validator with explicit operation/reference checks
- openai_eval_backend.py: _ACTUAL_ONLY_SCHEMA and _get_item_schema for
grader-aware item shape, string_check branch in _build_testing_criteria,
grader_type param in _build_jsonl_items (excludes expected_response for
string_check), grader-relevant detail key in results (operation vs
evaluation_metric), gated expected_invocations requirement
- docs/custom-evaluators.md: String Check Grader section, threshold
inapplicability note, grader-aware How it works description
- examples/custom_evaluators/eval_config.yaml: example entries for both
grader types
- README.md: mentions both grader types in Custom Evaluators section
- tests/test_openai_eval_backend.py: unit tests covering config validation,
schema selection, testing criteria, JSONL builder, score extraction,
and full mocked-client flow for both grader types
Addresses review feedback from @krisztianfekete on PR #102.
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -240,7 +240,7 @@ evaluators:
240
240
threshold: 0.7
241
241
```
242
242
243
-
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`).
243
+
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`). Two OpenAI grader types are supported: `text_similarity` for comparing responses against a golden reference, and `string_check` for exact or pattern-based matching against a fixed value.
244
244
245
245
See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK helpers, and how to contribute evaluators.
@@ -317,9 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
317
317
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318
318
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319
319
320
+
### String Check Grader
321
+
322
+
Checks the agent's response against a fixed reference string using comparison operations. Does **not** require an eval set — the reference value is specified directly in the grader config. The `threshold` field is not applicable to this grader (string_check always returns 0 or 1).
323
+
324
+
```yaml
325
+
evaluators:
326
+
- name: city_name_check
327
+
type: openai_eval
328
+
grader:
329
+
type: string_check
330
+
operation: eq
331
+
reference: "Paris"
332
+
```
333
+
334
+
The `grader.operation` field selects the comparison:
335
+
336
+
| Operation | Description |
337
+
|---|---|
338
+
| `eq` | Exact equality |
339
+
| `ne` | Not equal |
340
+
| `like` | Pattern match (case-sensitive) |
341
+
| `ilike` | Pattern match (case-insensitive) |
342
+
320
343
### How it works
321
344
322
-
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
345
+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits invocations as JSONL items, polls for results, and cleans up. For `text_similarity` graders, each item contains both the actual and expected responses; for `string_check` graders, each item contains only the actual response (the reference is supplied statically in the grader config). Items are placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
0 commit comments