Skip to content

Commit 854133a

Browse files
committed
feat: add StringCheckGrader support for OpenAI Evals backend
Adds string_check grader alongside the existing text_similarity grader. string_check evaluates agent responses against a fixed reference string using comparison operations (eq, ne, like, ilike). Unlike text_similarity, it does not require a golden eval set — the reference is specified directly in the grader config. Changes: - config.py: _VALID_STRING_CHECK_OPERATIONS, _SUPPORTED_GRADER_TYPES, grader-aware validator with explicit operation/reference checks - openai_eval_backend.py: _ACTUAL_ONLY_SCHEMA and _get_item_schema for grader-aware item shape, string_check branch in _build_testing_criteria, grader_type param in _build_jsonl_items (excludes expected_response for string_check), grader-relevant detail key in results (operation vs evaluation_metric), gated expected_invocations requirement - docs/custom-evaluators.md: String Check Grader section, threshold inapplicability note, grader-aware How it works description - examples/custom_evaluators/eval_config.yaml: example entries for both grader types - README.md: mentions both grader types in Custom Evaluators section - tests/test_openai_eval_backend.py: unit tests covering config validation, schema selection, testing criteria, JSONL builder, score extraction, and full mocked-client flow for both grader types Addresses review feedback from @krisztianfekete on PR #102.
1 parent 0fb491c commit 854133a

6 files changed

Lines changed: 482 additions & 27 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ evaluators:
240240
threshold: 0.7
241241
```
242242
243-
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`).
243+
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`). Two OpenAI grader types are supported: `text_similarity` for comparing responses against a golden reference, and `string_check` for exact or pattern-based matching against a fixed value.
244244

245245
See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK helpers, and how to contribute evaluators.
246246

docs/custom-evaluators.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ Each evaluator entry in the `evaluators` list uses the following fields. The `ty
104104
|---|---|---|---|
105105
| `name` | yes | | Unique name for the evaluator (used in output) |
106106
| `type` | yes | | `openai_eval` for OpenAI Evals API graders |
107-
| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader |
107+
| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader (not applicable for `string_check`) |
108108
| `timeout` | no | `120` | Max seconds to wait for the OpenAI eval run |
109109
| `grader` | yes | | OpenAI grader config (see [OpenAI Evals Graders](#openai-evals-api-graders)) |
110110

@@ -317,9 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
317317
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318318
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319319

320+
### String Check Grader
321+
322+
Checks the agent's response against a fixed reference string using comparison operations. Does **not** require an eval set — the reference value is specified directly in the grader config. The `threshold` field is not applicable to this grader (string_check always returns 0 or 1).
323+
324+
```yaml
325+
evaluators:
326+
- name: city_name_check
327+
type: openai_eval
328+
grader:
329+
type: string_check
330+
operation: eq
331+
reference: "Paris"
332+
```
333+
334+
The `grader.operation` field selects the comparison:
335+
336+
| Operation | Description |
337+
|---|---|
338+
| `eq` | Exact equality |
339+
| `ne` | Not equal |
340+
| `like` | Pattern match (case-sensitive) |
341+
| `ilike` | Pattern match (case-insensitive) |
342+
320343
### How it works
321344

322-
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
345+
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits invocations as JSONL items, polls for results, and cleans up. For `text_similarity` graders, each item contains both the actual and expected responses; for `string_check` graders, each item contains only the actual response (the reference is supplied statically in the grader config). Items are placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
323346

324347
### Configuring the GitHub source
325348

examples/custom_evaluators/eval_config.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,18 @@ evaluators:
3232
ref: evaluators/random_evaluator/random_evaluator.py
3333
threshold: 0.110
3434
executor: local
35+
36+
# OpenAI Evals API graders (requires OPENAI_API_KEY)
37+
- name: response_similarity
38+
type: openai_eval
39+
threshold: 0.8
40+
grader:
41+
type: text_similarity
42+
evaluation_metric: fuzzy_match
43+
44+
- name: city_name_check
45+
type: openai_eval
46+
grader:
47+
type: string_check
48+
operation: eq
49+
reference: "{{ item.expected_response }}"

src/agentevals/config.py

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,18 @@ class RemoteEvaluatorDef(BaseEvaluatorDef):
7070
}
7171
)
7272

73+
_VALID_STRING_CHECK_OPERATIONS = frozenset(
74+
{
75+
"eq",
76+
"ne",
77+
"like",
78+
"ilike",
79+
}
80+
)
81+
82+
# All supported grader types — used in error messages and type checks.
83+
_SUPPORTED_GRADER_TYPES = frozenset({"text_similarity", "string_check"})
84+
7385

7486
class OpenAIEvalDef(BaseModel):
7587
"""An evaluator that delegates grading to the OpenAI Evals API."""
@@ -84,13 +96,31 @@ class OpenAIEvalDef(BaseModel):
8496
@classmethod
8597
def _validate_grader(cls, v: dict[str, Any]) -> dict[str, Any]:
8698
grader_type = v.get("type")
87-
if grader_type != "text_similarity":
88-
raise ValueError(f"Only 'text_similarity' grader type is currently supported, got '{grader_type}'")
89-
metric = v.get("evaluation_metric")
90-
if not metric:
91-
raise ValueError("'evaluation_metric' is required for text_similarity grader")
92-
if metric not in _VALID_SIMILARITY_METRICS:
93-
raise ValueError(f"Unknown evaluation_metric '{metric}'. Valid: {sorted(_VALID_SIMILARITY_METRICS)}")
99+
100+
if grader_type == "text_similarity":
101+
metric = v.get("evaluation_metric")
102+
if not metric:
103+
raise ValueError("'evaluation_metric' is required for text_similarity grader")
104+
if metric not in _VALID_SIMILARITY_METRICS:
105+
raise ValueError(
106+
f"Unknown evaluation_metric '{metric}'. Valid: {sorted(_VALID_SIMILARITY_METRICS)}"
107+
)
108+
elif grader_type == "string_check":
109+
operation = v.get("operation")
110+
if not operation:
111+
raise ValueError("'operation' is required for string_check grader")
112+
if operation not in _VALID_STRING_CHECK_OPERATIONS:
113+
raise ValueError(
114+
f"Unknown operation '{operation}'. Valid: {sorted(_VALID_STRING_CHECK_OPERATIONS)}"
115+
)
116+
if not v.get("reference"):
117+
raise ValueError("'reference' is required for string_check grader")
118+
else:
119+
raise ValueError(
120+
f"Unsupported grader type '{grader_type}'. "
121+
f"Supported: {sorted(_SUPPORTED_GRADER_TYPES)}"
122+
)
123+
94124
return v
95125

96126

src/agentevals/openai_eval_backend.py

Lines changed: 63 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222

2323
_POLL_INTERVAL_SECONDS = 2
2424

25+
# Schema for graders that compare actual vs expected (e.g. text_similarity).
2526
_TEXT_PAIR_SCHEMA = {
2627
"type": "object",
2728
"properties": {
@@ -31,6 +32,22 @@
3132
"required": ["actual_response", "expected_response"],
3233
}
3334

35+
# Schema for graders that only need the actual response (e.g. string_check).
36+
_ACTUAL_ONLY_SCHEMA = {
37+
"type": "object",
38+
"properties": {
39+
"actual_response": {"type": "string"},
40+
},
41+
"required": ["actual_response"],
42+
}
43+
44+
45+
def _get_item_schema(grader_type: str) -> dict[str, Any]:
46+
"""Return the appropriate item schema for the given grader type."""
47+
if grader_type == "string_check":
48+
return _ACTUAL_ONLY_SCHEMA
49+
return _TEXT_PAIR_SCHEMA
50+
3451

3552
def _build_testing_criteria(evaluator_def: OpenAIEvalDef) -> dict[str, Any]:
3653
"""Build the OpenAI testing_criteria dict from the evaluator config.
@@ -51,28 +68,41 @@ def _build_testing_criteria(evaluator_def: OpenAIEvalDef) -> dict[str, Any]:
5168
"pass_threshold": evaluator_def.threshold,
5269
}
5370

71+
if grader_type == "string_check":
72+
return {
73+
"type": "string_check",
74+
"name": evaluator_def.name,
75+
"input": "{{ item.actual_response }}",
76+
"reference": grader["reference"],
77+
"operation": grader["operation"],
78+
}
79+
5480
raise ValueError(f"Unsupported grader type: {grader_type}")
5581

5682

5783
def _build_jsonl_items(
5884
actual_invocations: list[Invocation],
5985
expected_invocations: list[Invocation],
86+
grader_type: str = "",
6087
) -> list[dict[str, Any]]:
88+
"""Build JSONL items matching the grader-aware item schema.
89+
90+
string_check graders use a static reference from config and only need
91+
``actual_response`` in each item. All other graders (e.g. text_similarity)
92+
also require ``expected_response``.
93+
"""
94+
include_expected = grader_type != "string_check"
6195
items = []
6296
for i, actual_inv in enumerate(actual_invocations):
6397
actual_text = _content_to_text(actual_inv.final_response)
64-
if i < len(expected_invocations):
65-
expected_text = _content_to_text(expected_invocations[i].final_response)
66-
else:
67-
expected_text = ""
68-
items.append(
69-
{
70-
"item": {
71-
"actual_response": actual_text,
72-
"expected_response": expected_text,
73-
}
74-
}
75-
)
98+
item: dict[str, Any] = {"actual_response": actual_text}
99+
if include_expected:
100+
if i < len(expected_invocations):
101+
expected_text = _content_to_text(expected_invocations[i].final_response)
102+
else:
103+
expected_text = ""
104+
item["expected_response"] = expected_text
105+
items.append({"item": item})
76106
return items
77107

78108

@@ -111,13 +141,21 @@ async def evaluate_openai_eval(
111141
error="OPENAI_API_KEY environment variable is not set.",
112142
)
113143

114-
if expected_invocations is None:
144+
grader_type = evaluator_def.grader.get("type", "")
145+
146+
# string_check graders use a static reference from config and don't need
147+
# expected_invocations — only text_similarity requires a golden eval set.
148+
if grader_type != "string_check" and expected_invocations is None:
115149
return MetricResult(
116150
metric_name=evaluator_def.name,
117-
error="OpenAI text_similarity grader requires expected invocations (golden eval set).",
151+
error=f"OpenAI {grader_type} grader requires expected invocations (golden eval set).",
118152
)
119153

120-
items = _build_jsonl_items(actual_invocations, expected_invocations)
154+
items = _build_jsonl_items(
155+
actual_invocations,
156+
expected_invocations if expected_invocations is not None else [],
157+
grader_type=grader_type,
158+
)
121159
if not items:
122160
return MetricResult(
123161
metric_name=evaluator_def.name,
@@ -135,7 +173,7 @@ async def evaluate_openai_eval(
135173
name=f"agentevals-{evaluator_def.name}",
136174
data_source_config={
137175
"type": "custom",
138-
"item_schema": _TEXT_PAIR_SCHEMA,
176+
"item_schema": _get_item_schema(grader_type),
139177
"include_sample_schema": False,
140178
},
141179
testing_criteria=[testing_criteria],
@@ -225,10 +263,18 @@ async def _collect_results(client: Any, eval_id: str, run_id: str, run: Any, eva
225263
total = result_counts.total if result_counts else 0
226264
eval_status = "PASSED" if failed == 0 and total > 0 else "FAILED"
227265

266+
grader_type = evaluator_def.grader.get("type", "")
267+
# Include the grader-relevant key depending on type
268+
# (evaluation_metric for text_similarity, operation for string_check)
269+
if grader_type == "string_check":
270+
grader_detail_key = "operation"
271+
else:
272+
grader_detail_key = "evaluation_metric"
273+
228274
details: dict[str, Any] = {
229275
"openai_eval_id": eval_id,
230276
"openai_run_id": run_id,
231-
"evaluation_metric": evaluator_def.grader.get("evaluation_metric"),
277+
grader_detail_key: evaluator_def.grader.get(grader_detail_key),
232278
"result_counts": {"passed": passed, "failed": failed, "total": total},
233279
}
234280
per_criteria = getattr(run, "per_testing_criteria_results", None)

0 commit comments

Comments
 (0)