fix: address remaining review comments — update docs, examples, and PR description

wiliyam · claude · wiliyam · commit ec33312b03d9 · 2026-04-21T07:13:38.000Z
- Add String Check Grader section to docs/custom-evaluators.md
- Document that threshold is not applicable to string_check grader
- Update "How it works" to describe grader-aware JSONL item building
- Add openai_eval examples (text_similarity + string_check) to eval_config.yaml
- Mention both grader types in README Custom Evaluators section

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -216,7 +216,7 @@ evaluators:
 agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
 ```
 
-Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
+Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). Two grader types are supported: `text_similarity` for comparing responses against a golden reference, and `string_check` for exact or pattern-based matching against a fixed value. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
 
 ## Web UI
 
diff --git a/docs/custom-evaluators.md b/docs/custom-evaluators.md
@@ -104,7 +104,7 @@ Each evaluator entry in the `evaluators` list uses the following fields. The `ty
 |---|---|---|---|
 | `name` | yes | | Unique name for the evaluator (used in output) |
 | `type` | yes | | `openai_eval` for OpenAI Evals API graders |
-| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader |
+| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader (not applicable for `string_check`) |
 | `timeout` | no | `120` | Max seconds to wait for the OpenAI eval run |
 | `grader` | yes | | OpenAI grader config (see [OpenAI Evals Graders](#openai-evals-api-graders)) |
 
@@ -317,9 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
 | `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
 | `rouge_l` | Longest common subsequence overlap (F-measure) |
 
+### String Check Grader
+
+Checks the agent's response against a fixed reference string using comparison operations. Does **not** require an eval set — the reference value is specified directly in the grader config. The `threshold` field is not applicable to this grader (string_check always returns 0 or 1).
+
+```yaml
+evaluators:
+  - name: city_name_check
+    type: openai_eval
+    grader:
+      type: string_check
+      operation: eq
+      reference: "Paris"
+```
+
+The `grader.operation` field selects the comparison:
+
+| Operation | Description |
+|---|---|
+| `eq` | Exact equality |
+| `ne` | Not equal |
+| `like` | Pattern match (case-sensitive) |
+| `ilike` | Pattern match (case-insensitive) |
+
 ### How it works
 
-Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
+Under the hood, agentevals creates an ephemeral eval on OpenAI, submits invocations as JSONL items, polls for results, and cleans up. For `text_similarity` graders, each item contains both the actual and expected responses; for `string_check` graders, each item contains only the actual response (the reference is supplied statically in the grader config). Items are placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
 
 ### Configuring the GitHub source
 
diff --git a/examples/custom_evaluators/eval_config.yaml b/examples/custom_evaluators/eval_config.yaml
@@ -32,3 +32,18 @@ evaluators:
     ref: evaluators/random_evaluator/random_evaluator.py
     threshold: 0.110
     executor: local
+
+  # OpenAI Evals API graders (requires OPENAI_API_KEY)
+  - name: response_similarity
+    type: openai_eval
+    threshold: 0.8
+    grader:
+      type: text_similarity
+      evaluation_metric: fuzzy_match
+
+  - name: city_name_check
+    type: openai_eval
+    grader:
+      type: string_check
+      operation: eq
+      reference: "{{ item.expected_response }}"