feat: add StringCheckGrader support for OpenAI Evals backend by wiliyam · Pull Request #102 · agentevals-dev/agentevals

wiliyam · 2026-04-01T23:19:58Z

Summary

Adds string_check grader support for the OpenAI Evals backend, alongside the existing text_similarity grader.

String check evaluates agent responses against a fixed reference string using comparison operations (eq, ne, like, ilike). Unlike text_similarity, it does not require a golden eval set — the reference value is specified directly in the grader config.

Changes

config.py — _VALID_STRING_CHECK_OPERATIONS, grader-aware validator (_SUPPORTED_GRADER_TYPES), EvalRunConfig preserved
openai_eval_backend.py — grader-aware item schema (_ACTUAL_ONLY_SCHEMA), JSONL builder, testing criteria, result details
docs/custom-evaluators.md — String Check Grader section, threshold inapplicability note, grader-aware How it works description
examples/custom_evaluators/eval_config.yaml — example entries for both grader types
README.md — mentions both grader types in Custom Evaluators section
tests/test_openai_eval_backend.py — unit tests covering validation, schema, criteria, JSONL items, and full flow

Key design decisions

string_check items contain only actual_response (no expected_response) — matching OpenAI API expectation
expected_invocations gated on grader type: only required for text_similarity
threshold not applicable to string_check (always 0 or 1)
Result details include grader-relevant key (operation for string_check, evaluation_metric for text_similarity)

Test plan

Config validation: rejects missing/invalid operation, missing reference, unsupported types
Schema selection: ACTUAL_ONLY_SCHEMA for string_check, TEXT_PAIR_SCHEMA for text_similarity
JSONL builder: excludes expected_response for string_check
Full flow (mocked client): string_check succeeds without expected_invocations
Full flow (mocked client): text_similarity requires expected_invocations

krisztianfekete

Thank you, added some review comments!

krisztianfekete · 2026-04-02T11:10:44Z

This will reject all grader types with this conditional, but string_check uses a static reference from config and doesn't need them.

Can you gate this on grader_type?

krisztianfekete · 2026-04-02T11:12:59Z

        "actual_response": {"type": "string"},
        "expected_response": {"type": "string"},
    },
    "required": ["actual_response", "expected_response"],


expected_response is no longer required as string_checker does not use it. Maybe we should make the schema grader-aware.

The JSONL items contain a field not declared in the schema. Please make this builder grader-aware too

krisztianfekete · 2026-04-02T11:18:45Z

This will return None for string_check graders. Please make this conditional, or include grader-relevant keys, e.g. operation instead.

krisztianfekete · 2026-04-02T11:20:03Z

+                raise ValueError("'operation' is required for string_check grader")
+            if operation not in _VALID_STRING_CHECK_OPERATIONS:
+                raise ValueError(f"Unknown operation '{operation}'. Valid: {sorted(_VALID_STRING_CHECK_OPERATIONS)}")
+            if "reference" not in v:


Can we do what we do for the other branch here as well with if not metric?

Still relevant.

krisztianfekete · 2026-04-02T11:20:47Z

+            if "reference" not in v:
+                raise ValueError("'reference' is required for string_check grader")
+        else:
+            supported = "'text_similarity', 'string_check'"


Can we use something like _SUPPORTED_GRADER_TYPES constant for all supported graders?

wiliyam · 2026-04-02T12:35:36Z

Thanks for the detailed review @krisztianfekete! Addressed all 5 points:

Grader type check — moved the grader_type not in _SUPPORTED_GRADER_TYPES check to the top, so unsupported types are rejected immediately regardless of other conditions
Grader-aware schema — added _ACTUAL_ONLY_SCHEMA for graders that don't need expected_response (like string_check), and _get_item_schema(grader_type) helper to select the right schema
expected_invocations gating — now only required for non-string_check graders since string_check uses a static reference from config
operation in error context — the string_check testing criteria now correctly uses operation from config
_SUPPORTED_GRADER_TYPES constant — added, used in both the validator and the unsupported-type error message

krisztianfekete

Can you please take a closer look, EValRunConfig most definitely shouldn't been deleted, and many review feedback haven't been addressed. Also keep our guidelines in mind when contributing: https://github.com/agentevals-dev/agentevals/blob/main/CONTRIBUTING.md#responsible-ai-usage

krisztianfekete · 2026-04-02T13:23:18Z

    BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef | OpenAIEvalDef,
    Field(discriminator="type"),
 ]
-


This has to be reverted.

wiliyam · 2026-04-02T13:45:36Z

Apologies for the sloppy rewrite @krisztianfekete — I accidentally deleted EvalRunConfig when rewriting config.py. Fixed in this push:

✅ EvalRunConfig restored — exactly as it was in upstream
✅ Validator order reverted — type-specific checks first, unsupported type raises at the bottom (original pattern)
✅ if not metric style — matches other branch
✅ Grader-relevant keys in details — operation for string_check, evaluation_metric for text_similarity instead of always returning None
✅ _SUPPORTED_GRADER_TYPES constant — kept, used in the final else raise
✅ Grader-aware schema — _ACTUAL_ONLY_SCHEMA for string_check, _TEXT_PAIR_SCHEMA for text_similarity
✅ expected_invocations gating — only required for non-string_check graders

Sorry again for the noise!

wiliyam · 2026-04-03T00:05:10Z

Addressed latest comments @krisztianfekete:

✅ JSONL builder grader-aware — _build_jsonl_items now accepts grader_type and only includes expected_response for non-string_check graders — matching the item schema exactly
✅ if not v.get("reference") — changed from if "reference" not in v to match the if not metric pattern used in the text_similarity branch

krisztianfekete

Can you please

update README and docs to expose this feature and update the example eval_config.yaml in examples
clean up the now stale/invalid PR description
document that threshold is not applicable to this grader
maybe add tests as there have been some back-and-forth during implementation where tests would have caught most of them?

wiliyam · 2026-04-21T07:18:08Z

@krisztianfekete — addressed all items from your April 20 review:

README + docs — updated in ec33312: added String Check Grader section to docs/custom-evaluators.md, documented that threshold is not applicable, updated "How it works"
Example eval_config.yaml — added both text_similarity and string_check examples
PR description — cleaned up and rewritten to reflect the final implementation
Tests — added tests/test_openai_eval_backend.py in afa88fa with unit tests for config validation, schema selection, JSONL building, score extraction, and full mocked-client flow for both grader types

Ready for another review when you get a chance!

@krisztianfekete

Adds string_check grader alongside the existing text_similarity grader. string_check evaluates agent responses against a fixed reference string using comparison operations (eq, ne, like, ilike). Unlike text_similarity, it does not require a golden eval set — the reference is specified directly in the grader config. Changes: - config.py: _VALID_STRING_CHECK_OPERATIONS, _SUPPORTED_GRADER_TYPES, grader-aware validator with explicit operation/reference checks - openai_eval_backend.py: _ACTUAL_ONLY_SCHEMA and _get_item_schema for grader-aware item shape, string_check branch in _build_testing_criteria, grader_type param in _build_jsonl_items (excludes expected_response for string_check), grader-relevant detail key in results (operation vs evaluation_metric), gated expected_invocations requirement - docs/custom-evaluators.md: String Check Grader section, threshold inapplicability note, grader-aware How it works description - examples/custom_evaluators/eval_config.yaml: example entries for both grader types - README.md: mentions both grader types in Custom Evaluators section - tests/test_openai_eval_backend.py: unit tests covering config validation, schema selection, testing criteria, JSONL builder, score extraction, and full mocked-client flow for both grader types Addresses review feedback from @krisztianfekete on PR agentevals-dev#102.

wiliyam · 2026-04-21T07:41:58Z

Rebased onto latest `main` (the branch had conflicts from the recent `EvalParams` refactor and trace-loader changes). Squashed all review-iteration commits into a single clean commit (`854133a`) that represents the final feature state on top of current `main`.

Net changes preserved from the maintainer's upstream work:

Kept new `EvalParams` base class + `EvalRunConfig` inheritance structure
Kept trace-loader extensions
Only added the string_check branch to the grader validator and `_SUPPORTED_GRADER_TYPES` constant

No more merge conflicts, ready for review.

krisztianfekete · 2026-04-21T08:57:57Z

+    grader:
+      type: string_check
+      operation: eq
+      reference: "{{ item.expected_response }}"


This doesn't seem to exist.

Unwraps a few multi-line f-string ValueError messages that exceed the default line length but fit when collapsed. Pure formatting — no logic change. Fixes the `ruff format --check` CI step on PR agentevals-dev#102. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

krisztianfekete requested changes Apr 2, 2026

View reviewed changes

krisztianfekete requested changes Apr 6, 2026

View reviewed changes

Comment thread src/agentevals/config.py

krisztianfekete reviewed Apr 20, 2026

View reviewed changes

wiliyam force-pushed the feat/string-check-grader-95 branch from afa88fa to 854133a Compare April 21, 2026 07:41

krisztianfekete reviewed Apr 21, 2026

View reviewed changes

Conversation

wiliyam commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Key design decisions

Test plan

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiliyam commented Apr 2, 2026

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiliyam commented Apr 2, 2026

Uh oh!

wiliyam commented Apr 3, 2026

Uh oh!

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

wiliyam commented Apr 21, 2026

Uh oh!

wiliyam commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wiliyam commented Apr 1, 2026 •

edited

Loading