feat: add StringCheckGrader support for OpenAI Evals backend#102
feat: add StringCheckGrader support for OpenAI Evals backend#102wiliyam wants to merge 2 commits intoagentevals-dev:mainfrom
Conversation
krisztianfekete
left a comment
There was a problem hiding this comment.
Thank you, added some review comments!
There was a problem hiding this comment.
This will reject all grader types with this conditional, but string_check uses a static reference from config and doesn't need them.
Can you gate this on grader_type?
| "actual_response": {"type": "string"}, | ||
| "expected_response": {"type": "string"}, | ||
| }, | ||
| "required": ["actual_response", "expected_response"], |
There was a problem hiding this comment.
expected_response is no longer required as string_checker does not use it. Maybe we should make the schema grader-aware.
There was a problem hiding this comment.
The JSONL items contain a field not declared in the schema. Please make this builder grader-aware too
There was a problem hiding this comment.
This will return None for string_check graders. Please make this conditional, or include grader-relevant keys, e.g. operation instead.
| raise ValueError("'operation' is required for string_check grader") | ||
| if operation not in _VALID_STRING_CHECK_OPERATIONS: | ||
| raise ValueError(f"Unknown operation '{operation}'. Valid: {sorted(_VALID_STRING_CHECK_OPERATIONS)}") | ||
| if "reference" not in v: |
There was a problem hiding this comment.
Can we do what we do for the other branch here as well with if not metric?
| if "reference" not in v: | ||
| raise ValueError("'reference' is required for string_check grader") | ||
| else: | ||
| supported = "'text_similarity', 'string_check'" |
There was a problem hiding this comment.
Can we use something like _SUPPORTED_GRADER_TYPES constant for all supported graders?
|
Thanks for the detailed review @krisztianfekete! Addressed all 5 points:
|
krisztianfekete
left a comment
There was a problem hiding this comment.
Can you please take a closer look, EValRunConfig most definitely shouldn't been deleted, and many review feedback haven't been addressed. Also keep our guidelines in mind when contributing: https://github.com/agentevals-dev/agentevals/blob/main/CONTRIBUTING.md#responsible-ai-usage
| BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef | OpenAIEvalDef, | ||
| Field(discriminator="type"), | ||
| ] | ||
|
|
There was a problem hiding this comment.
This has to be reverted.
|
Apologies for the sloppy rewrite @krisztianfekete — I accidentally deleted
Sorry again for the noise! |
|
Addressed latest comments @krisztianfekete:
|
krisztianfekete
left a comment
There was a problem hiding this comment.
Can you please
- update README and docs to expose this feature and update the example eval_config.yaml in examples
- clean up the now stale/invalid PR description
- document that
thresholdis not applicable to this grader - maybe add tests as there have been some back-and-forth during implementation where tests would have caught most of them?
|
@krisztianfekete — addressed all items from your April 20 review:
Ready for another review when you get a chance! |
Adds string_check grader alongside the existing text_similarity grader. string_check evaluates agent responses against a fixed reference string using comparison operations (eq, ne, like, ilike). Unlike text_similarity, it does not require a golden eval set — the reference is specified directly in the grader config. Changes: - config.py: _VALID_STRING_CHECK_OPERATIONS, _SUPPORTED_GRADER_TYPES, grader-aware validator with explicit operation/reference checks - openai_eval_backend.py: _ACTUAL_ONLY_SCHEMA and _get_item_schema for grader-aware item shape, string_check branch in _build_testing_criteria, grader_type param in _build_jsonl_items (excludes expected_response for string_check), grader-relevant detail key in results (operation vs evaluation_metric), gated expected_invocations requirement - docs/custom-evaluators.md: String Check Grader section, threshold inapplicability note, grader-aware How it works description - examples/custom_evaluators/eval_config.yaml: example entries for both grader types - README.md: mentions both grader types in Custom Evaluators section - tests/test_openai_eval_backend.py: unit tests covering config validation, schema selection, testing criteria, JSONL builder, score extraction, and full mocked-client flow for both grader types Addresses review feedback from @krisztianfekete on PR agentevals-dev#102.
afa88fa to
854133a
Compare
|
Rebased onto latest `main` (the branch had conflicts from the recent `EvalParams` refactor and trace-loader changes). Squashed all review-iteration commits into a single clean commit (`854133a`) that represents the final feature state on top of current `main`. Net changes preserved from the maintainer's upstream work:
No more merge conflicts, ready for review. |
| grader: | ||
| type: string_check | ||
| operation: eq | ||
| reference: "{{ item.expected_response }}" |
There was a problem hiding this comment.
This doesn't seem to exist.
Unwraps a few multi-line f-string ValueError messages that exceed the default line length but fit when collapsed. Pure formatting — no logic change. Fixes the `ruff format --check` CI step on PR agentevals-dev#102. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
string_checkgrader support for the OpenAI Evals backend, alongside the existingtext_similaritygrader.String check evaluates agent responses against a fixed reference string using comparison operations (
eq,ne,like,ilike). Unliketext_similarity, it does not require a golden eval set — the reference value is specified directly in the grader config.Changes
_VALID_STRING_CHECK_OPERATIONS, grader-aware validator (_SUPPORTED_GRADER_TYPES),EvalRunConfigpreserved_ACTUAL_ONLY_SCHEMA), JSONL builder, testing criteria, result detailsKey design decisions
Test plan