Skip to content

feat: add StringCheckGrader support for OpenAI Evals backend#102

Open
wiliyam wants to merge 2 commits intoagentevals-dev:mainfrom
wiliyam:feat/string-check-grader-95
Open

feat: add StringCheckGrader support for OpenAI Evals backend#102
wiliyam wants to merge 2 commits intoagentevals-dev:mainfrom
wiliyam:feat/string-check-grader-95

Conversation

@wiliyam
Copy link
Copy Markdown

@wiliyam wiliyam commented Apr 1, 2026

Summary

Adds string_check grader support for the OpenAI Evals backend, alongside the existing text_similarity grader.

String check evaluates agent responses against a fixed reference string using comparison operations (eq, ne, like, ilike). Unlike text_similarity, it does not require a golden eval set — the reference value is specified directly in the grader config.

Changes

  • config.py_VALID_STRING_CHECK_OPERATIONS, grader-aware validator (_SUPPORTED_GRADER_TYPES), EvalRunConfig preserved
  • openai_eval_backend.py — grader-aware item schema (_ACTUAL_ONLY_SCHEMA), JSONL builder, testing criteria, result details
  • docs/custom-evaluators.md — String Check Grader section, threshold inapplicability note, grader-aware How it works description
  • examples/custom_evaluators/eval_config.yaml — example entries for both grader types
  • README.md — mentions both grader types in Custom Evaluators section
  • tests/test_openai_eval_backend.py — unit tests covering validation, schema, criteria, JSONL items, and full flow

Key design decisions

  • string_check items contain only actual_response (no expected_response) — matching OpenAI API expectation
  • expected_invocations gated on grader type: only required for text_similarity
  • threshold not applicable to string_check (always 0 or 1)
  • Result details include grader-relevant key (operation for string_check, evaluation_metric for text_similarity)

Test plan

  • Config validation: rejects missing/invalid operation, missing reference, unsupported types
  • Schema selection: ACTUAL_ONLY_SCHEMA for string_check, TEXT_PAIR_SCHEMA for text_similarity
  • JSONL builder: excludes expected_response for string_check
  • Full flow (mocked client): string_check succeeds without expected_invocations
  • Full flow (mocked client): text_similarity requires expected_invocations

Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, added some review comments!

Comment thread src/agentevals/openai_eval_backend.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reject all grader types with this conditional, but string_check uses a static reference from config and doesn't need them.

Can you gate this on grader_type?

"actual_response": {"type": "string"},
"expected_response": {"type": "string"},
},
"required": ["actual_response", "expected_response"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_response is no longer required as string_checker does not use it. Maybe we should make the schema grader-aware.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSONL items contain a field not declared in the schema. Please make this builder grader-aware too

Comment thread src/agentevals/openai_eval_backend.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return None for string_check graders. Please make this conditional, or include grader-relevant keys, e.g. operation instead.

Comment thread src/agentevals/config.py Outdated
raise ValueError("'operation' is required for string_check grader")
if operation not in _VALID_STRING_CHECK_OPERATIONS:
raise ValueError(f"Unknown operation '{operation}'. Valid: {sorted(_VALID_STRING_CHECK_OPERATIONS)}")
if "reference" not in v:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do what we do for the other branch here as well with if not metric?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant.

Comment thread src/agentevals/config.py Outdated
if "reference" not in v:
raise ValueError("'reference' is required for string_check grader")
else:
supported = "'text_similarity', 'string_check'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something like _SUPPORTED_GRADER_TYPES constant for all supported graders?

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 2, 2026

Thanks for the detailed review @krisztianfekete! Addressed all 5 points:

  1. Grader type check — moved the grader_type not in _SUPPORTED_GRADER_TYPES check to the top, so unsupported types are rejected immediately regardless of other conditions
  2. Grader-aware schema — added _ACTUAL_ONLY_SCHEMA for graders that don't need expected_response (like string_check), and _get_item_schema(grader_type) helper to select the right schema
  3. expected_invocations gating — now only required for non-string_check graders since string_check uses a static reference from config
  4. operation in error context — the string_check testing criteria now correctly uses operation from config
  5. _SUPPORTED_GRADER_TYPES constant — added, used in both the validator and the unsupported-type error message

Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take a closer look, EValRunConfig most definitely shouldn't been deleted, and many review feedback haven't been addressed. Also keep our guidelines in mind when contributing: https://github.com/agentevals-dev/agentevals/blob/main/CONTRIBUTING.md#responsible-ai-usage

Comment thread src/agentevals/config.py
BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef | OpenAIEvalDef,
Field(discriminator="type"),
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be reverted.

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 2, 2026

Apologies for the sloppy rewrite @krisztianfekete — I accidentally deleted EvalRunConfig when rewriting config.py. Fixed in this push:

  1. EvalRunConfig restored — exactly as it was in upstream
  2. Validator order reverted — type-specific checks first, unsupported type raises at the bottom (original pattern)
  3. if not metric style — matches other branch
  4. Grader-relevant keys in detailsoperation for string_check, evaluation_metric for text_similarity instead of always returning None
  5. _SUPPORTED_GRADER_TYPES constant — kept, used in the final else raise
  6. Grader-aware schema_ACTUAL_ONLY_SCHEMA for string_check, _TEXT_PAIR_SCHEMA for text_similarity
  7. expected_invocations gating — only required for non-string_check graders

Sorry again for the noise!

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 3, 2026

Addressed latest comments @krisztianfekete:

  1. JSONL builder grader-aware_build_jsonl_items now accepts grader_type and only includes expected_response for non-string_check graders — matching the item schema exactly
  2. if not v.get("reference") — changed from if "reference" not in v to match the if not metric pattern used in the text_similarity branch

Comment thread src/agentevals/config.py
Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please

  • update README and docs to expose this feature and update the example eval_config.yaml in examples
  • clean up the now stale/invalid PR description
  • document that threshold is not applicable to this grader
  • maybe add tests as there have been some back-and-forth during implementation where tests would have caught most of them?

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 21, 2026

@krisztianfekete — addressed all items from your April 20 review:

  • README + docs — updated in ec33312: added String Check Grader section to docs/custom-evaluators.md, documented that threshold is not applicable, updated "How it works"
  • Example eval_config.yaml — added both text_similarity and string_check examples
  • PR description — cleaned up and rewritten to reflect the final implementation
  • Tests — added tests/test_openai_eval_backend.py in afa88fa with unit tests for config validation, schema selection, JSONL building, score extraction, and full mocked-client flow for both grader types

Ready for another review when you get a chance!

Adds string_check grader alongside the existing text_similarity grader.
string_check evaluates agent responses against a fixed reference string
using comparison operations (eq, ne, like, ilike). Unlike text_similarity,
it does not require a golden eval set — the reference is specified
directly in the grader config.

Changes:
- config.py: _VALID_STRING_CHECK_OPERATIONS, _SUPPORTED_GRADER_TYPES,
  grader-aware validator with explicit operation/reference checks
- openai_eval_backend.py: _ACTUAL_ONLY_SCHEMA and _get_item_schema for
  grader-aware item shape, string_check branch in _build_testing_criteria,
  grader_type param in _build_jsonl_items (excludes expected_response for
  string_check), grader-relevant detail key in results (operation vs
  evaluation_metric), gated expected_invocations requirement
- docs/custom-evaluators.md: String Check Grader section, threshold
  inapplicability note, grader-aware How it works description
- examples/custom_evaluators/eval_config.yaml: example entries for both
  grader types
- README.md: mentions both grader types in Custom Evaluators section
- tests/test_openai_eval_backend.py: unit tests covering config validation,
  schema selection, testing criteria, JSONL builder, score extraction,
  and full mocked-client flow for both grader types

Addresses review feedback from @krisztianfekete on PR agentevals-dev#102.
@wiliyam wiliyam force-pushed the feat/string-check-grader-95 branch from afa88fa to 854133a Compare April 21, 2026 07:41
@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 21, 2026

Rebased onto latest `main` (the branch had conflicts from the recent `EvalParams` refactor and trace-loader changes). Squashed all review-iteration commits into a single clean commit (`854133a`) that represents the final feature state on top of current `main`.

Net changes preserved from the maintainer's upstream work:

  • Kept new `EvalParams` base class + `EvalRunConfig` inheritance structure
  • Kept trace-loader extensions
  • Only added the string_check branch to the grader validator and `_SUPPORTED_GRADER_TYPES` constant

No more merge conflicts, ready for review.

grader:
type: string_check
operation: eq
reference: "{{ item.expected_response }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to exist.

Unwraps a few multi-line f-string ValueError messages that exceed the
default line length but fit when collapsed. Pure formatting — no logic
change. Fixes the `ruff format --check` CI step on PR agentevals-dev#102.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants