Summary
I’d like to contribute a focused improvement to the evaluation module by implementing the deterministic code-based scorers that are currently placeholders:
exact_string_match
numeric_match
This would improve reproducibility for scenarios where deterministic checks are more appropriate than LLM-only judging.
Current State
At the moment:
src/evaluation/scorers/code_based.py contains NotImplementedError for both functions.
docs/evaluation.md marks these scorers as skeletons/placeholders.
src/evaluation/tests/test_scorers.py currently verifies the not-implemented behavior.
Why this is useful
- Reduces reliance on
llm_judge for objective scenario types.
- Improves repeatability and debugging of eval results.
- Enables clearer per-scenario scorer routing via
scoring_method.
Proposed MVP Scope
- Implement
exact_string_match with clear normalization rules (minimal and explicit).
- Implement
numeric_match with robust numeric parsing and tolerance support.
- Register both scorers in the scorer registry.
- Update tests from placeholder checks to behavior checks.
- Update
docs/evaluation.md to reflect availability and usage.
Out of Scope (for this first PR)
- No multi-judge panel logic.
- No changes to
llm_judge rubric behavior.
- No large evaluator architecture refactors.
Acceptance Criteria
exact_string_match and numeric_match execute without NotImplementedError.
- Scenarios using
scoring_method route correctly to deterministic scorers.
uv run pytest src/evaluation/tests passes.
- Documentation reflects the implemented scorer status and expected input fields.
Notes
I can keep this first PR intentionally small and test-focused for quick review.
If maintainers agree with this scope, I can open a draft PR proposal right away.
Summary
I’d like to contribute a focused improvement to the evaluation module by implementing the deterministic code-based scorers that are currently placeholders:
exact_string_matchnumeric_matchThis would improve reproducibility for scenarios where deterministic checks are more appropriate than LLM-only judging.
Current State
At the moment:
src/evaluation/scorers/code_based.pycontainsNotImplementedErrorfor both functions.docs/evaluation.mdmarks these scorers as skeletons/placeholders.src/evaluation/tests/test_scorers.pycurrently verifies the not-implemented behavior.Why this is useful
llm_judgefor objective scenario types.scoring_method.Proposed MVP Scope
exact_string_matchwith clear normalization rules (minimal and explicit).numeric_matchwith robust numeric parsing and tolerance support.docs/evaluation.mdto reflect availability and usage.Out of Scope (for this first PR)
llm_judgerubric behavior.Acceptance Criteria
exact_string_matchandnumeric_matchexecute withoutNotImplementedError.scoring_methodroute correctly to deterministic scorers.uv run pytest src/evaluation/testspasses.Notes
I can keep this first PR intentionally small and test-focused for quick review.
If maintainers agree with this scope, I can open a draft PR proposal right away.