Problem
The eval framework has 95 pytest tests, but it's unclear whether they cover the public API surface that third parties will use. The tests were written during development and may focus on internals rather than the documented interfaces.
What Needs Testing (Third-Party API Surface)
Scenario Loading
Runner
Scorecard
Corpus
CLI
Acceptance Criteria
Problem
The eval framework has 95 pytest tests, but it's unclear whether they cover the public API surface that third parties will use. The tests were written during development and may focus on internals rather than the documented interfaces.
What Needs Testing (Third-Party API Surface)
Scenario Loading
--scenario-dir)--category) and tag (--tag)Runner
--resume)Scorecard
--compare) detects regressionsCorpus
CLI
gaia eval agentflags work as documentedAcceptance Criteria