Replace Existing Eval Framework
The existing src/gaia/eval/ (~9,200 lines) is replaced entirely by the new agent eval framework. No backwards compatibility.
Files to Remove
Files to Keep (absorbed into new framework)
CLI Changes
New Files
Reference
- Plan:
docs/plans/agent-ui-eval-benchmark.md §1.3 (Disposition table)
Replace Existing Eval Framework
The existing
src/gaia/eval/(~9,200 lines) is replaced entirely by the new agent eval framework. No backwards compatibility.Files to Remove
src/gaia/eval/eval.py(3,336 lines) — old Evaluator classsrc/gaia/eval/groundtruth.py(~1,000 lines) — old ground truth generatorsrc/gaia/eval/batch_experiment.py(2,367 lines) — old batch runnersrc/gaia/eval/transcript_generator.py— not neededsrc/gaia/eval/email_generator.py— not neededsrc/gaia/eval/fix_code_testbench/— replaced by eval scenariosFiles to Keep (absorbed into new framework)
src/gaia/eval/claude.py— ClaudeClient (Anthropic SDK wrapper)src/gaia/eval/config.py— MODEL_PRICING + DEFAULT_CLAUDE_MODELsrc/gaia/eval/pdf_document_generator.py→ rename topdf_generator.pyCLI Changes
gaia eval,gaia groundtruth,gaia report,gaia create-template,gaia visualizegaia eval agentwith flags--fix,--audit-only,--generate-corpus,--compare,--resumeNew Files
src/gaia/eval/runner.py— AgentEvalRunnersrc/gaia/eval/audit.py— Architecture auditsrc/gaia/eval/scorecard.py— Scorecard generation + comparisonsrc/gaia/eval/webapp/— Rewritten eval webappReference
docs/plans/agent-ui-eval-benchmark.md§1.3 (Disposition table)