Skip to content

Agent Eval: Replace existing eval framework #573

@kovtcharov

Description

@kovtcharov

Replace Existing Eval Framework

The existing src/gaia/eval/ (~9,200 lines) is replaced entirely by the new agent eval framework. No backwards compatibility.

Files to Remove

  • src/gaia/eval/eval.py (3,336 lines) — old Evaluator class
  • src/gaia/eval/groundtruth.py (~1,000 lines) — old ground truth generator
  • src/gaia/eval/batch_experiment.py (2,367 lines) — old batch runner
  • src/gaia/eval/transcript_generator.py — not needed
  • src/gaia/eval/email_generator.py — not needed
  • src/gaia/eval/fix_code_testbench/ — replaced by eval scenarios

Files to Keep (absorbed into new framework)

  • src/gaia/eval/claude.py — ClaudeClient (Anthropic SDK wrapper)
  • src/gaia/eval/config.py — MODEL_PRICING + DEFAULT_CLAUDE_MODEL
  • src/gaia/eval/pdf_document_generator.py → rename to pdf_generator.py

CLI Changes

  • Remove: gaia eval, gaia groundtruth, gaia report, gaia create-template, gaia visualize
  • Add: gaia eval agent with flags --fix, --audit-only, --generate-corpus, --compare, --resume

New Files

  • src/gaia/eval/runner.py — AgentEvalRunner
  • src/gaia/eval/audit.py — Architecture audit
  • src/gaia/eval/scorecard.py — Scorecard generation + comparison
  • src/gaia/eval/webapp/ — Rewritten eval webapp

Reference

  • Plan: docs/plans/agent-ui-eval-benchmark.md §1.3 (Disposition table)

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain:qualityTests, CI/CD, security, performance, evalsevalEvaluation framework changestrack:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions