Skip to content

Agent Eval: Clean up legacy eval framework (9,200 lines of dead code) #672

@kovtcharov

Description

@kovtcharov

Problem

The legacy eval framework (eval.py at 159.7 KB, groundtruth.py, batch_experiment.py, transcript_generator.py, email_generator.py) still exists alongside the new agent eval framework. This is 9,200+ lines of dead code that:

  • Confuses new contributors ("which eval framework do I use?")
  • Inflates the codebase
  • Has stale CLI commands (gaia eval, gaia groundtruth, gaia report) that conflict with the new gaia eval agent

What to Remove

Per #573 (the replacement plan):

Files to Remove

  • src/gaia/eval/eval.py (3,336 lines) — old Evaluator class
  • src/gaia/eval/groundtruth.py (~1,000 lines) — old ground truth generator
  • src/gaia/eval/batch_experiment.py (2,367 lines) — old batch runner
  • src/gaia/eval/transcript_generator.py — not needed
  • src/gaia/eval/email_generator.py — not needed
  • src/gaia/eval/fix_code_testbench/ — replaced by eval scenarios
  • src/gaia/eval/configs/ — old config format
  • src/gaia/eval/webapp/ — old Express.js visualizer (if superseded)

Files to Keep

  • src/gaia/eval/runner.py — new AgentEvalRunner ✅
  • src/gaia/eval/scorecard.py — new scorecard ✅
  • src/gaia/eval/audit.py — new architecture audit ✅
  • src/gaia/eval/claude.py — ClaudeClient (Anthropic SDK wrapper) ✅
  • src/gaia/eval/config.py — MODEL_PRICING + DEFAULT_CLAUDE_MODEL ✅
  • src/gaia/eval/pdf_document_generator.py → rename to pdf_generator.py

CLI Changes

  • Remove: gaia eval (old), gaia groundtruth, gaia report, gaia create-template, gaia visualize
  • Keep: gaia eval agent (new framework)

Acceptance Criteria

  • Legacy eval files removed
  • Old CLI commands removed
  • gaia eval agent remains the single entry point
  • No broken imports after cleanup
  • Tests updated to reflect removal

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain:qualityTests, CI/CD, security, performance, evalsevalEvaluation framework changestech debttrack:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions