Skip to content

feat: Evaluator Agent — GAN-inspired generator/evaluator feedback loop #899

@ryaneggz

Description

@ryaneggz

Summary

Implement a first-class evaluator subsystem for Orchestra that separates generation from evaluation, inspired by Anthropic's harness design research. Agents are reliably bad at self-evaluation — a dedicated evaluator running in a separate context window with a skeptical posture produces dramatically better quality control.

Key Features

  • EvaluatorConfig on Assistant schema — attach an evaluator assistant with grading criteria to any generator assistant
  • Structured grading criteria — named dimensions (e.g., accuracy, completeness), weights, and per-criterion pass/fail thresholds
  • Generate → evaluate → critique → regenerate loop — automated feedback loop in LLMController with configurable max iterations
  • Separate context window — evaluator cannot see generator's conversation history, preventing self-evaluation bias
  • Default skeptical evaluator prompt — battle-tested system prompt tuned for critical assessment
  • Playwright MCP integration — evaluators can interact with live UIs before scoring (optional)
  • Frontend display — evaluation scores, pass/fail indicators, iteration badges on chat messages
  • Configurable fail actions — retry, flag, or stop when evaluation fails

Spec

Full specification: .claude/specs/evaluator-agent.md

Implementation plan: .claude/plans/evaluator-agent.md

Implementation Phases

  1. Schema & Data LayerEvaluationCriterion, EvaluatorConfig, EvaluationResult models + Alembic migration
  2. Orchestration Loop — generate-evaluate-critique-regenerate in LLMController
  3. Default Evaluator Prompt — skeptical-posture system prompt with calibration examples
  4. Frontend DisplayEvaluationBadge.tsx with scores, pass/fail indicators, iteration badges
  5. Testing & Documentation — unit/integration tests + example notebook

File Changes

  • backend/src/schemas/entities/evaluation.py — new
  • backend/src/schemas/entities/llm.py — add evaluator field
  • backend/src/controllers/llm.py — orchestration loop
  • backend/src/static/prompts/md/evaluator.md — new
  • backend/migrations/versions/xxx_add_evaluator.py — new
  • frontend/src/components/chat/EvaluationBadge.tsx — new
  • examples/agents/evaluator_example.ipynb — new
  • Tests in backend/tests/unit/ and backend/tests/integration/

Risks

  • Evaluator leniency: LLMs are naturally generous. Mitigation: strong skepticism prompting + few-shot calibration.
  • Cost multiplication: each iteration doubles token spend. Mitigation: configurable max iterations (default 3), budget cap.
  • Model mismatch: weaker evaluator than generator yields unreliable scores. Mitigation: document same-tier recommendation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestharness-designHarness design feature set (v0.9-v0.10)phase-2Phase 2: Quality & resilience

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions