Summary
Implement a first-class evaluator subsystem for Orchestra that separates generation from evaluation, inspired by Anthropic's harness design research. Agents are reliably bad at self-evaluation — a dedicated evaluator running in a separate context window with a skeptical posture produces dramatically better quality control.
Key Features
- EvaluatorConfig on Assistant schema — attach an evaluator assistant with grading criteria to any generator assistant
- Structured grading criteria — named dimensions (e.g., accuracy, completeness), weights, and per-criterion pass/fail thresholds
- Generate → evaluate → critique → regenerate loop — automated feedback loop in LLMController with configurable max iterations
- Separate context window — evaluator cannot see generator's conversation history, preventing self-evaluation bias
- Default skeptical evaluator prompt — battle-tested system prompt tuned for critical assessment
- Playwright MCP integration — evaluators can interact with live UIs before scoring (optional)
- Frontend display — evaluation scores, pass/fail indicators, iteration badges on chat messages
- Configurable fail actions — retry, flag, or stop when evaluation fails
Spec
Full specification: .claude/specs/evaluator-agent.md
Implementation plan: .claude/plans/evaluator-agent.md
Implementation Phases
- Schema & Data Layer —
EvaluationCriterion, EvaluatorConfig, EvaluationResult models + Alembic migration
- Orchestration Loop — generate-evaluate-critique-regenerate in LLMController
- Default Evaluator Prompt — skeptical-posture system prompt with calibration examples
- Frontend Display —
EvaluationBadge.tsx with scores, pass/fail indicators, iteration badges
- Testing & Documentation — unit/integration tests + example notebook
File Changes
backend/src/schemas/entities/evaluation.py — new
backend/src/schemas/entities/llm.py — add evaluator field
backend/src/controllers/llm.py — orchestration loop
backend/src/static/prompts/md/evaluator.md — new
backend/migrations/versions/xxx_add_evaluator.py — new
frontend/src/components/chat/EvaluationBadge.tsx — new
examples/agents/evaluator_example.ipynb — new
- Tests in
backend/tests/unit/ and backend/tests/integration/
Risks
- Evaluator leniency: LLMs are naturally generous. Mitigation: strong skepticism prompting + few-shot calibration.
- Cost multiplication: each iteration doubles token spend. Mitigation: configurable max iterations (default 3), budget cap.
- Model mismatch: weaker evaluator than generator yields unreliable scores. Mitigation: document same-tier recommendation.
Summary
Implement a first-class evaluator subsystem for Orchestra that separates generation from evaluation, inspired by Anthropic's harness design research. Agents are reliably bad at self-evaluation — a dedicated evaluator running in a separate context window with a skeptical posture produces dramatically better quality control.
Key Features
Spec
Full specification:
.claude/specs/evaluator-agent.mdImplementation plan:
.claude/plans/evaluator-agent.mdImplementation Phases
EvaluationCriterion,EvaluatorConfig,EvaluationResultmodels + Alembic migrationEvaluationBadge.tsxwith scores, pass/fail indicators, iteration badgesFile Changes
backend/src/schemas/entities/evaluation.py— newbackend/src/schemas/entities/llm.py— add evaluator fieldbackend/src/controllers/llm.py— orchestration loopbackend/src/static/prompts/md/evaluator.md— newbackend/migrations/versions/xxx_add_evaluator.py— newfrontend/src/components/chat/EvaluationBadge.tsx— newexamples/agents/evaluator_example.ipynb— newbackend/tests/unit/andbackend/tests/integration/Risks