feat: Evaluator Agent — GAN-inspired generator/evaluator feedback loop

## Summary

Implement a first-class evaluator subsystem for Orchestra that separates generation from evaluation, inspired by [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps). Agents are reliably bad at self-evaluation — a dedicated evaluator running in a separate context window with a skeptical posture produces dramatically better quality control.

## Key Features

- **EvaluatorConfig on Assistant schema** — attach an evaluator assistant with grading criteria to any generator assistant
- **Structured grading criteria** — named dimensions (e.g., accuracy, completeness), weights, and per-criterion pass/fail thresholds
- **Generate → evaluate → critique → regenerate loop** — automated feedback loop in LLMController with configurable max iterations
- **Separate context window** — evaluator cannot see generator's conversation history, preventing self-evaluation bias
- **Default skeptical evaluator prompt** — battle-tested system prompt tuned for critical assessment
- **Playwright MCP integration** — evaluators can interact with live UIs before scoring (optional)
- **Frontend display** — evaluation scores, pass/fail indicators, iteration badges on chat messages
- **Configurable fail actions** — retry, flag, or stop when evaluation fails

## Spec

Full specification: [`.claude/specs/evaluator-agent.md`](https://github.com/ruska-ai/orchestra/blob/feat/evaluator-agent/.claude/specs/evaluator-agent.md)

Implementation plan: [`.claude/plans/evaluator-agent.md`](https://github.com/ruska-ai/orchestra/blob/feat/evaluator-agent/.claude/plans/evaluator-agent.md)

## Implementation Phases

1. **Schema & Data Layer** — `EvaluationCriterion`, `EvaluatorConfig`, `EvaluationResult` models + Alembic migration
2. **Orchestration Loop** — generate-evaluate-critique-regenerate in LLMController
3. **Default Evaluator Prompt** — skeptical-posture system prompt with calibration examples
4. **Frontend Display** — `EvaluationBadge.tsx` with scores, pass/fail indicators, iteration badges
5. **Testing & Documentation** — unit/integration tests + example notebook

## File Changes

- `backend/src/schemas/entities/evaluation.py` — new
- `backend/src/schemas/entities/llm.py` — add evaluator field
- `backend/src/controllers/llm.py` — orchestration loop
- `backend/src/static/prompts/md/evaluator.md` — new
- `backend/migrations/versions/xxx_add_evaluator.py` — new
- `frontend/src/components/chat/EvaluationBadge.tsx` — new
- `examples/agents/evaluator_example.ipynb` — new
- Tests in `backend/tests/unit/` and `backend/tests/integration/`

## Risks

- **Evaluator leniency**: LLMs are naturally generous. Mitigation: strong skepticism prompting + few-shot calibration.
- **Cost multiplication**: each iteration doubles token spend. Mitigation: configurable max iterations (default 3), budget cap.
- **Model mismatch**: weaker evaluator than generator yields unreliable scores. Mitigation: document same-tier recommendation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Evaluator Agent — GAN-inspired generator/evaluator feedback loop #899

Summary

Key Features

Spec

Implementation Phases

File Changes

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

feat: Evaluator Agent — GAN-inspired generator/evaluator feedback loop #899

Description

Summary

Key Features

Spec

Implementation Phases

File Changes

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions