Detailed component breakdown and design decisions for building a generate-evaluate optimization loop.
graph TD
Input([Task]) --> Gen[Generator LLM]
Gen -->|"candidate"| Eval[Evaluator]
Eval -->|"score + feedback"| Ctrl{Iteration Controller}
Ctrl -->|"below threshold"| Gen
Ctrl -->|"meets threshold OR max reached"| Sel[Best-So-Far Selector]
Sel --> Output([Best Output])
History[(Attempt History)] <-->|"track"| Ctrl
style Input fill:#e3f2fd
style Gen fill:#fff3e0
style Eval fill:#e8f5e9
style Ctrl fill:#fce4ec
style Sel fill:#f3e5f5
style Output fill:#e3f2fd
style History fill:#f3e5f5
Produces output for the task. On iteration 2+, receives the original task, previous attempt, and evaluator feedback. Must be prompted to use feedback, not regenerate from scratch.
Assesses quality. Types:
| Type | Description | Tradeoff |
|---|---|---|
| LLM evaluator | LLM scores with evaluation prompt | Flexible; costs tokens |
| Rule-based | Code checks format, length, keywords | Cheap, deterministic; limited |
| Hybrid | Rules for format + LLM for content | Balanced |
Evaluator output: score (0.0–1.0), actionable feedback, optional per-criterion scores.
Checks: score met threshold? Max iterations reached? Score converged (no improvement in K rounds)?
Tracks highest-scoring attempt. Returns the best — not necessarily the last, since later iterations can regress.
Records: {iteration, output, score, feedback}. Used for convergence detection and debugging.
history = []
best = {score: -1, output: null}
for i in 1..max_iterations:
if i == 1:
candidate = generator(task)
else:
candidate = generator(task, previous.output, previous.feedback)
eval = evaluator(candidate)
history.append({i, candidate, eval})
if eval.score > best.score:
best = {score: eval.score, output: candidate}
if eval.score >= threshold:
return best.output
if converged(history):
return best.output
return best.output
- Degenerate output — Check length/similarity; adjust temperature
- Regression — Best-so-far tracker handles this
- Ignores feedback — Structure feedback as explicit instructions
- Score inflation — Calibrate with known examples; use dimension scores
- Inconsistency — Lower temperature; add rule-based components
- Gaming — Diverse criteria; periodic human review
- Oscillation — Track score variance; stop if exceeding threshold
- False convergence — Separate minimum quality from convergence detection
2 LLM calls per iteration. Total = 2 × K iterations. K = 2–3 typically (diminishing returns).
K × (generator_latency + evaluator_latency). Use faster model for evaluation if possible.
- Iteration 1→2: significant improvement
- Iteration 2→3: moderate improvement
- Iteration 3+: rarely worth the cost
Default: max_iterations = 3.
Evaluate per-step or end-to-end. Per-step is thorough but expensive.
Evaluate synthesized output; feedback guides re-decomposition.
When evaluator and generator should be the same entity with richer self-critique. See Reflection evolution.
| Factor | LLM Evaluator | Rule-Based | Hybrid |
|---|---|---|---|
| Setup cost | Low | Medium | Medium-high |
| Per-call cost | High | Free | Medium |
| Nuance | High | Low | Medium-high |
| Determinism | Low | High | Medium |
| Best for | Prototyping | Format checks | Production |
Guideline: Start LLM-based. Replace measurable criteria with rules as you identify them. Production = hybrid.
Cognitive concerns this repo covers; operational concerns belong in agent-deployments.
| Concern | This pattern's surface | Where to read |
|---|---|---|
| Prompt injection | evaluator output feeds back into generator — a poisoned evaluator output drives a bad next generation | foundations/security-and-safety.md |
| Hallucination & grounding | this pattern IS a grounding mechanism; evaluator needs its own grounding to be trustworthy | foundations/hallucination-and-grounding.md |
| Cost & model selection | 2× per iteration; cap iterations explicitly | foundations/cost-and-model-selection.md |
| Rate limiting & retries | inherited | agent-deployments cross-cutting |
| Idempotency | inherited | agent-deployments cross-cutting |
| Observability hooks | see observability.md alongside this file |
foundations |