-
Notifications
You must be signed in to change notification settings - Fork 131
Open
Description
Problem
When the checklist evaluation identifies quality gaps across rounds, agents consistently:
- Dismiss valid feedback: In a recent PPTX generation run, GPT-5.2 vision model flagged "text density is the streaming #1 problem" — the agent dismissed it as "intentionally explanatory" and made zero changes.
- Cherry-pick easy fixes: Agents fix 2-3 surface-level items per round (rename a button label, bump 3 color values) while ignoring harder improvements (reduce text density by 40%, redesign layout, increase all fonts to 18pt).
- Propose but not implement: Agent proposed "increase to 14pt minimum" but implemented 12pt. Agent identified slide 9's green card palette break but never fixed it. Gap between stated intention and actual changes.
- Same gaps persist across rounds: The same evaluation failures (E2 font size, E3 maze continuity, E8 visual design) appeared in rounds 1, 2, AND 3 without being fully resolved.
Evidence
From log_20260228_132342_891702:
- Round 1: Vision model identified text density as streaming #1 problem → Agent dismissed it
- Round 2: E2 (font size) scored 7/10 → Agent fixed 3 of ~128 text instances
- Round 3: E2 scored 6/10 (worse!) → Agent fixed fonts on slides 7-10 only, slides 1-6 untouched
- Round 3: Slide 9 green card identified as palette break → Never fixed
Current Behavior
The checklist system tells agents to "implement ALL identified improvements, not cherry-pick" but has no enforcement mechanism. Agents can acknowledge gaps in their evaluation, fix the easiest 2-3, and submit. The system has no way to verify that hard improvements were attempted.
Potential Solutions (to explore)
- Task plan enforcement: The system already has task/plan infrastructure. When evaluation identifies N gaps, require the agent to create a task plan with one task per gap, then verify completion before allowing checklist submission.
- Gap tracking across rounds: Track which specific gaps (by criterion ID) persist across consecutive rounds. If the same gap appears 2+ rounds without score improvement, escalate it — either force the agent to address it first or apply a score penalty.
- Improvement verification: After the agent claims to have fixed something, re-run the specific verification (e.g., font size scan) before accepting the checklist submission.
- Prioritized improvement queue: Instead of showing all gaps at once (which enables cherry-picking), present the top 1-2 highest-impact gaps and require those be addressed before moving to others.
Related
- Quality rethinking subagent (implemented separately) addresses the "what to improve" side
- This issue addresses the "actually do the improvements" enforcement side
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels