[BUG] Agents dismiss or cherry-pick evaluation improvements instead of implementing all identified gaps

## Problem

When the checklist evaluation identifies quality gaps across rounds, agents consistently:

1. **Dismiss valid feedback**: In a recent PPTX generation run, GPT-5.2 vision model flagged "text density is the #1 problem" — the agent dismissed it as "intentionally explanatory" and made zero changes.
2. **Cherry-pick easy fixes**: Agents fix 2-3 surface-level items per round (rename a button label, bump 3 color values) while ignoring harder improvements (reduce text density by 40%, redesign layout, increase all fonts to 18pt).
3. **Propose but not implement**: Agent proposed "increase to 14pt minimum" but implemented 12pt. Agent identified slide 9's green card palette break but never fixed it. Gap between stated intention and actual changes.
4. **Same gaps persist across rounds**: The same evaluation failures (E2 font size, E3 maze continuity, E8 visual design) appeared in rounds 1, 2, AND 3 without being fully resolved.

## Evidence

From `log_20260228_132342_891702`:

* Round 1: Vision model identified text density as #1 problem → Agent dismissed it
* Round 2: E2 (font size) scored 7/10 → Agent fixed 3 of \~128 text instances
* Round 3: E2 scored 6/10 (worse!) → Agent fixed fonts on slides 7-10 only, slides 1-6 untouched
* Round 3: Slide 9 green card identified as palette break → Never fixed

## Current Behavior

The checklist system tells agents to "implement ALL identified improvements, not cherry-pick" but has no enforcement mechanism. Agents can acknowledge gaps in their evaluation, fix the easiest 2-3, and submit. The system has no way to verify that hard improvements were attempted.

## Potential Solutions (to explore)

1. **Task plan enforcement**: The system already has task/plan infrastructure. When evaluation identifies N gaps, require the agent to create a task plan with one task per gap, then verify completion before allowing checklist submission.
2. **Gap tracking across rounds**: Track which specific gaps (by criterion ID) persist across consecutive rounds. If the same gap appears 2+ rounds without score improvement, escalate it — either force the agent to address it first or apply a score penalty.
3. **Improvement verification**: After the agent claims to have fixed something, re-run the specific verification (e.g., font size scan) before accepting the checklist submission.
4. **Prioritized improvement queue**: Instead of showing all gaps at once (which enables cherry-picking), present the top 1-2 highest-impact gaps and require those be addressed before moving to others.

## Related

* Quality rethinking subagent (implemented separately) addresses the "what to improve" side
* This issue addresses the "actually do the improvements" enforcement side

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Agents dismiss or cherry-pick evaluation improvements instead of implementing all identified gaps #961

Problem

Evidence

Current Behavior

Potential Solutions (to explore)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Agents dismiss or cherry-pick evaluation improvements instead of implementing all identified gaps #961

Description

Problem

Evidence

Current Behavior

Potential Solutions (to explore)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions