Skip to content

[BUG] Agents dismiss or cherry-pick evaluation improvements instead of implementing all identified gaps #961

@ncrispino

Description

@ncrispino

Problem

When the checklist evaluation identifies quality gaps across rounds, agents consistently:

  1. Dismiss valid feedback: In a recent PPTX generation run, GPT-5.2 vision model flagged "text density is the streaming #1 problem" — the agent dismissed it as "intentionally explanatory" and made zero changes.
  2. Cherry-pick easy fixes: Agents fix 2-3 surface-level items per round (rename a button label, bump 3 color values) while ignoring harder improvements (reduce text density by 40%, redesign layout, increase all fonts to 18pt).
  3. Propose but not implement: Agent proposed "increase to 14pt minimum" but implemented 12pt. Agent identified slide 9's green card palette break but never fixed it. Gap between stated intention and actual changes.
  4. Same gaps persist across rounds: The same evaluation failures (E2 font size, E3 maze continuity, E8 visual design) appeared in rounds 1, 2, AND 3 without being fully resolved.

Evidence

From log_20260228_132342_891702:

  • Round 1: Vision model identified text density as streaming #1 problem → Agent dismissed it
  • Round 2: E2 (font size) scored 7/10 → Agent fixed 3 of ~128 text instances
  • Round 3: E2 scored 6/10 (worse!) → Agent fixed fonts on slides 7-10 only, slides 1-6 untouched
  • Round 3: Slide 9 green card identified as palette break → Never fixed

Current Behavior

The checklist system tells agents to "implement ALL identified improvements, not cherry-pick" but has no enforcement mechanism. Agents can acknowledge gaps in their evaluation, fix the easiest 2-3, and submit. The system has no way to verify that hard improvements were attempted.

Potential Solutions (to explore)

  1. Task plan enforcement: The system already has task/plan infrastructure. When evaluation identifies N gaps, require the agent to create a task plan with one task per gap, then verify completion before allowing checklist submission.
  2. Gap tracking across rounds: Track which specific gaps (by criterion ID) persist across consecutive rounds. If the same gap appears 2+ rounds without score improvement, escalate it — either force the agent to address it first or apply a score penalty.
  3. Improvement verification: After the agent claims to have fixed something, re-run the specific verification (e.g., font size scan) before accepting the checklist submission.
  4. Prioritized improvement queue: Instead of showing all gaps at once (which enables cherry-picking), present the top 1-2 highest-impact gaps and require those be addressed before moving to others.

Related

  • Quality rethinking subagent (implemented separately) addresses the "what to improve" side
  • This issue addresses the "actually do the improvements" enforcement side

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions