Skip to content

[Feature Request] Agentic Vision (code_execution) for Critic Agent to detect structural connection errors #35

@catallactics

Description

@catallactics

Problem

Connection/structural errors in generated diagrams are a primary failure mode in PaperBanana's pipeline. The current Critic Agent relies on pure VLM text reasoning to evaluate generated diagrams — it cannot programmatically verify structural properties like:

  • Correct number of nodes/components matching the methodology
  • Arrow/connection integrity (dangling arrows, missing connections)
  • Source-target alignment of data flow paths
  • Component count consistency between description and rendered image

These structural issues persist through multiple critic rounds because the VLM "sees" the diagram holistically but lacks the ability to perform precise pixel-level verification.

Proposed Solution

Enable Gemini's code_execution tool on the Critic Agent, allowing it to run Python code (PIL-based image analysis) within the critique loop. This creates a hybrid critic that combines:

  1. VLM reasoning (existing) — semantic understanding, content fidelity, aesthetic evaluation
  2. Code execution (new) — programmatic structural verification via pixel analysis

Implementation

The implementation adds an --agentic-critic CLI flag that:

  • Adds tools=[Tool(code_execution=ToolCodeExecution())] to the Critic's GenerateContentConfig
  • Appends a "Structural Verification" section to the system prompt instructing the model to analyze the image with PIL
  • Handles multi-part responses (separating executable_code/code_execution_result parts from text)
  • Removes response_mime_type (incompatible with tools) and uses regex JSON extraction instead
  • Logs code execution output and timing metrics per round

~175 lines changed across 5 files (critic_agent.py, generation_utils.py, main.py, config.py, paperviz_processor.py). Zero changes to existing behavior when --agentic-critic is not set.

Experiment Setup

Model Configuration

Role Model Notes
Text (Planner, Stylist, Critic) gemini-2.5-flash Same for both A and B
Image (Visualizer) gemini-3.1-flash-image-preview (Nano Banana 2) Same for both A and B

Note: Initially tested with gemini-2.0-flash-exp but it returned 404 (deprecated). Switched to gemini-3.1-flash-image-preview (Nano Banana 2) for all experiments. Both baseline and agentic runs use the exact same models — the only variable is the Critic Agent's code_execution tool.

Controlled Conditions

  • Pipeline: Retriever → Planner → Stylist → Visualizer → Critic → Visualizer (loop)
  • Retrieval: none (no example retrieval, to isolate Critic behavior)
  • Critic rounds: 3 max
  • Variable: Critic only — --agentic-critic flag on (B) vs off (A)
  • Everything else identical: same samples, same planner/stylist/visualizer, same models

Two Test Sets

Set Samples Complexity Purpose
Simple (3 samples) test_274, test_51, test_17 931–2,727 chars content Baseline comparison
Complex (3 samples) test_260, test_46, test_156 11,444–26,502 chars content Stress test on multi-module architectures

Experiment 1: Simple Diagrams (3 samples)

Qualitative Findings

Aspect A: Baseline Critic B: Agentic Vision Critic
Structural verification No programmatic analysis; relies on visual inspection only Runs PIL code to count nodes, detect dark regions, estimate arrow/connection density
Node counting Not attempted Explicitly counts bounded regions and compares against methodology description
Critique specificity Sometimes generic ("the flow is confusing") Evidence-backed ("18 boxes/cylinders detected, 14 arrows confirmed consistent with description")
Convergence Always runs max rounds Sometimes converges early when structural analysis confirms correctness

Per-Sample Comparison

test_51 (Knowledge Distillation — generative_learning)

  • Baseline: Found redundant Student Non-Target Logits bar, mathematical formatting issues. All 3 rounds iterated.
  • Agentic: PIL analysis confirmed component structure; focused on semantic accuracy of DTM/Momentum Buffer placement. Similar iteration count.
  • Visual difference: Agentic output has slightly clearer Teacher/Student separation with individually labeled DTM and Momentum Buffer blocks.

test_17 (Alias-Free Transformer — vision_perception)

  • Baseline: Round 1 said "No changes needed" but Round 2 found AF-LN placement errors — inconsistent self-assessment.
  • Agentic: PIL detected 18 boxes/cylinders and 14 arrows, confirmed consistent with methodology. More consistent across rounds.
  • Visual difference: Baseline output is slightly cleaner; Agentic output has more detailed internal labels (AF-PI Block, xL notation).

test_274 (Robot Planning Pipeline — agent_reasoning)

  • Baseline: Found text inconsistency ("Pick Block" vs methodology), redundant "Figure 2" label.
  • Agentic: PIL threshold analysis + same text issues + additional concrete planning examples.
  • Visual difference: Agentic includes concrete examples (pick_block, place_block_table_A) vs baseline's abstract icons.

Simple Diagrams: Verdict

For simple samples (2–10 major components), visual quality difference is marginal. The agentic critic provides evidence-backed critiques but the final rendered output is comparable.


Experiment 2: Complex Diagrams (3 samples) — Key Results

These samples have 10–30+ components, multi-branch data flows, and sub-module hierarchies — exactly the scenario where structural verification should matter.

Visual Comparison (Complex Diagrams)

Important caveat: These images look very different despite being generated from the same paper/methodology. This is expected — see Limitations below. The Visualizer (Nano Banana 2) is non-deterministic and the Critic's different feedback causes divergent re-rendering. Compare the Critic's text suggestions, not the images, for a fair assessment.

test_46 — CroPe (15+ components, most complex)

Baseline Critic Agentic Vision Critic
baseline_test_46 agentic_test_46

test_156 — T2MIR (3-panel MoE Transformer)

Baseline Critic Agentic Vision Critic
baseline_test_156 agentic_test_156

test_260 — TransGesture (Dual-path architecture)

Baseline Critic Agentic Vision Critic
baseline_test_260 agentic_test_260

Per-Sample Analysis

test_260: TransGesture (Gaze-Gesture Framework — vision_perception)

Methodology: Dual-path architecture (Gaze + Gesture decoders), frozen DINOv2 encoder, joint cross-attention module → convolution mask head → segmentation mask + binary existence prediction.

A: Baseline B: Agentic
Critic behavior Round 1 said "No changes needed" — stopped early All 3 rounds iterated with code-verified structural checks
Structural check None PIL detected region density, verified dual-path layout symmetry
Key finding Missed concatenation discrepancy in Gesture Decoder input Found and flagged layout duplication error + Gesture Decoder input mismatch
Final result Stopped at Round 0 output Iterated through all 3 rounds with progressive refinement
Pipeline completeness ✅ Complete — includes Segmentation Head, mask output, Binary Existence Prediction Incomplete — ends at JCA module output (f_GCA), missing Segmentation Head and final output

Notable — Agentic Critic failure case: Despite running PIL-based structural analysis for 3 rounds, the agentic critic failed to detect that the final output stage was missing. The diagram ends at the JCA (Joint Cross Attention) module's f_GCA output, but the methodology clearly specifies a downstream "convolution mask head" producing segmentation masks and binary existence predictions. The baseline, by contrast, includes the complete pipeline.

This reveals a key limitation: the agentic critic's code_execution analyzes what IS in the image (node counts, region density) but does not verify what is MISSING against the methodology. The structural verification prompt focuses on detecting errors in existing elements rather than checking for absent components. A potential improvement would be adding a "pipeline completeness check" step that explicitly enumerates expected output stages from the methodology and verifies each one appears in the image.

test_46: CroPe (Complementary-Perceptive Text Generation — vision_perception)

Methodology: Visual Encoder → CPTG module (3 sub-paths: Domain-Specific, Domain-Invariant, Gated Complementary) → RCTVF module → Segmentation Head. Highly complex with 15+ labeled components.

A: Baseline B: Agentic
Critic behavior Round 1 said "No changes needed" — stopped early All 3 rounds with structural analysis
Key finding (Baseline) RCTVF module naming/flow ambiguity Same, plus stopped after 1 round
Key finding (Agentic) Input image multiplicity error, CPTG internal data flow paths Found content mismatch in Visual Encoder output to CPTG ([P2,P3,P4] multi-line representation)
Visual difference Cleaner but fewer sub-module details More explicit sub-module labels (MLP1/MLP2, Domain-Specific/Invariant paths clearly separated), Gated Complementary flow visible

Notable: The agentic output for this complex sample shows significantly more structural detail — the Domain-Specific Perception, Domain-Invariant Regularization, and Gated Complementary Fusion paths are individually traced and labeled, whereas baseline collapses them into a single CPTG block.

test_156: T2MIR (Token-Task MoE Transformer — generative_learning)

Methodology: 3-panel diagram: (a) Overall Pipeline with parallel MoE layers in transformer blocks, (b) Token-wise MoE with top-k routing, (c) Task-wise MoE with contrastive learning.

A: Baseline B: Agentic
Critic behavior All 3 rounds iterated All 3 rounds with code execution
Key finding (Baseline) Missing labels in panel (c), general formatting issues Same + structural verification of panel counts
Key finding (Agentic) Panel (c) contrastive learning flow details, gating mechanism notation Found critical content error in data flow routing between Token-wise and Task-wise MoE
Visual difference Standard 3-panel layout Panel (b)/(c) have more explicit routing arrows and gating notation (g_tok, g_task), auxiliary flow paths labeled

Complex Diagrams: Quantitative

Metric A: Baseline B: Agentic Delta
Total pipeline time (3 complex) ~6 min ~7.5 min +25%
Avg critic round time ~5-10s ~66.2s +7-13x
Token usage per round ~3-5k ~27-47k ~6-12x
Code execution every round No Yes (2 parts/round)
All visualizations succeeded Same
Baseline early-stopped (false "No changes needed") 2/3 samples 0/3 samples

Complex Agentic Token Breakdown

Sample Round Time (s) Total Tokens Code Parts
test_260 0 43.0 26,958 2
test_260 1 79.6 30,885 2
test_260 2 61.0 37,790 2
test_46 0 60.0 46,225 2
test_46 1 79.8 40,027 2
test_46 2 69.3 47,343 2
test_156 0 88.4 43,013 2
test_156 1 54.4 34,923 2
test_156 2 60.5 43,159 2

Key Insight: False Early Convergence

The most significant finding from the complex diagram experiment:

Baseline critic said "No changes needed" in Round 1 for 2 out of 3 complex samples (test_260, test_46), stopping iteration prematurely. The agentic critic, armed with structural analysis, continued iterating and found real issues in subsequent rounds:

  • test_260: Layout duplication error, JCA module flow ambiguity
  • test_46: Input image multiplicity error, CPTG internal data flow path mismatches

This suggests that VLM-only critique over-confidently approves complex diagrams because it lacks the granularity to detect subtle structural issues. The code_execution tool provides a "second opinion" mechanism that prevents premature convergence.


Example: Agentic Critic Code Execution

In Round 0 for test_46 (complex CroPe architecture), the agentic critic ran:

from PIL import Image, ImageEnhance, ImageOps
import io, base64

image = Image.open("input_file_0.jpeg").convert("RGB")
gray_image = image.convert("L")

# Simple threshold to binarize
threshold = 128
binary = gray_image.point(lambda x: 255 if x < threshold else 0)

# Count dark pixel regions as approximate component count
import numpy as np
arr = np.array(binary)
dark_ratio = np.sum(arr == 255) / arr.size
# Horizontal projection for column-based component counting
col_sums = np.sum(arr == 255, axis=0)
...

This provided evidence for the subsequent critique: "Image Features [P2, P3, P4] should be represented as three distinct feature lines/paths from the Visual Encoder to the CPTG module."


Trade-offs

Pro Con
Programmatic structural verification ~6-12x token increase per critic round
Prevents false early convergence on complex diagrams +50-90s latency per round
Evidence-backed critique suggestions Sandbox library constraints (no scikit-image; cv2 available but untested — see Future Improvement)
Zero changes to non-agentic path Gemini-specific (code_execution not available on Claude/OpenAI)

Limitations of This Experiment

Visual comparison is not a reliable metric here. The final images from baseline vs agentic look significantly different even for the same sample — this is because:

  1. No seed control: PaperBanana currently does not support fixed random seeds. The Planner and Stylist stages can produce slightly different descriptions across runs, and Nano Banana 2 is non-deterministic, so the Visualizer generates different layouts each time.
  2. Critic feedback cascades: Different Critic suggestions → different revised descriptions → completely different re-rendered images. A single wording change in the Critic's revision can cause the Visualizer to choose an entirely different layout.
  3. Not an apples-to-apples image comparison: The images are the end result of divergent feedback loops, not controlled single-variable outputs.

What IS reliably comparable:

  • Critic suggestion text — directly shows what each Critic detected and recommended (see Per-Sample Comparison sections above)
  • False early convergence — Baseline said "No changes needed" prematurely on 2/3 complex samples; Agentic continued finding real issues
  • Code execution evidence — Agentic Critic ran Python code to count nodes/regions and used results as structural evidence
  • Token/timing metrics — objective cost measurements

Agentic Critic can miss absent components:
In test_260, the agentic critic's PIL analysis verified what was present in the image (node counts, region density) but failed to detect that the Segmentation Head and final output stage were completely missing. The baseline actually produced a more complete diagram. This shows that code_execution-based verification is better at checking existing elements than ensuring pipeline completeness — a "what's missing" check would require comparing the methodology's expected components against the image, which the current prompt doesn't explicitly instruct.

A more rigorous experiment would require: (a) deterministic seeding across the full pipeline, (b) freezing Planner/Stylist/Visualizer outputs and only varying the Critic, or (c) human evaluation of Critic suggestion quality on identical intermediate outputs.

Future Improvement: OpenCV in Sandbox

A post-experiment discovery: Gemini's code_execution sandbox has opencv-python (cv2) pre-installed, not just PIL. Our current prompt restricts the critic to PIL-only analysis (thresholding, region counting), but cv2 enables significantly more powerful structural verification:

  • Canny Edge Detection → precise arrow/connection line tracing
  • cv2.findContours() → accurate node bounding box detection and counting
  • Hough Line Transform → detect straight connection lines and their endpoints
  • Template matching → verify specific icon/symbol presence

This could address the key weakness found in test_260 (missing Segmentation Head). With cv2 contour detection, the critic could enumerate all detected components and compare against an expected component list derived from the methodology section — catching "what's missing", not just "what's wrong".

A potential v2 prompt addition:

You also have OpenCV (cv2) available. Use cv2.findContours() to enumerate
all distinct bounded components, then compare this count against the
expected components listed in the methodology. Flag any MISSING components.

Recommendation

Based on the experiments, --agentic-critic is most valuable as a selective, opt-in feature for:

  • Complex diagrams with 10+ components and multi-branch data flows
  • Architecture diagrams where connection accuracy is critical
  • Quality-sensitive production runs where false early convergence is costly

For simple diagrams (< 10 components), the overhead is not justified — baseline critic is sufficient.

Implementation Reference

Fork with full implementation: https://github.com/catallactics/PaperBanana/tree/feature/agentic-vision-critic

Files Changed

File Change
agents/critic_agent.py code_execution tool, AGENTIC_VISION_PROMPT_SUPPLEMENT, multi-part response handling, timing/logging
utils/generation_utils.py New call_gemini_agentic_async() for multi-part response parsing
utils/paperviz_processor.py Defensive fix for critic_suggestions being a list (pre-existing bug surfaced by complex samples)
main.py --agentic-critic CLI flag
utils/config.py agentic_critic: bool field in ExpConfig
scripts/compare_results.py A/B comparison utility script

Happy to convert this to a PR if there's interest. The implementation is designed to be fully backward-compatible — existing behavior is unchanged unless --agentic-critic is explicitly passed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions