Problem
Connection/structural errors in generated diagrams are a primary failure mode in PaperBanana's pipeline. The current Critic Agent relies on pure VLM text reasoning to evaluate generated diagrams — it cannot programmatically verify structural properties like:
- Correct number of nodes/components matching the methodology
- Arrow/connection integrity (dangling arrows, missing connections)
- Source-target alignment of data flow paths
- Component count consistency between description and rendered image
These structural issues persist through multiple critic rounds because the VLM "sees" the diagram holistically but lacks the ability to perform precise pixel-level verification.
Proposed Solution
Enable Gemini's code_execution tool on the Critic Agent, allowing it to run Python code (PIL-based image analysis) within the critique loop. This creates a hybrid critic that combines:
- VLM reasoning (existing) — semantic understanding, content fidelity, aesthetic evaluation
- Code execution (new) — programmatic structural verification via pixel analysis
Implementation
The implementation adds an --agentic-critic CLI flag that:
- Adds
tools=[Tool(code_execution=ToolCodeExecution())] to the Critic's GenerateContentConfig
- Appends a "Structural Verification" section to the system prompt instructing the model to analyze the image with PIL
- Handles multi-part responses (separating
executable_code/code_execution_result parts from text)
- Removes
response_mime_type (incompatible with tools) and uses regex JSON extraction instead
- Logs code execution output and timing metrics per round
~175 lines changed across 5 files (critic_agent.py, generation_utils.py, main.py, config.py, paperviz_processor.py). Zero changes to existing behavior when --agentic-critic is not set.
Experiment Setup
Model Configuration
| Role |
Model |
Notes |
| Text (Planner, Stylist, Critic) |
gemini-2.5-flash |
Same for both A and B |
| Image (Visualizer) |
gemini-3.1-flash-image-preview (Nano Banana 2) |
Same for both A and B |
Note: Initially tested with gemini-2.0-flash-exp but it returned 404 (deprecated). Switched to gemini-3.1-flash-image-preview (Nano Banana 2) for all experiments. Both baseline and agentic runs use the exact same models — the only variable is the Critic Agent's code_execution tool.
Controlled Conditions
- Pipeline: Retriever → Planner → Stylist → Visualizer → Critic → Visualizer (loop)
- Retrieval:
none (no example retrieval, to isolate Critic behavior)
- Critic rounds: 3 max
- Variable: Critic only —
--agentic-critic flag on (B) vs off (A)
- Everything else identical: same samples, same planner/stylist/visualizer, same models
Two Test Sets
| Set |
Samples |
Complexity |
Purpose |
| Simple (3 samples) |
test_274, test_51, test_17 |
931–2,727 chars content |
Baseline comparison |
| Complex (3 samples) |
test_260, test_46, test_156 |
11,444–26,502 chars content |
Stress test on multi-module architectures |
Experiment 1: Simple Diagrams (3 samples)
Qualitative Findings
| Aspect |
A: Baseline Critic |
B: Agentic Vision Critic |
| Structural verification |
No programmatic analysis; relies on visual inspection only |
Runs PIL code to count nodes, detect dark regions, estimate arrow/connection density |
| Node counting |
Not attempted |
Explicitly counts bounded regions and compares against methodology description |
| Critique specificity |
Sometimes generic ("the flow is confusing") |
Evidence-backed ("18 boxes/cylinders detected, 14 arrows confirmed consistent with description") |
| Convergence |
Always runs max rounds |
Sometimes converges early when structural analysis confirms correctness |
Per-Sample Comparison
test_51 (Knowledge Distillation — generative_learning)
- Baseline: Found redundant Student Non-Target Logits bar, mathematical formatting issues. All 3 rounds iterated.
- Agentic: PIL analysis confirmed component structure; focused on semantic accuracy of DTM/Momentum Buffer placement. Similar iteration count.
- Visual difference: Agentic output has slightly clearer Teacher/Student separation with individually labeled DTM and Momentum Buffer blocks.
test_17 (Alias-Free Transformer — vision_perception)
- Baseline: Round 1 said "No changes needed" but Round 2 found AF-LN placement errors — inconsistent self-assessment.
- Agentic: PIL detected 18 boxes/cylinders and 14 arrows, confirmed consistent with methodology. More consistent across rounds.
- Visual difference: Baseline output is slightly cleaner; Agentic output has more detailed internal labels (AF-PI Block, xL notation).
test_274 (Robot Planning Pipeline — agent_reasoning)
- Baseline: Found text inconsistency ("Pick Block" vs methodology), redundant "Figure 2" label.
- Agentic: PIL threshold analysis + same text issues + additional concrete planning examples.
- Visual difference: Agentic includes concrete examples (pick_block, place_block_table_A) vs baseline's abstract icons.
Simple Diagrams: Verdict
For simple samples (2–10 major components), visual quality difference is marginal. The agentic critic provides evidence-backed critiques but the final rendered output is comparable.
Experiment 2: Complex Diagrams (3 samples) — Key Results
These samples have 10–30+ components, multi-branch data flows, and sub-module hierarchies — exactly the scenario where structural verification should matter.
Visual Comparison (Complex Diagrams)
Important caveat: These images look very different despite being generated from the same paper/methodology. This is expected — see Limitations below. The Visualizer (Nano Banana 2) is non-deterministic and the Critic's different feedback causes divergent re-rendering. Compare the Critic's text suggestions, not the images, for a fair assessment.
test_46 — CroPe (15+ components, most complex)
| Baseline Critic |
Agentic Vision Critic |
 |
 |
test_156 — T2MIR (3-panel MoE Transformer)
| Baseline Critic |
Agentic Vision Critic |
 |
 |
test_260 — TransGesture (Dual-path architecture)
| Baseline Critic |
Agentic Vision Critic |
 |
 |
Per-Sample Analysis
test_260: TransGesture (Gaze-Gesture Framework — vision_perception)
Methodology: Dual-path architecture (Gaze + Gesture decoders), frozen DINOv2 encoder, joint cross-attention module → convolution mask head → segmentation mask + binary existence prediction.
|
A: Baseline |
B: Agentic |
| Critic behavior |
Round 1 said "No changes needed" — stopped early |
All 3 rounds iterated with code-verified structural checks |
| Structural check |
None |
PIL detected region density, verified dual-path layout symmetry |
| Key finding |
Missed concatenation discrepancy in Gesture Decoder input |
Found and flagged layout duplication error + Gesture Decoder input mismatch |
| Final result |
Stopped at Round 0 output |
Iterated through all 3 rounds with progressive refinement |
| Pipeline completeness |
✅ Complete — includes Segmentation Head, mask output, Binary Existence Prediction |
❌ Incomplete — ends at JCA module output (f_GCA), missing Segmentation Head and final output |
Notable — Agentic Critic failure case: Despite running PIL-based structural analysis for 3 rounds, the agentic critic failed to detect that the final output stage was missing. The diagram ends at the JCA (Joint Cross Attention) module's f_GCA output, but the methodology clearly specifies a downstream "convolution mask head" producing segmentation masks and binary existence predictions. The baseline, by contrast, includes the complete pipeline.
This reveals a key limitation: the agentic critic's code_execution analyzes what IS in the image (node counts, region density) but does not verify what is MISSING against the methodology. The structural verification prompt focuses on detecting errors in existing elements rather than checking for absent components. A potential improvement would be adding a "pipeline completeness check" step that explicitly enumerates expected output stages from the methodology and verifies each one appears in the image.
test_46: CroPe (Complementary-Perceptive Text Generation — vision_perception)
Methodology: Visual Encoder → CPTG module (3 sub-paths: Domain-Specific, Domain-Invariant, Gated Complementary) → RCTVF module → Segmentation Head. Highly complex with 15+ labeled components.
|
A: Baseline |
B: Agentic |
| Critic behavior |
Round 1 said "No changes needed" — stopped early |
All 3 rounds with structural analysis |
| Key finding (Baseline) |
RCTVF module naming/flow ambiguity |
Same, plus stopped after 1 round |
| Key finding (Agentic) |
Input image multiplicity error, CPTG internal data flow paths |
Found content mismatch in Visual Encoder output to CPTG ([P2,P3,P4] multi-line representation) |
| Visual difference |
Cleaner but fewer sub-module details |
More explicit sub-module labels (MLP1/MLP2, Domain-Specific/Invariant paths clearly separated), Gated Complementary flow visible |
Notable: The agentic output for this complex sample shows significantly more structural detail — the Domain-Specific Perception, Domain-Invariant Regularization, and Gated Complementary Fusion paths are individually traced and labeled, whereas baseline collapses them into a single CPTG block.
test_156: T2MIR (Token-Task MoE Transformer — generative_learning)
Methodology: 3-panel diagram: (a) Overall Pipeline with parallel MoE layers in transformer blocks, (b) Token-wise MoE with top-k routing, (c) Task-wise MoE with contrastive learning.
|
A: Baseline |
B: Agentic |
| Critic behavior |
All 3 rounds iterated |
All 3 rounds with code execution |
| Key finding (Baseline) |
Missing labels in panel (c), general formatting issues |
Same + structural verification of panel counts |
| Key finding (Agentic) |
Panel (c) contrastive learning flow details, gating mechanism notation |
Found critical content error in data flow routing between Token-wise and Task-wise MoE |
| Visual difference |
Standard 3-panel layout |
Panel (b)/(c) have more explicit routing arrows and gating notation (g_tok, g_task), auxiliary flow paths labeled |
Complex Diagrams: Quantitative
| Metric |
A: Baseline |
B: Agentic |
Delta |
| Total pipeline time (3 complex) |
~6 min |
~7.5 min |
+25% |
| Avg critic round time |
~5-10s |
~66.2s |
+7-13x |
| Token usage per round |
~3-5k |
~27-47k |
~6-12x |
| Code execution every round |
No |
Yes (2 parts/round) |
— |
| All visualizations succeeded |
✅ |
✅ |
Same |
| Baseline early-stopped (false "No changes needed") |
2/3 samples |
0/3 samples |
— |
Complex Agentic Token Breakdown
| Sample |
Round |
Time (s) |
Total Tokens |
Code Parts |
| test_260 |
0 |
43.0 |
26,958 |
2 |
| test_260 |
1 |
79.6 |
30,885 |
2 |
| test_260 |
2 |
61.0 |
37,790 |
2 |
| test_46 |
0 |
60.0 |
46,225 |
2 |
| test_46 |
1 |
79.8 |
40,027 |
2 |
| test_46 |
2 |
69.3 |
47,343 |
2 |
| test_156 |
0 |
88.4 |
43,013 |
2 |
| test_156 |
1 |
54.4 |
34,923 |
2 |
| test_156 |
2 |
60.5 |
43,159 |
2 |
Key Insight: False Early Convergence
The most significant finding from the complex diagram experiment:
Baseline critic said "No changes needed" in Round 1 for 2 out of 3 complex samples (test_260, test_46), stopping iteration prematurely. The agentic critic, armed with structural analysis, continued iterating and found real issues in subsequent rounds:
- test_260: Layout duplication error, JCA module flow ambiguity
- test_46: Input image multiplicity error, CPTG internal data flow path mismatches
This suggests that VLM-only critique over-confidently approves complex diagrams because it lacks the granularity to detect subtle structural issues. The code_execution tool provides a "second opinion" mechanism that prevents premature convergence.
Example: Agentic Critic Code Execution
In Round 0 for test_46 (complex CroPe architecture), the agentic critic ran:
from PIL import Image, ImageEnhance, ImageOps
import io, base64
image = Image.open("input_file_0.jpeg").convert("RGB")
gray_image = image.convert("L")
# Simple threshold to binarize
threshold = 128
binary = gray_image.point(lambda x: 255 if x < threshold else 0)
# Count dark pixel regions as approximate component count
import numpy as np
arr = np.array(binary)
dark_ratio = np.sum(arr == 255) / arr.size
# Horizontal projection for column-based component counting
col_sums = np.sum(arr == 255, axis=0)
...
This provided evidence for the subsequent critique: "Image Features [P2, P3, P4] should be represented as three distinct feature lines/paths from the Visual Encoder to the CPTG module."
Trade-offs
| Pro |
Con |
| Programmatic structural verification |
~6-12x token increase per critic round |
| Prevents false early convergence on complex diagrams |
+50-90s latency per round |
| Evidence-backed critique suggestions |
Sandbox library constraints (no scikit-image; cv2 available but untested — see Future Improvement) |
| Zero changes to non-agentic path |
Gemini-specific (code_execution not available on Claude/OpenAI) |
Limitations of This Experiment
Visual comparison is not a reliable metric here. The final images from baseline vs agentic look significantly different even for the same sample — this is because:
- No seed control: PaperBanana currently does not support fixed random seeds. The Planner and Stylist stages can produce slightly different descriptions across runs, and Nano Banana 2 is non-deterministic, so the Visualizer generates different layouts each time.
- Critic feedback cascades: Different Critic suggestions → different revised descriptions → completely different re-rendered images. A single wording change in the Critic's revision can cause the Visualizer to choose an entirely different layout.
- Not an apples-to-apples image comparison: The images are the end result of divergent feedback loops, not controlled single-variable outputs.
What IS reliably comparable:
- Critic suggestion text — directly shows what each Critic detected and recommended (see Per-Sample Comparison sections above)
- False early convergence — Baseline said "No changes needed" prematurely on 2/3 complex samples; Agentic continued finding real issues
- Code execution evidence — Agentic Critic ran Python code to count nodes/regions and used results as structural evidence
- Token/timing metrics — objective cost measurements
Agentic Critic can miss absent components:
In test_260, the agentic critic's PIL analysis verified what was present in the image (node counts, region density) but failed to detect that the Segmentation Head and final output stage were completely missing. The baseline actually produced a more complete diagram. This shows that code_execution-based verification is better at checking existing elements than ensuring pipeline completeness — a "what's missing" check would require comparing the methodology's expected components against the image, which the current prompt doesn't explicitly instruct.
A more rigorous experiment would require: (a) deterministic seeding across the full pipeline, (b) freezing Planner/Stylist/Visualizer outputs and only varying the Critic, or (c) human evaluation of Critic suggestion quality on identical intermediate outputs.
Future Improvement: OpenCV in Sandbox
A post-experiment discovery: Gemini's code_execution sandbox has opencv-python (cv2) pre-installed, not just PIL. Our current prompt restricts the critic to PIL-only analysis (thresholding, region counting), but cv2 enables significantly more powerful structural verification:
- Canny Edge Detection → precise arrow/connection line tracing
cv2.findContours() → accurate node bounding box detection and counting
- Hough Line Transform → detect straight connection lines and their endpoints
- Template matching → verify specific icon/symbol presence
This could address the key weakness found in test_260 (missing Segmentation Head). With cv2 contour detection, the critic could enumerate all detected components and compare against an expected component list derived from the methodology section — catching "what's missing", not just "what's wrong".
A potential v2 prompt addition:
You also have OpenCV (cv2) available. Use cv2.findContours() to enumerate
all distinct bounded components, then compare this count against the
expected components listed in the methodology. Flag any MISSING components.
Recommendation
Based on the experiments, --agentic-critic is most valuable as a selective, opt-in feature for:
- Complex diagrams with 10+ components and multi-branch data flows
- Architecture diagrams where connection accuracy is critical
- Quality-sensitive production runs where false early convergence is costly
For simple diagrams (< 10 components), the overhead is not justified — baseline critic is sufficient.
Implementation Reference
Fork with full implementation: https://github.com/catallactics/PaperBanana/tree/feature/agentic-vision-critic
Files Changed
| File |
Change |
agents/critic_agent.py |
code_execution tool, AGENTIC_VISION_PROMPT_SUPPLEMENT, multi-part response handling, timing/logging |
utils/generation_utils.py |
New call_gemini_agentic_async() for multi-part response parsing |
utils/paperviz_processor.py |
Defensive fix for critic_suggestions being a list (pre-existing bug surfaced by complex samples) |
main.py |
--agentic-critic CLI flag |
utils/config.py |
agentic_critic: bool field in ExpConfig |
scripts/compare_results.py |
A/B comparison utility script |
Happy to convert this to a PR if there's interest. The implementation is designed to be fully backward-compatible — existing behavior is unchanged unless --agentic-critic is explicitly passed.
Problem
Connection/structural errors in generated diagrams are a primary failure mode in PaperBanana's pipeline. The current Critic Agent relies on pure VLM text reasoning to evaluate generated diagrams — it cannot programmatically verify structural properties like:
These structural issues persist through multiple critic rounds because the VLM "sees" the diagram holistically but lacks the ability to perform precise pixel-level verification.
Proposed Solution
Enable Gemini's
code_executiontool on the Critic Agent, allowing it to run Python code (PIL-based image analysis) within the critique loop. This creates a hybrid critic that combines:Implementation
The implementation adds an
--agentic-criticCLI flag that:tools=[Tool(code_execution=ToolCodeExecution())]to the Critic'sGenerateContentConfigexecutable_code/code_execution_resultparts from text)response_mime_type(incompatible with tools) and uses regex JSON extraction instead~175 lines changed across 5 files (
critic_agent.py,generation_utils.py,main.py,config.py,paperviz_processor.py). Zero changes to existing behavior when--agentic-criticis not set.Experiment Setup
Model Configuration
gemini-2.5-flashgemini-3.1-flash-image-preview(Nano Banana 2)Controlled Conditions
none(no example retrieval, to isolate Critic behavior)--agentic-criticflag on (B) vs off (A)Two Test Sets
Experiment 1: Simple Diagrams (3 samples)
Qualitative Findings
Per-Sample Comparison
test_51 (Knowledge Distillation —
generative_learning)test_17 (Alias-Free Transformer —
vision_perception)test_274 (Robot Planning Pipeline —
agent_reasoning)Simple Diagrams: Verdict
For simple samples (2–10 major components), visual quality difference is marginal. The agentic critic provides evidence-backed critiques but the final rendered output is comparable.
Experiment 2: Complex Diagrams (3 samples) — Key Results
These samples have 10–30+ components, multi-branch data flows, and sub-module hierarchies — exactly the scenario where structural verification should matter.
Visual Comparison (Complex Diagrams)
test_46 — CroPe (15+ components, most complex)
test_156 — T2MIR (3-panel MoE Transformer)
test_260 — TransGesture (Dual-path architecture)
Per-Sample Analysis
test_260: TransGesture (Gaze-Gesture Framework —
vision_perception)Methodology: Dual-path architecture (Gaze + Gesture decoders), frozen DINOv2 encoder, joint cross-attention module → convolution mask head → segmentation mask + binary existence prediction.
f_GCA), missing Segmentation Head and final outputNotable — Agentic Critic failure case: Despite running PIL-based structural analysis for 3 rounds, the agentic critic failed to detect that the final output stage was missing. The diagram ends at the JCA (Joint Cross Attention) module's
f_GCAoutput, but the methodology clearly specifies a downstream "convolution mask head" producing segmentation masks and binary existence predictions. The baseline, by contrast, includes the complete pipeline.This reveals a key limitation: the agentic critic's code_execution analyzes what IS in the image (node counts, region density) but does not verify what is MISSING against the methodology. The structural verification prompt focuses on detecting errors in existing elements rather than checking for absent components. A potential improvement would be adding a "pipeline completeness check" step that explicitly enumerates expected output stages from the methodology and verifies each one appears in the image.
test_46: CroPe (Complementary-Perceptive Text Generation —
vision_perception)Methodology: Visual Encoder → CPTG module (3 sub-paths: Domain-Specific, Domain-Invariant, Gated Complementary) → RCTVF module → Segmentation Head. Highly complex with 15+ labeled components.
[P2,P3,P4]multi-line representation)Notable: The agentic output for this complex sample shows significantly more structural detail — the Domain-Specific Perception, Domain-Invariant Regularization, and Gated Complementary Fusion paths are individually traced and labeled, whereas baseline collapses them into a single CPTG block.
test_156: T2MIR (Token-Task MoE Transformer —
generative_learning)Methodology: 3-panel diagram: (a) Overall Pipeline with parallel MoE layers in transformer blocks, (b) Token-wise MoE with top-k routing, (c) Task-wise MoE with contrastive learning.
Complex Diagrams: Quantitative
Complex Agentic Token Breakdown
Key Insight: False Early Convergence
The most significant finding from the complex diagram experiment:
Baseline critic said "No changes needed" in Round 1 for 2 out of 3 complex samples (test_260, test_46), stopping iteration prematurely. The agentic critic, armed with structural analysis, continued iterating and found real issues in subsequent rounds:
This suggests that VLM-only critique over-confidently approves complex diagrams because it lacks the granularity to detect subtle structural issues. The code_execution tool provides a "second opinion" mechanism that prevents premature convergence.
Example: Agentic Critic Code Execution
In Round 0 for test_46 (complex CroPe architecture), the agentic critic ran:
This provided evidence for the subsequent critique: "Image Features
[P2, P3, P4]should be represented as three distinct feature lines/paths from the Visual Encoder to the CPTG module."Trade-offs
Limitations of This Experiment
Visual comparison is not a reliable metric here. The final images from baseline vs agentic look significantly different even for the same sample — this is because:
What IS reliably comparable:
Agentic Critic can miss absent components:
In test_260, the agentic critic's PIL analysis verified what was present in the image (node counts, region density) but failed to detect that the Segmentation Head and final output stage were completely missing. The baseline actually produced a more complete diagram. This shows that code_execution-based verification is better at checking existing elements than ensuring pipeline completeness — a "what's missing" check would require comparing the methodology's expected components against the image, which the current prompt doesn't explicitly instruct.
A more rigorous experiment would require: (a) deterministic seeding across the full pipeline, (b) freezing Planner/Stylist/Visualizer outputs and only varying the Critic, or (c) human evaluation of Critic suggestion quality on identical intermediate outputs.
Future Improvement: OpenCV in Sandbox
A post-experiment discovery: Gemini's
code_executionsandbox hasopencv-python(cv2) pre-installed, not just PIL. Our current prompt restricts the critic to PIL-only analysis (thresholding, region counting), but cv2 enables significantly more powerful structural verification:cv2.findContours()→ accurate node bounding box detection and countingThis could address the key weakness found in test_260 (missing Segmentation Head). With cv2 contour detection, the critic could enumerate all detected components and compare against an expected component list derived from the methodology section — catching "what's missing", not just "what's wrong".
A potential v2 prompt addition:
Recommendation
Based on the experiments,
--agentic-criticis most valuable as a selective, opt-in feature for:For simple diagrams (< 10 components), the overhead is not justified — baseline critic is sufficient.
Implementation Reference
Fork with full implementation: https://github.com/catallactics/PaperBanana/tree/feature/agentic-vision-critic
Files Changed
agents/critic_agent.pycode_executiontool,AGENTIC_VISION_PROMPT_SUPPLEMENT, multi-part response handling, timing/loggingutils/generation_utils.pycall_gemini_agentic_async()for multi-part response parsingutils/paperviz_processor.pycritic_suggestionsbeing a list (pre-existing bug surfaced by complex samples)main.py--agentic-criticCLI flagutils/config.pyagentic_critic: boolfield inExpConfigscripts/compare_results.pyHappy to convert this to a PR if there's interest. The implementation is designed to be fully backward-compatible — existing behavior is unchanged unless
--agentic-criticis explicitly passed.