[Feature Request] Agentic Vision (code_execution) for Critic Agent to detect structural connection errors

## Problem

Connection/structural errors in generated diagrams are a primary failure mode in PaperBanana's pipeline. The current Critic Agent relies on **pure VLM text reasoning** to evaluate generated diagrams — it cannot programmatically verify structural properties like:

- Correct number of nodes/components matching the methodology
- Arrow/connection integrity (dangling arrows, missing connections)
- Source-target alignment of data flow paths
- Component count consistency between description and rendered image

These structural issues persist through multiple critic rounds because the VLM "sees" the diagram holistically but lacks the ability to perform precise pixel-level verification.

## Proposed Solution

Enable Gemini's **`code_execution` tool** on the Critic Agent, allowing it to run Python code (PIL-based image analysis) within the critique loop. This creates a hybrid critic that combines:

1. **VLM reasoning** (existing) — semantic understanding, content fidelity, aesthetic evaluation
2. **Code execution** (new) — programmatic structural verification via pixel analysis

### Implementation

The implementation adds an `--agentic-critic` CLI flag that:
- Adds `tools=[Tool(code_execution=ToolCodeExecution())]` to the Critic's `GenerateContentConfig`
- Appends a "Structural Verification" section to the system prompt instructing the model to analyze the image with PIL
- Handles multi-part responses (separating `executable_code`/`code_execution_result` parts from text)
- Removes `response_mime_type` (incompatible with tools) and uses regex JSON extraction instead
- Logs code execution output and timing metrics per round

**~175 lines changed across 5 files** (`critic_agent.py`, `generation_utils.py`, `main.py`, `config.py`, `paperviz_processor.py`). Zero changes to existing behavior when `--agentic-critic` is not set.

## Experiment Setup

### Model Configuration

| Role | Model | Notes |
|------|-------|-------|
| Text (Planner, Stylist, Critic) | `gemini-2.5-flash` | Same for both A and B |
| Image (Visualizer) | `gemini-3.1-flash-image-preview` (Nano Banana 2) | Same for both A and B |

> **Note:** Initially tested with `gemini-2.0-flash-exp` but it returned 404 (deprecated). Switched to `gemini-3.1-flash-image-preview` (Nano Banana 2) for all experiments. Both baseline and agentic runs use the **exact same models** — the only variable is the Critic Agent's `code_execution` tool.

### Controlled Conditions

- **Pipeline:** Retriever → Planner → Stylist → Visualizer → **Critic** → Visualizer (loop)
- **Retrieval:** `none` (no example retrieval, to isolate Critic behavior)
- **Critic rounds:** 3 max
- **Variable:** Critic only — `--agentic-critic` flag on (B) vs off (A)
- **Everything else identical:** same samples, same planner/stylist/visualizer, same models

### Two Test Sets

| Set | Samples | Complexity | Purpose |
|-----|---------|------------|---------|
| **Simple** (3 samples) | test_274, test_51, test_17 | 931–2,727 chars content | Baseline comparison |
| **Complex** (3 samples) | test_260, test_46, test_156 | 11,444–26,502 chars content | Stress test on multi-module architectures |

---

## Experiment 1: Simple Diagrams (3 samples)

### Qualitative Findings

| Aspect | A: Baseline Critic | B: Agentic Vision Critic |
|--------|-------------------|-------------------------|
| **Structural verification** | No programmatic analysis; relies on visual inspection only | Runs PIL code to count nodes, detect dark regions, estimate arrow/connection density |
| **Node counting** | Not attempted | Explicitly counts bounded regions and compares against methodology description |
| **Critique specificity** | Sometimes generic ("the flow is confusing") | Evidence-backed ("18 boxes/cylinders detected, 14 arrows confirmed consistent with description") |
| **Convergence** | Always runs max rounds | Sometimes converges early when structural analysis confirms correctness |

### Per-Sample Comparison

#### test_51 (Knowledge Distillation — `generative_learning`)
- **Baseline:** Found redundant Student Non-Target Logits bar, mathematical formatting issues. All 3 rounds iterated.
- **Agentic:** PIL analysis confirmed component structure; focused on semantic accuracy of DTM/Momentum Buffer placement. Similar iteration count.
- **Visual difference:** Agentic output has slightly clearer Teacher/Student separation with individually labeled DTM and Momentum Buffer blocks.

#### test_17 (Alias-Free Transformer — `vision_perception`)
- **Baseline:** Round 1 said "No changes needed" but Round 2 found AF-LN placement errors — inconsistent self-assessment.
- **Agentic:** PIL detected 18 boxes/cylinders and 14 arrows, confirmed consistent with methodology. More consistent across rounds.
- **Visual difference:** Baseline output is slightly cleaner; Agentic output has more detailed internal labels (AF-PI Block, xL notation).

#### test_274 (Robot Planning Pipeline — `agent_reasoning`)
- **Baseline:** Found text inconsistency ("Pick Block" vs methodology), redundant "Figure 2" label.
- **Agentic:** PIL threshold analysis + same text issues + additional concrete planning examples.
- **Visual difference:** Agentic includes concrete examples (pick_block, place_block_table_A) vs baseline's abstract icons.

### Simple Diagrams: Verdict

For simple samples (2–10 major components), **visual quality difference is marginal**. The agentic critic provides evidence-backed critiques but the final rendered output is comparable.

---

## Experiment 2: Complex Diagrams (3 samples) — Key Results

These samples have 10–30+ components, multi-branch data flows, and sub-module hierarchies — exactly the scenario where structural verification should matter.

### Visual Comparison (Complex Diagrams)

> **Important caveat:** These images look very different despite being generated from the same paper/methodology. This is expected — see [Limitations](#limitations-of-this-experiment) below. The Visualizer (Nano Banana 2) is non-deterministic and the Critic's different feedback causes divergent re-rendering. **Compare the Critic's text suggestions, not the images**, for a fair assessment.

#### test_46 — CroPe (15+ components, most complex)

| Baseline Critic | Agentic Vision Critic |
|:-:|:-:|
| ![baseline_test_46](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/baseline_test_46.png?raw=true) | ![agentic_test_46](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/agentic_test_46.png?raw=true) |

#### test_156 — T2MIR (3-panel MoE Transformer)

| Baseline Critic | Agentic Vision Critic |
|:-:|:-:|
| ![baseline_test_156](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/baseline_test_156.png?raw=true) | ![agentic_test_156](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/agentic_test_156.png?raw=true) |

#### test_260 — TransGesture (Dual-path architecture)

| Baseline Critic | Agentic Vision Critic |
|:-:|:-:|
| ![baseline_test_260](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/baseline_test_260.png?raw=true) | ![agentic_test_260](https://github.com/catallactics/PaperBanana/blob/feature/agentic-vision-critic/experiments/complex_comparison/agentic_test_260.png?raw=true) |

---

### Per-Sample Analysis

### test_260: TransGesture (Gaze-Gesture Framework — `vision_perception`)

**Methodology:** Dual-path architecture (Gaze + Gesture decoders), frozen DINOv2 encoder, joint cross-attention module → convolution mask head → segmentation mask + binary existence prediction.

| | A: Baseline | B: Agentic |
|--|------------|------------|
| **Critic behavior** | Round 1 said "No changes needed" — stopped early | All 3 rounds iterated with code-verified structural checks |
| **Structural check** | None | PIL detected region density, verified dual-path layout symmetry |
| **Key finding** | Missed concatenation discrepancy in Gesture Decoder input | Found and flagged layout duplication error + Gesture Decoder input mismatch |
| **Final result** | Stopped at Round 0 output | Iterated through all 3 rounds with progressive refinement |
| **Pipeline completeness** | ✅ Complete — includes Segmentation Head, mask output, Binary Existence Prediction | ❌ **Incomplete** — ends at JCA module output (`f_GCA`), missing Segmentation Head and final output |

**Notable — Agentic Critic failure case:** Despite running PIL-based structural analysis for 3 rounds, the agentic critic **failed to detect that the final output stage was missing**. The diagram ends at the JCA (Joint Cross Attention) module's `f_GCA` output, but the methodology clearly specifies a downstream "convolution mask head" producing segmentation masks and binary existence predictions. The baseline, by contrast, includes the complete pipeline.

This reveals a key limitation: **the agentic critic's code_execution analyzes what IS in the image (node counts, region density) but does not verify what is MISSING against the methodology.** The structural verification prompt focuses on detecting errors in existing elements rather than checking for absent components. A potential improvement would be adding a "pipeline completeness check" step that explicitly enumerates expected output stages from the methodology and verifies each one appears in the image.

### test_46: CroPe (Complementary-Perceptive Text Generation — `vision_perception`)

**Methodology:** Visual Encoder → CPTG module (3 sub-paths: Domain-Specific, Domain-Invariant, Gated Complementary) → RCTVF module → Segmentation Head. Highly complex with 15+ labeled components.

| | A: Baseline | B: Agentic |
|--|------------|------------|
| **Critic behavior** | Round 1 said "No changes needed" — stopped early | All 3 rounds with structural analysis |
| **Key finding (Baseline)** | RCTVF module naming/flow ambiguity | Same, plus stopped after 1 round |
| **Key finding (Agentic)** | Input image multiplicity error, CPTG internal data flow paths | Found content mismatch in Visual Encoder output to CPTG (`[P2,P3,P4]` multi-line representation) |
| **Visual difference** | Cleaner but fewer sub-module details | More explicit sub-module labels (MLP1/MLP2, Domain-Specific/Invariant paths clearly separated), Gated Complementary flow visible |

**Notable:** The agentic output for this complex sample shows **significantly more structural detail** — the Domain-Specific Perception, Domain-Invariant Regularization, and Gated Complementary Fusion paths are individually traced and labeled, whereas baseline collapses them into a single CPTG block.

### test_156: T2MIR (Token-Task MoE Transformer — `generative_learning`)

**Methodology:** 3-panel diagram: (a) Overall Pipeline with parallel MoE layers in transformer blocks, (b) Token-wise MoE with top-k routing, (c) Task-wise MoE with contrastive learning.

| | A: Baseline | B: Agentic |
|--|------------|------------|
| **Critic behavior** | All 3 rounds iterated | All 3 rounds with code execution |
| **Key finding (Baseline)** | Missing labels in panel (c), general formatting issues | Same + structural verification of panel counts |
| **Key finding (Agentic)** | Panel (c) contrastive learning flow details, gating mechanism notation | Found critical content error in data flow routing between Token-wise and Task-wise MoE |
| **Visual difference** | Standard 3-panel layout | Panel (b)/(c) have more explicit routing arrows and gating notation (g_tok, g_task), auxiliary flow paths labeled |

### Complex Diagrams: Quantitative

| Metric | A: Baseline | B: Agentic | Delta |
|--------|------------|------------|-------|
| Total pipeline time (3 complex) | ~6 min | ~7.5 min | +25% |
| Avg critic round time | ~5-10s | ~66.2s | +7-13x |
| Token usage per round | ~3-5k | ~27-47k | ~6-12x |
| Code execution every round | No | Yes (2 parts/round) | — |
| All visualizations succeeded | ✅ | ✅ | Same |
| Baseline early-stopped (false "No changes needed") | 2/3 samples | 0/3 samples | — |

### Complex Agentic Token Breakdown

| Sample | Round | Time (s) | Total Tokens | Code Parts |
|--------|-------|----------|-------------|------------|
| test_260 | 0 | 43.0 | 26,958 | 2 |
| test_260 | 1 | 79.6 | 30,885 | 2 |
| test_260 | 2 | 61.0 | 37,790 | 2 |
| test_46 | 0 | 60.0 | 46,225 | 2 |
| test_46 | 1 | 79.8 | 40,027 | 2 |
| test_46 | 2 | 69.3 | 47,343 | 2 |
| test_156 | 0 | 88.4 | 43,013 | 2 |
| test_156 | 1 | 54.4 | 34,923 | 2 |
| test_156 | 2 | 60.5 | 43,159 | 2 |

---

## Key Insight: False Early Convergence

The most significant finding from the complex diagram experiment:

**Baseline critic said "No changes needed" in Round 1 for 2 out of 3 complex samples (test_260, test_46)**, stopping iteration prematurely. The agentic critic, armed with structural analysis, continued iterating and found real issues in subsequent rounds:

- test_260: Layout duplication error, JCA module flow ambiguity
- test_46: Input image multiplicity error, CPTG internal data flow path mismatches

This suggests that **VLM-only critique over-confidently approves complex diagrams** because it lacks the granularity to detect subtle structural issues. The code_execution tool provides a "second opinion" mechanism that prevents premature convergence.

---

## Example: Agentic Critic Code Execution

In Round 0 for test_46 (complex CroPe architecture), the agentic critic ran:

```python
from PIL import Image, ImageEnhance, ImageOps
import io, base64

image = Image.open("input_file_0.jpeg").convert("RGB")
gray_image = image.convert("L")

# Simple threshold to binarize
threshold = 128
binary = gray_image.point(lambda x: 255 if x < threshold else 0)

# Count dark pixel regions as approximate component count
import numpy as np
arr = np.array(binary)
dark_ratio = np.sum(arr == 255) / arr.size
# Horizontal projection for column-based component counting
col_sums = np.sum(arr == 255, axis=0)
...
```

This provided evidence for the subsequent critique: "Image Features `[P2, P3, P4]` should be represented as three distinct feature lines/paths from the Visual Encoder to the CPTG module."

---

## Trade-offs

| Pro | Con |
|-----|-----|
| Programmatic structural verification | ~6-12x token increase per critic round |
| Prevents false early convergence on complex diagrams | +50-90s latency per round |
| Evidence-backed critique suggestions | Sandbox library constraints (no scikit-image; cv2 available but untested — see [Future Improvement](#future-improvement-opencv-in-sandbox)) |
| Zero changes to non-agentic path | Gemini-specific (code_execution not available on Claude/OpenAI) |

## Limitations of This Experiment

**Visual comparison is not a reliable metric here.** The final images from baseline vs agentic look significantly different even for the same sample — this is because:

1. **No seed control:** PaperBanana currently does not support fixed random seeds. The Planner and Stylist stages can produce slightly different descriptions across runs, and Nano Banana 2 is non-deterministic, so the Visualizer generates different layouts each time.
2. **Critic feedback cascades:** Different Critic suggestions → different revised descriptions → completely different re-rendered images. A single wording change in the Critic's revision can cause the Visualizer to choose an entirely different layout.
3. **Not an apples-to-apples image comparison:** The images are the *end result* of divergent feedback loops, not controlled single-variable outputs.

**What IS reliably comparable:**
- **Critic suggestion text** — directly shows what each Critic detected and recommended (see Per-Sample Comparison sections above)
- **False early convergence** — Baseline said "No changes needed" prematurely on 2/3 complex samples; Agentic continued finding real issues
- **Code execution evidence** — Agentic Critic ran Python code to count nodes/regions and used results as structural evidence
- **Token/timing metrics** — objective cost measurements

**Agentic Critic can miss absent components:**
In test_260, the agentic critic's PIL analysis verified what was present in the image (node counts, region density) but **failed to detect that the Segmentation Head and final output stage were completely missing**. The baseline actually produced a more complete diagram. This shows that code_execution-based verification is better at checking existing elements than ensuring pipeline completeness — a "what's missing" check would require comparing the methodology's expected components against the image, which the current prompt doesn't explicitly instruct.

A more rigorous experiment would require: (a) deterministic seeding across the full pipeline, (b) freezing Planner/Stylist/Visualizer outputs and only varying the Critic, or (c) human evaluation of Critic suggestion quality on identical intermediate outputs.

## Future Improvement: OpenCV in Sandbox

A post-experiment discovery: Gemini's `code_execution` sandbox has **`opencv-python` (cv2) pre-installed**, not just PIL. Our current prompt restricts the critic to PIL-only analysis (thresholding, region counting), but cv2 enables significantly more powerful structural verification:

- **Canny Edge Detection** → precise arrow/connection line tracing
- **`cv2.findContours()`** → accurate node bounding box detection and counting
- **Hough Line Transform** → detect straight connection lines and their endpoints
- **Template matching** → verify specific icon/symbol presence

This could address the key weakness found in test_260 (missing Segmentation Head). With cv2 contour detection, the critic could enumerate all detected components and compare against an expected component list derived from the methodology section — catching "what's missing", not just "what's wrong".

A potential v2 prompt addition:
```
You also have OpenCV (cv2) available. Use cv2.findContours() to enumerate
all distinct bounded components, then compare this count against the
expected components listed in the methodology. Flag any MISSING components.
```

## Recommendation

Based on the experiments, `--agentic-critic` is most valuable as a **selective, opt-in feature** for:
- Complex diagrams with 10+ components and multi-branch data flows
- Architecture diagrams where connection accuracy is critical
- Quality-sensitive production runs where false early convergence is costly

For simple diagrams (< 10 components), the overhead is not justified — baseline critic is sufficient.

## Implementation Reference

Fork with full implementation: https://github.com/catallactics/PaperBanana/tree/feature/agentic-vision-critic

### Files Changed

| File | Change |
|------|--------|
| `agents/critic_agent.py` | `code_execution` tool, `AGENTIC_VISION_PROMPT_SUPPLEMENT`, multi-part response handling, timing/logging |
| `utils/generation_utils.py` | New `call_gemini_agentic_async()` for multi-part response parsing |
| `utils/paperviz_processor.py` | Defensive fix for `critic_suggestions` being a list (pre-existing bug surfaced by complex samples) |
| `main.py` | `--agentic-critic` CLI flag |
| `utils/config.py` | `agentic_critic: bool` field in `ExpConfig` |
| `scripts/compare_results.py` | A/B comparison utility script |

Happy to convert this to a PR if there's interest. The implementation is designed to be fully backward-compatible — existing behavior is unchanged unless `--agentic-critic` is explicitly passed.


File	Change
`agents/critic_agent.py`	`code_execution` tool, `AGENTIC_VISION_PROMPT_SUPPLEMENT`, multi-part response handling, timing/logging
`utils/generation_utils.py`	New `call_gemini_agentic_async()` for multi-part response parsing
`utils/paperviz_processor.py`	Defensive fix for `critic_suggestions` being a list (pre-existing bug surfaced by complex samples)
`main.py`	`--agentic-critic` CLI flag
`utils/config.py`	`agentic_critic: bool` field in `ExpConfig`
`scripts/compare_results.py`	A/B comparison utility script

Role	Model	Notes
Text (Planner, Stylist, Critic)	`gemini-2.5-flash`	Same for both A and B
Image (Visualizer)	`gemini-3.1-flash-image-preview` (Nano Banana 2)	Same for both A and B

Set	Samples	Complexity	Purpose
Simple (3 samples)	test_274, test_51, test_17	931–2,727 chars content	Baseline comparison
Complex (3 samples)	test_260, test_46, test_156	11,444–26,502 chars content	Stress test on multi-module architectures

Aspect	A: Baseline Critic	B: Agentic Vision Critic
Structural verification	No programmatic analysis; relies on visual inspection only	Runs PIL code to count nodes, detect dark regions, estimate arrow/connection density
Node counting	Not attempted	Explicitly counts bounded regions and compares against methodology description
Critique specificity	Sometimes generic ("the flow is confusing")	Evidence-backed ("18 boxes/cylinders detected, 14 arrows confirmed consistent with description")
Convergence	Always runs max rounds	Sometimes converges early when structural analysis confirms correctness

	A: Baseline	B: Agentic
Critic behavior	Round 1 said "No changes needed" — stopped early	All 3 rounds iterated with code-verified structural checks
Structural check	None	PIL detected region density, verified dual-path layout symmetry
Key finding	Missed concatenation discrepancy in Gesture Decoder input	Found and flagged layout duplication error + Gesture Decoder input mismatch
Final result	Stopped at Round 0 output	Iterated through all 3 rounds with progressive refinement
Pipeline completeness	✅ Complete — includes Segmentation Head, mask output, Binary Existence Prediction	❌ Incomplete — ends at JCA module output (`f_GCA`), missing Segmentation Head and final output

	A: Baseline	B: Agentic
Critic behavior	Round 1 said "No changes needed" — stopped early	All 3 rounds with structural analysis
Key finding (Baseline)	RCTVF module naming/flow ambiguity	Same, plus stopped after 1 round
Key finding (Agentic)	Input image multiplicity error, CPTG internal data flow paths	Found content mismatch in Visual Encoder output to CPTG (`[P2,P3,P4]` multi-line representation)
Visual difference	Cleaner but fewer sub-module details	More explicit sub-module labels (MLP1/MLP2, Domain-Specific/Invariant paths clearly separated), Gated Complementary flow visible

	A: Baseline	B: Agentic
Critic behavior	All 3 rounds iterated	All 3 rounds with code execution
Key finding (Baseline)	Missing labels in panel (c), general formatting issues	Same + structural verification of panel counts
Key finding (Agentic)	Panel (c) contrastive learning flow details, gating mechanism notation	Found critical content error in data flow routing between Token-wise and Task-wise MoE
Visual difference	Standard 3-panel layout	Panel (b)/(c) have more explicit routing arrows and gating notation (g_tok, g_task), auxiliary flow paths labeled

Metric	A: Baseline	B: Agentic	Delta
Total pipeline time (3 complex)	~6 min	~7.5 min	+25%
Avg critic round time	~5-10s	~66.2s	+7-13x
Token usage per round	~3-5k	~27-47k	~6-12x
Code execution every round	No	Yes (2 parts/round)	—
All visualizations succeeded	✅	✅	Same
Baseline early-stopped (false "No changes needed")	2/3 samples	0/3 samples	—

Sample	Round	Time (s)	Total Tokens	Code Parts
test_260	0	43.0	26,958	2
test_260	1	79.6	30,885	2
test_260	2	61.0	37,790	2
test_46	0	60.0	46,225	2
test_46	1	79.8	40,027	2
test_46	2	69.3	47,343	2
test_156	0	88.4	43,013	2
test_156	1	54.4	34,923	2
test_156	2	60.5	43,159	2

Pro	Con
Programmatic structural verification	~6-12x token increase per critic round
Prevents false early convergence on complex diagrams	+50-90s latency per round
Evidence-backed critique suggestions	Sandbox library constraints (no scikit-image; cv2 available but untested — see Future Improvement)
Zero changes to non-agentic path	Gemini-specific (code_execution not available on Claude/OpenAI)

[Feature Request] Agentic Vision (code_execution) for Critic Agent to detect structural connection errors #35

Description

Problem

Proposed Solution

Implementation

Experiment Setup

Model Configuration

Controlled Conditions

Two Test Sets

Experiment 1: Simple Diagrams (3 samples)

Qualitative Findings

Per-Sample Comparison

test_51 (Knowledge Distillation — generative_learning)

test_17 (Alias-Free Transformer — vision_perception)

test_274 (Robot Planning Pipeline — agent_reasoning)

Simple Diagrams: Verdict

Experiment 2: Complex Diagrams (3 samples) — Key Results

Visual Comparison (Complex Diagrams)

test_46 — CroPe (15+ components, most complex)

test_156 — T2MIR (3-panel MoE Transformer)

test_260 — TransGesture (Dual-path architecture)

Per-Sample Analysis

test_260: TransGesture (Gaze-Gesture Framework — vision_perception)

test_46: CroPe (Complementary-Perceptive Text Generation — vision_perception)

test_156: T2MIR (Token-Task MoE Transformer — generative_learning)

Complex Diagrams: Quantitative

Complex Agentic Token Breakdown

Key Insight: False Early Convergence

Example: Agentic Critic Code Execution

Trade-offs

Limitations of This Experiment

Future Improvement: OpenCV in Sandbox

Recommendation

Implementation Reference

Files Changed

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

test_51 (Knowledge Distillation — `generative_learning`)

test_17 (Alias-Free Transformer — `vision_perception`)

test_274 (Robot Planning Pipeline — `agent_reasoning`)

test_260: TransGesture (Gaze-Gesture Framework — `vision_perception`)

test_46: CroPe (Complementary-Perceptive Text Generation — `vision_perception`)

test_156: T2MIR (Token-Task MoE Transformer — `generative_learning`)