Name	Name	Last commit message	Last commit date
parent directory ..
prompts	prompts
README.md	README.md
analyse_results.py	analyse_results.py
batch_call.py	batch_call.py
config.py	config.py
gemini_client.py	gemini_client.py
generate_edit_instructions.py	generate_edit_instructions.py
generate_gold_images.py	generate_gold_images.py
generate_perturbations.py	generate_perturbations.py
parallel_call.py	parallel_call.py
parse_image_results.py	parse_image_results.py
parse_results.py	parse_results.py
prepare_image_requests.py	prepare_image_requests.py
prepare_requests.py	prepare_requests.py

T2I Perturbation Generation

Perturbation generation pipeline for the Text-to-Image (T2I) benchmark. The pipeline takes benchmark instances (text prompts), generates gold-standard images, then creates subtle adversarial visual perturbations guided by category-specific edit instructions. These perturbed images are used to test whether VLM-based evaluators can distinguish correct from subtly incorrect generated images.

Two-model architecture:

Image generation & editing: gemini-3-pro-image-preview
Edit instruction generation: gemini-3.1-pro-preview

Setup
Perturbation Taxonomy
Pipeline
Batch API Management
Configuration
Directory Structure

Setup

pip install google-genai Pillow requests
export GEMINI_API_KEY="your-api-key"

All commands below are run from the perturbations/ directory.

Perturbation Taxonomy

Category	Subcategory
Basic Skill	Object substitution, Element addition/omission, Attribute manipulation, Spatial manipulation, Scale distortion, Action constraint violation
Scene Context & Style	Partial scene rendering, Missing context, Style inconsistency, Environmental thematic conflict, Disorganized composition, Overcrowding
Reasoning	Logical causal contradiction, Physics manipulation, State transformation failure, Functional absurdity, Literalized idioms
Text Rendering	Typographical substitution, Incomplete rendering, Background misrendering, Mislabelled symbols/diagrams

Perturbation types and prompts are defined in prompts/. Category-to-subcategory mappings are in config.py.

Pipeline

Step 1: Generate gold images

python3 generate_gold_images.py \
    --data_dir   ../samples \
    --output_dir ../gold_images

Generates one PNG per instance in ../gold_images/.

Step 2: Generate edit instructions

Edit instruction generation is text-only (image + prompt → edit instruction), so it supports sync, parallel, and batch modes.

Option A: Sync (simplest, sequential)

python3 generate_edit_instructions.py \
    --data_dir        ../samples \
    --gold_image_dir  ../gold_images \
    --output_dir      ../edit_instructions

Option B: Parallel async (recommended for small-to-medium runs)

# Prepare JSONL requests
python3 prepare_requests.py \
    --data_dir       ../samples \
    --gold_image_dir ../gold_images \
    --output_dir     ../batch_requests

# Run in parallel
for f in ../batch_requests/*__*.jsonl; do
    out="../batch_results/$(basename $f .jsonl)_results.jsonl"
    python3 parallel_call.py --input_file "$f" --output_file "$out" --n_jobs 10
done

# Parse results into edit instruction JSONs
for f in ../batch_results/*__*_results.jsonl; do
    python3 parse_results.py \
        --results_file  "$f" \
        --instances_dir ../samples \
        --output_dir    ../edit_instructions
done

Option C: Batch API (recommended for large-scale runs)

# Prepare JSONL requests
python3 prepare_requests.py \
    --data_dir       ../samples \
    --gold_image_dir ../gold_images \
    --output_dir     ../batch_requests

# Submit all jobs
for f in ../batch_requests/*__*.jsonl; do
    python3 batch_call.py create --input_file "$f"
done

# Check all jobs
python3 batch_call.py list

# Wait and download each job
python3 batch_call.py wait \
    --job_name   <job_name> \
    --output_file ../batch_results/<category>__<subcategory>_results.jsonl

# Parse results
for f in ../batch_results/*__*_results.jsonl; do
    python3 parse_results.py \
        --results_file  "$f" \
        --instances_dir ../samples \
        --output_dir    ../edit_instructions
done

Use --category and --subcategory on any script to filter to a single perturbation type.

Step 3: Generate perturbed images

Apply edit instructions using the image generation model:

# All categories
python3 generate_perturbations.py \
    --edit_dir       ../edit_instructions \
    --gold_image_dir ../gold_images \
    --instances_dir  ../samples \
    --output_dir     ../perturbation_outputs

# Single category/subcategory
python3 generate_perturbations.py \
    --edit_dir       ../edit_instructions \
    --gold_image_dir ../gold_images \
    --instances_dir  ../samples \
    --output_dir     ../perturbation_outputs \
    --category       basic_skill \
    --subcategory    object_substitution

Step 4: Analyse

# Check gold image coverage
python3 analyse_results.py --mode gold \
    --dir ../gold_images --data_dir ../samples

# Check edit instruction coverage
python3 analyse_results.py --mode edit_instructions \
    --dir ../edit_instructions

# Check final perturbation coverage
python3 analyse_results.py --mode perturbations \
    --dir ../perturbation_outputs

Batch API Management

python3 batch_call.py list                       # List all jobs
python3 batch_call.py check  --job_name <name>   # Check status
python3 batch_call.py cancel --job_name <name>   # Cancel job
python3 batch_call.py delete --job_name <name>   # Delete job

Configuration

Model names can be overridden via CLI --model flags. Defaults are in config.py:

MODEL_CONFIG = {
    "image_generation_model": "gemini-3-pro-image-preview",
    "edit_instruction_model": "gemini-3.1-pro-preview",
}

Directory Structure

perturbations/
├── config.py                      # Category/subcategory definitions, model config
├── generate_gold_images.py        # Generate gold-standard images (sync)
├── generate_edit_instructions.py  # Generate edit instructions (sync)
├── generate_perturbations.py      # Apply edit instructions to generate perturbed images
├── prepare_requests.py            # Build JSONL requests for edit instruction generation
├── prepare_image_requests.py      # Build JSONL requests for image-specific tasks
├── parallel_call.py               # Async parallel execution via Gemini SDK
├── batch_call.py                  # Batch API job management
├── gemini_client.py               # Gemini API wrapper (image generation + editing)
├── parse_results.py               # Parse text model outputs
├── parse_image_results.py         # Parse image model outputs
├── analyse_results.py             # Coverage and quality analysis
└── prompts/
    ├── basic_skill.py
    ├── scene_context_style.py
    ├── reasoning.py
    └── text_rendering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

T2I Perturbation Generation

Table of Contents

Setup

Perturbation Taxonomy

Pipeline

Step 1: Generate gold images

Step 2: Generate edit instructions

Step 3: Generate perturbed images

Step 4: Analyse

Batch API Management

Configuration

Directory Structure

FilesExpand file tree

perturbations

Directory actions

More options

Directory actions

More options

Latest commit

History

perturbations

Folders and files

parent directory

README.md

T2I Perturbation Generation

Table of Contents

Setup

Perturbation Taxonomy

Pipeline

Step 1: Generate gold images

Step 2: Generate edit instructions

Step 3: Generate perturbed images

Step 4: Analyse

Batch API Management

Configuration

Directory Structure