Perturbation generation pipeline for the Text-to-Image (T2I) benchmark. The pipeline takes benchmark instances (text prompts), generates gold-standard images, then creates subtle adversarial visual perturbations guided by category-specific edit instructions. These perturbed images are used to test whether VLM-based evaluators can distinguish correct from subtly incorrect generated images.
Two-model architecture:
- Image generation & editing:
gemini-3-pro-image-preview - Edit instruction generation:
gemini-3.1-pro-preview
pip install google-genai Pillow requests
export GEMINI_API_KEY="your-api-key"All commands below are run from the perturbations/ directory.
| Category | Subcategory |
|---|---|
| Basic Skill | Object substitution, Element addition/omission, Attribute manipulation, Spatial manipulation, Scale distortion, Action constraint violation |
| Scene Context & Style | Partial scene rendering, Missing context, Style inconsistency, Environmental thematic conflict, Disorganized composition, Overcrowding |
| Reasoning | Logical causal contradiction, Physics manipulation, State transformation failure, Functional absurdity, Literalized idioms |
| Text Rendering | Typographical substitution, Incomplete rendering, Background misrendering, Mislabelled symbols/diagrams |
Perturbation types and prompts are defined in prompts/. Category-to-subcategory mappings are in config.py.
python3 generate_gold_images.py \
--data_dir ../samples \
--output_dir ../gold_imagesGenerates one PNG per instance in ../gold_images/.
Edit instruction generation is text-only (image + prompt → edit instruction), so it supports sync, parallel, and batch modes.
Option A: Sync (simplest, sequential)
python3 generate_edit_instructions.py \
--data_dir ../samples \
--gold_image_dir ../gold_images \
--output_dir ../edit_instructionsOption B: Parallel async (recommended for small-to-medium runs)
# Prepare JSONL requests
python3 prepare_requests.py \
--data_dir ../samples \
--gold_image_dir ../gold_images \
--output_dir ../batch_requests
# Run in parallel
for f in ../batch_requests/*__*.jsonl; do
out="../batch_results/$(basename $f .jsonl)_results.jsonl"
python3 parallel_call.py --input_file "$f" --output_file "$out" --n_jobs 10
done
# Parse results into edit instruction JSONs
for f in ../batch_results/*__*_results.jsonl; do
python3 parse_results.py \
--results_file "$f" \
--instances_dir ../samples \
--output_dir ../edit_instructions
doneOption C: Batch API (recommended for large-scale runs)
# Prepare JSONL requests
python3 prepare_requests.py \
--data_dir ../samples \
--gold_image_dir ../gold_images \
--output_dir ../batch_requests
# Submit all jobs
for f in ../batch_requests/*__*.jsonl; do
python3 batch_call.py create --input_file "$f"
done
# Check all jobs
python3 batch_call.py list
# Wait and download each job
python3 batch_call.py wait \
--job_name <job_name> \
--output_file ../batch_results/<category>__<subcategory>_results.jsonl
# Parse results
for f in ../batch_results/*__*_results.jsonl; do
python3 parse_results.py \
--results_file "$f" \
--instances_dir ../samples \
--output_dir ../edit_instructions
doneUse --category and --subcategory on any script to filter to a single perturbation type.
Apply edit instructions using the image generation model:
# All categories
python3 generate_perturbations.py \
--edit_dir ../edit_instructions \
--gold_image_dir ../gold_images \
--instances_dir ../samples \
--output_dir ../perturbation_outputs
# Single category/subcategory
python3 generate_perturbations.py \
--edit_dir ../edit_instructions \
--gold_image_dir ../gold_images \
--instances_dir ../samples \
--output_dir ../perturbation_outputs \
--category basic_skill \
--subcategory object_substitution# Check gold image coverage
python3 analyse_results.py --mode gold \
--dir ../gold_images --data_dir ../samples
# Check edit instruction coverage
python3 analyse_results.py --mode edit_instructions \
--dir ../edit_instructions
# Check final perturbation coverage
python3 analyse_results.py --mode perturbations \
--dir ../perturbation_outputspython3 batch_call.py list # List all jobs
python3 batch_call.py check --job_name <name> # Check status
python3 batch_call.py cancel --job_name <name> # Cancel job
python3 batch_call.py delete --job_name <name> # Delete jobModel names can be overridden via CLI --model flags. Defaults are in config.py:
MODEL_CONFIG = {
"image_generation_model": "gemini-3-pro-image-preview",
"edit_instruction_model": "gemini-3.1-pro-preview",
}perturbations/
├── config.py # Category/subcategory definitions, model config
├── generate_gold_images.py # Generate gold-standard images (sync)
├── generate_edit_instructions.py # Generate edit instructions (sync)
├── generate_perturbations.py # Apply edit instructions to generate perturbed images
├── prepare_requests.py # Build JSONL requests for edit instruction generation
├── prepare_image_requests.py # Build JSONL requests for image-specific tasks
├── parallel_call.py # Async parallel execution via Gemini SDK
├── batch_call.py # Batch API job management
├── gemini_client.py # Gemini API wrapper (image generation + editing)
├── parse_results.py # Parse text model outputs
├── parse_image_results.py # Parse image model outputs
├── analyse_results.py # Coverage and quality analysis
└── prompts/
├── basic_skill.py
├── scene_context_style.py
├── reasoning.py
└── text_rendering.py