This module evaluates text-to-image (T2I) model outputs using LLM-as-a-judge. It generates model-agnostic evaluator requests in OpenAI chat format, then executes them via batch APIs or parallel calls.
The architecture mirrors the I2T evaluators — the same evaluator types, execution engines, and orchestration scripts are used, adapted for image quality assessment.
- Quick Start
- Input Format
- Evaluator Types
- Pipeline Overview
- Individual Script Usage
- Environment Variables
- Directory Structure
# 1. Generate model-agnostic evaluator requests for all evaluators
bash generate_batches.sh \
--input_dir /data/my_datasets \
--output_dir main_batches \
--image_root /data/images \
--evaluators all
# 2a. Schedule batch jobs for a specific evaluator + model
bash schedule_batches.sh \
--input_folder main_batches/single_vanilla \
--provider gpt \
--model gpt-4o \
--output_dir expt_batches \
--chunk_size 5000 \
--poll_interval 60
# 2b. OR schedule parallel calls (real-time, via provider SDKs)
bash schedule_parallel_calls.sh \
--input_folder main_batches/single_vanilla \
--model gpt-4o \
--output_dir expt_batches \
--chunk_size 5000 \
--n_jobs 10
# Results will be in: expt_batches/output_batches/single_vanilla/<model>/Supports .tsv, .csv, .json, .jsonl. The evaluator scripts auto-detect these fields:
| Field | Looked up as |
|---|---|
| Instance ID | p_id, id |
| Text prompt | prompt, question |
| Gold image | gold_image, image, img_url, image_url |
| Perturbed image | perturbed_image |
Images can be: HTTP URLs, local file paths (relative to --image_root), data URLs, or raw base64 strings.
Score an individual generated image against a text prompt.
| Evaluator | Script | Response Model | Output |
|---|---|---|---|
| Vanilla CoT | single_vanilla.py |
SingleVanillaCOTScore |
justification + score (1-10) |
| Rubrics | single_rubrics.py |
SingleRubricsScore |
justification + score (0-2) |
| Multi-Axes | single_axes.py |
SingleAxesScore |
justification + score per metric |
| Axes + Rubrics | single_axes_rubrics.py |
SingleAxesRubricsScore |
justification + score per metric |
Each produces two requests per input row: one for the gold image (-orig), one for the perturbed image (-pert).
Compare two images (A vs B) and pick a winner.
| Evaluator | Script | Response Model | Output |
|---|---|---|---|
| Vanilla CoT | compare_vanilla.py |
CompareVanillaCOTScore |
justification + verdict (A/B) |
| Rules | compare_rules.py |
CompareRulesScore |
justification + verdict (A/B) |
| Multi-Axes | compare_axes.py |
CompareAxesScore |
justification + verdict per metric |
| Axes + Rules | compare_axes_rules.py |
CompareAxesRulesScore |
justification + verdict per metric |
Use --p_mode to swap A/B order (generates _perturb.jsonl variant).
Scores a generated image against a reference image.
| Evaluator | Script | Response Model | Output |
|---|---|---|---|
| Reference | reference_based.py |
ReferenceScore |
justification + score |
- prompt_align — Prompt alignment: how faithfully the image follows the text prompt
- visual_qual — Visual quality: realism and coherence of the generated image
- comp_acc — Compositional accuracy: correct rendering of objects, attributes, and spatial relations
- text_render — Text rendering: accuracy of any text depicted in the image
bash generate_batches.sh \
--input_dir <dir> # Directory with input data files
--output_dir <dir> # Output base (default: main_batches)
--evaluators <list|all> # Comma-separated or "all"
--image_root <dir> # Optional image path rootThis calls each evaluator script on every input file and stores model-agnostic JSONL requests in:
main_batches/
single_vanilla/dataset1.jsonl
compare_vanilla/dataset1.jsonl
compare_vanilla/dataset1_perturb.jsonl
...
bash schedule_batches.sh \
--input_folder main_batches/single_vanilla # One evaluator folder
--provider gpt # gemini|vertex_gemini|gpt|claude
--model gpt-4o # Model name
--output_dir expt_batches # Output base
--chunk_size 5000 # Entries per split
--poll_interval 60 # Seconds between polls
--display_name my-eval # Job display name (optional)
--debug # Sample 30 rows (optional)Output structure:
expt_batches/
input_batches/single_vanilla/gpt-4o/
dataset1_gpt-4o_001.jsonl
dataset1_gpt-4o_001.output.jsonl
dataset1.tracker.gpt-4o.20260324_103000.json
output_batches/single_vanilla/gpt-4o/
dataset1_gpt-4o.jsonl # merged final output
An alternative to batch jobs — processes requests in real time using parallel threaded calls via direct provider SDKs.
bash schedule_parallel_calls.sh \
--input_folder main_batches/single_vanilla
--model gpt-4o \
--output_dir expt_batches \
--chunk_size 5000 \
--n_jobs 10When to use batch vs parallel:
schedule_batches.sh |
schedule_parallel_calls.sh |
|
|---|---|---|
| Mechanism | Provider batch APIs (async jobs) | Real-time threaded calls |
| Cost | Often cheaper (batch pricing) | Standard API pricing |
| Latency | Higher (queued processing) | Lower (immediate) |
python single_vanilla.py \
--file_name input.tsv \
--out_file_name requests.jsonl \
--image_root /data/imagesAdditional flags:
- Axes evaluators:
--all(all metrics) or--axes prompt_align visual_qual(specific metrics) - Compare evaluators:
--p_mode(swap A/B order)
# Create a batch job
python batch_call.py create \
--input_file requests.jsonl \
--provider gpt \
--model gpt-4o
# Poll until complete, then download
python batch_call.py wait \
--provider gpt \
--job_name <job_id> \
--output_file results.jsonl \
--poll_interval 30
# Split large file, submit chunks, create tracker
python batch_call.py split_submit \
--input_file requests.jsonl \
--provider gpt \
--model gpt-4o \
--chunk_size 5000 \
--output_dir expt_batches/input_batches/single_vanilla/gpt-4o
# Poll tracker, download completed chunks, merge when done
python batch_call.py poll_tracker \
--tracker_file path/to/dataset1.tracker.gpt-4o.20260324_103000.json \
--merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4oOther operations: check, list, cancel.
# Process a single file
python parallel_call.py run \
--input_file requests.jsonl \
--output_file results.jsonl \
--n_jobs 10 \
--model gpt-4o
# Split, process per-chunk, merge
python parallel_call.py split_run \
--input_file requests.jsonl \
--model gpt-4o \
--chunk_size 5000 \
--n_jobs 10 \
--output_dir expt_batches/input_batches/single_vanilla/gpt-4o \
--merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4o| Variable | Used by |
|---|---|
GEMINI_API_KEY |
Gemini direct API |
OPENAI_API_KEY |
OpenAI / GPT batches |
ANTHROPIC_API_KEY |
Claude batches |
GOOGLE_CLOUD_PROJECT |
Vertex AI Gemini |
GOOGLE_CLOUD_LOCATION |
Vertex AI (default: global) |
GOOGLE_APPLICATION_CREDENTIALS |
GCP service account |
GEMINI_BUCKET_NAME |
GCS bucket for Vertex AI |
evaluators/
├── Orchestration Scripts
│ ├── generate_batches.sh # Step 1: generate evaluator JSONL files
│ ├── schedule_batches.sh # Step 2a: split, submit, poll, merge (batch APIs)
│ ├── schedule_parallel_calls.sh # Step 2b: split, run parallel calls, merge
│ └── run_analysis.sh # Post-processing and result analysis
│
├── Evaluator Scripts (request generators)
│ ├── single_vanilla.py # Single image, vanilla CoT
│ ├── single_rubrics.py # Single image, rubric-based
│ ├── single_axes.py # Single image, multi-metric
│ ├── single_axes_rubrics.py # Single image, axes + rubrics
│ ├── compare_vanilla.py # Compare two images, vanilla CoT
│ ├── compare_rules.py # Compare two images, rules-based
│ ├── compare_axes.py # Compare two images, multi-metric
│ ├── compare_axes_rules.py # Compare two images, axes + rules
│ └── reference_based.py # Reference-based scoring
│
├── Execution Engines
│ ├── batch_call.py # Batch API handler (Gemini/Vertex/GPT/Claude)
│ └── parallel_call.py # Parallel provider SDK executor
│
├── Shared Utilities
│ ├── common.py # Request building, image handling
│ └── parsers.py # Pydantic response models
│
└── Prompt Templates
└── prompts/
├── single_vanilla.py
├── compare_vanilla.py
├── single_axes.py
├── compare_axes.py
├── single_rubrics.py
├── single_axes_rubrics.py
├── compare_rules.py
├── compare_axes_rules.py
└── reference_based.py