T2I LLM Evaluators

This module evaluates text-to-image (T2I) model outputs using LLM-as-a-judge. It generates model-agnostic evaluator requests in OpenAI chat format, then executes them via batch APIs or parallel calls.

The architecture mirrors the I2T evaluators — the same evaluator types, execution engines, and orchestration scripts are used, adapted for image quality assessment.

Quick Start
Input Format
Evaluator Types
Pipeline Overview
Individual Script Usage
Environment Variables
Directory Structure

Quick Start

# 1. Generate model-agnostic evaluator requests for all evaluators
bash generate_batches.sh \
  --input_dir   /data/my_datasets \
  --output_dir  main_batches \
  --image_root  /data/images \
  --evaluators  all

# 2a. Schedule batch jobs for a specific evaluator + model
bash schedule_batches.sh \
  --input_folder  main_batches/single_vanilla \
  --provider      gpt \
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --poll_interval 60

# 2b. OR schedule parallel calls (real-time, via provider SDKs)
bash schedule_parallel_calls.sh \
  --input_folder  main_batches/single_vanilla \
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --n_jobs        10

# Results will be in: expt_batches/output_batches/single_vanilla/<model>/

Input Format

Supports .tsv, .csv, .json, .jsonl. The evaluator scripts auto-detect these fields:

Field	Looked up as
Instance ID	`p_id`, `id`
Text prompt	`prompt`, `question`
Gold image	`gold_image`, `image`, `img_url`, `image_url`
Perturbed image	`perturbed_image`

Images can be: HTTP URLs, local file paths (relative to --image_root), data URLs, or raw base64 strings.

Evaluator Types

Single-Answer Evaluators

Score an individual generated image against a text prompt.

Evaluator	Script	Response Model	Output
Vanilla CoT	`single_vanilla.py`	`SingleVanillaCOTScore`	justification + score (1-10)
Rubrics	`single_rubrics.py`	`SingleRubricsScore`	justification + score (0-2)
Multi-Axes	`single_axes.py`	`SingleAxesScore`	justification + score per metric
Axes + Rubrics	`single_axes_rubrics.py`	`SingleAxesRubricsScore`	justification + score per metric

Each produces two requests per input row: one for the gold image (-orig), one for the perturbed image (-pert).

Comparison Evaluators

Compare two images (A vs B) and pick a winner.

Evaluator	Script	Response Model	Output
Vanilla CoT	`compare_vanilla.py`	`CompareVanillaCOTScore`	justification + verdict (A/B)
Rules	`compare_rules.py`	`CompareRulesScore`	justification + verdict (A/B)
Multi-Axes	`compare_axes.py`	`CompareAxesScore`	justification + verdict per metric
Axes + Rules	`compare_axes_rules.py`	`CompareAxesRulesScore`	justification + verdict per metric

Use --p_mode to swap A/B order (generates _perturb.jsonl variant).

Reference-Based Evaluator

Scores a generated image against a reference image.

Evaluator	Script	Response Model	Output
Reference	`reference_based.py`	`ReferenceScore`	justification + score

Metrics (Axes Evaluators)

prompt_align — Prompt alignment: how faithfully the image follows the text prompt
visual_qual — Visual quality: realism and coherence of the generated image
comp_acc — Compositional accuracy: correct rendering of objects, attributes, and spatial relations
text_render — Text rendering: accuracy of any text depicted in the image

Pipeline Overview

Step 1: Generate Evaluator Batches

bash generate_batches.sh \
  --input_dir   <dir>           # Directory with input data files
  --output_dir  <dir>           # Output base (default: main_batches)
  --evaluators  <list|all>      # Comma-separated or "all"
  --image_root  <dir>           # Optional image path root

This calls each evaluator script on every input file and stores model-agnostic JSONL requests in:

main_batches/
  single_vanilla/dataset1.jsonl
  compare_vanilla/dataset1.jsonl
  compare_vanilla/dataset1_perturb.jsonl
  ...

Step 2a: Schedule Batch Jobs

bash schedule_batches.sh \
  --input_folder  main_batches/single_vanilla  # One evaluator folder
  --provider      gpt                          # gemini|vertex_gemini|gpt|claude
  --model         gpt-4o                       # Model name
  --output_dir    expt_batches                 # Output base
  --chunk_size    5000                         # Entries per split
  --poll_interval 60                           # Seconds between polls
  --display_name  my-eval                      # Job display name (optional)
  --debug                                      # Sample 30 rows (optional)

Output structure:

expt_batches/
  input_batches/single_vanilla/gpt-4o/
    dataset1_gpt-4o_001.jsonl
    dataset1_gpt-4o_001.output.jsonl
    dataset1.tracker.gpt-4o.20260324_103000.json
  output_batches/single_vanilla/gpt-4o/
    dataset1_gpt-4o.jsonl                # merged final output

Step 2b: Schedule Parallel Calls

An alternative to batch jobs — processes requests in real time using parallel threaded calls via direct provider SDKs.

bash schedule_parallel_calls.sh \
  --input_folder  main_batches/single_vanilla
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --n_jobs        10

When to use batch vs parallel:

	`schedule_batches.sh`	`schedule_parallel_calls.sh`
Mechanism	Provider batch APIs (async jobs)	Real-time threaded calls
Cost	Often cheaper (batch pricing)	Standard API pricing
Latency	Higher (queued processing)	Lower (immediate)

Individual Script Usage

Evaluator Scripts

python single_vanilla.py \
  --file_name     input.tsv \
  --out_file_name requests.jsonl \
  --image_root    /data/images

Additional flags:

Axes evaluators: --all (all metrics) or --axes prompt_align visual_qual (specific metrics)
Compare evaluators: --p_mode (swap A/B order)

batch_call.py

# Create a batch job
python batch_call.py create \
  --input_file requests.jsonl \
  --provider gpt \
  --model gpt-4o

# Poll until complete, then download
python batch_call.py wait \
  --provider gpt \
  --job_name <job_id> \
  --output_file results.jsonl \
  --poll_interval 30

# Split large file, submit chunks, create tracker
python batch_call.py split_submit \
  --input_file requests.jsonl \
  --provider gpt \
  --model gpt-4o \
  --chunk_size 5000 \
  --output_dir expt_batches/input_batches/single_vanilla/gpt-4o

# Poll tracker, download completed chunks, merge when done
python batch_call.py poll_tracker \
  --tracker_file path/to/dataset1.tracker.gpt-4o.20260324_103000.json \
  --merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4o

Other operations: check, list, cancel.

parallel_call.py

# Process a single file
python parallel_call.py run \
  --input_file  requests.jsonl \
  --output_file results.jsonl \
  --n_jobs      10 \
  --model       gpt-4o

# Split, process per-chunk, merge
python parallel_call.py split_run \
  --input_file       requests.jsonl \
  --model            gpt-4o \
  --chunk_size       5000 \
  --n_jobs           10 \
  --output_dir       expt_batches/input_batches/single_vanilla/gpt-4o \
  --merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4o

Environment Variables

Variable	Used by
`GEMINI_API_KEY`	Gemini direct API
`OPENAI_API_KEY`	OpenAI / GPT batches
`ANTHROPIC_API_KEY`	Claude batches
`GOOGLE_CLOUD_PROJECT`	Vertex AI Gemini
`GOOGLE_CLOUD_LOCATION`	Vertex AI (default: `global`)
`GOOGLE_APPLICATION_CREDENTIALS`	GCP service account
`GEMINI_BUCKET_NAME`	GCS bucket for Vertex AI

Directory Structure

evaluators/
├── Orchestration Scripts
│   ├── generate_batches.sh           # Step 1: generate evaluator JSONL files
│   ├── schedule_batches.sh           # Step 2a: split, submit, poll, merge (batch APIs)
│   ├── schedule_parallel_calls.sh    # Step 2b: split, run parallel calls, merge
│   └── run_analysis.sh               # Post-processing and result analysis
│
├── Evaluator Scripts (request generators)
│   ├── single_vanilla.py             # Single image, vanilla CoT
│   ├── single_rubrics.py             # Single image, rubric-based
│   ├── single_axes.py                # Single image, multi-metric
│   ├── single_axes_rubrics.py        # Single image, axes + rubrics
│   ├── compare_vanilla.py            # Compare two images, vanilla CoT
│   ├── compare_rules.py              # Compare two images, rules-based
│   ├── compare_axes.py               # Compare two images, multi-metric
│   ├── compare_axes_rules.py         # Compare two images, axes + rules
│   └── reference_based.py            # Reference-based scoring
│
├── Execution Engines
│   ├── batch_call.py                 # Batch API handler (Gemini/Vertex/GPT/Claude)
│   └── parallel_call.py              # Parallel provider SDK executor
│
├── Shared Utilities
│   ├── common.py                     # Request building, image handling
│   └── parsers.py                    # Pydantic response models
│
└── Prompt Templates
    └── prompts/
        ├── single_vanilla.py
        ├── compare_vanilla.py
        ├── single_axes.py
        ├── compare_axes.py
        ├── single_rubrics.py
        ├── single_axes_rubrics.py
        ├── compare_rules.py
        ├── compare_axes_rules.py
        └── reference_based.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T2I LLM Evaluators

Table of Contents

Quick Start

Input Format

Evaluator Types

Single-Answer Evaluators

Comparison Evaluators

Reference-Based Evaluator

Metrics (Axes Evaluators)

Pipeline Overview

Step 1: Generate Evaluator Batches

Step 2a: Schedule Batch Jobs

Step 2b: Schedule Parallel Calls

Individual Script Usage

Evaluator Scripts

batch_call.py

parallel_call.py

Environment Variables

Directory Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

T2I LLM Evaluators

Table of Contents

Quick Start

Input Format

Evaluator Types

Single-Answer Evaluators

Comparison Evaluators

Reference-Based Evaluator

Metrics (Axes Evaluators)

Pipeline Overview

Step 1: Generate Evaluator Batches

Step 2a: Schedule Batch Jobs

Step 2b: Schedule Parallel Calls

Individual Script Usage

Evaluator Scripts

batch_call.py

parallel_call.py

Environment Variables

Directory Structure