Skip to content

Latest commit

 

History

History
302 lines (243 loc) · 10.3 KB

File metadata and controls

302 lines (243 loc) · 10.3 KB

T2I LLM Evaluators

This module evaluates text-to-image (T2I) model outputs using LLM-as-a-judge. It generates model-agnostic evaluator requests in OpenAI chat format, then executes them via batch APIs or parallel calls.

The architecture mirrors the I2T evaluators — the same evaluator types, execution engines, and orchestration scripts are used, adapted for image quality assessment.

Table of Contents


Quick Start

# 1. Generate model-agnostic evaluator requests for all evaluators
bash generate_batches.sh \
  --input_dir   /data/my_datasets \
  --output_dir  main_batches \
  --image_root  /data/images \
  --evaluators  all

# 2a. Schedule batch jobs for a specific evaluator + model
bash schedule_batches.sh \
  --input_folder  main_batches/single_vanilla \
  --provider      gpt \
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --poll_interval 60

# 2b. OR schedule parallel calls (real-time, via provider SDKs)
bash schedule_parallel_calls.sh \
  --input_folder  main_batches/single_vanilla \
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --n_jobs        10

# Results will be in: expt_batches/output_batches/single_vanilla/<model>/

Input Format

Supports .tsv, .csv, .json, .jsonl. The evaluator scripts auto-detect these fields:

Field Looked up as
Instance ID p_id, id
Text prompt prompt, question
Gold image gold_image, image, img_url, image_url
Perturbed image perturbed_image

Images can be: HTTP URLs, local file paths (relative to --image_root), data URLs, or raw base64 strings.


Evaluator Types

Single-Answer Evaluators

Score an individual generated image against a text prompt.

Evaluator Script Response Model Output
Vanilla CoT single_vanilla.py SingleVanillaCOTScore justification + score (1-10)
Rubrics single_rubrics.py SingleRubricsScore justification + score (0-2)
Multi-Axes single_axes.py SingleAxesScore justification + score per metric
Axes + Rubrics single_axes_rubrics.py SingleAxesRubricsScore justification + score per metric

Each produces two requests per input row: one for the gold image (-orig), one for the perturbed image (-pert).

Comparison Evaluators

Compare two images (A vs B) and pick a winner.

Evaluator Script Response Model Output
Vanilla CoT compare_vanilla.py CompareVanillaCOTScore justification + verdict (A/B)
Rules compare_rules.py CompareRulesScore justification + verdict (A/B)
Multi-Axes compare_axes.py CompareAxesScore justification + verdict per metric
Axes + Rules compare_axes_rules.py CompareAxesRulesScore justification + verdict per metric

Use --p_mode to swap A/B order (generates _perturb.jsonl variant).

Reference-Based Evaluator

Scores a generated image against a reference image.

Evaluator Script Response Model Output
Reference reference_based.py ReferenceScore justification + score

Metrics (Axes Evaluators)

  • prompt_align — Prompt alignment: how faithfully the image follows the text prompt
  • visual_qual — Visual quality: realism and coherence of the generated image
  • comp_acc — Compositional accuracy: correct rendering of objects, attributes, and spatial relations
  • text_render — Text rendering: accuracy of any text depicted in the image

Pipeline Overview

Step 1: Generate Evaluator Batches

bash generate_batches.sh \
  --input_dir   <dir>           # Directory with input data files
  --output_dir  <dir>           # Output base (default: main_batches)
  --evaluators  <list|all>      # Comma-separated or "all"
  --image_root  <dir>           # Optional image path root

This calls each evaluator script on every input file and stores model-agnostic JSONL requests in:

main_batches/
  single_vanilla/dataset1.jsonl
  compare_vanilla/dataset1.jsonl
  compare_vanilla/dataset1_perturb.jsonl
  ...

Step 2a: Schedule Batch Jobs

bash schedule_batches.sh \
  --input_folder  main_batches/single_vanilla  # One evaluator folder
  --provider      gpt                          # gemini|vertex_gemini|gpt|claude
  --model         gpt-4o                       # Model name
  --output_dir    expt_batches                 # Output base
  --chunk_size    5000                         # Entries per split
  --poll_interval 60                           # Seconds between polls
  --display_name  my-eval                      # Job display name (optional)
  --debug                                      # Sample 30 rows (optional)

Output structure:

expt_batches/
  input_batches/single_vanilla/gpt-4o/
    dataset1_gpt-4o_001.jsonl
    dataset1_gpt-4o_001.output.jsonl
    dataset1.tracker.gpt-4o.20260324_103000.json
  output_batches/single_vanilla/gpt-4o/
    dataset1_gpt-4o.jsonl                # merged final output

Step 2b: Schedule Parallel Calls

An alternative to batch jobs — processes requests in real time using parallel threaded calls via direct provider SDKs.

bash schedule_parallel_calls.sh \
  --input_folder  main_batches/single_vanilla
  --model         gpt-4o \
  --output_dir    expt_batches \
  --chunk_size    5000 \
  --n_jobs        10

When to use batch vs parallel:

schedule_batches.sh schedule_parallel_calls.sh
Mechanism Provider batch APIs (async jobs) Real-time threaded calls
Cost Often cheaper (batch pricing) Standard API pricing
Latency Higher (queued processing) Lower (immediate)

Individual Script Usage

Evaluator Scripts

python single_vanilla.py \
  --file_name     input.tsv \
  --out_file_name requests.jsonl \
  --image_root    /data/images

Additional flags:

  • Axes evaluators: --all (all metrics) or --axes prompt_align visual_qual (specific metrics)
  • Compare evaluators: --p_mode (swap A/B order)

batch_call.py

# Create a batch job
python batch_call.py create \
  --input_file requests.jsonl \
  --provider gpt \
  --model gpt-4o

# Poll until complete, then download
python batch_call.py wait \
  --provider gpt \
  --job_name <job_id> \
  --output_file results.jsonl \
  --poll_interval 30

# Split large file, submit chunks, create tracker
python batch_call.py split_submit \
  --input_file requests.jsonl \
  --provider gpt \
  --model gpt-4o \
  --chunk_size 5000 \
  --output_dir expt_batches/input_batches/single_vanilla/gpt-4o

# Poll tracker, download completed chunks, merge when done
python batch_call.py poll_tracker \
  --tracker_file path/to/dataset1.tracker.gpt-4o.20260324_103000.json \
  --merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4o

Other operations: check, list, cancel.

parallel_call.py

# Process a single file
python parallel_call.py run \
  --input_file  requests.jsonl \
  --output_file results.jsonl \
  --n_jobs      10 \
  --model       gpt-4o

# Split, process per-chunk, merge
python parallel_call.py split_run \
  --input_file       requests.jsonl \
  --model            gpt-4o \
  --chunk_size       5000 \
  --n_jobs           10 \
  --output_dir       expt_batches/input_batches/single_vanilla/gpt-4o \
  --merge_output_dir expt_batches/output_batches/single_vanilla/gpt-4o

Environment Variables

Variable Used by
GEMINI_API_KEY Gemini direct API
OPENAI_API_KEY OpenAI / GPT batches
ANTHROPIC_API_KEY Claude batches
GOOGLE_CLOUD_PROJECT Vertex AI Gemini
GOOGLE_CLOUD_LOCATION Vertex AI (default: global)
GOOGLE_APPLICATION_CREDENTIALS GCP service account
GEMINI_BUCKET_NAME GCS bucket for Vertex AI

Directory Structure

evaluators/
├── Orchestration Scripts
│   ├── generate_batches.sh           # Step 1: generate evaluator JSONL files
│   ├── schedule_batches.sh           # Step 2a: split, submit, poll, merge (batch APIs)
│   ├── schedule_parallel_calls.sh    # Step 2b: split, run parallel calls, merge
│   └── run_analysis.sh               # Post-processing and result analysis
│
├── Evaluator Scripts (request generators)
│   ├── single_vanilla.py             # Single image, vanilla CoT
│   ├── single_rubrics.py             # Single image, rubric-based
│   ├── single_axes.py                # Single image, multi-metric
│   ├── single_axes_rubrics.py        # Single image, axes + rubrics
│   ├── compare_vanilla.py            # Compare two images, vanilla CoT
│   ├── compare_rules.py              # Compare two images, rules-based
│   ├── compare_axes.py               # Compare two images, multi-metric
│   ├── compare_axes_rules.py         # Compare two images, axes + rules
│   └── reference_based.py            # Reference-based scoring
│
├── Execution Engines
│   ├── batch_call.py                 # Batch API handler (Gemini/Vertex/GPT/Claude)
│   └── parallel_call.py              # Parallel provider SDK executor
│
├── Shared Utilities
│   ├── common.py                     # Request building, image handling
│   └── parsers.py                    # Pydantic response models
│
└── Prompt Templates
    └── prompts/
        ├── single_vanilla.py
        ├── compare_vanilla.py
        ├── single_axes.py
        ├── compare_axes.py
        ├── single_rubrics.py
        ├── single_axes_rubrics.py
        ├── compare_rules.py
        ├── compare_axes_rules.py
        └── reference_based.py