Multimodal Offline Evals — Traffic Sign Interpretation + Image Enhancement

This example demonstrates how to run multimodal offline evaluations using LangSmith. It compares two OpenAI GPT-5.4 models — gpt-5.4-nano and gpt-5.4-mini — on two tasks:

Sign interpretation — extracting structured information from traffic sign images
Image enhancement — generating a higher-quality version of each sign image

The evaluation uses:

Image attachments in LangSmith datasets to store sign images alongside ground-truth labels
Structured output (Pydantic models) to extract speed limits and interpretations
LangChain agents (create_agent) for structured sign interpretation
OpenAI Responses API with the image_generation tool for image enhancement
Four evaluators: exact-match scoring on speed_limit, LLM-as-judge for interpretation quality, pre-computed image quality score, and LLM-as-judge for enhancement quality

Quickstart

Prerequisites

Python 3.11+
uv package manager
LangSmith account
OpenAI API key (used for target models and LLM-as-judge)

Installation

uv sync

Environment Setup

cp .env.example .env

Then fill in your API keys in .env.

Run

uv run jupyter lab

Open multimodal_evals.ipynb and run the cells in order.

How It Works

Dataset

The LangSmith dataset (multimodal-traffic-signs-image-gen) contains 21 traffic and parking sign examples. Each example has:

Input: a question asking the model to interpret the sign
Reference outputs: speed_limit (int), description (str), image_quality (int 1–5)
Attachment: the sign image, uploaded as a binary attachment and served via presigned URL at eval time

Target Functions

Each target function wraps one GPT-5.4 model and performs two tasks per example:

Sign interpretation — A LangChain create_agent with response_format=SignInterpretation returns structured output (speed_limit, interpretation).
Image enhancement — The OpenAI Responses API with the image_generation tool generates a higher-quality version of the sign image. The result is compressed from PNG to JPEG (quality 60) using Pillow to keep payloads well under LangSmith's 20 MB limit. The compressed image is attached to the LangSmith run so it renders in the UI.

Evaluators

Evaluator	Type	Score	What it measures
`exact_match_fields`	Deterministic	0 or 1 per field	Did the model extract the correct `speed_limit`?
`interpretation_quality`	LLM-as-judge (GPT-4o)	0.0, 0.5, 1.0	Did the interpretation capture the key info?
`image_quality`	Pre-computed	1–5	How clear is the original input image?
`image_enhancement_quality`	LLM-as-judge (GPT-4o)	0 or 1	Is the generated image a meaningful improvement?

The image_quality score is stored once in the dataset reference outputs so it can be reused across experiments without additional LLM calls.

Additional Notes

Use case: Traffic signs are a concrete example, but the same pattern — image attachments + structured output + image generation + multi-evaluator scoring — applies to any multimodal evaluation task.
Costs: Running the evaluations incurs OpenAI API costs for target models, LLM-as-judge calls (GPT-4o), and image generation. The dataset has 21 examples across 2 models.
Runtime: Image generation is the bottleneck (~20–30s per image). Expect roughly 5–8 minutes per experiment, ~15 minutes total.
LangSmith Experiments: All results are viewable in the LangSmith Experiments UI, where you can compare models side by side, drill into individual examples, view generated image attachments, and filter by evaluator scores.
Images: The images/ directory contains 21 traffic and parking sign images used as the evaluation dataset.

Related Resources

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
multimodal_evals.ipynb		multimodal_evals.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Offline Evals — Traffic Sign Interpretation + Image Enhancement

Quickstart

Prerequisites

Installation

Environment Setup

Run

How It Works

Dataset

Target Functions

Evaluators

Additional Notes

Related Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Offline Evals — Traffic Sign Interpretation + Image Enhancement

Quickstart

Prerequisites

Installation

Environment Setup

Run

How It Works

Dataset

Target Functions

Evaluators

Additional Notes

Related Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages