This example demonstrates how to run multimodal offline evaluations using LangSmith. It compares two OpenAI GPT-5.4 models — gpt-5.4-nano and gpt-5.4-mini — on two tasks:
- Sign interpretation — extracting structured information from traffic sign images
- Image enhancement — generating a higher-quality version of each sign image
The evaluation uses:
- Image attachments in LangSmith datasets to store sign images alongside ground-truth labels
- Structured output (Pydantic models) to extract speed limits and interpretations
- LangChain agents (
create_agent) for structured sign interpretation - OpenAI Responses API with the
image_generationtool for image enhancement - Four evaluators: exact-match scoring on
speed_limit, LLM-as-judge for interpretation quality, pre-computed image quality score, and LLM-as-judge for enhancement quality
- Python 3.11+
- uv package manager
- LangSmith account
- OpenAI API key (used for target models and LLM-as-judge)
uv synccp .env.example .envThen fill in your API keys in .env.
uv run jupyter labOpen multimodal_evals.ipynb and run the cells in order.
The LangSmith dataset (multimodal-traffic-signs-image-gen) contains 21 traffic and parking sign examples. Each example has:
- Input: a question asking the model to interpret the sign
- Reference outputs:
speed_limit(int),description(str),image_quality(int 1–5) - Attachment: the sign image, uploaded as a binary attachment and served via presigned URL at eval time
Each target function wraps one GPT-5.4 model and performs two tasks per example:
- Sign interpretation — A LangChain
create_agentwithresponse_format=SignInterpretationreturns structured output (speed_limit,interpretation). - Image enhancement — The OpenAI Responses API with the
image_generationtool generates a higher-quality version of the sign image. The result is compressed from PNG to JPEG (quality 60) using Pillow to keep payloads well under LangSmith's 20 MB limit. The compressed image is attached to the LangSmith run so it renders in the UI.
| Evaluator | Type | Score | What it measures |
|---|---|---|---|
exact_match_fields |
Deterministic | 0 or 1 per field | Did the model extract the correct speed_limit? |
interpretation_quality |
LLM-as-judge (GPT-4o) | 0.0, 0.5, 1.0 | Did the interpretation capture the key info? |
image_quality |
Pre-computed | 1–5 | How clear is the original input image? |
image_enhancement_quality |
LLM-as-judge (GPT-4o) | 0 or 1 | Is the generated image a meaningful improvement? |
The image_quality score is stored once in the dataset reference outputs so it can be reused across experiments without additional LLM calls.
- Use case: Traffic signs are a concrete example, but the same pattern — image attachments + structured output + image generation + multi-evaluator scoring — applies to any multimodal evaluation task.
- Costs: Running the evaluations incurs OpenAI API costs for target models, LLM-as-judge calls (GPT-4o), and image generation. The dataset has 21 examples across 2 models.
- Runtime: Image generation is the bottleneck (~20–30s per image). Expect roughly 5–8 minutes per experiment, ~15 minutes total.
- LangSmith Experiments: All results are viewable in the LangSmith Experiments UI, where you can compare models side by side, drill into individual examples, view generated image attachments, and filter by evaluator scores.
- Images: The
images/directory contains 21 traffic and parking sign images used as the evaluation dataset.
MIT