Skip to content

jacobkleiman-LC/multimodal-offline-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Offline Evals — Traffic Sign Interpretation + Image Enhancement

This example demonstrates how to run multimodal offline evaluations using LangSmith. It compares two OpenAI GPT-5.4 models — gpt-5.4-nano and gpt-5.4-mini — on two tasks:

  1. Sign interpretation — extracting structured information from traffic sign images
  2. Image enhancement — generating a higher-quality version of each sign image

The evaluation uses:

  • Image attachments in LangSmith datasets to store sign images alongside ground-truth labels
  • Structured output (Pydantic models) to extract speed limits and interpretations
  • LangChain agents (create_agent) for structured sign interpretation
  • OpenAI Responses API with the image_generation tool for image enhancement
  • Four evaluators: exact-match scoring on speed_limit, LLM-as-judge for interpretation quality, pre-computed image quality score, and LLM-as-judge for enhancement quality

Quickstart

Prerequisites

Installation

uv sync

Environment Setup

cp .env.example .env

Then fill in your API keys in .env.

Run

uv run jupyter lab

Open multimodal_evals.ipynb and run the cells in order.

How It Works

Dataset

The LangSmith dataset (multimodal-traffic-signs-image-gen) contains 21 traffic and parking sign examples. Each example has:

  • Input: a question asking the model to interpret the sign
  • Reference outputs: speed_limit (int), description (str), image_quality (int 1–5)
  • Attachment: the sign image, uploaded as a binary attachment and served via presigned URL at eval time

Target Functions

Each target function wraps one GPT-5.4 model and performs two tasks per example:

  1. Sign interpretation — A LangChain create_agent with response_format=SignInterpretation returns structured output (speed_limit, interpretation).
  2. Image enhancement — The OpenAI Responses API with the image_generation tool generates a higher-quality version of the sign image. The result is compressed from PNG to JPEG (quality 60) using Pillow to keep payloads well under LangSmith's 20 MB limit. The compressed image is attached to the LangSmith run so it renders in the UI.

Evaluators

Evaluator Type Score What it measures
exact_match_fields Deterministic 0 or 1 per field Did the model extract the correct speed_limit?
interpretation_quality LLM-as-judge (GPT-4o) 0.0, 0.5, 1.0 Did the interpretation capture the key info?
image_quality Pre-computed 1–5 How clear is the original input image?
image_enhancement_quality LLM-as-judge (GPT-4o) 0 or 1 Is the generated image a meaningful improvement?

The image_quality score is stored once in the dataset reference outputs so it can be reused across experiments without additional LLM calls.

Additional Notes

  • Use case: Traffic signs are a concrete example, but the same pattern — image attachments + structured output + image generation + multi-evaluator scoring — applies to any multimodal evaluation task.
  • Costs: Running the evaluations incurs OpenAI API costs for target models, LLM-as-judge calls (GPT-4o), and image generation. The dataset has 21 examples across 2 models.
  • Runtime: Image generation is the bottleneck (~20–30s per image). Expect roughly 5–8 minutes per experiment, ~15 minutes total.
  • LangSmith Experiments: All results are viewable in the LangSmith Experiments UI, where you can compare models side by side, drill into individual examples, view generated image attachments, and filter by evaluator scores.
  • Images: The images/ directory contains 21 traffic and parking sign images used as the evaluation dataset.

Related Resources

License

MIT

About

Using the SDK to create and run offline evaluations that assess various LLMs performance on multimodal datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors