VLM Distillation

Train smaller vision-language models to extract structured recipe data from images, replacing expensive API calls.

What This Is

A learning-focused project to understand knowledge distillation by doing it hands-on:

Teacher (Qwen2.5-VL-72B via OpenRouter API) generates high-quality recipe extractions
Students (7B and ~4B models) learn to mimic the teacher's outputs
Judge (Claude Opus 4) evaluates quality independently

The goal is a small model that can run locally or on mobile, extracting ingredients, steps, and nutrition from recipe photos.

Current Results

Two-pass pipeline: 3B VLM → markdown → text model → constrained JSON.

Quality (same judge + test recipes, n=10)

Model	Architecture	Parse Rate	Quality
Gemini 2.5 Flash (API)*	Single-pass	~100%	4.60/5
Teacher 72B (API)*	Single-pass	100%	4.30/5
Student 3B, pass 1 only	Markdown extraction	100%	4.30/5
Student 3B, full two-pass	Markdown → JSON (7B text API)	100%	3.50/5
Baseline 2B (no fine-tuning)	Single-pass JSON	70%	1.43/5

*API models run on provider-optimized infrastructure (vLLM, batching, MoE) — latency and cost are not comparable to our self-hosted setup. Gemini Flash achieves 4.60/5 at ~$0.001/recipe and 2.9s, but doesn't give us control over the model or data privacy.

Pass-1 distillation works — the 3B student matches the 72B teacher on markdown extraction. The quality drop happens in pass 2 (markdown → JSON conversion), where content gets lost. Single-pass JSON with small models failed entirely (60% parse rate, repetition loops).

Latency & Cost

	Unoptimized	Optimized (merge+vLLM+CUDA graphs+W4A16)	Target
Pass 1 (3B VLM)	21s	4.1s (5.1x faster, 3.8 GB)	3-5s ✅
Pass 2 (7B text, API)	9s avg	9s avg	<2s (distill to 0.5-1.5B)
Total	~30s	~13s	~5-7s
Cost/recipe	~$0.01	~$0.003	~$0.002

Pass-1 optimization stack: LoRA merge → vLLM → CUDA graphs → W4A16 quantization (4-bit weights via llm-compressor GPTQ). Most single-image recipes process in 1.8-2.5s. The pass-2 model is text-only (no vision weights), but using 7B for JSON conversion is still heavier than it should be — distilling to a smaller model is a priority.

Next Steps

~~Optimize pass-1 inference~~ — ✅ done (21s → 4.1s, 5.1x speedup)
Fix pass-2 quality loss — guardrails to detect/prevent silent content dropping
Shrink pass-2 model — distill 7B text → 0.5-1.5B, or use deterministic parser + LLM fallback
KD experiments — data scaling, progressive distillation, feature alignment

Recipe Image → VLM → Structured JSON
                     {
                       "title": "Chocolate Cake",
                       "ingredients": [...],
                       "steps": [...],
                       "nutrition": {...}
                     }

Quick Start

cd vlm-distillation

# Install dependencies
uv sync

# Run tests to verify setup
uv run pytest -v

E2E Full Run (Baseline -> SFT -> Improvement Check)

This is the full KD loop from no prepared data to a trained model and an apples-to-apples baseline comparison.

Preconditions:

OPENROUTER_API_KEY is set
source data is available (YUMS_DIR set or default sibling path exists)
Modal CLI is authenticated

# 1) Build local dataset and upload processed images/splits to Modal.
make data

# 2) Create a fixed test slice for fair before/after comparison (same recipe IDs).
make eval-ids EVAL_SPLIT=test EVAL_LIMIT=30 EVAL_IDS_FILE=outputs/eval/test_ids_30.txt

# 3) Run baseline student on the fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
  --model qwen2-2b \
  --recipe-ids-file outputs/eval/test_ids_30.txt

# 4) Generate teacher labels for training (JSON mode for qwen2-2b FT path).
uv run python scripts/run_teacher_api.py --split train

# 5) Sync labels to Modal volume (images + labels + splits).
make sync

# 6) Train student on Modal (real SFT run).
uv run modal run scripts/train_modal.py::main \
  --model qwen2-2b \
  --epochs 3 \
  --report-to none

# 7) Run fine-tuned student on the exact same fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
  --model qwen2-2b-ft \
  --recipe-ids-file outputs/eval/test_ids_30.txt

# 8) Evaluate teacher, baseline, and fine-tuned outputs with the same judge flow.
make eval-judge \
  EVAL_INPUT=outputs/teacher/labels \
  EVAL_OUTPUT=outputs/eval/teacher-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
  EVAL_INPUT=outputs/student/baselines/qwen2-2b \
  EVAL_OUTPUT=outputs/eval/qwen2-2b-baseline-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
  EVAL_INPUT=outputs/student/finetuned/qwen2-2b-ft \
  EVAL_OUTPUT=outputs/eval/qwen2-2b-ft-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt

# 9) Compare baseline vs fine-tuned report metrics.
make eval-compare \
  BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
  CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
  COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json

# 10) Export a single report card artifact (JSON + Markdown).
make report-card \
  TEACHER_REPORT=outputs/eval/teacher-eval.json \
  BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
  CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
  COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json \
  REPORT_CARD_JSON=outputs/eval/qwen2-2b-report-card.json \
  REPORT_CARD_MD=outputs/eval/qwen2-2b-report-card.md

Primary artifacts produced by this run:

Baseline outputs: outputs/student/baselines/qwen2-2b/
Teacher labels: outputs/teacher/labels/
Fine-tuned outputs: outputs/student/finetuned/qwen2-2b-ft/
Eval summaries: outputs/eval/teacher-eval.json, outputs/eval/qwen2-2b-baseline-eval.json, outputs/eval/qwen2-2b-ft-eval.json
Comparison summary: outputs/eval/qwen2-2b-comparison.json
Report card: outputs/eval/qwen2-2b-report-card.json, outputs/eval/qwen2-2b-report-card.md

Results

Evaluation protocol: docs/eval_protocol.md
Canonical comparison summary script: scripts/export_eval_summary.py
Latest report card artifact (generated): outputs/eval/qwen2-2b-report-card.md

Current Architecture (Two-Pass)

The current runtime architecture is:

Pass 1: image(s) -> markdown (vision extraction)
Pass 2: markdown -> JSON (schema-constrained)

Core shared modules:

src/vlm_distill/config.py
Centralized API key loading and OpenRouter endpoint constants
src/vlm_distill/pipeline.py
Shared pass logic: extract_markdown, markdown_to_json, two_pass_extract
Shared validation: validate_markdown, validate_json
src/vlm_distill/experiments.py
Typed experiment config and JSON/YAML config loading
src/vlm_distill/models/registry.py
Canonical base model metadata (ID/GPU/tensor-parallel/max context)

Architecture rules:

Scripts should not call os.getenv("OPENROUTER_API_KEY") directly.
Scripts should use validate_markdown() for markdown quality checks.
Experiment scripts should accept config files instead of adding many ad-hoc flags.
Modal scripts should source base model identity/GPU from MODEL_REGISTRY and only keep local overlays for run-specific fields (adapter path, output dir).
Modal scripts must be invoked with explicit entrypoints: modal run script.py::main (or ::sync_data).

Reference smoke config:

configs/two_pass_smoke.json

Data Pipeline Architecture

Canonical data flow:

export_images -> create_splits -> preprocess_images classify -> preprocess_images process -> sync_volume

Orchestrator:

scripts/run_data_pipeline.py
Shared helpers: src/vlm_distill/data/pipeline.py

Source repo resolution for export step:

Uses YUMS_DIR env var if set.
Falls back to default local path from src/vlm_distill/paths.py.
Fails with a clear error if neither exists.

Convenience entrypoint:

make data

See all shortcuts with:

make help

Modal Entrypoints

Use explicit local entrypoints with Modal CLI to avoid invocation errors:

# Training
uv run modal run scripts/train_modal.py::main --model qwen-3b

# Data sync helper
uv run modal run scripts/train_modal.py::sync_data

# Student baselines
uv run modal run scripts/run_student_baseline.py::main --model qwen2-2b --limit 1

make train already uses the correct ::main entrypoint. For train canaries, use --max-samples 2 or higher so the 90/10 split includes at least one train sample.

Manual Workflow

The pipeline is designed to run step-by-step. Each script uses the shared architecture.

Step 1: Export Images from Source Database

# Downloads recipe images from Convex to data/images/
# Creates data/metadata.json with recipe info
# Set YUMS_DIR if your yums checkout is not in the default sibling location
# export YUMS_DIR="/path/to/yums/apps/mobile"
uv run python scripts/export_images.py

Step 2: Create Train/Val/Test Splits

# Creates data/splits.json with 70/15/15 split
uv run python scripts/create_splits.py

Step 3: Preprocess Images

Prepares images for training by:

Deduplicating - Removes identical images (by file hash)
Classifying - Uses VLM to detect text-containing images (only for recipes with >3 unique images)
Filtering - Keeps text images first, up to 3 per recipe
Resizing - Resizes to 768px max dimension

Recipes with ≤3 unique images keep all images (including dish photos), adding diversity to training data.

# Set API key for classification
export OPENROUTER_API_KEY="sk-or-..."

# Check stats (no API calls)
uv run python scripts/preprocess_images.py stats

# Classify images that need it (~320 images, ~$0.30)
uv run python scripts/preprocess_images.py classify

# Process all images (dedupe + filter + resize)
uv run python scripts/preprocess_images.py process

# Preview without writing files
uv run python scripts/preprocess_images.py process --dry-run

Output: data/processed/{recipe_id}/0.jpg, 1.jpg, 2.jpg

Step 4: Sync to Modal Volume

After preprocessing, sync data to Modal for GPU training/inference:

# Full sync (images + labels + splits)
uv run python scripts/sync_volume.py

# Dry run to preview
uv run python scripts/sync_volume.py --dry-run

This uploads to Modal volume with correct structure:

/data/images/{recipe_id}/ - Preprocessed images
/data/labels/{recipe_id}.json - Teacher labels
/data/splits.json - Train/val/test split

Step 5: Validate Teacher Model

# Test teacher on 5 random recipes before full run
# Requires OPENROUTER_API_KEY env var or secrets/openrouter_key file
uv run python scripts/validate_teacher.py --count 5

# View results
cat outputs/teacher/validation.json | python -m json.tool

Step 4: Run Teacher on All Recipes

# Generate labels for all recipes (can resume if interrupted)
uv run python scripts/run_teacher_api.py --split train

# Or process all splits
uv run python scripts/run_teacher_api.py

Step 5: Evaluate Results

# Basic stats (parse rate, field counts)
uv run python scripts/eval.py outputs/teacher/validation.json

# With LLM-as-judge evaluation (uses Claude)
uv run python scripts/eval.py outputs/teacher/validation.json --judge

# Evaluate a directory of per-recipe files
uv run python scripts/eval.py outputs/teacher/labels/

# Save evaluation results
uv run python scripts/eval.py outputs/teacher/validation.json --output outputs/eval/teacher.json

Low-Cost Pipeline Smoke Run

This validates the new two-pass architecture without fine-tuned models:

# 1 recipe, low spend, end-to-end two-pass check
uv run python scripts/test_two_pass.py \
  --limit 1 \
  --vlm-model google/gemini-2.5-flash \
  --llm-model qwen/qwen-2.5-7b-instruct \
  --save

Outputs:

outputs/two_pass_test/{recipe_id}.json
Includes markdown, JSON output, markdown validation, JSON validation, and token usage.

Run pass-2 constrained JSON conversion on existing markdown outputs:

make pass2-json \
  PASS2_INPUT=outputs/student/finetuned/qwen-3b-ft \
  PASS2_OUTPUT=outputs/two_pass/qwen-3b-ft-pass2 \
  PASS2_MODEL=qwen/qwen-2.5-7b-instruct \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt

Outputs:

outputs/two_pass/qwen-3b-ft-pass2/{recipe_id}.json
outputs/two_pass/qwen-3b-ft-pass2/summary.json (parse rate + latency)

Project Structure

vlm-distillation/
├── Makefile                  # Common data/train/eval/smoke commands
│
├── configs/
│   └── two_pass_smoke.json   # Example reproducible experiment config
│
├── src/vlm_distill/           # Core library
│   ├── config.py              # Shared API keys + client endpoint constants
│   ├── data/
│   │   ├── pipeline.py        # Data pipeline orchestration + YUMS_DIR resolution
│   │   ├── schemas.py         # Pydantic schemas (ExtractionResult, etc.)
│   │   └── loaders.py         # Recipe/image loading utilities
│   ├── experiments.py         # Typed experiment config + JSON/YAML loading
│   ├── eval/
│   │   ├── validation.py      # JSON parsing + schema validation
│   │   └── judge.py           # LLM-as-judge evaluation (stub)
│   ├── models/
│   │   └── registry.py        # Model configs (GPU, tensor parallel)
│   ├── pipeline.py            # Shared two-pass pipeline + markdown/json validation
│   ├── modal/
│   │   ├── images.py          # Modal image definitions
│   │   └── volumes.py         # Persistent model cache
│   ├── paths.py               # Standard paths for all scripts
│   └── prompts.py             # Shared extraction prompt
│
├── scripts/                   # CLI scripts
│   ├── run_data_pipeline.py   # Orchestrated data pipeline runner
│   ├── export_images.py       # Step 1: Download images from Convex
│   ├── create_splits.py       # Step 2: Create train/val/test splits
│   ├── validate_teacher.py    # Step 3: Test teacher on samples
│   ├── run_pass2_json.py      # Batch pass-2 markdown -> constrained JSON
│   └── run_teacher_api.py     # Step 4: Run teacher on all recipes
│
├── data/                      # Input data (gitignored)
│   ├── images/                # Downloaded recipe images (raw)
│   ├── processed/             # Preprocessed images (dedupe + resize)
│   ├── image_classifications.json  # VLM text detection results
│   ├── metadata.json          # Recipe metadata from Convex
│   └── splits.json            # Train/val/test split definitions
│
├── outputs/                   # All outputs (gitignored)
│   ├── teacher/
│   │   ├── validation.json    # Sample validation results
│   │   └── labels/            # Full labeling (one JSON per recipe)
│   ├── student/
│   │   └── baselines/         # Pre-training baseline results
│   └── eval/
│       └── judge/             # LLM-as-judge evaluation results
│
└── tests/                     # Unit tests

Active Script Surface

Maintained scripts for the active workflow:

Data: export_images.py, create_splits.py, preprocess_images.py, sync_volume.py, run_data_pipeline.py
Teacher/eval: run_teacher_api.py, validate_teacher.py, eval.py, eval_markdown.py, run_judge.py, create_eval_ids.py, compare_eval_reports.py, export_eval_summary.py
Distillation/runtime: train_sft.py, train_modal.py, run_student_baseline.py, test_two_pass.py, test_gemini.py

Legacy utility scripts that were not part of the maintained workflow were removed in cleanup.

Architecture

All scripts use shared components from vlm_distill:

# Runtime config (shared API keys/endpoints)
from vlm_distill.config import get_api_key, OPENROUTER_CHAT_COMPLETIONS_URL

# Shared two-pass pipeline
from vlm_distill.pipeline import (
    extract_markdown,
    markdown_to_json,
    two_pass_extract,
    validate_markdown,
    validate_json,
)

# Typed experiment config (JSON/YAML)
from vlm_distill.experiments import ExperimentConfig, load_experiment_config

# Existing schema/eval modules remain for structured validation + judge
from vlm_distill.eval import validate_extraction

API Keys

Teacher labeling uses OpenRouter API (360x cheaper than self-hosted):

# Option 1: Environment variable
export OPENROUTER_API_KEY="sk-or-..."

# Option 2: Secrets file (gitignored)
echo "sk-or-..." > ../secrets/openrouter_key

Get a key at: https://openrouter.ai/keys

Development

# Lint and format
uv run ruff check . --fix
uv run ruff format .

# Type check
uv run pyright

# Run tests
uv run pytest -v

Documentation

Detailed docs in workpads:

File	Contents
`workpads/vlm-distillation/knowledge.md`	Decisions, architecture, research
`workpads/vlm-distillation/references.md`	Papers, models, external resources
`workpads/vlm-distillation/tasks.md`	Current task list with status
`workpads/vlm-distillation/learning.md`	Distillation theory explainer

Key Decisions

ID	Decision	Rationale
D12	OpenRouter API for teacher	360x cheaper than self-hosted
D16	Claude Opus 4 as judge	Avoid circular eval (teacher ≠ judge)
D20	Pydantic schemas	Automatic validation, JSON serialization
D24	L40S for 7B, A10G for 4B	Best price/performance on Modal

See workpads/vlm-distillation/knowledge.md for full decision log.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
configs		configs
data		data
docs		docs
outputs		outputs
scripts		scripts
src/vlm_distill		src/vlm_distill
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM Distillation

What This Is

Current Results

Quality (same judge + test recipes, n=10)

Latency & Cost

Next Steps

Quick Start

E2E Full Run (Baseline -> SFT -> Improvement Check)

Results

Current Architecture (Two-Pass)

Data Pipeline Architecture

Modal Entrypoints

Manual Workflow

Step 1: Export Images from Source Database

Step 2: Create Train/Val/Test Splits

Step 3: Preprocess Images

Step 4: Sync to Modal Volume

Step 5: Validate Teacher Model

Step 4: Run Teacher on All Recipes

Step 5: Evaluate Results

Low-Cost Pipeline Smoke Run

Project Structure

Active Script Surface

Architecture

API Keys

Development

Documentation

Key Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM Distillation

What This Is

Current Results

Quality (same judge + test recipes, n=10)

Latency & Cost

Next Steps

Quick Start

E2E Full Run (Baseline -> SFT -> Improvement Check)

Results

Current Architecture (Two-Pass)

Data Pipeline Architecture

Modal Entrypoints

Manual Workflow

Step 1: Export Images from Source Database

Step 2: Create Train/Val/Test Splits

Step 3: Preprocess Images

Step 4: Sync to Modal Volume

Step 5: Validate Teacher Model

Step 4: Run Teacher on All Recipes

Step 5: Evaluate Results

Low-Cost Pipeline Smoke Run

Project Structure

Active Script Surface

Architecture

API Keys

Development

Documentation

Key Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages