Train smaller vision-language models to extract structured recipe data from images, replacing expensive API calls.
A learning-focused project to understand knowledge distillation by doing it hands-on:
- Teacher (Qwen2.5-VL-72B via OpenRouter API) generates high-quality recipe extractions
- Students (7B and ~4B models) learn to mimic the teacher's outputs
- Judge (Claude Opus 4) evaluates quality independently
The goal is a small model that can run locally or on mobile, extracting ingredients, steps, and nutrition from recipe photos.
Two-pass pipeline: 3B VLM → markdown → text model → constrained JSON.
| Model | Architecture | Parse Rate | Quality |
|---|---|---|---|
| Gemini 2.5 Flash (API)* | Single-pass | ~100% | 4.60/5 |
| Teacher 72B (API)* | Single-pass | 100% | 4.30/5 |
| Student 3B, pass 1 only | Markdown extraction | 100% | 4.30/5 |
| Student 3B, full two-pass | Markdown → JSON (7B text API) | 100% | 3.50/5 |
| Baseline 2B (no fine-tuning) | Single-pass JSON | 70% | 1.43/5 |
*API models run on provider-optimized infrastructure (vLLM, batching, MoE) — latency and cost are not comparable to our self-hosted setup. Gemini Flash achieves 4.60/5 at ~$0.001/recipe and 2.9s, but doesn't give us control over the model or data privacy.
Pass-1 distillation works — the 3B student matches the 72B teacher on markdown extraction. The quality drop happens in pass 2 (markdown → JSON conversion), where content gets lost. Single-pass JSON with small models failed entirely (60% parse rate, repetition loops).
| Unoptimized | Optimized (merge+vLLM+CUDA graphs+W4A16) | Target | |
|---|---|---|---|
| Pass 1 (3B VLM) | 21s | 4.1s (5.1x faster, 3.8 GB) | 3-5s ✅ |
| Pass 2 (7B text, API) | 9s avg | 9s avg | <2s (distill to 0.5-1.5B) |
| Total | ~30s | ~13s | ~5-7s |
| Cost/recipe | ~$0.01 | ~$0.003 | ~$0.002 |
Pass-1 optimization stack: LoRA merge → vLLM → CUDA graphs → W4A16 quantization (4-bit weights via llm-compressor GPTQ). Most single-image recipes process in 1.8-2.5s. The pass-2 model is text-only (no vision weights), but using 7B for JSON conversion is still heavier than it should be — distilling to a smaller model is a priority.
Optimize pass-1 inference— ✅ done (21s → 4.1s, 5.1x speedup)- Fix pass-2 quality loss — guardrails to detect/prevent silent content dropping
- Shrink pass-2 model — distill 7B text → 0.5-1.5B, or use deterministic parser + LLM fallback
- KD experiments — data scaling, progressive distillation, feature alignment
Recipe Image → VLM → Structured JSON
{
"title": "Chocolate Cake",
"ingredients": [...],
"steps": [...],
"nutrition": {...}
}
cd vlm-distillation
# Install dependencies
uv sync
# Run tests to verify setup
uv run pytest -vThis is the full KD loop from no prepared data to a trained model and an apples-to-apples baseline comparison.
Preconditions:
OPENROUTER_API_KEYis set- source data is available (
YUMS_DIRset or default sibling path exists) - Modal CLI is authenticated
# 1) Build local dataset and upload processed images/splits to Modal.
make data
# 2) Create a fixed test slice for fair before/after comparison (same recipe IDs).
make eval-ids EVAL_SPLIT=test EVAL_LIMIT=30 EVAL_IDS_FILE=outputs/eval/test_ids_30.txt
# 3) Run baseline student on the fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
--model qwen2-2b \
--recipe-ids-file outputs/eval/test_ids_30.txt
# 4) Generate teacher labels for training (JSON mode for qwen2-2b FT path).
uv run python scripts/run_teacher_api.py --split train
# 5) Sync labels to Modal volume (images + labels + splits).
make sync
# 6) Train student on Modal (real SFT run).
uv run modal run scripts/train_modal.py::main \
--model qwen2-2b \
--epochs 3 \
--report-to none
# 7) Run fine-tuned student on the exact same fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
--model qwen2-2b-ft \
--recipe-ids-file outputs/eval/test_ids_30.txt
# 8) Evaluate teacher, baseline, and fine-tuned outputs with the same judge flow.
make eval-judge \
EVAL_INPUT=outputs/teacher/labels \
EVAL_OUTPUT=outputs/eval/teacher-eval.json \
EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
EVAL_INPUT=outputs/student/baselines/qwen2-2b \
EVAL_OUTPUT=outputs/eval/qwen2-2b-baseline-eval.json \
EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
EVAL_INPUT=outputs/student/finetuned/qwen2-2b-ft \
EVAL_OUTPUT=outputs/eval/qwen2-2b-ft-eval.json \
EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
# 9) Compare baseline vs fine-tuned report metrics.
make eval-compare \
BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json
# 10) Export a single report card artifact (JSON + Markdown).
make report-card \
TEACHER_REPORT=outputs/eval/teacher-eval.json \
BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json \
REPORT_CARD_JSON=outputs/eval/qwen2-2b-report-card.json \
REPORT_CARD_MD=outputs/eval/qwen2-2b-report-card.mdPrimary artifacts produced by this run:
- Baseline outputs:
outputs/student/baselines/qwen2-2b/ - Teacher labels:
outputs/teacher/labels/ - Fine-tuned outputs:
outputs/student/finetuned/qwen2-2b-ft/ - Eval summaries:
outputs/eval/teacher-eval.json,outputs/eval/qwen2-2b-baseline-eval.json,outputs/eval/qwen2-2b-ft-eval.json - Comparison summary:
outputs/eval/qwen2-2b-comparison.json - Report card:
outputs/eval/qwen2-2b-report-card.json,outputs/eval/qwen2-2b-report-card.md
- Evaluation protocol:
docs/eval_protocol.md - Canonical comparison summary script:
scripts/export_eval_summary.py - Latest report card artifact (generated):
outputs/eval/qwen2-2b-report-card.md
The current runtime architecture is:
Pass 1: image(s) -> markdown (vision extraction)
Pass 2: markdown -> JSON (schema-constrained)
Core shared modules:
src/vlm_distill/config.py- Centralized API key loading and OpenRouter endpoint constants
src/vlm_distill/pipeline.py- Shared pass logic:
extract_markdown,markdown_to_json,two_pass_extract - Shared validation:
validate_markdown,validate_json src/vlm_distill/experiments.py- Typed experiment config and JSON/YAML config loading
src/vlm_distill/models/registry.py- Canonical base model metadata (ID/GPU/tensor-parallel/max context)
Architecture rules:
- Scripts should not call
os.getenv("OPENROUTER_API_KEY")directly. - Scripts should use
validate_markdown()for markdown quality checks. - Experiment scripts should accept config files instead of adding many ad-hoc flags.
- Modal scripts should source base model identity/GPU from
MODEL_REGISTRYand only keep local overlays for run-specific fields (adapter path, output dir). - Modal scripts must be invoked with explicit entrypoints:
modal run script.py::main(or::sync_data).
Reference smoke config:
configs/two_pass_smoke.json
Canonical data flow:
export_images -> create_splits -> preprocess_images classify -> preprocess_images process -> sync_volume
Orchestrator:
scripts/run_data_pipeline.py- Shared helpers:
src/vlm_distill/data/pipeline.py
Source repo resolution for export step:
- Uses
YUMS_DIRenv var if set. - Falls back to default local path from
src/vlm_distill/paths.py. - Fails with a clear error if neither exists.
Convenience entrypoint:
make dataSee all shortcuts with:
make helpUse explicit local entrypoints with Modal CLI to avoid invocation errors:
# Training
uv run modal run scripts/train_modal.py::main --model qwen-3b
# Data sync helper
uv run modal run scripts/train_modal.py::sync_data
# Student baselines
uv run modal run scripts/run_student_baseline.py::main --model qwen2-2b --limit 1make train already uses the correct ::main entrypoint.
For train canaries, use --max-samples 2 or higher so the 90/10 split includes at least one train sample.
The pipeline is designed to run step-by-step. Each script uses the shared architecture.
# Downloads recipe images from Convex to data/images/
# Creates data/metadata.json with recipe info
# Set YUMS_DIR if your yums checkout is not in the default sibling location
# export YUMS_DIR="/path/to/yums/apps/mobile"
uv run python scripts/export_images.py# Creates data/splits.json with 70/15/15 split
uv run python scripts/create_splits.pyPrepares images for training by:
- Deduplicating - Removes identical images (by file hash)
- Classifying - Uses VLM to detect text-containing images (only for recipes with >3 unique images)
- Filtering - Keeps text images first, up to 3 per recipe
- Resizing - Resizes to 768px max dimension
Recipes with ≤3 unique images keep all images (including dish photos), adding diversity to training data.
# Set API key for classification
export OPENROUTER_API_KEY="sk-or-..."
# Check stats (no API calls)
uv run python scripts/preprocess_images.py stats
# Classify images that need it (~320 images, ~$0.30)
uv run python scripts/preprocess_images.py classify
# Process all images (dedupe + filter + resize)
uv run python scripts/preprocess_images.py process
# Preview without writing files
uv run python scripts/preprocess_images.py process --dry-runOutput: data/processed/{recipe_id}/0.jpg, 1.jpg, 2.jpg
After preprocessing, sync data to Modal for GPU training/inference:
# Full sync (images + labels + splits)
uv run python scripts/sync_volume.py
# Dry run to preview
uv run python scripts/sync_volume.py --dry-runThis uploads to Modal volume with correct structure:
/data/images/{recipe_id}/- Preprocessed images/data/labels/{recipe_id}.json- Teacher labels/data/splits.json- Train/val/test split
# Test teacher on 5 random recipes before full run
# Requires OPENROUTER_API_KEY env var or secrets/openrouter_key file
uv run python scripts/validate_teacher.py --count 5
# View results
cat outputs/teacher/validation.json | python -m json.tool# Generate labels for all recipes (can resume if interrupted)
uv run python scripts/run_teacher_api.py --split train
# Or process all splits
uv run python scripts/run_teacher_api.py# Basic stats (parse rate, field counts)
uv run python scripts/eval.py outputs/teacher/validation.json
# With LLM-as-judge evaluation (uses Claude)
uv run python scripts/eval.py outputs/teacher/validation.json --judge
# Evaluate a directory of per-recipe files
uv run python scripts/eval.py outputs/teacher/labels/
# Save evaluation results
uv run python scripts/eval.py outputs/teacher/validation.json --output outputs/eval/teacher.jsonThis validates the new two-pass architecture without fine-tuned models:
# 1 recipe, low spend, end-to-end two-pass check
uv run python scripts/test_two_pass.py \
--limit 1 \
--vlm-model google/gemini-2.5-flash \
--llm-model qwen/qwen-2.5-7b-instruct \
--saveOutputs:
outputs/two_pass_test/{recipe_id}.json- Includes markdown, JSON output, markdown validation, JSON validation, and token usage.
Run pass-2 constrained JSON conversion on existing markdown outputs:
make pass2-json \
PASS2_INPUT=outputs/student/finetuned/qwen-3b-ft \
PASS2_OUTPUT=outputs/two_pass/qwen-3b-ft-pass2 \
PASS2_MODEL=qwen/qwen-2.5-7b-instruct \
EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txtOutputs:
outputs/two_pass/qwen-3b-ft-pass2/{recipe_id}.jsonoutputs/two_pass/qwen-3b-ft-pass2/summary.json(parse rate + latency)
vlm-distillation/
├── Makefile # Common data/train/eval/smoke commands
│
├── configs/
│ └── two_pass_smoke.json # Example reproducible experiment config
│
├── src/vlm_distill/ # Core library
│ ├── config.py # Shared API keys + client endpoint constants
│ ├── data/
│ │ ├── pipeline.py # Data pipeline orchestration + YUMS_DIR resolution
│ │ ├── schemas.py # Pydantic schemas (ExtractionResult, etc.)
│ │ └── loaders.py # Recipe/image loading utilities
│ ├── experiments.py # Typed experiment config + JSON/YAML loading
│ ├── eval/
│ │ ├── validation.py # JSON parsing + schema validation
│ │ └── judge.py # LLM-as-judge evaluation (stub)
│ ├── models/
│ │ └── registry.py # Model configs (GPU, tensor parallel)
│ ├── pipeline.py # Shared two-pass pipeline + markdown/json validation
│ ├── modal/
│ │ ├── images.py # Modal image definitions
│ │ └── volumes.py # Persistent model cache
│ ├── paths.py # Standard paths for all scripts
│ └── prompts.py # Shared extraction prompt
│
├── scripts/ # CLI scripts
│ ├── run_data_pipeline.py # Orchestrated data pipeline runner
│ ├── export_images.py # Step 1: Download images from Convex
│ ├── create_splits.py # Step 2: Create train/val/test splits
│ ├── validate_teacher.py # Step 3: Test teacher on samples
│ ├── run_pass2_json.py # Batch pass-2 markdown -> constrained JSON
│ └── run_teacher_api.py # Step 4: Run teacher on all recipes
│
├── data/ # Input data (gitignored)
│ ├── images/ # Downloaded recipe images (raw)
│ ├── processed/ # Preprocessed images (dedupe + resize)
│ ├── image_classifications.json # VLM text detection results
│ ├── metadata.json # Recipe metadata from Convex
│ └── splits.json # Train/val/test split definitions
│
├── outputs/ # All outputs (gitignored)
│ ├── teacher/
│ │ ├── validation.json # Sample validation results
│ │ └── labels/ # Full labeling (one JSON per recipe)
│ ├── student/
│ │ └── baselines/ # Pre-training baseline results
│ └── eval/
│ └── judge/ # LLM-as-judge evaluation results
│
└── tests/ # Unit tests
Maintained scripts for the active workflow:
- Data:
export_images.py,create_splits.py,preprocess_images.py,sync_volume.py,run_data_pipeline.py - Teacher/eval:
run_teacher_api.py,validate_teacher.py,eval.py,eval_markdown.py,run_judge.py,create_eval_ids.py,compare_eval_reports.py,export_eval_summary.py - Distillation/runtime:
train_sft.py,train_modal.py,run_student_baseline.py,test_two_pass.py,test_gemini.py
Legacy utility scripts that were not part of the maintained workflow were removed in cleanup.
All scripts use shared components from vlm_distill:
# Runtime config (shared API keys/endpoints)
from vlm_distill.config import get_api_key, OPENROUTER_CHAT_COMPLETIONS_URL
# Shared two-pass pipeline
from vlm_distill.pipeline import (
extract_markdown,
markdown_to_json,
two_pass_extract,
validate_markdown,
validate_json,
)
# Typed experiment config (JSON/YAML)
from vlm_distill.experiments import ExperimentConfig, load_experiment_config
# Existing schema/eval modules remain for structured validation + judge
from vlm_distill.eval import validate_extractionTeacher labeling uses OpenRouter API (360x cheaper than self-hosted):
# Option 1: Environment variable
export OPENROUTER_API_KEY="sk-or-..."
# Option 2: Secrets file (gitignored)
echo "sk-or-..." > ../secrets/openrouter_keyGet a key at: https://openrouter.ai/keys
# Lint and format
uv run ruff check . --fix
uv run ruff format .
# Type check
uv run pyright
# Run tests
uv run pytest -vDetailed docs in workpads:
| File | Contents |
|---|---|
workpads/vlm-distillation/knowledge.md |
Decisions, architecture, research |
workpads/vlm-distillation/references.md |
Papers, models, external resources |
workpads/vlm-distillation/tasks.md |
Current task list with status |
workpads/vlm-distillation/learning.md |
Distillation theory explainer |
| ID | Decision | Rationale |
|---|---|---|
| D12 | OpenRouter API for teacher | 360x cheaper than self-hosted |
| D16 | Claude Opus 4 as judge | Avoid circular eval (teacher ≠ judge) |
| D20 | Pydantic schemas | Automatic validation, JSON serialization |
| D24 | L40S for 7B, A10G for 4B | Best price/performance on Modal |
See workpads/vlm-distillation/knowledge.md for full decision log.