Skip to content

nicolaslara/vlm-distillation

Repository files navigation

VLM Distillation

Train smaller vision-language models to extract structured recipe data from images, replacing expensive API calls.

What This Is

A learning-focused project to understand knowledge distillation by doing it hands-on:

  1. Teacher (Qwen2.5-VL-72B via OpenRouter API) generates high-quality recipe extractions
  2. Students (7B and ~4B models) learn to mimic the teacher's outputs
  3. Judge (Claude Opus 4) evaluates quality independently

The goal is a small model that can run locally or on mobile, extracting ingredients, steps, and nutrition from recipe photos.

Current Results

Two-pass pipeline: 3B VLM → markdown → text model → constrained JSON.

Quality (same judge + test recipes, n=10)

Model Architecture Parse Rate Quality
Gemini 2.5 Flash (API)* Single-pass ~100% 4.60/5
Teacher 72B (API)* Single-pass 100% 4.30/5
Student 3B, pass 1 only Markdown extraction 100% 4.30/5
Student 3B, full two-pass Markdown → JSON (7B text API) 100% 3.50/5
Baseline 2B (no fine-tuning) Single-pass JSON 70% 1.43/5

*API models run on provider-optimized infrastructure (vLLM, batching, MoE) — latency and cost are not comparable to our self-hosted setup. Gemini Flash achieves 4.60/5 at ~$0.001/recipe and 2.9s, but doesn't give us control over the model or data privacy.

Pass-1 distillation works — the 3B student matches the 72B teacher on markdown extraction. The quality drop happens in pass 2 (markdown → JSON conversion), where content gets lost. Single-pass JSON with small models failed entirely (60% parse rate, repetition loops).

Latency & Cost

Unoptimized Optimized (merge+vLLM+CUDA graphs+W4A16) Target
Pass 1 (3B VLM) 21s 4.1s (5.1x faster, 3.8 GB) 3-5s ✅
Pass 2 (7B text, API) 9s avg 9s avg <2s (distill to 0.5-1.5B)
Total ~30s ~13s ~5-7s
Cost/recipe ~$0.01 ~$0.003 ~$0.002

Pass-1 optimization stack: LoRA merge → vLLM → CUDA graphs → W4A16 quantization (4-bit weights via llm-compressor GPTQ). Most single-image recipes process in 1.8-2.5s. The pass-2 model is text-only (no vision weights), but using 7B for JSON conversion is still heavier than it should be — distilling to a smaller model is a priority.

Next Steps

  1. Optimize pass-1 inference — ✅ done (21s → 4.1s, 5.1x speedup)
  2. Fix pass-2 quality loss — guardrails to detect/prevent silent content dropping
  3. Shrink pass-2 model — distill 7B text → 0.5-1.5B, or use deterministic parser + LLM fallback
  4. KD experiments — data scaling, progressive distillation, feature alignment
Recipe Image → VLM → Structured JSON
                     {
                       "title": "Chocolate Cake",
                       "ingredients": [...],
                       "steps": [...],
                       "nutrition": {...}
                     }

Quick Start

cd vlm-distillation

# Install dependencies
uv sync

# Run tests to verify setup
uv run pytest -v

E2E Full Run (Baseline -> SFT -> Improvement Check)

This is the full KD loop from no prepared data to a trained model and an apples-to-apples baseline comparison.

Preconditions:

  • OPENROUTER_API_KEY is set
  • source data is available (YUMS_DIR set or default sibling path exists)
  • Modal CLI is authenticated
# 1) Build local dataset and upload processed images/splits to Modal.
make data

# 2) Create a fixed test slice for fair before/after comparison (same recipe IDs).
make eval-ids EVAL_SPLIT=test EVAL_LIMIT=30 EVAL_IDS_FILE=outputs/eval/test_ids_30.txt

# 3) Run baseline student on the fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
  --model qwen2-2b \
  --recipe-ids-file outputs/eval/test_ids_30.txt

# 4) Generate teacher labels for training (JSON mode for qwen2-2b FT path).
uv run python scripts/run_teacher_api.py --split train

# 5) Sync labels to Modal volume (images + labels + splits).
make sync

# 6) Train student on Modal (real SFT run).
uv run modal run scripts/train_modal.py::main \
  --model qwen2-2b \
  --epochs 3 \
  --report-to none

# 7) Run fine-tuned student on the exact same fixed IDs.
uv run modal run scripts/run_student_baseline.py::main \
  --model qwen2-2b-ft \
  --recipe-ids-file outputs/eval/test_ids_30.txt

# 8) Evaluate teacher, baseline, and fine-tuned outputs with the same judge flow.
make eval-judge \
  EVAL_INPUT=outputs/teacher/labels \
  EVAL_OUTPUT=outputs/eval/teacher-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
  EVAL_INPUT=outputs/student/baselines/qwen2-2b \
  EVAL_OUTPUT=outputs/eval/qwen2-2b-baseline-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt
make eval-judge \
  EVAL_INPUT=outputs/student/finetuned/qwen2-2b-ft \
  EVAL_OUTPUT=outputs/eval/qwen2-2b-ft-eval.json \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt

# 9) Compare baseline vs fine-tuned report metrics.
make eval-compare \
  BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
  CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
  COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json

# 10) Export a single report card artifact (JSON + Markdown).
make report-card \
  TEACHER_REPORT=outputs/eval/teacher-eval.json \
  BASELINE_REPORT=outputs/eval/qwen2-2b-baseline-eval.json \
  CANDIDATE_REPORT=outputs/eval/qwen2-2b-ft-eval.json \
  COMPARE_OUTPUT=outputs/eval/qwen2-2b-comparison.json \
  REPORT_CARD_JSON=outputs/eval/qwen2-2b-report-card.json \
  REPORT_CARD_MD=outputs/eval/qwen2-2b-report-card.md

Primary artifacts produced by this run:

  • Baseline outputs: outputs/student/baselines/qwen2-2b/
  • Teacher labels: outputs/teacher/labels/
  • Fine-tuned outputs: outputs/student/finetuned/qwen2-2b-ft/
  • Eval summaries: outputs/eval/teacher-eval.json, outputs/eval/qwen2-2b-baseline-eval.json, outputs/eval/qwen2-2b-ft-eval.json
  • Comparison summary: outputs/eval/qwen2-2b-comparison.json
  • Report card: outputs/eval/qwen2-2b-report-card.json, outputs/eval/qwen2-2b-report-card.md

Results

  • Evaluation protocol: docs/eval_protocol.md
  • Canonical comparison summary script: scripts/export_eval_summary.py
  • Latest report card artifact (generated): outputs/eval/qwen2-2b-report-card.md

Current Architecture (Two-Pass)

The current runtime architecture is:

Pass 1: image(s) -> markdown (vision extraction)
Pass 2: markdown -> JSON (schema-constrained)

Core shared modules:

  • src/vlm_distill/config.py
  • Centralized API key loading and OpenRouter endpoint constants
  • src/vlm_distill/pipeline.py
  • Shared pass logic: extract_markdown, markdown_to_json, two_pass_extract
  • Shared validation: validate_markdown, validate_json
  • src/vlm_distill/experiments.py
  • Typed experiment config and JSON/YAML config loading
  • src/vlm_distill/models/registry.py
  • Canonical base model metadata (ID/GPU/tensor-parallel/max context)

Architecture rules:

  • Scripts should not call os.getenv("OPENROUTER_API_KEY") directly.
  • Scripts should use validate_markdown() for markdown quality checks.
  • Experiment scripts should accept config files instead of adding many ad-hoc flags.
  • Modal scripts should source base model identity/GPU from MODEL_REGISTRY and only keep local overlays for run-specific fields (adapter path, output dir).
  • Modal scripts must be invoked with explicit entrypoints: modal run script.py::main (or ::sync_data).

Reference smoke config:

  • configs/two_pass_smoke.json

Data Pipeline Architecture

Canonical data flow:

export_images -> create_splits -> preprocess_images classify -> preprocess_images process -> sync_volume

Orchestrator:

  • scripts/run_data_pipeline.py
  • Shared helpers: src/vlm_distill/data/pipeline.py

Source repo resolution for export step:

  • Uses YUMS_DIR env var if set.
  • Falls back to default local path from src/vlm_distill/paths.py.
  • Fails with a clear error if neither exists.

Convenience entrypoint:

make data

See all shortcuts with:

make help

Modal Entrypoints

Use explicit local entrypoints with Modal CLI to avoid invocation errors:

# Training
uv run modal run scripts/train_modal.py::main --model qwen-3b

# Data sync helper
uv run modal run scripts/train_modal.py::sync_data

# Student baselines
uv run modal run scripts/run_student_baseline.py::main --model qwen2-2b --limit 1

make train already uses the correct ::main entrypoint. For train canaries, use --max-samples 2 or higher so the 90/10 split includes at least one train sample.

Manual Workflow

The pipeline is designed to run step-by-step. Each script uses the shared architecture.

Step 1: Export Images from Source Database

# Downloads recipe images from Convex to data/images/
# Creates data/metadata.json with recipe info
# Set YUMS_DIR if your yums checkout is not in the default sibling location
# export YUMS_DIR="/path/to/yums/apps/mobile"
uv run python scripts/export_images.py

Step 2: Create Train/Val/Test Splits

# Creates data/splits.json with 70/15/15 split
uv run python scripts/create_splits.py

Step 3: Preprocess Images

Prepares images for training by:

  1. Deduplicating - Removes identical images (by file hash)
  2. Classifying - Uses VLM to detect text-containing images (only for recipes with >3 unique images)
  3. Filtering - Keeps text images first, up to 3 per recipe
  4. Resizing - Resizes to 768px max dimension

Recipes with ≤3 unique images keep all images (including dish photos), adding diversity to training data.

# Set API key for classification
export OPENROUTER_API_KEY="sk-or-..."

# Check stats (no API calls)
uv run python scripts/preprocess_images.py stats

# Classify images that need it (~320 images, ~$0.30)
uv run python scripts/preprocess_images.py classify

# Process all images (dedupe + filter + resize)
uv run python scripts/preprocess_images.py process

# Preview without writing files
uv run python scripts/preprocess_images.py process --dry-run

Output: data/processed/{recipe_id}/0.jpg, 1.jpg, 2.jpg

Step 4: Sync to Modal Volume

After preprocessing, sync data to Modal for GPU training/inference:

# Full sync (images + labels + splits)
uv run python scripts/sync_volume.py

# Dry run to preview
uv run python scripts/sync_volume.py --dry-run

This uploads to Modal volume with correct structure:

  • /data/images/{recipe_id}/ - Preprocessed images
  • /data/labels/{recipe_id}.json - Teacher labels
  • /data/splits.json - Train/val/test split

Step 5: Validate Teacher Model

# Test teacher on 5 random recipes before full run
# Requires OPENROUTER_API_KEY env var or secrets/openrouter_key file
uv run python scripts/validate_teacher.py --count 5

# View results
cat outputs/teacher/validation.json | python -m json.tool

Step 4: Run Teacher on All Recipes

# Generate labels for all recipes (can resume if interrupted)
uv run python scripts/run_teacher_api.py --split train

# Or process all splits
uv run python scripts/run_teacher_api.py

Step 5: Evaluate Results

# Basic stats (parse rate, field counts)
uv run python scripts/eval.py outputs/teacher/validation.json

# With LLM-as-judge evaluation (uses Claude)
uv run python scripts/eval.py outputs/teacher/validation.json --judge

# Evaluate a directory of per-recipe files
uv run python scripts/eval.py outputs/teacher/labels/

# Save evaluation results
uv run python scripts/eval.py outputs/teacher/validation.json --output outputs/eval/teacher.json

Low-Cost Pipeline Smoke Run

This validates the new two-pass architecture without fine-tuned models:

# 1 recipe, low spend, end-to-end two-pass check
uv run python scripts/test_two_pass.py \
  --limit 1 \
  --vlm-model google/gemini-2.5-flash \
  --llm-model qwen/qwen-2.5-7b-instruct \
  --save

Outputs:

  • outputs/two_pass_test/{recipe_id}.json
  • Includes markdown, JSON output, markdown validation, JSON validation, and token usage.

Run pass-2 constrained JSON conversion on existing markdown outputs:

make pass2-json \
  PASS2_INPUT=outputs/student/finetuned/qwen-3b-ft \
  PASS2_OUTPUT=outputs/two_pass/qwen-3b-ft-pass2 \
  PASS2_MODEL=qwen/qwen-2.5-7b-instruct \
  EVAL_RECIPE_IDS_FILE=outputs/eval/test_ids_30.txt

Outputs:

  • outputs/two_pass/qwen-3b-ft-pass2/{recipe_id}.json
  • outputs/two_pass/qwen-3b-ft-pass2/summary.json (parse rate + latency)

Project Structure

vlm-distillation/
├── Makefile                  # Common data/train/eval/smoke commands
│
├── configs/
│   └── two_pass_smoke.json   # Example reproducible experiment config
│
├── src/vlm_distill/           # Core library
│   ├── config.py              # Shared API keys + client endpoint constants
│   ├── data/
│   │   ├── pipeline.py        # Data pipeline orchestration + YUMS_DIR resolution
│   │   ├── schemas.py         # Pydantic schemas (ExtractionResult, etc.)
│   │   └── loaders.py         # Recipe/image loading utilities
│   ├── experiments.py         # Typed experiment config + JSON/YAML loading
│   ├── eval/
│   │   ├── validation.py      # JSON parsing + schema validation
│   │   └── judge.py           # LLM-as-judge evaluation (stub)
│   ├── models/
│   │   └── registry.py        # Model configs (GPU, tensor parallel)
│   ├── pipeline.py            # Shared two-pass pipeline + markdown/json validation
│   ├── modal/
│   │   ├── images.py          # Modal image definitions
│   │   └── volumes.py         # Persistent model cache
│   ├── paths.py               # Standard paths for all scripts
│   └── prompts.py             # Shared extraction prompt
│
├── scripts/                   # CLI scripts
│   ├── run_data_pipeline.py   # Orchestrated data pipeline runner
│   ├── export_images.py       # Step 1: Download images from Convex
│   ├── create_splits.py       # Step 2: Create train/val/test splits
│   ├── validate_teacher.py    # Step 3: Test teacher on samples
│   ├── run_pass2_json.py      # Batch pass-2 markdown -> constrained JSON
│   └── run_teacher_api.py     # Step 4: Run teacher on all recipes
│
├── data/                      # Input data (gitignored)
│   ├── images/                # Downloaded recipe images (raw)
│   ├── processed/             # Preprocessed images (dedupe + resize)
│   ├── image_classifications.json  # VLM text detection results
│   ├── metadata.json          # Recipe metadata from Convex
│   └── splits.json            # Train/val/test split definitions
│
├── outputs/                   # All outputs (gitignored)
│   ├── teacher/
│   │   ├── validation.json    # Sample validation results
│   │   └── labels/            # Full labeling (one JSON per recipe)
│   ├── student/
│   │   └── baselines/         # Pre-training baseline results
│   └── eval/
│       └── judge/             # LLM-as-judge evaluation results
│
└── tests/                     # Unit tests

Active Script Surface

Maintained scripts for the active workflow:

  • Data: export_images.py, create_splits.py, preprocess_images.py, sync_volume.py, run_data_pipeline.py
  • Teacher/eval: run_teacher_api.py, validate_teacher.py, eval.py, eval_markdown.py, run_judge.py, create_eval_ids.py, compare_eval_reports.py, export_eval_summary.py
  • Distillation/runtime: train_sft.py, train_modal.py, run_student_baseline.py, test_two_pass.py, test_gemini.py

Legacy utility scripts that were not part of the maintained workflow were removed in cleanup.

Architecture

All scripts use shared components from vlm_distill:

# Runtime config (shared API keys/endpoints)
from vlm_distill.config import get_api_key, OPENROUTER_CHAT_COMPLETIONS_URL

# Shared two-pass pipeline
from vlm_distill.pipeline import (
    extract_markdown,
    markdown_to_json,
    two_pass_extract,
    validate_markdown,
    validate_json,
)

# Typed experiment config (JSON/YAML)
from vlm_distill.experiments import ExperimentConfig, load_experiment_config

# Existing schema/eval modules remain for structured validation + judge
from vlm_distill.eval import validate_extraction

API Keys

Teacher labeling uses OpenRouter API (360x cheaper than self-hosted):

# Option 1: Environment variable
export OPENROUTER_API_KEY="sk-or-..."

# Option 2: Secrets file (gitignored)
echo "sk-or-..." > ../secrets/openrouter_key

Get a key at: https://openrouter.ai/keys

Development

# Lint and format
uv run ruff check . --fix
uv run ruff format .

# Type check
uv run pyright

# Run tests
uv run pytest -v

Documentation

Detailed docs in workpads:

File Contents
workpads/vlm-distillation/knowledge.md Decisions, architecture, research
workpads/vlm-distillation/references.md Papers, models, external resources
workpads/vlm-distillation/tasks.md Current task list with status
workpads/vlm-distillation/learning.md Distillation theory explainer

Key Decisions

ID Decision Rationale
D12 OpenRouter API for teacher 360x cheaper than self-hosted
D16 Claude Opus 4 as judge Avoid circular eval (teacher ≠ judge)
D20 Pydantic schemas Automatic validation, JSON serialization
D24 L40S for 7B, A10G for 4B Best price/performance on Modal

See workpads/vlm-distillation/knowledge.md for full decision log.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors