Image embedding model for zero-shot tagging and perceptual similarity. Uses contrastive language-image pre-training to produce 512-dimensional embeddings that can be compared via cosine similarity.
- Repository: https://github.com/mlfoundations/open_clip
- Paper: Reproducible scaling laws for contrastive language-image learning (2023)
- Pretrained:
laion2b_s34b_b79k(LAION-2B dataset) - License: MIT
ViT-B-32 (Vision Transformer Base, 32x32 patches) — 12 transformer layers, 768 hidden dimension, 12 attention heads. Input image is split into 7x7 = 49 patches of 32x32 pixels plus a CLS token. The CLS token output is projected to 512 dimensions.
| Property | Value |
|---|---|
| File | model.onnx |
| Input | image — float32 [1, 3, 224, 224] |
| Output | embedding — float32 [1, 512] |
| Resolution | Fixed 224x224 (resize + center crop) |
| Normalize | [0, 1] range (CLIP norm baked in) |
- Input and output are both float32.
- Embeddings are L2-normalized — use dot product for cosine similarity.
- No tiling: the model requires fixed 224x224 input.
- Architecture is loaded via
open_clippip package (no repo clone needed). - No checkpoints to download —
open_clipfetches pretrained weights automatically during conversion. - Output
tags.jsoncontains pre-computed text embeddings for ~80 photo tags.
| Property | Value |
|---|---|
| Model license | MIT |
| OSAID v1.0 | Open Source AI |
| MOF | Class II (Open Tooling) |
| Training data license | LAION-2B (CC-BY-4.0 metadata); images are web-crawled with mixed licenses |
| Training data provenance | LAION-2B: 2B image-text pairs from Common Crawl, filtered using CLIP for quality |
| Training code | MIT |
| Known limitations | Training images are web-crawled; individual image licenses are not verified |
| Published research | Reproducible scaling laws for contrastive language-image learning |
| Inference | Local only, no cloud dependencies |
| Scope | Image embedding for tagging and similarity (no generation or synthesis) |
| Reproducibility | Full pipeline (setup, convert, clean, demo) |