Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 1.3 KB

File metadata and controls

59 lines (45 loc) · 1.3 KB

Multi-Modal Embedding Service

Python service for generating multi-modal embeddings for social media content.

Features

  • CLIP Visual Embeddings (512-dim)
  • Text Embeddings (768-dim) from captions and OCR
  • OCR Extraction using EasyOCR
  • NSFW Classification and Content Type Detection
  • Video Support with frame extraction
  • Batch Processing

Quick Start

Local Development

pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

API Endpoints

Health Check

GET /health

Multi-Modal Extraction

POST /extract-multimodal

  • files: Image/Video files
  • caption: Optional text

Legacy Endpoints

  • POST /extract-features (Images)
  • POST /extract-features-video (Videos)
  • POST /extract-features-text (Text only)
  • POST /extract-ocr (OCR only)
  • POST /classify-nsfw (NSFW only)

Configuration

Environment variables:

  • QDRANT_URL: Qdrant connection string
  • MEDIA_STORAGE_PATH: Path to media files
  • USE_GPU: Enable GPU acceleration (default: false)
  • PORT: Service port (default: 8000)

Models

  • CLIP: openai/clip-vit-base-patch32
  • Text: sentence-transformers/all-mpnet-base-v2
  • NSFW: JanadaSroor/vit-nsfw-classifier

Performance (CPU)

  • Image: ~500ms
  • OCR: ~1-2s
  • Text: ~50ms
  • Video: ~5-10s (10 frames)

License

Apache License 2.0.