A Streamlit application to compare different methods for embedding video clips. Test and benchmark three different approaches to understand which works best for your use case.
- Upload and process video files (MP4, AVI, MOV, MKV)
- Configurable chunk duration - Split videos into chunks of customizable length (1-30 seconds)
- Three embedding methods:
- Method A: Image + Text Model (CLIP) - Extracts key frames and embeds using CLIP
- Method B: Video + Text Model - Processes multiple frames for temporal understanding
- Method C: LLM Description + Text - Generates descriptions and embeds with sentence transformers
- Performance metrics - Compare processing time, embedding dimensions, and quality
- Visual comparison - Side-by-side metrics and recommendations
- GPU support - Optional GPU acceleration for faster processing
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py
# OR
./run_app.shThen open your browser to http://localhost:8501
- Clone the repository:
git clone https://github.com/ClipABit/multimodal-embeddings.git
cd multimodal-embeddings- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Start the Streamlit app:
streamlit run app.py-
Open your browser to the URL shown (typically
http://localhost:8501) -
Configure settings in the sidebar:
- Select chunk duration (1-30 seconds)
- Enable/disable GPU acceleration
-
Upload a video file
-
Click "Process Video with All Methods" to run the comparison
-
Review the performance metrics and recommendations
- Samples 5 representative frames per chunk
- Uses OpenAI's CLIP model for image embeddings
- Fast processing, good for static visual content
- May miss temporal dynamics
- Processes 16 frames per chunk
- Better captures motion and temporal sequences
- Slower but more comprehensive
- Better for action-heavy videos
- Generates text descriptions of video content
- Embeds descriptions with sentence transformers
- Fastest method
- Quality depends on description accuracy
- Note: Demo uses simplified descriptions; production would use VLMs like BLIP or GPT-4V
The app provides:
- Processing Time: Total time to process all chunks
- Avg Time/Chunk: Average processing time per chunk
- Embedding Dimension: Size of the embedding vectors
- Chunk Similarity: Consistency between consecutive chunks
- Python 3.8+
- CUDA-compatible GPU (optional, for faster processing)
- 4GB+ RAM recommended
- Internet connection (for first-time model downloads)
- See
requirements.txtfor detailed dependencies
- USAGE_GUIDE.md - Detailed usage instructions and tips
- DEMO.md - Examples and expected outputs
- CONTRIBUTING.md - How to contribute to the project
multimodal-embeddings/
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── run_app.sh # Shell script to start the app
├── README.md # This file
├── USAGE_GUIDE.md # Detailed usage guide
├── DEMO.md # Demo and examples
├── CONTRIBUTING.md # Contributing guidelines
└── .gitignore # Git ignore patterns
- Upload a video - Choose any supported video file
- Configure settings - Select chunk duration and GPU usage
- Process - The app:
- Splits video into chunks based on your selected duration
- Processes each chunk with all 3 methods
- Measures performance metrics
- Compare results - View side-by-side comparison of:
- Processing times
- Embedding quality
- Recommendations for your use case
The app tracks and displays:
- Processing Time: How long each method takes
- Avg Time/Chunk: Average processing time per video chunk
- Embedding Dimension: Size of the embedding vectors
- Chunk Similarity: Consistency between consecutive chunks
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License
- OpenAI CLIP for image-text embeddings
- Sentence Transformers for text embeddings
- Streamlit for the web interface
