Trustworthy, low-latency RAG that keeps learning from text, images, video, and audio โ without re-indexing.
LUMA maintains a cheap, streaming cross-modal alignment (CLAPโCLIP bridge), prioritizes memory under a budget, and exposes a provable safety signal for each answer.
| Component | Description |
|---|---|
| Canonical space | CLIP (ViT-B/32, d=512) |
| Modalities | Text, Images, Video (frames), Audio (CLAP) |
| Indexing | Hot HNSW (RAM) + Warm IVFPQ (compressed, adaptive) |
| Alignment | Incremental Procrustes (streaming CLAPโCLIP bridge) |
| Safety | Safe@k = margin_top2 vs (ฮต alignment drift + ฮถ PQ distortion) |
| UI | Streamlit app with live ingestion, retrieval, citations, telemetry |
- ๐งฉ Streaming alignment: integrates new modalities online (no re-index).
- ๐ก๏ธ Safety you can see: every query shows
margin_top2,ฮต,ฮถ,Safe@1. - ๐งฎ Real-world memory: adaptive hot/warm tiers with IVFPQ compression.
- โ๏ธ Provable stability: Safe@1 ensures the top-1 result won't flip under drift.
- Python 3.10 or 3.11
- OS: Windows / Linux / macOS
- GPU recommended (NVIDIA)
- FFmpeg (required for video/audio)
# Windows
choco install ffmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg# Clone the repository
git clone https://github.com/rohan9024/LUMA.git
cd LUMA
# Create virtual environment
python -m venv venv
# Activate environment
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
# Install PyTorch (choose one based on your setup)
# CUDA 11.8 build (recommended for NVIDIA GPUs):
pip install --index-url https://download.pytorch.org/whl/cu118 torch torchvision torchaudio
# OR CPU-only (slower):
# pip install torch torchvision torchaudio
# Install core dependencies
pip install open-clip-torch faiss-cpu transformers sentencepiece opencv-python ffmpeg-python librosa pydub numpy scipy scikit-learn streamlit tqdm orjson fastapi uvicorn accelerate pyttsx3Create or edit config.py in the root directory:
# config.py
class ModelConfig:
device = "auto" # "cuda", "cpu", or "auto"
clip_name = "ViT-B-32"
clip_pretrained = "openai"
d = 512 # embedding dimension
clap_model = "laion/clap-htsat-unfused"
clap_d = 512
class MemoryConfig:
budget_hot = 2000 # max items in hot (HNSW) tier
budget_warm = 10000 # max items in warm (IVFPQ) tier
# HNSW parameters
hnsw_M = 16
hnsw_ef_construction = 200
hnsw_ef_search = 50
# IVFPQ parameters
ivf_nlist = 32
pq_m = 8
pq_nbits = 8
class AlignmentConfig:
refresh_interval = 100 # refresh alignment every N audio queries
epsilon_threshold = 0.1 # alignment drift warning threshold
class SafetyConfig:
k = 10 # retrieve top-k results
safety_multiplier = 2.0 # Safe@1 = margin_top2 > 2*(ฮต + ฮถ)# Start the Streamlit web interface
streamlit run web/app.py
# OR
python -m streamlit run web/app.pyThen open the local URL shown in your terminal (typically http://localhost:8501).
Features in the UI:
- ๐ฅ Ingest text, images, video, audio
- ๐ Query across all modalities
- ๐ View telemetry:
margin_top2,ฮต,ฮถ,Safe@1 - ๐ Citations with source tracking
# 1. Generate captions for your images using BLIP
pip install accelerate
python -m scripts.gen_captions --dir data/inbox --out data/captions_inbox.txt
# 2. Evaluate retrieval performance
python -u -m scripts.eval_folder \
--dir data/inbox \
--captions data/captions_inbox.txt \
--no-align \
--hot_budget 1000Expected Results:
- Recall@10 โ 0.94
- MRR โ 0.59
- Safe@1 โ 1.00
# 1. Augment dataset with copies to simulate large-scale data
python -m scripts.augment_copies \
--src data/inbox \
--out data/inbox_aug \
--copies 20
# 2. Generate captions for augmented data
python -m scripts.gen_captions \
--dir data/inbox_aug \
--out data/captions_aug.txt
# 3. Evaluate with memory budget constraints
python -u -m scripts.eval_folder_group \
--dir data/inbox_aug \
--captions data/captions_aug.txt \
--hot_budget 500Expected Results:
- Shows adaptive IVFPQ training
- Group-aware Recall@10 โ 0.53
- Demonstrates hot/warm tier switching
# 1. Generate TTS audio from captions
pip install pyttsx3
python -m scripts.gen_tts_audio \
--captions data/captions_inbox.txt \
--audio_dir data/audio_inbox
# 2. Evaluate audio-to-image retrieval
python -u -m scripts.eval_audio_query \
--img_dir data/inbox \
--captions data/captions_inbox.txt \
--audio_dir data/audio_inbox \
--k 10Expected Results:
- Recall@10 โ 0.42
- ฮต โ 0.00 after refresh
- Safe@1 โ 1.00
luma/
โโโ config.py # Configuration settings
โโโ luma/
โ โโโ embedders/
โ โ โโโ image_embedder.py # CLIP image encoder
โ โ โโโ text_embedder.py # CLIP text encoder
โ โ โโโ audio_embedder.py # CLAP audio encoder
โ โ โโโ video_embedder.py # Video frame processing
โ โโโ alignment/
โ โ โโโ incremental_procrustes.py # CLAPโCLIP alignment
โ โโโ index/
โ โ โโโ hnsw.py # Hot tier (FAISS HNSW)
โ โ โโโ ivfpq.py # Warm tier (FAISS IVFPQ)
โ โ โโโ manager.py # Multi-tier index manager
โ โโโ memory/
โ โ โโโ policy.py # Memory budget management
โ โโโ ingest/
โ โ โโโ pipeline.py # Data ingestion pipeline
โ โโโ retrieval/
โ โ โโโ engine.py # Retrieval with safety metrics
โ โโโ rag/
โ โโโ generator.py # RAG response generation
โโโ web/
โ โโโ app.py # Streamlit web interface
โโโ scripts/
โ โโโ run_server.py # FastAPI server
โ โโโ gen_captions.py # BLIP caption generation
โ โโโ gen_tts_audio.py # TTS audio generation
โ โโโ eval_folder.py # Single-folder evaluation
โ โโโ eval_folder_group.py # Multi-tier evaluation
โ โโโ eval_audio_query.py # Audio retrieval evaluation
โ โโโ augment_copies.py # Dataset augmentation
โ โโโ profile_latency.py # Performance profiling
โโโ data/
โ โโโ inbox/ # Your image data
โ โโโ audio_inbox/ # Your audio data
โ โโโ captions_inbox.txt # Caption file
โโโ README.md
For each query, LUMA computes:
| Metric | Meaning |
|---|---|
| margin_top2 | Similarity gap between top-1 and top-2 results |
| ฮต | Alignment drift (spectral norm change since last refresh) |
| ฮถ | PQ distortion (reconstruction error for warm tier) |
| Safe@1 | True if margin_top2 > 2*(ฮต + ฮถ) |
Interpretation:
- โ Green (Safe@1=True): Top-1 result is stable and trustworthy
โ ๏ธ Yellow/Red: Consider checking more results (broader k)
| Task | Dataset | Metrics |
|---|---|---|
| TextโImage | 31 images, BLIP captions | nDCG@10 โ 0.67, Recall@10 โ 0.94, Safe@1 = 1.00 |
| Memory/PQ Spill | 620 images, hot_budget=500 | Recall@10 โ 0.53, MRR โ 0.47 |
| AudioโImage | TTS audio | Recall@10 โ 0.42, ฮต โ 0.00, Safe@1 โ 1.00 |
| Issue | Solution |
|---|---|
| Torch not compiled with CUDA | Reinstall with CUDA wheel: pip install --index-url https://download.pytorch.org/whl/cu118 torch torchvision torchaudio |
| Paging file too small (Windows) | Use ViT-B/32 or increase Windows pagefile size |
| Relative import error | Run from repo root with absolute imports: python -m scripts.eval_folder |
| FFmpeg not found | Add FFmpeg to PATH or reinstall |
| FAISS "nx โฅ k" error | Reduce ivf_nlist in config or ingest more data |
| 0.00 ms latency | Ignore (Windows timer granularity issue) |
| QuickGELU mismatch warning | Harmless, can be ignored |
| Missing captions | Ensure --captions path is correct |
LUMA is a streaming multimodal RAG system with:
- Online CLAPโCLIP alignment bridge (no re-indexing needed)
- Provable per-query stability guarantee (Safe@k metric)
- Budgeted multi-tier memory (hot HNSW + warm IVFPQ)
If you use or extend LUMA, please cite:
@misc{luma2025,
title = {LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment},
author = {Rohan Wandre},
year = {2025},
note = {https://github.com/rohan9024/luma}
}- RL-based memory policy (learned retention)
- Non-linear hyper-network alignment
- Public multimodal benchmarks (Flickr30k, MSR-VTT, AudioCaps)
- Faithfulness & hallucination auditing
- Docker container for easy deployment
- Support for more audio formats (MP3, FLAC, etc.)
- Video understanding with temporal modeling
- Multi-user support with isolated memory spaces
- OpenCLIP team for CLIP models
- CLAP team for audio embeddings
- FAISS for efficient similarity search
- BLIP for image captioning
- PyTorch, HuggingFace Transformers, Streamlit, scikit-learn communities
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin feature-name - Submit a Pull Request
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Built with โค๏ธ for the multimodal AI community