Skip to content

rohan9024/LUMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒŒ LUMA-RAG: Lifelong Multimodal Retrieval-Augmented Agent

Trustworthy, low-latency RAG that keeps learning from text, images, video, and audio โ€” without re-indexing.

LUMA maintains a cheap, streaming cross-modal alignment (CLAPโ†’CLIP bridge), prioritizes memory under a budget, and exposes a provable safety signal for each answer.


๐Ÿง  Overview

Component Description
Canonical space CLIP (ViT-B/32, d=512)
Modalities Text, Images, Video (frames), Audio (CLAP)
Indexing Hot HNSW (RAM) + Warm IVFPQ (compressed, adaptive)
Alignment Incremental Procrustes (streaming CLAPโ†’CLIP bridge)
Safety Safe@k = margin_top2 vs (ฮต alignment drift + ฮถ PQ distortion)
UI Streamlit app with live ingestion, retrieval, citations, telemetry

๐Ÿš€ Why LUMA is Different

  • ๐Ÿงฉ Streaming alignment: integrates new modalities online (no re-index).
  • ๐Ÿ›ก๏ธ Safety you can see: every query shows margin_top2, ฮต, ฮถ, Safe@1.
  • ๐Ÿงฎ Real-world memory: adaptive hot/warm tiers with IVFPQ compression.
  • โš–๏ธ Provable stability: Safe@1 ensures the top-1 result won't flip under drift.

โš™๏ธ Quick Start

1๏ธโƒฃ Prerequisites

  • Python 3.10 or 3.11
  • OS: Windows / Linux / macOS
  • GPU recommended (NVIDIA)
  • FFmpeg (required for video/audio)
# Windows
choco install ffmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

2๏ธโƒฃ Create Environment & Install Dependencies

# Clone the repository
git clone https://github.com/rohan9024/LUMA.git
cd LUMA

# Create virtual environment
python -m venv venv

# Activate environment
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

# Install PyTorch (choose one based on your setup)
# CUDA 11.8 build (recommended for NVIDIA GPUs):
pip install --index-url https://download.pytorch.org/whl/cu118 torch torchvision torchaudio

# OR CPU-only (slower):
# pip install torch torchvision torchaudio

# Install core dependencies
pip install open-clip-torch faiss-cpu transformers sentencepiece opencv-python ffmpeg-python librosa pydub numpy scipy scikit-learn streamlit tqdm orjson fastapi uvicorn accelerate pyttsx3

3๏ธโƒฃ Configure

Create or edit config.py in the root directory:

# config.py
class ModelConfig:
    device = "auto"  # "cuda", "cpu", or "auto"
    clip_name = "ViT-B-32"
    clip_pretrained = "openai"
    d = 512  # embedding dimension
    
    clap_model = "laion/clap-htsat-unfused"
    clap_d = 512

class MemoryConfig:
    budget_hot = 2000  # max items in hot (HNSW) tier
    budget_warm = 10000  # max items in warm (IVFPQ) tier
    
    # HNSW parameters
    hnsw_M = 16
    hnsw_ef_construction = 200
    hnsw_ef_search = 50
    
    # IVFPQ parameters
    ivf_nlist = 32
    pq_m = 8
    pq_nbits = 8

class AlignmentConfig:
    refresh_interval = 100  # refresh alignment every N audio queries
    epsilon_threshold = 0.1  # alignment drift warning threshold

class SafetyConfig:
    k = 10  # retrieve top-k results
    safety_multiplier = 2.0  # Safe@1 = margin_top2 > 2*(ฮต + ฮถ)

4๏ธโƒฃ Run the App

# Start the Streamlit web interface
streamlit run web/app.py

# OR
python -m streamlit run web/app.py

Then open the local URL shown in your terminal (typically http://localhost:8501).

Features in the UI:

  • ๐Ÿ“ฅ Ingest text, images, video, audio
  • ๐Ÿ” Query across all modalities
  • ๐Ÿ“Š View telemetry: margin_top2, ฮต, ฮถ, Safe@1
  • ๐Ÿ“š Citations with source tracking

๐Ÿงฉ Usage Examples

A) Text โ†’ Image Retrieval

# 1. Generate captions for your images using BLIP
pip install accelerate
python -m scripts.gen_captions --dir data/inbox --out data/captions_inbox.txt

# 2. Evaluate retrieval performance
python -u -m scripts.eval_folder \
  --dir data/inbox \
  --captions data/captions_inbox.txt \
  --no-align \
  --hot_budget 1000

Expected Results:

  • Recall@10 โ‰ˆ 0.94
  • MRR โ‰ˆ 0.59
  • Safe@1 โ‰ˆ 1.00

B) Memory Spill (Hot/Warm Tiers)

# 1. Augment dataset with copies to simulate large-scale data
python -m scripts.augment_copies \
  --src data/inbox \
  --out data/inbox_aug \
  --copies 20

# 2. Generate captions for augmented data
python -m scripts.gen_captions \
  --dir data/inbox_aug \
  --out data/captions_aug.txt

# 3. Evaluate with memory budget constraints
python -u -m scripts.eval_folder_group \
  --dir data/inbox_aug \
  --captions data/captions_aug.txt \
  --hot_budget 500

Expected Results:

  • Shows adaptive IVFPQ training
  • Group-aware Recall@10 โ‰ˆ 0.53
  • Demonstrates hot/warm tier switching

C) Audio โ†’ Image (CLAPโ†’CLIP Bridge)

# 1. Generate TTS audio from captions
pip install pyttsx3
python -m scripts.gen_tts_audio \
  --captions data/captions_inbox.txt \
  --audio_dir data/audio_inbox

# 2. Evaluate audio-to-image retrieval
python -u -m scripts.eval_audio_query \
  --img_dir data/inbox \
  --captions data/captions_inbox.txt \
  --audio_dir data/audio_inbox \
  --k 10

Expected Results:

  • Recall@10 โ‰ˆ 0.42
  • ฮต โ‰ˆ 0.00 after refresh
  • Safe@1 โ‰ˆ 1.00

๐Ÿ“ Repository Structure

luma/
โ”œโ”€โ”€ config.py                    # Configuration settings
โ”œโ”€โ”€ luma/
โ”‚   โ”œโ”€โ”€ embedders/
โ”‚   โ”‚   โ”œโ”€โ”€ image_embedder.py   # CLIP image encoder
โ”‚   โ”‚   โ”œโ”€โ”€ text_embedder.py    # CLIP text encoder
โ”‚   โ”‚   โ”œโ”€โ”€ audio_embedder.py   # CLAP audio encoder
โ”‚   โ”‚   โ””โ”€โ”€ video_embedder.py   # Video frame processing
โ”‚   โ”œโ”€โ”€ alignment/
โ”‚   โ”‚   โ””โ”€โ”€ incremental_procrustes.py  # CLAPโ†’CLIP alignment
โ”‚   โ”œโ”€โ”€ index/
โ”‚   โ”‚   โ”œโ”€โ”€ hnsw.py             # Hot tier (FAISS HNSW)
โ”‚   โ”‚   โ”œโ”€โ”€ ivfpq.py            # Warm tier (FAISS IVFPQ)
โ”‚   โ”‚   โ””โ”€โ”€ manager.py          # Multi-tier index manager
โ”‚   โ”œโ”€โ”€ memory/
โ”‚   โ”‚   โ””โ”€โ”€ policy.py           # Memory budget management
โ”‚   โ”œโ”€โ”€ ingest/
โ”‚   โ”‚   โ””โ”€โ”€ pipeline.py         # Data ingestion pipeline
โ”‚   โ”œโ”€โ”€ retrieval/
โ”‚   โ”‚   โ””โ”€โ”€ engine.py           # Retrieval with safety metrics
โ”‚   โ””โ”€โ”€ rag/
โ”‚       โ””โ”€โ”€ generator.py        # RAG response generation
โ”œโ”€โ”€ web/
โ”‚   โ””โ”€โ”€ app.py                  # Streamlit web interface
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ run_server.py           # FastAPI server
โ”‚   โ”œโ”€โ”€ gen_captions.py         # BLIP caption generation
โ”‚   โ”œโ”€โ”€ gen_tts_audio.py        # TTS audio generation
โ”‚   โ”œโ”€โ”€ eval_folder.py          # Single-folder evaluation
โ”‚   โ”œโ”€โ”€ eval_folder_group.py    # Multi-tier evaluation
โ”‚   โ”œโ”€โ”€ eval_audio_query.py     # Audio retrieval evaluation
โ”‚   โ”œโ”€โ”€ augment_copies.py       # Dataset augmentation
โ”‚   โ””โ”€โ”€ profile_latency.py      # Performance profiling
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ inbox/                  # Your image data
โ”‚   โ”œโ”€โ”€ audio_inbox/            # Your audio data
โ”‚   โ””โ”€โ”€ captions_inbox.txt      # Caption file
โ””โ”€โ”€ README.md

๐Ÿงฎ Safety Telemetry (Safe@k)

For each query, LUMA computes:

Metric Meaning
margin_top2 Similarity gap between top-1 and top-2 results
ฮต Alignment drift (spectral norm change since last refresh)
ฮถ PQ distortion (reconstruction error for warm tier)
Safe@1 True if margin_top2 > 2*(ฮต + ฮถ)

Interpretation:

  • โœ… Green (Safe@1=True): Top-1 result is stable and trustworthy
  • โš ๏ธ Yellow/Red: Consider checking more results (broader k)

๐Ÿ“Š Benchmark Results

Task Dataset Metrics
Textโ†’Image 31 images, BLIP captions nDCG@10 โ‰ˆ 0.67, Recall@10 โ‰ˆ 0.94, Safe@1 = 1.00
Memory/PQ Spill 620 images, hot_budget=500 Recall@10 โ‰ˆ 0.53, MRR โ‰ˆ 0.47
Audioโ†’Image TTS audio Recall@10 โ‰ˆ 0.42, ฮต โ‰ˆ 0.00, Safe@1 โ‰ˆ 1.00

๐Ÿงฐ Troubleshooting

Issue Solution
Torch not compiled with CUDA Reinstall with CUDA wheel: pip install --index-url https://download.pytorch.org/whl/cu118 torch torchvision torchaudio
Paging file too small (Windows) Use ViT-B/32 or increase Windows pagefile size
Relative import error Run from repo root with absolute imports: python -m scripts.eval_folder
FFmpeg not found Add FFmpeg to PATH or reinstall
FAISS "nx โ‰ฅ k" error Reduce ivf_nlist in config or ingest more data
0.00 ms latency Ignore (Windows timer granularity issue)
QuickGELU mismatch warning Harmless, can be ignored
Missing captions Ensure --captions path is correct

๐Ÿงช Research Claims

LUMA is a streaming multimodal RAG system with:

  1. Online CLAPโ†’CLIP alignment bridge (no re-indexing needed)
  2. Provable per-query stability guarantee (Safe@k metric)
  3. Budgeted multi-tier memory (hot HNSW + warm IVFPQ)

Citation

If you use or extend LUMA, please cite:

@misc{luma2025,
  title  = {LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment},
  author = {Rohan Wandre},
  year   = {2025},
  note   = {https://github.com/rohan9024/luma}
}

๐Ÿงญ Roadmap

  • RL-based memory policy (learned retention)
  • Non-linear hyper-network alignment
  • Public multimodal benchmarks (Flickr30k, MSR-VTT, AudioCaps)
  • Faithfulness & hallucination auditing
  • Docker container for easy deployment
  • Support for more audio formats (MP3, FLAC, etc.)
  • Video understanding with temporal modeling
  • Multi-user support with isolated memory spaces

๐Ÿ™ Acknowledgements

  • OpenCLIP team for CLIP models
  • CLAP team for audio embeddings
  • FAISS for efficient similarity search
  • BLIP for image captioning
  • PyTorch, HuggingFace Transformers, Streamlit, scikit-learn communities

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Commit changes: git commit -am 'Add feature'
  4. Push to branch: git push origin feature-name
  5. Submit a Pull Request

๐Ÿ“ž Support


Built with โค๏ธ for the multimodal AI community

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages