A Streamlit-based multimodal Retrieval-Augmented Generation (RAG) app that understands text + images inside documents and answers questions with context, citations, and memory.
Built for when PDFs stop being searchable and start being annoying.
- Upload PDF / DOCX / TXT files
- Extracts:
- Text
- Images inside PDFs
- OCR text from images (multilingual)
- Indexes everything using hybrid search:
- Semantic (vector embeddings)
- Keyword (BM25)
- Lets you chat with your documents
- Retrieves relevant images + text together
- Maintains chat history across sessions
Basically: your documents, but smarter and less silent.
-
Multimodal RAG
Text + OCR + image captions all live in the same retrieval pipeline. -
Hybrid Retrieval
Combines semantic search (FAISS) with keyword search (BM25) using weighted ensembling. -
Adaptive Chunking
Chunk size adjusts automatically based on document length. -
Multilingual OCR
EasyOCR with dynamic language selection (English, Hindi, Tamil, etc.). -
History-Aware QA
Follow-up questions actually understand past context. -
Image-Aware Answers
Retrieved images are surfaced alongside responses when relevant.
- Frontend: Streamlit
- LLM: LLaMA 3 (70B) via Groq
- Embeddings: Sentence Transformers (MiniLM)
- Vector Store: FAISS
- Retrieval: BM25 + Semantic Ensemble
- OCR: EasyOCR
- PDF Processing: PyMuPDF
- Framework: LangChain
- Documents are uploaded and parsed
- Text is chunked and embedded
- Images are extracted and OCR’d
- Everything is indexed together
- Queries run through hybrid retrieval
- LLM answers using retrieved context + chat history
No magic. Just well-orchestrated components doing their job.
Create a Virtual Environment:
python -m venv venvActivate it:
venv\Scripts\activateInstall Dependencies:
pip install -r requirements.txtSet your Groq API key:
GROQ_API_KEY=your_key_hereRun the app:
streamlit run app.py