A Retrieval-Augmented Generation (RAG) based Document Question Answering system that enables users to ask natural language questions over documents and receive accurate, context-aware answers using semantic search and Large Language Models (LLMs).
This project demonstrates a real-world RAG pipeline combining document ingestion, vector similarity search, and LLM-powered answer generation, suitable for AI-powered search engines and enterprise knowledge assistants.
Traditional LLMs may hallucinate or provide outdated information.
This system solves that by retrieving relevant document context first and then generating answers grounded in actual data.
Key idea:
Retrieve first → Generate second
- 📄 Document ingestion and preprocessing
- 🔍 Semantic search using vector embeddings
- 🧠 Retrieval-Augmented Generation (RAG) architecture
- 🤖 LLM-powered answer generation
- 📊 JSON-based input/output handling
- 📝 Logging for debugging and traceability
- 🧩 Modular and extensible Python codebase
- ⚙️ Shell-based deployment support
User Question
↓
Vector Embedding
↓
Semantic Retriever
↓
Relevant Document Context
↓
LLM (GPT-based)
↓
Final Answer
- Language: Python
- LLM Integration: GPT-based client
- Vectorization: Embedding-based semantic search
- Data Formats: JSON
- Deployment: Shell scripting
- Version Control: Git & GitHub
rag-based-document-qa/
│
├── app.py # Application entry point
├── main.py # Core execution flow
├── document_loader.py # Document ingestion & preprocessing
├── vectorizer.py # Embedding generation logic
├── retriever.py # Semantic similarity search
├── gpt_client.py # LLM interaction
├── submitter.py # Output handling
├── deploy.sh # Deployment helper script
├── requirements.txt # Python dependencies
├── questions.json # Sample input questions
├── answers_output.json # Generated answers
├── log.txt # Execution logs
└── README.md
git clone https://github.com/Debasish-87/rag-based-document-qa.git
cd rag-based-document-qapip install -r requirements.txt- Add your API key inside
gpt_client.py - Or configure environment variables as required
python app.py- Generated answers →
answers_output.json - Logs →
log.txt
- Load documents
- Chunk and vectorize content
- Accept user questions from JSON
- Retrieve relevant context using semantic similarity
- Generate grounded answers via LLM
- Store output for analysis
- 📚 Enterprise document Q&A
- 🔍 AI-powered knowledge search
- 📄 Research paper analysis
- 🏢 Internal documentation assistants
- 🤖 Chatbots with document grounding
- API keys must be kept secure (never commit secrets)
- Designed for learning and demonstration purposes
- Can be extended with production-grade vector databases
- FAISS / Pinecone / ChromaDB integration
- Web-based UI (FastAPI / Streamlit)
- Support for PDF, DOCX, and HTML documents
- Authentication & access control
- Performance optimization for large datasets
- Debasish Mohanty – Core development & architecture
- Rudra Prasad Jena – Collaboration & contributions
- Srujan Rana – Feature enhancements
This project is licensed under the MIT License.