A Fully Configurable Retrieval-Augmented Generation Pipeline for Document Q&A Applications
- Overview
- Key Features
- Architecture
- Tech Stack
- Getting Started
- Usage
- Configuration
- Project Structure
- How It Works
- Supported Models
- Roadmap & Future Enhancements
- Contributing
- Citation
- License
- Authors
ProRAG is an open-source, fully configurable Retrieval-Augmented Generation (RAG) pipeline for document Q&A applications. It bridges the gap between large language models and domain-specific knowledge by combining semantic document retrieval with instruction-tuned text generation.
Built on top of LangChain, ChromaDB, and Hugging Face Transformers, ProRAG allows researchers and developers to perform context-aware question answering over text corpora with minimal setup. Whether you're building a chatbot, an academic research tool, or a document Q&A system — PoRAG provides the modular, extensible foundation to get started.
| Feature | Description |
|---|---|
| Language-Aware Design | End-to-end pipeline optimized for text — from chunking with configurable sentence delimiters (!, ?, and custom) to instruction-tuned LLM generation. |
| Plug-and-Play Models | Seamlessly swap chat and embedding models via Hugging Face Hub IDs. Use any compatible model without code changes. |
| 4-Bit Quantization | Built-in support for BitsAndBytes NF4 quantization, enabling inference of 8B+ parameter models on consumer GPUs with as little as ~6 GB VRAM. |
| ChromaDB Vector Store | Persistent vector storage with similarity-based retrieval for fast, scalable document search. |
| Configurable Chunking | Fine-grained control over chunk_size and chunk_overlap parameters for optimal retrieval granularity. |
| LangChain LCEL Chains | Modern LangChain Expression Language (LCEL) pipeline using RunnableParallel and RunnablePassthrough for composable, debuggable chains. |
| Interactive CLI | Rich terminal UI with colored panels, progress bars, and an interactive Q&A loop. |
| Context Transparency | Optional --show_context flag to inspect retrieved source passages alongside generated answers. |
| GPU Auto-Detection | Automatic CUDA device detection with graceful CPU fallback. |
| Hugging Face Auth | Native --hf_token support for gated or private model access. |
┌─────────────────────────────────────────────────────────────────────┐
│ ProRAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌───────────────────┐ ┌────────────────┐ │
│ │ Text File │───▶│ Text Splitter │───▶│ Text Chunks │ │
│ │ (.txt file) │ │ (Recursive, ।!?) │ │ │ │
│ └──────────────┘ └───────────────────┘ └───────┬────────┘ │
│ │ │
│ ┌────────────────────────────────▼────────┐ │
│ │ Embedding Model (Sentence Transformer)│ │
│ │ l3cube-pune/bengali-sentence- │ │
│ │ similarity-sbert │ │
│ └────────────────────────────────┬────────┘ │
│ │ │
│ ┌────────────────────────────────▼────────┐ │
│ │ ChromaDB Vector Store │ │
│ │ (Similarity Search, Top-K Retrieval) │ │
│ └────────────────────────────────┬────────┘ │
│ │ │
│ ┌──────────────┐ │ │
│ │ User Query │──────────────────────┐ │ │
│ │ (User Query) │ │ │ │
│ └──────────────┘ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ LangChain RAG Chain (LCEL) │ │
│ │ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ Retriever │ │ Prompt Template │ │ │
│ │ │ (Top-K) │──│ (Instruction) │ │ │
│ │ └─────────────┘ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ ┌──────────────▼─────────┐ │ │
│ │ │ LLM Generation │ │ │
│ │ │ (LLM Generation) │ │ │
│ │ └──────────────┬─────────┘ │ │
│ └─────────────────────────────┼───────────┘ │
│ │ │
│ ┌─────────────────────────────▼───────────┐ │
│ │ Response (Answer + Context) │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
| Component | Technology | Role |
|---|---|---|
| Orchestration | LangChain >=0.2.3 |
RAG chain composition & LCEL pipelines |
| Vector Database | ChromaDB >=0.5.0 |
Document embedding storage & similarity retrieval |
| LLM Framework | Hugging Face Transformers >=4.40.1 |
Model loading, tokenization & text generation |
| Embeddings | Sentence Transformers >=3.0.1 |
Sentence embedding generation |
| Quantization | BitsAndBytes 0.41.3 |
4-bit NF4 quantization for memory-efficient inference |
| Fine-Tuning | PEFT >=0.11.1 |
Parameter-efficient fine-tuning (LoRA/QLoRA) support |
| Acceleration | Accelerate 0.31.0 |
Multi-GPU & mixed-precision training utilities |
| Deep Learning | PyTorch | Tensor computation & CUDA acceleration |
| Terminal UI | Rich >=13.7.1 |
Beautiful terminal output with panels & progress bars |
- Python 3.10 or higher
- CUDA-compatible GPU (recommended; CPU fallback available but significantly slower)
- Git for cloning the repository
- ~16 GB GPU VRAM for full-precision inference (~6 GB with 4-bit quantization)
- Clone the repository
git clone https://github.com/healer-125/pro-rag.git
cd pro-rag- Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows- Install dependencies
pip install -r requirements.txt- Install PyTorch with CUDA (if not already installed)
# For CUDA 12.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Run ProRAG with a text file:
python main.py --text_path ./test.txtWith all options:
python main.py \
--text_path ./test.txt \
--chat_model hassanaliemon/bn_rag_llama3-8b \
--embed_model l3cube-pune/bengali-sentence-similarity-sbert \
--k 4 \
--top_k 2 \
--top_p 0.6 \
--temperature 0.6 \
--chunk_size 500 \
--chunk_overlap 150 \
--max_new_tokens 256 \
--quantization \
--show_context \
--hf_token YOUR_HF_TOKENInteractive session example:
Your question: When was the author of the document born?
Answer: The author was born on May 7, 1861.
Your question: exit
Goodbye, thank you!
Use ProRAG as a Python library in your own applications:
from prorag import RAGChain
# Initialize the pipeline
rag = RAGChain()
# Load models and data
rag.load(
chat_model_id="hassanaliemon/bn_rag_llama3-8b",
embed_model_id="l3cube-pune/bengali-sentence-similarity-sbert",
text_path="./test.txt",
quantization=True, # Enable 4-bit quantization
k=4, # Retrieve top 4 chunks
top_k=2,
top_p=0.6,
temperature=0.6,
chunk_size=500,
chunk_overlap=150,
max_new_tokens=256,
hf_token=None, # Optional: for gated models
)
# Ask questions
answer, context = rag.get_response("Tell me about the main subject of the document.")
print(f"Answer: {answer}")
print(f"Context: {context}")| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
| Chat Model | --chat_model |
hassanaliemon/bn_rag_llama3-8b |
Hugging Face model ID for the instruction-tuned LLM |
| Embedding Model | --embed_model |
l3cube-pune/bengali-sentence-similarity-sbert |
Hugging Face model ID for sentence embeddings |
| Text Path | --text_path |
required | Path to the .txt file to index |
| Top-K Retrieval | --k |
4 |
Number of document chunks to retrieve |
| Top-K Sampling | --top_k |
2 |
Top-k sampling parameter for generation |
| Top-P (Nucleus) | --top_p |
0.6 |
Nucleus sampling probability threshold |
| Temperature | --temperature |
0.6 |
Controls randomness in generation (lower = more deterministic) |
| Max New Tokens | --max_new_tokens |
256 |
Maximum number of tokens to generate |
| Chunk Size | --chunk_size |
500 |
Character-level chunk size for text splitting |
| Chunk Overlap | --chunk_overlap |
150 |
Overlap between consecutive chunks |
| Show Context | --show_context |
False |
Display retrieved context alongside answers |
| Quantization | --quantization |
False |
Enable 4-bit NF4 quantization |
| HF Token | --hf_token |
None |
Hugging Face API token for private/gated models |
ProRAG/
├── main.py # CLI entry point & interactive Q&A loop
├── prorag/ # Core package
│ ├── __init__.py # Package exports (RAGChain)
│ └── rag_pipeline.py # RAG pipeline implementation
├── test.txt # Sample text file for testing
├── requirements.txt # Python dependencies
├── CITATION.cff # Academic citation metadata
├── LICENSE # MIT License
└── README.md # This file
ProRAG follows a standard RAG workflow:
-
Document Ingestion — Text is read from a
.txtfile and split into overlapping chunks usingRecursiveCharacterTextSplitterwith configurable delimiters (e.g.!,?). -
Embedding & Indexing — Each chunk is embedded using a sentence transformer model and stored in a ChromaDB vector database.
-
Query & Retrieval — When a user submits a query, the retriever performs similarity search against the vector store and returns the top-K most relevant chunks.
-
Augmented Generation — Retrieved chunks are formatted as context and injected into an instruction prompt template. The instruction-tuned LLM generates a grounded response.
-
Response Extraction — The raw model output is parsed to extract the clean response from the
### Response:section of the template.
| Role | Model | Source |
|---|---|---|
| Chat / Generation | hassanaliemon/bn_rag_llama3-8b |
Instruction-tuned Llama 3 8B |
| Embeddings | l3cube-pune/bengali-sentence-similarity-sbert |
Sentence-BERT for embeddings |
You can replace the default models with any compatible Hugging Face model:
- Chat Models: Llama 3.x, Mistral, Gemma 2, Qwen 2.5, Phi-3/4, Command R+, or any compatible causal LM
- Embedding Models: Any
sentence-transformerscompatible model for your language or domain
ProRAG is actively evolving. Below are planned and aspirational features aligned with the latest advancements in the Python and AI ecosystem:
- Multi-Document Support — Ingest multiple files, PDFs, and web-scraped content
- Persistent Vector Store — Persist ChromaDB collections to disk for reuse across sessions
- Streaming Generation — Token-by-token streaming responses for real-time UX
- LangSmith Integration — Observability, tracing, and evaluation of RAG chains via LangSmith
- vLLM / TGI Backend — High-throughput inference with vLLM or Text Generation Inference
- GGUF / llama.cpp Support — CPU-optimized inference with quantized GGUF models via llama-cpp-python
- GPTQ & AWQ Quantization — Post-training quantization methods beyond NF4 for deployment flexibility
- Speculative Decoding — Accelerated generation using draft models for faster inference
- Multi-Modal RAG — Support for image+text documents using vision-language models (e.g., LLaVA, Qwen-VL)
- Hybrid Search — Combine dense vector similarity with BM25 sparse retrieval for improved recall
- Re-Ranking — Cross-encoder re-ranking of retrieved passages using models like
ms-marcoor domain-specific re-rankers - Parent Document Retriever — Retrieve small chunks but return full parent documents for richer context
- Multi-Vector Retriever — Generate multiple embeddings per document (summary + content) for semantic diversity
- Knowledge Graph Integration — Structured knowledge extraction and graph-based retrieval (GraphRAG)
- Contextual Compression — LLM-based compression of retrieved passages to reduce noise
- Agentic RAG — Tool-using agents with LangGraph that can dynamically decide when and how to retrieve
- Corrective RAG (CRAG) — Self-reflective retrieval with hallucination detection and query rewriting
- Self-RAG — Adaptive retrieval where the model decides whether retrieval is needed
- RAG Fusion — Multiple query reformulations with reciprocal rank fusion for robust retrieval
- RAPTOR — Recursive abstractive processing for tree-organized retrieval across document hierarchies
- Python 3.12+ Features — Leverage
typingimprovements (PEP 695 type aliases),asynciotask groups, and improved error messages - Async Pipeline — Fully async chain execution using
asyncioand LangChain's async APIs - Pydantic v2 Schemas — Structured input/output validation with Pydantic v2 for type-safe pipelines
- FastAPI / Gradio Server — REST API and web UI for production deployment
- Docker & Docker Compose — Containerized deployment with GPU passthrough
- Poetry / uv Package Management — Modern dependency management with
pyproject.toml - Comprehensive Test Suite — Unit and integration tests with
pytestandpytest-asyncio - CI/CD Pipeline — GitHub Actions for linting, testing, and automated releases
- RAGAS Evaluation — Automated RAG evaluation metrics (faithfulness, answer relevancy, context precision)
- Custom Benchmarks — Domain-specific evaluation datasets for Q&A
- OpenTelemetry Tracing — Distributed tracing for production monitoring
- LangFuse Integration — Open-source LLM observability and analytics
Contributions are welcome! Whether it's bug fixes, new features, or documentation improvements — every contribution helps grow the RAG and NLP ecosystem.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure your code follows the existing style and includes appropriate documentation.
If you use ProRAG in your research, please cite it:
@software{prorag2024,
title = {ProRAG: A Fully Configurable RAG Pipeline for Document Q&A Applications},
author = {Abdullah, Al Asif and Al Emon, Hasan},
year = {2024},
url = {https://github.com/healer-125/pro-rag},
license = {MIT}
}Or use the CITATION.cff file included in this repository for automatic citation generation on GitHub.
This project is licensed under the MIT License — see the LICENSE file for details.