ProRAG

A Fully Configurable Retrieval-Augmented Generation Pipeline for Document Q&A Applications

Overview

ProRAG is an open-source, fully configurable Retrieval-Augmented Generation (RAG) pipeline for document Q&A applications. It bridges the gap between large language models and domain-specific knowledge by combining semantic document retrieval with instruction-tuned text generation.

Built on top of LangChain, ChromaDB, and Hugging Face Transformers, ProRAG allows researchers and developers to perform context-aware question answering over text corpora with minimal setup. Whether you're building a chatbot, an academic research tool, or a document Q&A system — PoRAG provides the modular, extensible foundation to get started.

Key Features

Feature	Description
Language-Aware Design	End-to-end pipeline optimized for text — from chunking with configurable sentence delimiters (`!`, `?`, and custom) to instruction-tuned LLM generation.
Plug-and-Play Models	Seamlessly swap chat and embedding models via Hugging Face Hub IDs. Use any compatible model without code changes.
4-Bit Quantization	Built-in support for BitsAndBytes NF4 quantization, enabling inference of 8B+ parameter models on consumer GPUs with as little as ~6 GB VRAM.
ChromaDB Vector Store	Persistent vector storage with similarity-based retrieval for fast, scalable document search.
Configurable Chunking	Fine-grained control over `chunk_size` and `chunk_overlap` parameters for optimal retrieval granularity.
LangChain LCEL Chains	Modern LangChain Expression Language (LCEL) pipeline using `RunnableParallel` and `RunnablePassthrough` for composable, debuggable chains.
Interactive CLI	Rich terminal UI with colored panels, progress bars, and an interactive Q&A loop.
Context Transparency	Optional `--show_context` flag to inspect retrieved source passages alongside generated answers.
GPU Auto-Detection	Automatic CUDA device detection with graceful CPU fallback.
Hugging Face Auth	Native `--hf_token` support for gated or private model access.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          ProRAG Pipeline                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────────┐    ┌───────────────────┐    ┌────────────────┐  │
│   │  Text File     │───▶│ Text Splitter      │───▶│ Text Chunks    │  │
│   │  (.txt file)  │    │ (Recursive, ।!?)  │    │                │  │
│   └──────────────┘    └───────────────────┘    └───────┬────────┘  │
│                                                         │           │
│                        ┌────────────────────────────────▼────────┐  │
│                        │  Embedding Model (Sentence Transformer)│  │
│                        │  l3cube-pune/bengali-sentence-           │  │
│                        │  similarity-sbert                        │  │
│                        └────────────────────────────────┬────────┘  │
│                                                         │           │
│                        ┌────────────────────────────────▼────────┐  │
│                        │  ChromaDB Vector Store                  │  │
│                        │  (Similarity Search, Top-K Retrieval)   │  │
│                        └────────────────────────────────┬────────┘  │
│                                                         │           │
│   ┌──────────────┐                                      │           │
│   │  User Query   │──────────────────────┐              │           │
│   │  (User Query) │                      │              │           │
│   └──────────────┘                      ▼              ▼           │
│                        ┌─────────────────────────────────────────┐  │
│                        │  LangChain RAG Chain (LCEL)             │  │
│                        │  ┌─────────────┐  ┌──────────────────┐ │  │
│                        │  │  Retriever   │  │ Prompt Template  │ │  │
│                        │  │  (Top-K)     │──│ (Instruction)    │ │  │
│                        │  └─────────────┘  └────────┬─────────┘ │  │
│                        │                             │           │  │
│                        │              ┌──────────────▼─────────┐ │  │
│                        │              │  LLM Generation        │ │  │
│                        │              │  (LLM Generation)      │ │  │
│                        │              └──────────────┬─────────┘ │  │
│                        └─────────────────────────────┼───────────┘  │
│                                                      │              │
│                        ┌─────────────────────────────▼───────────┐  │
│                        │  Response (Answer + Context)             │  │
│                        └─────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Tech Stack

Component	Technology	Role
Orchestration	LangChain `>=0.2.3`	RAG chain composition & LCEL pipelines
Vector Database	ChromaDB `>=0.5.0`	Document embedding storage & similarity retrieval
LLM Framework	Hugging Face Transformers `>=4.40.1`	Model loading, tokenization & text generation
Embeddings	Sentence Transformers `>=3.0.1`	Sentence embedding generation
Quantization	BitsAndBytes `0.41.3`	4-bit NF4 quantization for memory-efficient inference
Fine-Tuning	PEFT `>=0.11.1`	Parameter-efficient fine-tuning (LoRA/QLoRA) support
Acceleration	Accelerate `0.31.0`	Multi-GPU & mixed-precision training utilities
Deep Learning	PyTorch	Tensor computation & CUDA acceleration
Terminal UI	Rich `>=13.7.1`	Beautiful terminal output with panels & progress bars

Getting Started

Prerequisites

Python 3.10 or higher
CUDA-compatible GPU (recommended; CPU fallback available but significantly slower)
Git for cloning the repository
~16 GB GPU VRAM for full-precision inference (~6 GB with 4-bit quantization)

Installation

Clone the repository

git clone https://github.com/healer-125/pro-rag.git
cd pro-rag

Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate        # Linux/macOS
venv\Scripts\activate           # Windows

Install dependencies

pip install -r requirements.txt

Install PyTorch with CUDA (if not already installed)

# For CUDA 12.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Usage

Command-Line Interface

Run ProRAG with a text file:

python main.py --text_path ./test.txt

With all options:

python main.py \
  --text_path ./test.txt \
  --chat_model hassanaliemon/bn_rag_llama3-8b \
  --embed_model l3cube-pune/bengali-sentence-similarity-sbert \
  --k 4 \
  --top_k 2 \
  --top_p 0.6 \
  --temperature 0.6 \
  --chunk_size 500 \
  --chunk_overlap 150 \
  --max_new_tokens 256 \
  --quantization \
  --show_context \
  --hf_token YOUR_HF_TOKEN

Interactive session example:

Your question: When was the author of the document born?
Answer: The author was born on May 7, 1861.

Your question: exit
Goodbye, thank you!

Programmatic API

Use ProRAG as a Python library in your own applications:

from prorag import RAGChain

# Initialize the pipeline
rag = RAGChain()

# Load models and data
rag.load(
    chat_model_id="hassanaliemon/bn_rag_llama3-8b",
    embed_model_id="l3cube-pune/bengali-sentence-similarity-sbert",
    text_path="./test.txt",
    quantization=True,       # Enable 4-bit quantization
    k=4,                     # Retrieve top 4 chunks
    top_k=2,
    top_p=0.6,
    temperature=0.6,
    chunk_size=500,
    chunk_overlap=150,
    max_new_tokens=256,
    hf_token=None,           # Optional: for gated models
)

# Ask questions
answer, context = rag.get_response("Tell me about the main subject of the document.")
print(f"Answer: {answer}")
print(f"Context: {context}")

Configuration

Parameter	CLI Flag	Default	Description
Chat Model	`--chat_model`	`hassanaliemon/bn_rag_llama3-8b`	Hugging Face model ID for the instruction-tuned LLM
Embedding Model	`--embed_model`	`l3cube-pune/bengali-sentence-similarity-sbert`	Hugging Face model ID for sentence embeddings
Text Path	`--text_path`	required	Path to the `.txt` file to index
Top-K Retrieval	`--k`	`4`	Number of document chunks to retrieve
Top-K Sampling	`--top_k`	`2`	Top-k sampling parameter for generation
Top-P (Nucleus)	`--top_p`	`0.6`	Nucleus sampling probability threshold
Temperature	`--temperature`	`0.6`	Controls randomness in generation (lower = more deterministic)
Max New Tokens	`--max_new_tokens`	`256`	Maximum number of tokens to generate
Chunk Size	`--chunk_size`	`500`	Character-level chunk size for text splitting
Chunk Overlap	`--chunk_overlap`	`150`	Overlap between consecutive chunks
Show Context	`--show_context`	`False`	Display retrieved context alongside answers
Quantization	`--quantization`	`False`	Enable 4-bit NF4 quantization
HF Token	`--hf_token`	`None`	Hugging Face API token for private/gated models

Project Structure

ProRAG/
├── main.py                        # CLI entry point & interactive Q&A loop
├── prorag/                         # Core package
│   ├── __init__.py                # Package exports (RAGChain)
│   └── rag_pipeline.py            # RAG pipeline implementation
├── test.txt                       # Sample text file for testing
├── requirements.txt               # Python dependencies
├── CITATION.cff                   # Academic citation metadata
├── LICENSE                        # MIT License
└── README.md                      # This file

How It Works

ProRAG follows a standard RAG workflow:

Document Ingestion — Text is read from a .txt file and split into overlapping chunks using RecursiveCharacterTextSplitter with configurable delimiters (e.g. !, ?).
Embedding & Indexing — Each chunk is embedded using a sentence transformer model and stored in a ChromaDB vector database.
Query & Retrieval — When a user submits a query, the retriever performs similarity search against the vector store and returns the top-K most relevant chunks.
Augmented Generation — Retrieved chunks are formatted as context and injected into an instruction prompt template. The instruction-tuned LLM generates a grounded response.
Response Extraction — The raw model output is parsed to extract the clean response from the ### Response: section of the template.

Supported Models

Default Models

Role	Model	Source
Chat / Generation	`hassanaliemon/bn_rag_llama3-8b`	Instruction-tuned Llama 3 8B
Embeddings	`l3cube-pune/bengali-sentence-similarity-sbert`	Sentence-BERT for embeddings

Compatible Model Families

You can replace the default models with any compatible Hugging Face model:

Chat Models: Llama 3.x, Mistral, Gemma 2, Qwen 2.5, Phi-3/4, Command R+, or any compatible causal LM
Embedding Models: Any sentence-transformers compatible model for your language or domain

Roadmap & Future Enhancements

ProRAG is actively evolving. Below are planned and aspirational features aligned with the latest advancements in the Python and AI ecosystem:

Near-Term

Multi-Document Support — Ingest multiple files, PDFs, and web-scraped content
Persistent Vector Store — Persist ChromaDB collections to disk for reuse across sessions
Streaming Generation — Token-by-token streaming responses for real-time UX
LangSmith Integration — Observability, tracing, and evaluation of RAG chains via LangSmith

Model & Inference Enhancements

vLLM / TGI Backend — High-throughput inference with vLLM or Text Generation Inference
GGUF / llama.cpp Support — CPU-optimized inference with quantized GGUF models via llama-cpp-python
GPTQ & AWQ Quantization — Post-training quantization methods beyond NF4 for deployment flexibility
Speculative Decoding — Accelerated generation using draft models for faster inference
Multi-Modal RAG — Support for image+text documents using vision-language models (e.g., LLaVA, Qwen-VL)

Retrieval Enhancements

Hybrid Search — Combine dense vector similarity with BM25 sparse retrieval for improved recall
Re-Ranking — Cross-encoder re-ranking of retrieved passages using models like ms-marco or domain-specific re-rankers
Parent Document Retriever — Retrieve small chunks but return full parent documents for richer context
Multi-Vector Retriever — Generate multiple embeddings per document (summary + content) for semantic diversity
Knowledge Graph Integration — Structured knowledge extraction and graph-based retrieval (GraphRAG)
Contextual Compression — LLM-based compression of retrieved passages to reduce noise

Advanced RAG Patterns

Agentic RAG — Tool-using agents with LangGraph that can dynamically decide when and how to retrieve
Corrective RAG (CRAG) — Self-reflective retrieval with hallucination detection and query rewriting
Self-RAG — Adaptive retrieval where the model decides whether retrieval is needed
RAG Fusion — Multiple query reformulations with reciprocal rank fusion for robust retrieval
RAPTOR — Recursive abstractive processing for tree-organized retrieval across document hierarchies

Python & Developer Experience

Python 3.12+ Features — Leverage typing improvements (PEP 695 type aliases), asyncio task groups, and improved error messages
Async Pipeline — Fully async chain execution using asyncio and LangChain's async APIs
Pydantic v2 Schemas — Structured input/output validation with Pydantic v2 for type-safe pipelines
FastAPI / Gradio Server — REST API and web UI for production deployment
Docker & Docker Compose — Containerized deployment with GPU passthrough
Poetry / uv Package Management — Modern dependency management with pyproject.toml
Comprehensive Test Suite — Unit and integration tests with pytest and pytest-asyncio
CI/CD Pipeline — GitHub Actions for linting, testing, and automated releases

Evaluation & Observability

RAGAS Evaluation — Automated RAG evaluation metrics (faithfulness, answer relevancy, context precision)
Custom Benchmarks — Domain-specific evaluation datasets for Q&A
OpenTelemetry Tracing — Distributed tracing for production monitoring
LangFuse Integration — Open-source LLM observability and analytics

Contributing

Contributions are welcome! Whether it's bug fixes, new features, or documentation improvements — every contribution helps grow the RAG and NLP ecosystem.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please ensure your code follows the existing style and includes appropriate documentation.

Citation

If you use ProRAG in your research, please cite it:

@software{prorag2024,
  title     = {ProRAG: A Fully Configurable RAG Pipeline for Document Q&A Applications},
  author    = {Abdullah, Al Asif and Al Emon, Hasan},
  year      = {2024},
  url       = {https://github.com/healer-125/pro-rag},
  license   = {MIT}
}

Or use the CITATION.cff file included in this repository for automatic citation generation on GitHub.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
porag		porag
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png
main.py		main.py
requirements.txt		requirements.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProRAG

Table of Contents

Overview

Key Features

Architecture

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Command-Line Interface

Programmatic API

Configuration

Project Structure

How It Works

Supported Models

Default Models

Compatible Model Families

Roadmap & Future Enhancements

Near-Term

Model & Inference Enhancements

Retrieval Enhancements

Advanced RAG Patterns

Python & Developer Experience

Evaluation & Observability

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProRAG

Table of Contents

Overview

Key Features

Architecture

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Command-Line Interface

Programmatic API

Configuration

Project Structure

How It Works

Supported Models

Default Models

Compatible Model Families

Roadmap & Future Enhancements

Near-Term

Model & Inference Enhancements

Retrieval Enhancements

Advanced RAG Patterns

Python & Developer Experience

Evaluation & Observability

Contributing

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages