Production-ready Agentic RAG system with LangGraph, conversation memory, and human-in-the-loop query clarification
Overview β’ How It Works β’ LLM Providers β’ Implementation β’ Installation & Usage β’ Troubleshooting
If you like this project, a star βοΈ would mean a lot :)
β¨ New:
β’ Multi-Agent Map-Reduce architecture for parallel query processing
β’ Comprehensive PDF β Markdown conversion guide, including tool comparisons and VLM-based approaches
β’ End-to-end Gradio interface for a complete interactive RAG pipeline
This repository demonstrates how to build an Agentic RAG (Retrieval-Augmented Generation) system using LangGraph with minimal code. It implements:
- π¬ Conversation Memory: Maintains context across multiple questions for natural dialogue
- π Query Clarification: Automatically rewrites ambiguous queries or asks for clarification
- π Hierarchical Indexing: Search small, specific chunks (Child) for precision, retrieve larger Parent chunks for context
- π€ Agent Orchestration: Uses LangGraph to coordinate the entire workflow
- π§ Intelligent Evaluation: Assesses relevance at the granular chunk level
- β Self-Correction: Re-queries if initial results are insufficient
- π Multi-Agent Map-Reduce: Decomposes queries into parallel sub-queries for comprehensive answers
Before queries can be processed, documents are split twice for optimal retrieval:
- Parent Chunks: Large sections based on Markdown headers (H1, H2, H3)
- Child Chunks: Small, fixed-size pieces derived from parents
This approach combines the precision of small chunks for search with the contextual richness of large chunks for answer generation.
User Query β Conversation Analysis β Query Clarification β
Agent Reasoning β Search Child Chunks β Evaluate Relevance β
(If needed) β Retrieve Parent Chunks β Generate Answer β Return Response
- Analyzes recent conversation history to extract context
- Maintains conversational continuity across multiple questions
The system intelligently processes the user's query:
- Resolves references - Converts "How do I update it?" β "How do I update SQL?"
- Splits complex questions - Breaks multi-part questions into focused sub-queries
- Detects unclear queries - Identifies nonsense, insults, or vague questions
- Requests clarification - Uses human-in-the-loop to pause and ask for details
- Rewrites for retrieval - Optimizes query with specific, keyword-rich language
Multi-Agent Map-Reduce Architecture:
When the query analysis stage identifies multiple distinct questions (either explicitly asked or decomposed from a complex query), the system automatically spawns parallel agent subgraphs using LangGraph's Send API. Each agent independently processes one question through the full retrieval workflow:
- Agent searches child chunks for precision
- Evaluates if results are sufficient
- Fetches parent chunks for context if needed
- Extracts final answer from conversation
- Self-corrects and re-queries if insufficient
All agent responses are then aggregated into a unified answer.
Example: "What is JavaScript? What is Python?" β 2 parallel agents execute simultaneously
Single question workflow: For simple queries, a single agent executes the retrieval workflow without parallelization.
The system synthesizes information from retrieved chunks (or multiple agents) into a coherent, accurate answer that directly addresses the user's question.
This system is provider-agnostic - you can use any LLM supported by LangChain. Choose the option that best fits your needs:
Install Ollama and download the model:
# Install Ollama from https://ollama.com
ollama pull qwen3:4b-instruct-2507-q4_K_MPython code:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:4b-instruct-2507-q4_K_M", temperature=0)Install the package:
pip install -qU langchain-google-genaiPython code:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
# Set your Google API key
os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp", temperature=0)Click to expand
OpenAI:
pip install -qU langchain-openaifrom langchain_openai import ChatOpenAI
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)Anthropic Claude:
pip install -qU langchain-anthropicfrom langchain_anthropic import ChatAnthropic
import os
os.environ["ANTHROPIC_API_KEY"] = "your-api-key-here"
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0)- All providers work with the exact same code - only the LLM initialization changes
- Cost considerations: Cloud providers charge per token, while Ollama is free but requires local compute
The app (project/ folder) is organized in modular components that can be easily customized:
project/
βββ app.py # Main Gradio application entry point
βββ config.py # Configuration hub (models, chunk sizes, providers)
βββ util.py # PDF to markdown conversion
βββ document_chunker.py # Chunking strategy
βββ core/ # Core RAG components orchestration
β βββ chat_interface.py
β βββ document_manager.py
β βββ rag_system.py
βββ db/ # Storage management
β βββ parent_store_manager.py # Parent chunks storage (JSON)
β βββ vector_db_manager.py # Qdrant vector database setup
βββ rag_agent/ # LangGraph agent workflow
β βββ edges.py # Conditional routing logic
β βββ graph.py # Graph construction and compilation
β βββ graph_state.py # State definitions
β βββ nodes.py # Processing nodes (summarize, rewrite, agent)
β βββ prompts.py # System prompts
β βββ schemas.py # Pydantic data models
β βββ tools.py # Retrieval tools
βββ ui/ # User interface
βββ gradio_app.py # Gradio interface components
# Clone the repository
git clone <repo-url>
cd agentic-rag-for-dummies
# Create virtual environment (recommended)
python -m venv venv
# Activate it
# On macOS/Linux:
source venv/bin/activate
# On Windows:
.\venv\Scripts\activate
# Install packages
pip install -r requirements.txtpython app.pyOpen the local URL (e.g., http://127.0.0.1:7860) to start chatting.
β οΈ System Requirements: Docker deployment requires at least 8GB of RAM allocated to Docker. The Ollama model (qwen3:4b-instruct-2507-q4_K_M) needs approximately 3.3GB of memory to run.
- Docker installed on your system (Get Docker)
- Docker Desktop configured with at least 8GB of RAM (Settings β Resources β Memory)
docker build -f project/Dockerfile -t agentic-rag .docker run --name rag-assistant -p 7860:7860 agentic-rag
β οΈ Performance Note: Docker deployment may be 20-50% slower than running Python locally, especially on Windows/Mac, due to virtualization overhead and I/O operations. This is normal and expected. For maximum performance during development, consider using Option 2 (Full Python Project).
Optional: Enable GPU acceleration (NVIDIA GPU only):
If you have an NVIDIA GPU and NVIDIA Container Toolkit installed:
docker run --gpus all --name rag-assistant -p 7860:7860 agentic-ragCommon Docker commands:
# Stop the container
docker stop rag-assistant
# Start an existing container
docker start rag-assistant
# View logs in real-time
docker logs -f rag-assistant
# Remove the container
docker rm rag-assistant
# Remove the container forcefully (if running)
docker rm -f rag-assistantOnce the container is running and you see:
π Launching RAG Assistant...
* Running on local URL: http://0.0.0.0:7860
Open your browser and navigate to:
http://localhost:7860
With Conversation Memory:
User: "How do I install SQL?"
Agent: [Provides installation steps from documentation]
User: "How do I update it?"
Agent: [Understands "it" = SQL, provides update instructions]
With Query Clarification:
User: "Tell me about that thing"
Agent: "I need more information. What specific topic are you asking about?"
User: "The installation process for PostgreSQL"
Agent: [Retrieves and answers with specific information]
| Area | Common Problems | Suggested Solutions |
|---|---|---|
| Model Selection | - Responses ignore instructions - Tools (retrieval/search) used incorrectly - Poor context understanding - Hallucinations or incomplete aggregation |
- Use more capable LLMs - Prefer models 7B+ for better reasoning - Consider cloud-based models if local models are limited |
| System Prompt Behavior | - Model answers without retrieving documents - Query rewriting loses context - Aggregation introduces hallucinations |
- Make retrieval explicit in system prompts - Keep query rewriting close to user intent - Enforce strict aggregation rules |
| Retrieval Configuration | - Relevant documents not retrieved - Too much irrelevant information |
- Increase retrieved chunks (k) or lower similarity thresholds to improve recall- Reduce k or increase thresholds to improve precision |
| Chunk Size / Document Splitting | - Answers lack context or feel fragmented - Retrieval is slow or embedding costs are high |
- Increase chunk & parent sizes for more context - Decrease chunk sizes to improve speed and reduce costs |
| Temperature & Consistency | - Responses inconsistent or overly creative - Responses too rigid or repetitive |
- Set temperature to 0 for factual, consistent output- Slightly increase temperature for summarization or analysis tasks |
| Embedding Model Quality | - Poor semantic search - Weak performance on domain-specific or multilingual docs |
- Use higher-quality or domain-specific embeddings - Re-index all documents after changing embeddings |
MIT License - Feel free to use this for learning and building your own projects!
Contributions are welcome! Open an issue or submit a pull request!

