Skip to content

fantasyfist0320/AI-Agent-with-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Agentic RAG for Dummies Logo

AI Agentic RAG system - 1

Production-ready Agentic RAG system with LangGraph, conversation memory, and human-in-the-loop query clarification

Overview β€’ How It Works β€’ LLM Providers β€’ Implementation β€’ Installation & Usage β€’ Troubleshooting

Agentic RAG Demo

If you like this project, a star ⭐️ would mean a lot :)

✨ New:
β€’ Multi-Agent Map-Reduce architecture for parallel query processing
β€’ Comprehensive PDF β†’ Markdown conversion guide, including tool comparisons and VLM-based approaches
β€’ End-to-end Gradio interface for a complete interactive RAG pipeline

Overview

This repository demonstrates how to build an Agentic RAG (Retrieval-Augmented Generation) system using LangGraph with minimal code. It implements:

  • πŸ’¬ Conversation Memory: Maintains context across multiple questions for natural dialogue
  • πŸ”„ Query Clarification: Automatically rewrites ambiguous queries or asks for clarification
  • πŸ” Hierarchical Indexing: Search small, specific chunks (Child) for precision, retrieve larger Parent chunks for context
  • πŸ€– Agent Orchestration: Uses LangGraph to coordinate the entire workflow
  • 🧠 Intelligent Evaluation: Assesses relevance at the granular chunk level
  • βœ… Self-Correction: Re-queries if initial results are insufficient
  • πŸ”€ Multi-Agent Map-Reduce: Decomposes queries into parallel sub-queries for comprehensive answers

How It Works

Document Preparation: Hierarchical Indexing

Before queries can be processed, documents are split twice for optimal retrieval:

  • Parent Chunks: Large sections based on Markdown headers (H1, H2, H3)
  • Child Chunks: Small, fixed-size pieces derived from parents

This approach combines the precision of small chunks for search with the contextual richness of large chunks for answer generation.


Query Processing: Four-Stage Intelligent Workflow

User Query β†’ Conversation Analysis β†’ Query Clarification β†’
Agent Reasoning β†’ Search Child Chunks β†’ Evaluate Relevance β†’
(If needed) β†’ Retrieve Parent Chunks β†’ Generate Answer β†’ Return Response

Stage 1: Conversation Understanding

  • Analyzes recent conversation history to extract context
  • Maintains conversational continuity across multiple questions

Stage 2: Query Clarification

The system intelligently processes the user's query:

  1. Resolves references - Converts "How do I update it?" β†’ "How do I update SQL?"
  2. Splits complex questions - Breaks multi-part questions into focused sub-queries
  3. Detects unclear queries - Identifies nonsense, insults, or vague questions
  4. Requests clarification - Uses human-in-the-loop to pause and ask for details
  5. Rewrites for retrieval - Optimizes query with specific, keyword-rich language

Stage 3: Intelligent Retrieval

Multi-Agent Map-Reduce Architecture:

When the query analysis stage identifies multiple distinct questions (either explicitly asked or decomposed from a complex query), the system automatically spawns parallel agent subgraphs using LangGraph's Send API. Each agent independently processes one question through the full retrieval workflow:

  1. Agent searches child chunks for precision
  2. Evaluates if results are sufficient
  3. Fetches parent chunks for context if needed
  4. Extracts final answer from conversation
  5. Self-corrects and re-queries if insufficient

All agent responses are then aggregated into a unified answer.

Example: "What is JavaScript? What is Python?" β†’ 2 parallel agents execute simultaneously

Single question workflow: For simple queries, a single agent executes the retrieval workflow without parallelization.

Stage 4: Response Generation

The system synthesizes information from retrieved chunks (or multiple agents) into a coherent, accurate answer that directly addresses the user's question.


LLM Provider Configuration

This system is provider-agnostic - you can use any LLM supported by LangChain. Choose the option that best fits your needs:

Ollama (Local - Recommended for Development)

Install Ollama and download the model:

# Install Ollama from https://ollama.com
ollama pull qwen3:4b-instruct-2507-q4_K_M

Python code:

from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3:4b-instruct-2507-q4_K_M", temperature=0)

Google Gemini (Cloud - Recommended for Production)

Install the package:

pip install -qU langchain-google-genai

Python code:

import os
from langchain_google_genai import ChatGoogleGenerativeAI

# Set your Google API key
os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp", temperature=0)

OpenAI / Anthropic Claude

Click to expand

OpenAI:

pip install -qU langchain-openai
from langchain_openai import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Anthropic Claude:

pip install -qU langchain-anthropic
from langchain_anthropic import ChatAnthropic
import os

os.environ["ANTHROPIC_API_KEY"] = "your-api-key-here"
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0)

Important Notes

  • All providers work with the exact same code - only the LLM initialization changes
  • Cost considerations: Cloud providers charge per token, while Ollama is free but requires local compute

Modular Architecture

The app (project/ folder) is organized in modular components that can be easily customized:

πŸ“‚ Project Structure

project/
β”œβ”€β”€ app.py                    # Main Gradio application entry point
β”œβ”€β”€ config.py                 # Configuration hub (models, chunk sizes, providers)
β”œβ”€β”€ util.py                   # PDF to markdown conversion
β”œβ”€β”€ document_chunker.py       # Chunking strategy
β”œβ”€β”€ core/                     # Core RAG components orchestration
β”‚   β”œβ”€β”€ chat_interface.py     
β”‚   β”œβ”€β”€ document_manager.py   
β”‚   └── rag_system.py         
β”œβ”€β”€ db/                       # Storage management
β”‚   β”œβ”€β”€ parent_store_manager.py  # Parent chunks storage (JSON)
β”‚   └── vector_db_manager.py     # Qdrant vector database setup
β”œβ”€β”€ rag_agent/                # LangGraph agent workflow
β”‚   β”œβ”€β”€ edges.py              # Conditional routing logic
β”‚   β”œβ”€β”€ graph.py              # Graph construction and compilation
β”‚   β”œβ”€β”€ graph_state.py        # State definitions
β”‚   β”œβ”€β”€ nodes.py              # Processing nodes (summarize, rewrite, agent)
β”‚   β”œβ”€β”€ prompts.py            # System prompts
β”‚   β”œβ”€β”€ schemas.py            # Pydantic data models
β”‚   └── tools.py              # Retrieval tools
└── ui/                       # User interface
    └── gradio_app.py         # Gradio interface components

Full Python Project Setup

1. Install Dependencies

# Clone the repository
git clone <repo-url>
cd agentic-rag-for-dummies

# Create virtual environment (recommended)
python -m venv venv

# Activate it
# On macOS/Linux:
source venv/bin/activate
# On Windows:
.\venv\Scripts\activate

# Install packages
pip install -r requirements.txt

2. Run the Application

python app.py

3. Ask Questions

Open the local URL (e.g., http://127.0.0.1:7860) to start chatting.


Option 3: Docker Deployment

⚠️ System Requirements: Docker deployment requires at least 8GB of RAM allocated to Docker. The Ollama model (qwen3:4b-instruct-2507-q4_K_M) needs approximately 3.3GB of memory to run.

Prerequisites

  • Docker installed on your system (Get Docker)
  • Docker Desktop configured with at least 8GB of RAM (Settings β†’ Resources β†’ Memory)

1. Build the Docker Image

docker build -f project/Dockerfile -t agentic-rag .

2. Run the Container

docker run --name rag-assistant -p 7860:7860 agentic-rag

⚠️ Performance Note: Docker deployment may be 20-50% slower than running Python locally, especially on Windows/Mac, due to virtualization overhead and I/O operations. This is normal and expected. For maximum performance during development, consider using Option 2 (Full Python Project).

Optional: Enable GPU acceleration (NVIDIA GPU only):

If you have an NVIDIA GPU and NVIDIA Container Toolkit installed:

docker run --gpus all --name rag-assistant -p 7860:7860 agentic-rag

Common Docker commands:

# Stop the container
docker stop rag-assistant

# Start an existing container
docker start rag-assistant

# View logs in real-time
docker logs -f rag-assistant

# Remove the container
docker rm rag-assistant

# Remove the container forcefully (if running)
docker rm -f rag-assistant

3. Access the Application

Once the container is running and you see:

πŸš€ Launching RAG Assistant...
* Running on local URL:  http://0.0.0.0:7860

Open your browser and navigate to:

http://localhost:7860

Example Conversations

With Conversation Memory:

User: "How do I install SQL?"
Agent: [Provides installation steps from documentation]

User: "How do I update it?"
Agent: [Understands "it" = SQL, provides update instructions]

With Query Clarification:

User: "Tell me about that thing"
Agent: "I need more information. What specific topic are you asking about?"

User: "The installation process for PostgreSQL"
Agent: [Retrieves and answers with specific information]

Troubleshooting

Area Common Problems Suggested Solutions
Model Selection - Responses ignore instructions
- Tools (retrieval/search) used incorrectly
- Poor context understanding
- Hallucinations or incomplete aggregation
- Use more capable LLMs
- Prefer models 7B+ for better reasoning
- Consider cloud-based models if local models are limited
System Prompt Behavior - Model answers without retrieving documents
- Query rewriting loses context
- Aggregation introduces hallucinations
- Make retrieval explicit in system prompts
- Keep query rewriting close to user intent
- Enforce strict aggregation rules
Retrieval Configuration - Relevant documents not retrieved
- Too much irrelevant information
- Increase retrieved chunks (k) or lower similarity thresholds to improve recall
- Reduce k or increase thresholds to improve precision
Chunk Size / Document Splitting - Answers lack context or feel fragmented
- Retrieval is slow or embedding costs are high
- Increase chunk & parent sizes for more context
- Decrease chunk sizes to improve speed and reduce costs
Temperature & Consistency - Responses inconsistent or overly creative
- Responses too rigid or repetitive
- Set temperature to 0 for factual, consistent output
- Slightly increase temperature for summarization or analysis tasks
Embedding Model Quality - Poor semantic search
- Weak performance on domain-specific or multilingual docs
- Use higher-quality or domain-specific embeddings
- Re-index all documents after changing embeddings

License

MIT License - Feel free to use this for learning and building your own projects!


Contributing

Contributions are welcome! Open an issue or submit a pull request!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors