A production-ready Retrieval-Augmented Generation (RAG) agent built with LangGraph that intelligently retrieves information from a Milvus Lite vector store and generates context-aware responses using Llama Stack and Ollama.
- 🚀 Quick Start - Get running in 8 steps
- 📖 Detailed Guide - Comprehensive setup with troubleshooting
- 🔧 Configuration - Environment variables and settings
- 🎯 API Endpoints - REST API documentation
- Agentic RAG Workflow: The agent autonomously decides when to retrieve information
- Llama Stack Integration: Unified model serving with Ollama for local LLM inference
- Milvus Lite Vector Store: High-performance vector database with easy migration to production Milvus
- FastAPI Service: REST API with
/chatand/healthendpoints - Tool-based Retrieval: LangGraph tool integration for seamless retrieval
- Document Loader: Easy document ingestion from text files with customizable chunking
The RAG workflow consists of three main steps:
- Agent Node: Decides whether to retrieve information based on the user's query
- Retrieve Node: If needed, retrieves relevant documents from the vector store
- Generate Node: Generates a final answer based on retrieved context
START → Agent → [Decision] → Retrieve → Generate → END
↓
END (if no retrieval needed)
Run this script to set up stuff:
git clone <repository-url>
cd Agentic-Starter-Kitspython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateIf you want to install ollama you need to install app from Ollama site or via Brew
#brew install ollama
# or
#curl -fsSL https://ollama.com/install.sh | shInstall Llama Stack:
pip install llama-stack llama-stack-clientStep 1: Pull Required Models
ollama pull llama3.2:3b
ollama pull embeddinggemma:latestStep 2: Start Ollama Service
ollama serveKeep this terminal open - Ollama needs to keep running.
Step 3: Start Llama Stack Server
From the repository root directory:
llama stack run run_llama_server.yamlKeep this terminal open - the server needs to keep running. You should see output indicating the server started on
http://localhost:8321.
Step 4: Install Agent Dependencies
Navigate to the RAG agent directory and install dependencies:
cd agents/community/langgraph_agentic_rag
pip install -r requirements.txt⚡ Or with uv (from repo root):
- Create venv and activate:
uv venv --python 3.12
source .venv/bin/activate- Copy shared utils into the agent package:
cp utils.py agents/community/langgraph_agentic_rag/src/langgraph_agentic_rag- Install agent (editable) and its requirements:
uv pip install -e agents/community/langgraph_agentic_rag/. -r agents/community/langgraph_agentic_rag/requirements.txt- Run the example:
uv run agents/community/langgraph_agentic_rag/examples/execute_ai_service_locally.pyStep 5: Configure Environment Variables
Copy the example environment file:
cp .env.example .envEdit the .env file with your configuration:
# Llama Stack Server Configuration
BASE_URL=http://localhost:8321
MODEL_ID=ollama/llama3.2:3b
API_KEY=not-needed
# RAG Configuration
VECTOR_STORE_PATH=/absolute/path/to/milvus_data/milvus_lite.db
EMBEDDING_MODEL=ollama/embeddinggemma:latest
DOCS_TO_LOAD=./data/sample_knowledge.txt
# Server Configuration
PORT=8000Important: Update VECTOR_STORE_PATH to an absolute path where you want the Milvus database stored.
Step 6: Load Documents into Vector Store
Navigate to the data directory and run the document loader:
cd data
python load_documents.pyThis will:
- Read documents from
sample_knowledge.txt - Split documents into chunks (512 characters with 128 overlap)
- Generate embeddings using the
embeddinggemmamodel - Store chunks in the Milvus Lite vector database
Step 7: Run the Interactive Chat
cd ../examples
python execute_ai_service_locally.pyYou should see:
================================================================================
LangGraph Agentic RAG - Interactive Chat
================================================================================
Model: ollama/llama3.2:3b
Base URL: http://localhost:8321/v1
...
Choose a question or ask one of your own.
-->
Step 8: Ask Questions!
Try asking questions about the loaded documents:
--> What is LangChain?
This agent requires the following key dependencies (see requirements.txt for complete list):
langchain-core,langchain-openai- LangChain framework componentslanggraph,langgraph-prebuilt- Graph-based agent orchestrationllama-stack-client- Llama Stack API clientfastapi,uvicorn- Web service frameworkpydantic,python-dotenv- Configuration and data validation
Dependencies are installed in Step 4 of the Quick Start guide above.
Configuration is handled through two files:
Located in the repository root, this configures:
- Server port (8321)
- Milvus Lite vector store path
- Ollama integration URL
- Registered models (LLM and embedding)
Environment variables (configured in Step 5 of Quick Start):
BASE_URL- Llama Stack server URL (default:http://localhost:8321)MODEL_ID- LLM model identifier (e.g.,ollama/llama3.2:3b)API_KEY- API authentication (usenot-neededfor local setup)VECTOR_STORE_PATH- Absolute path to Milvus Lite database fileEMBEDDING_MODEL- Embedding model name (e.g.,ollama/embeddinggemma:latest)DOCS_TO_LOAD- Path to documents for vector store (e.g.,./data/sample_knowledge.txt)PORT- FastAPI server port (default:8000)
# Test the health endpoint
curl http://localhost:8000/health
# Send a chat message
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is RAG?"}'- Create a text file with your content (e.g.,
my_documents.txt) - Update
.envto point to your file:DOCS_TO_LOAD=./data/my_documents.txt
- Re-run the document loader:
cd data python load_documents.py
Edit load_documents.py to customize chunking parameters:
load_and_index_documents(
chunk_size=512, # Size of text chunks (default: 512)
chunk_overlap=128, # Overlap between chunks (default: 128)
)Recommended chunk sizes:
- Technical documentation: 512-1024 characters
- Narrative text: 256-512 characters
- Code snippets: 128-256 characters
Navigate to the agent directory:
cd agents/community/langgraph_agentic_ragMake scripts executable (first time only)
chmod +x init.sh deploy.sh
./init.shThis will:
- Load and validate environment variables from
.envfile - Copy shared utilities (
utils.py) to the agent source directory
Now you need to login to OC and Docker
./deploy.shThis will:
- Create Kubernetes secret for API key
- Build and push the Docker image
- Deploy the agent to OpenShift
- Create Service and Route
Get your route URL:
oc get route langgraph-agentic-rag -o jsonpath='{.spec.host}'copy the response to curl beneath to <YOUR_ROUTE_URL>
Send a test request:
curl -X POST https://<YOUR_ROUTE_URL>/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is LangChain?"}'Error: Connection refused to http://localhost:8321
Solution: Ensure the Llama Stack server is running:
llama stack run run_llama_server.yamlSolution: Load documents into the vector store:
cd data
python load_documents.pyThere is probability that locally creadted vector store can be broken somehow. Then you need to delete insides of the milvus_data folder.
After that run load_documents.py again, and it will populate that folder.
Possible causes:
- Chunk size too small - Documents split into headers only
- Solution: Increase
chunk_sizeto 512+ inload_documents.py
- Solution: Increase
- Documents not loaded - Vector store is empty
- Solution: Re-run
python load_documents.py
- Solution: Re-run
- Wrong model - Model not compatible
- Solution: Use
llama3.2:3borllama3.1:8b
- Solution: Use
This RAG agent extends the base LangGraph agent with:
- Retrieval Capability: Automatic knowledge base search via Llama Stack
- Multi-step Workflow: Agent → Retrieve → Generate pattern
- Vector Store Integration: Milvus Lite-based document storage and retrieval
- Context-aware Generation: Answers based on retrieved documents with relevance checking
- Llama Stack Integration: Unified model serving and vector operations
- LangGraph Documentation - LangGraph framework docs
- Llama Stack Documentation - Llama Stack API reference
- Ollama Documentation - Local model serving
MIT License