RAG (Retrieval-Augmented Generation) agent that indexes documents in a vector store and retrieves relevant chunks to augment the LLM's answers with your own data. Built with LangGraph, LangChain, and LlamaStack. Supports Milvus Lite (local) and pgvector (OpenShift) as vector store backends.
- uv — Python package manager
- Podman or Docker — for local container builds (Option A)
- oc — for OpenShift deployment
- Helm — for deploying to Kubernetes/OpenShift
- GNU Make and a bash-compatible shell — on Windows, use WSL (recommended) or Git Bash
make init creates a .env file from .env.example. Set your environment variables in the .env file.
cd agents/langgraph/agentic_rag
make initNow you will remove old .venv and create new. Next dependencies will be installed.
make envTracing is optional. If MLflow tracing is required, enable it by uncommenting and setting the following environment variables in the .env file.
MLFLOW_TRACKING_URI="http://localhost:5000"
MLFLOW_EXPERIMENT_NAME="langgraph-agentic-rag"
MLFLOW_HTTP_REQUEST_TIMEOUT=2
MLFLOW_HTTP_REQUEST_MAX_RETRIES=0Then start the MLflow server in a separate terminal:
# Start the MLflow server
uv run --extra tracing mlflow server --port 5000When MLFLOW_TRACKING_URI is set, make run-app and make run-cli will automatically install the tracing dependency.
To enable tracing and logging with MLflow on your OpenShift cluster, add the following environment variables to your .env file:
MLFLOW_TRACKING_URI="https://<openshift-dashboard-url>/mlflow"
MLFLOW_TRACKING_TOKEN="<your-openshift-token>"
MLFLOW_EXPERIMENT_NAME="langgraph-agentic-rag"
MLFLOW_TRACKING_INSECURE_TLS="true"
MLFLOW_WORKSPACE="default"Notes:
-
MLFLOW_TRACKING_URI- URL of your MLflow server. For local development, usehttp://localhost:5000. If using MLflow on an OpenShift cluster, replace<openshift-dashboard-url>with your cluster's data science gateway URL. -
MLFLOW_TRACKING_TOKEN- Required for OpenShift only. Your OpenShift authentication token, obtained from the OpenShift console. -
MLFLOW_EXPERIMENT_NAME- A descriptive name for your experiment (e.g., "LangGraph Agentic RAG Demo") -
MLFLOW_TRACKING_INSECURE_TLS- Required for OpenShift only. Set to"true"if your cluster does not use trusted certificates. -
MLFLOW_WORKSPACE- Required for OpenShift only. Project name. -
Tracing is optional; if you do not set
MLFLOW_TRACKING_URI, the application will run without MLflow logging. -
If
MLFLOW_TRACKING_URIis set, the application will attempt to connect to the MLflow server at startup. If the server is unreachable, the application will log a warning and continue running without tracing. -
You can control how long the application waits for the MLflow server by setting
MLFLOW_HEALTH_CHECK_TIMEOUT(in seconds, default:5).
In addition to the model configuration, this agent requires RAG-specific settings in your .env file:
EMBEDDING_MODEL=ollama/embeddinggemma:latest
EMBEDDING_DIMENSION=768
VECTOR_STORE_ID=
VECTOR_STORE_PROVIDER=milvus
VECTOR_STORE_PATH=/absolute/path/to/milvus_data/milvus_lite.db
DOCS_TO_LOAD=./data/sample_knowledge.txtNotes:
EMBEDDING_MODEL- Model used for generating document embeddings. For local use with Ollama, pull the model first:ollama pull embeddinggemma:latestEMBEDDING_DIMENSION- Dimension of the embedding vectors (default:768). Must match the embedding model's output dimension.VECTOR_STORE_ID- Identifier for the vector store collection. If left empty, a new collection will be created when loading documents.VECTOR_STORE_PROVIDER- Vector store backend:milvusfor local development (default),pgvectorfor OpenShift deployments.VECTOR_STORE_PATH- Absolute path where the Milvus Lite database will be stored. Not used whenVECTOR_STORE_PROVIDER=pgvector.DOCS_TO_LOAD- Path to the text file containing documents to load into the vector store. A sample file is provided at./data/sample_knowledge.txt.
This will install ollama if it is not installed already. Then pull needed models for local work.
The default model is llama3.1:8b. To use a different model, pass MODEL=:
make ollama MODEL=llama3.2:3b
This also pulls the embedding model (embeddinggemma:latest) required for RAG.
make ollamaKeep this terminal open – the server needs to keep running. You should see output indicating the server started on
http://localhost:8321.
make llama-serverBefore running the agent, you need to load documents into the vector store.
If you do not have a VECTOR_STORE_ID, you can create one by running the document loader:
make load-docsThis will:
- Read documents from the file specified in
DOCS_TO_LOAD - Split documents into chunks (512 characters with 128 overlap by default)
- Generate embeddings using the model specified in
EMBEDDING_MODEL - Create a new vector store (using
VECTOR_STORE_PROVIDER, defaults tomilvusfor local) - Store chunks in the vector store
- Automatically write the new
VECTOR_STORE_IDback to your.envfile
Keep this terminal open – the app needs to keep running. You should see output indicating the app started on
http://localhost:8000.
cd agents/langgraph/agentic_rag
make run-app # fails if port is already in use and print steps TO-DOFor terminal-based testing without a browser:
cd agents/langgraph/agentic_rag
make run-cliThis launches an interactive prompt where you can pick predefined questions or type your own. Tool calls and results are displayed inline with colored output.
cd agents/langgraph/agentic_rag
make initEdit .env with your model endpoint, RAG configuration, and container image.
If a LlamaStack server is already deployed on the cluster (e.g., in the llama-serving namespace), use its
external route URL so both LLM and vector store operations go through LlamaStack:
API_KEY=not-needed
BASE_URL=https://llamastack-route-host/v1
MODEL_ID=vllm//mnt/models
CONTAINER_IMAGE=quay.io/your-username/langgraph-agentic-rag:latest
# RAG Configuration
EMBEDDING_MODEL=sentence-transformers/nomic-ai/nomic-embed-text-v1.5
EMBEDDING_DIMENSION=768
VECTOR_STORE_ID=
VECTOR_STORE_PROVIDER=pgvector
DOCS_TO_LOAD=./data/sample_knowledge.txtTo discover the LlamaStack route URL and available models on your cluster:
# Get the LlamaStack route
oc get route -n llama-serving llamastack -o jsonpath='{.spec.host}'
# Check available models
curl -s https://<route-host>/v1/models | python3 -m json.toolNotes:
-
API_KEY- your API key or contact your cluster administrator. Usenot-neededfor LlamaStack servers that don't require auth. -
BASE_URL- should end with/v1. For LlamaStack on the cluster, use the external route URL. -
MODEL_ID- model identifier available on your endpoint -
VECTOR_STORE_PROVIDER- vector store backend configured in your LlamaStack server. Usepgvectorormilvusdepending on your LlamaStack deployment. -
CONTAINER_IMAGE-- full image path where the agent container will be pushed and pulled from. The image is built locally, pushed to this registry, and then deployed to OpenShift.Format:
<registry>/<namespace>/<image-name>:<tag>Examples:
- Quay.io:
quay.io/your-username/langgraph-agentic-rag:latest - Docker Hub:
docker.io/your-username/langgraph-agentic-rag:latest - GHCR:
ghcr.io/your-org/langgraph-agentic-rag:latest
Note: OpenShift must be able to pull the container image. Make the image public, or configure an image pull secret for private registries.
- Quay.io:
Before deploying the agent, load documents into the LlamaStack vector store. Run the loader script
locally, pointing it at the LlamaStack server's external route (the same BASE_URL used in your .env):
uv run python data/load_documents.pyThe script creates a new vector store, prints its ID, and writes the VECTOR_STORE_ID back to your .env file automatically.
Login to OC
oc login -u "login" -p "password" https://super-link-to-cluster:111Login ex. Docker
docker login -u='login' -p='password' quay.ioRequires Podman (or Docker) and a registry account (e.g., Quay.io).
make build # builds the image locally
make push # pushes to the registry specified in CONTAINER_IMAGENo Podman, Docker, or registry account needed -- just the oc CLI.
make build-openshiftAfter the build completes, set CONTAINER_IMAGE in your .env to the internal registry URL printed after the build.
make dry-run # preview rendered Helm manifests (secrets redacted)make deployAfter deploying, the application may take about a minute to become available while the pod starts up.
The route URL is printed after make deploy. You can also retrieve it manually:
oc get route langgraph-agentic-rag -o jsonpath='{.spec.host}'make undeploySee OpenShift Deployment for more details.
make testNon-streaming:
curl -X POST http://localhost:8000/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is LangChain?"}], "stream": false}'Streaming:
curl -sN -X POST http://localhost:8000/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is LangChain?"}], "stream": true}'Pretty Printed Stream:
curl -sN -X POST http://localhost:8000/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is LangChain?"}], "stream": true}' |
jq -R -r -j --stream 'scan("^data:(.*)")[] | fromjson.choices[0].delta.content // empty'curl http://localhost:8000/healthThis agent implements a Retrieval-Augmented Generation (RAG) pattern:
-
Document Indexing: Documents are loaded from a text file, split into chunks, and embedded using the configured embedding model. The embeddings are stored in a vector database via LlamaStack (supports Milvus and pgvector backends, configurable via
VECTOR_STORE_PROVIDER). -
Query Processing: When the user asks a question, the agent searches the vector store through LlamaStack for the most relevant document chunks.
-
Augmented Generation: The retrieved chunks are provided as context to the LLM, which generates an answer grounded in the relevant documents. This reduces hallucination and allows the model to answer questions about your specific data.
The agent uses LangGraph to orchestrate the retrieval and generation steps, LangChain for the LLM integration, and LlamaStack for vector store operations.
Behavioral tests validate tool usage, response quality, latency, and reliability against a deployed agent.
# Set the deployed agent URL
export AGENTIC_RAG_AGENT_URL=https://<your-agent-route>
# Optional: enable MLflow trace enrichment for tool_calls extraction
export MLFLOW_TRACKING_URI=https://<mlflow-route>/mlflow
export MLFLOW_EXPERIMENT_NAME=<experiment>
# Run all behavioral tests
pytest agents/langgraph/agentic_rag/tests/behavioral/ -v
# Run specific test categories
pytest agents/langgraph/agentic_rag/tests/behavioral/ -v -m "agentic_rag and not slow"See tests/behavioral/ at the repo root for the shared test harness and threshold configuration.