Enterprise-grade RAG system using NVIDIA NIM for LLM inference, NV-Embed-QA for embeddings, FAISS for vector search, BM25 for keyword retrieval, and NVIDIA Reranker for ranking.
FastAPI Backend
↓
Conversation Orchestrator
↓
NVIDIA NIM (Llama 3.1 70B)
↓
Hybrid Retrieval (BM25 + Vector Search)
↓
NVIDIA Reranker
↓
Grounded Generation
- Install dependencies:
pip install -r requirements.txt- Configure NVIDIA API:
- Copy
.env.exampleto.env - Get your NVIDIA API key from the NVIDIA API Catalog
- Set
NVIDIA_API_KEYin.env
- Build embeddings and indices:
python scripts/build_embeddings.pyThis script:
- Reads
data/catalog.json - Generates embeddings using NVIDIA NV-Embed-QA
- Builds FAISS vector index
- Builds BM25 keyword index
- Run the server:
uvicorn app.main:app --reload --port 8000- Test locally:
pytest -q- Evaluate retrieval quality:
python scripts/eval_retrieval.pyThis reports:
- Recall@5
- Recall@10
- MRR
- reranker improvements
- candidate rescue rate
The app serves:
- API:
POST /chat,GET /health - UI:
/(static files fromapp/static/)
- Push this repo to GitHub.
- In Render: New → Blueprint and select the repo.
- Render will read
render.yaml. - Add secret env var
NVIDIA_API_KEYin the Render dashboard.
See azure-app-service.md.
docker build -t shl-nim-rag .
docker run -p 8000:8000 \
-e NVIDIA_API_KEY=YOUR_KEY \
-e NVIDIA_NIM_BASE_URL=https://integrate.api.nvidia.com/v1 \
-e AUTO_BUILD_INDICES=true \
shl-nim-ragOpen:
- UI:
http://localhost:8000/ - Health:
http://localhost:8000/health
Health check endpoint.
curl http://localhost:8000/healthResponse:
{
"status": "ok",
"backend": "NVIDIA NIM"
}Main conversation endpoint.
Request:
{
"messages": [
{
"role": "user",
"content": "I'm looking for a Python backend developer assessment"
}
],
"top_k": 5,
"use_reranker": true
}Response:
{
"action": "respond",
"reply": "Based on your requirement for a Python backend developer...",
"retrieved_assessments": [
{
"rank": 1,
"id": "assessment_1",
"title": "Python Backend Developer Assessment",
"hybrid_score": 0.95,
"vector_score": 0.88,
"bm25_score": 0.92,
"rerank_score": 0.98,
"final_rank": 1,
"meta": {...}
}
],
"turn_count": 1,
"provenance": {
"model": "meta/llama-3.1-70b-instruct",
"embedding_model": "nvidia/nv-embed-qa-e5-v5",
"retrieval_method": "hybrid_bm25_vector",
"reranked": true
}
}- Chat completions via Llama 3.1 70B
- Embeddings via NV-Embed-QA
- Reranking via NVIDIA Reranker
- BM25 for keyword matching
- Vector search via FAISS + NV-Embed
- Weighted hybrid scoring (
semantic + bm25 + metadata) - Query expansion for intent-rich prompts
- Metadata boosting using catalog fields
- Reranking for final ranking
- Lightweight keyword-based search
- Fast inference
- Vector similarity search
- Efficient L2 distance computation
- Stateless conversation handling
- Clarification-first policy
- Grounded generation with provenance
Set these environment variables in .env:
NVIDIA_API_KEY=<your-nvidia-api-key>
NVIDIA_NIM_BASE_URL=https://integrate.api.nvidia.com/v1
FAISS_INDEX_PATH=data/faiss.index
EMBEDDINGS_PKL=data/embeddings.pkl
CATALOG_JSON=data/catalog.json
BM25_PKL=data/bm25_retriever.pkl- User query reaches FastAPI
- Clarification check determines whether more detail is needed
- Hybrid retrieval combines BM25 and vector search
- Reranking re-scores the top candidates
- Grounding prompt includes the retrieved assessments
- Llama 3.1 generates the response grounded in retrieved data
- The API returns structured JSON with provenance
- Enterprise-grade GPU-accelerated inference
- Open models such as Llama 3.1 and NV-Embed
- Cost-effective pay-per-token pricing
- High performance for production RAG pipelines
- Useful for demonstrating modern AI infrastructure knowledge
- FAISS and BM25 are stored locally for fast iteration
- Reranker integration is placeholder; update with NVIDIA reranker API
- Conversation state is reconstructed from message history
- Turn count limited to 8 for assignment constraints
This repo is Vercel-ready as a Python ASGI app. Vercel uses api/index.py as the entrypoint, and that module imports the FastAPI app from app/main.py.
- Push the repository to GitHub.
- Import the repo into Vercel.
- Set the required runtime variables, especially
NVIDIA_API_KEY. - Make sure the built assets in
data/are present so/healthand/chatcan load the retriever. - The UI is available at
/ui, and/redirects there automatically.