An AI-native semantic search API built with Retrieval-Augmented Generation (RAG) β designed to demonstrate production-pattern enterprise search engineering.
Traditional keyword search misses meaning. Enterprise Search RAG Demo retrieves documents by semantic similarity (vector embeddings + cosine similarity), then uses an LLM to generate grounded, context-anchored answers.
This mirrors how modern enterprise AI systems β like SAP Joule for SAP LeanIX β surface knowledge across an organization's IT landscape, architecture documentation, and operational data.
The core idea, in three steps:
- Retrieve β embed the user query, find the top-K semantically similar documents via cosine similarity
- Augment β inject the retrieved documents as grounded context into a carefully engineered prompt
- Generate β call an LLM to produce a precise answer anchored only in the retrieved context
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"question": "How can we reduce cloud costs?", "top_k": 2}'Sample response:
{
"question": "How can we reduce cloud costs?",
"answer": "Organizations can reduce Azure and AWS expenses by monitoring unused resources and implementing rightsizing strategies...",
"sources": [
{
"title": "Managing Cloud Infrastructure Costs",
"content": "Organizations can reduce Azure and AWS expenses by monitoring unused resources...",
"similarity_score": 0.8124
}
],
"model": "llama-3.1-8b-instant"
}flowchart LR
A[Client] -->|POST /search| B[FastAPI Server<br/>main.py]
B -->|Pydantic validation| C[RAG Engine<br/>rag.py]
C -->|1. RETRIEVE| D[Sentence Transformers<br/>all-MiniLM-L6-v2<br/>384-dim embeddings]
D -->|Cosine similarity| E[Pre-computed<br/>Document Vectors]
E -->|Top-K docs| F[2. AUGMENT<br/>Context injection<br/>into prompt]
F -->|3. GENERATE| G[Groq LLM<br/>Llama 3.1 8B Instant]
G -->|Grounded answer| B
B -->|JSON Response| A
- Client sends
POST /searchwith{"question": "...", "top_k": 3} - FastAPI validates the request via Pydantic models
search()encodes the question with Sentence Transformers (384-dim vector)- Cosine similarity is computed against pre-computed document vectors (indexed once at startup)
- Top-K documents are returned ranked by similarity
generate_answer()builds a context-injected prompt and calls Groq's Llama 3.1 8B- Response is returned with answer, sources, similarity scores, and the model used
- Python 3.12 or later
- A free Groq API key
# Clone the repo
git clone https://github.com/J4jatin/enterprise-search-rag-demo.git
cd enterprise-search-rag-demo
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure your Groq API key
cp .env.example .env
# Then edit .env and set: GROQ_API_KEY=your_key_here
# Start the server
uvicorn main:app --reload --host 0.0.0.0 --port 8000The API will be live at http://localhost:8000. Visit http://localhost:8000/docs for the auto-generated Swagger UI.
# Build the image
docker build -t enterprise-search-rag-demo .
# Run with your API key
docker run -p 8000:8000 -e GROQ_API_KEY=your_key_here enterprise-search-rag-demo| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Service info and available endpoints |
| POST | /search |
Main RAG search β retrieve + augment + generate |
| GET | /health |
Liveness probe for Kubernetes / load balancers |
| GET | /docs |
Auto-generated Swagger UI |
| GET | /redoc |
Alternative API documentation |
Request body
{
"question": "What are Kubernetes deployment best practices?",
"top_k": 3
}Response
{
"question": "...",
"answer": "...",
"sources": [{ "title": "...", "content": "...", "similarity_score": 0.8421 }],
"model": "llama-3.1-8b-instant"
}- The question is encoded to a 384-dim dense vector using
all-MiniLM-L6-v2 - Document vectors are pre-computed at startup β a key performance optimization that eliminates per-query encoding of the corpus
- Cosine similarity is computed between the question vector and all document vectors
np.argsortselects the top-K most similar documents
- Retrieved documents are formatted as a structured context block
- A carefully engineered system prompt instructs the LLM to:
- Answer only from the provided context
- Acknowledge gaps if the context doesn't contain the answer
- Avoid hallucination by ignoring outside knowledge
- The augmented prompt is sent to Groq's
llama-3.1-8b-instant temperature=0.1for near-deterministic, grounded answersmax_tokens=500to cap response length for cost and UX
| Layer | Technology | Why this choice |
|---|---|---|
| Web framework | FastAPI | Native async, auto-docs, Pydantic validation |
| ASGI server | Uvicorn | Production-grade ASGI runtime |
| Embeddings | Sentence Transformers (MiniLM-L6-v2) | 384-dim, fast, great prototype quality |
| Vector math | NumPy + scikit-learn | Standard, performant, well-documented |
| LLM | Groq Llama 3.1 8B Instant | Fast inference, low cost, generous free tier |
| Validation | Pydantic v2 | Type-safe request and response models |
| Container | Docker (python:3.12-slim) | Lightweight, reproducible deployment |
| CI/CD | GitHub Actions | Automated import checks + smoke tests on PRs |
| Secrets | python-dotenv + .env files | Standard local pattern; Kubernetes Secrets in prod |
enterprise-search-rag-demo/
βββ .github/
β βββ workflows/
β βββ ci.yml # GitHub Actions CI pipeline
βββ documents.py # In-memory document corpus
βββ rag.py # Core retrieval + generation logic
βββ main.py # FastAPI app + endpoints
βββ Dockerfile # Container build (layer-cache optimized)
βββ requirements.txt # Python dependencies
βββ .env.example # Template for secrets
βββ .gitignore # Excludes .env, venv, __pycache__
βββ README.md # You are here
- Startup-time indexing. The corpus is embedded once at module load, not per request. This trades a few seconds of startup for sub-100ms query latency.
- Pre-trained embeddings.
all-MiniLM-L6-v2chosen for the balance of speed (384 dims), size (~80MB), and quality. Production candidates:all-mpnet-base-v2for higher precision, or a multilingual variant for international enterprises. - Groq + Llama 3.1 8B. Selected for inference speed (Groq's LPU hardware) and cost. Production candidates: Claude, GPT-4o, or a self-hosted LLM for data-sensitive deployments.
- Strict prompt grounding. The system prompt explicitly forbids outside knowledge, mitigating hallucination β a critical concern for enterprise search.
- Container-first. The Dockerfile uses layer caching (
COPY requirements.txt .beforeCOPY . .) so dependency installs are cached when only code changes. - Kubernetes-aware.
/healthendpoint enables liveness/readiness probes; uvicorn binds to0.0.0.0for container networking.
| Metric | Value |
|---|---|
| Embedding model load time | ~2 seconds (one-time at startup) |
| Corpus indexing | ~50ms per document (one-time) |
| Query encoding latency | ~10β20ms per query |
| Cosine similarity (5 docs) | <1ms |
| Groq LLM call (avg) | ~300β500ms |
| End-to-end query latency | ~500ms typical |
| Memory footprint | ~250MB (model + corpus + runtime) |
Scale note: brute-force cosine similarity is O(n) per query. For corpora beyond ~10K documents, swap to Approximate Nearest Neighbour (ANN) via FAISS, HNSW, or pgvector β see Roadmap.
An honest list of what would make this production-ready:
- Chunking strategy β split long documents into 300β500 token chunks with overlap before embedding
- Vector database β migrate from in-memory NumPy to
pgvectororWeaviatefor persistence and scale - Approximate Nearest Neighbour (ANN) β HNSW indexing for sub-linear query time at 1M+ docs
- Cross-encoder reranker β post-retrieval reranking with a slower but more accurate model (e.g.
bge-reranker-base) for higher precision - Hybrid retrieval β combine sparse (BM25) and dense vector search for the best of both worlds
- Unit & integration tests β pytest suite covering
search(),generate_answer(), and end-to-end via FastAPI'sTestClient - Async endpoints β convert
/searchtoasync def+ async Groq client for higher concurrency - Pinned dependencies β adopt Poetry or pip-tools for lockfile-based dependency management
- Non-root container user β add
USER appuserfor tighter container security - Observability β structured JSON logging, OpenTelemetry traces, Prometheus metrics
This is an intentional prototype β being upfront about what's not yet production-ready:
- No unit tests yet β CI currently only checks imports and the health endpoint
- Dependencies not pinned to exact versions β
pip freezelockfile is the next step - In-memory document store β sample docs are hard-coded; no persistence layer
- No chunking β long documents would be embedded as a single vector, losing local detail
- Brute-force similarity β O(n) per query; doesn't scale beyond a few thousand documents
- No rate limiting, auth, or CORS configuration β needed for any real deployment
- Container runs as root β should add a non-root user
These are known gaps, listed deliberately. The roadmap above is the prioritized fix list.
Jattin Shah β MSc Applied AI student at TU Dresden.
Passionate about search systems and AI-native architectures. This project was built to explore production-pattern RAG engineering β semantic retrieval, prompt engineering, grounded generation, container deployment, and CI/CD β in a single focused codebase.
- GitHub: @J4jatin
- LinkedIn: jattin-shah
MIT License β feel free to use, learn from, and adapt.
Built with curiosity, deliberately scoped, and engineered with the patterns that matter at enterprise scale.