NVIDIA NIM Architecture for SHL Assignment

Overview

This SHL Assessment Recommendation Engine demonstrates enterprise-grade AI infrastructure using NVIDIA services for production-ready RAG (Retrieval-Augmented Generation) pipelines.

The stack is designed to answer: "Which assessment should we recommend for this candidate?" using semantic understanding, keyword matching, intelligent reranking, and grounded generation.

Why NVIDIA Over Azure/OpenAI?

Aspect	NVIDIA NIM	Azure OpenAI
Model Control	Open-weight (Llama 3.1)	Proprietary GPT
Inference Speed	GPU-optimized inference	API latency
Cost Efficiency	Pay-per-token, lower rates	Premium pricing
Reranking	Built-in NVIDIA Reranker	Manual implementation
Interview Value	Shows AI infrastructure knowledge	Standard approach
Learning Value	Deep understanding of RAG systems	API integration only

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        USER REQUEST                              │
│        "Python backend developer with 5 years experience"       │
└──────────────────────────┬──────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────────┐
│                    FASTAPI ORCHESTRATOR                          │
│  ✓ Validate input                                               │
│  ✓ Check if clarification needed                                │
│  ✓ Route to retrieval pipeline                                  │
│  ✓ Orchestrate generation                                       │
└──────────────────────────┬──────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────────┐
│              HYBRID RETRIEVAL PIPELINE                           │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ Input: Query Text                                         │   │
│  └────────────────────────┬─────────────────────────────────┘   │
│                           ↓                                      │
│           ┌───────────────────────────────┐                     │
│           │ NVIDIA NV-Embed Embeddings    │                     │
│           │ (Text → Dense Vector)         │                     │
│           └───────────────────────────────┘                     │
│                ↙                           ↘                    │
│      ┌──────────────┐          ┌──────────────────┐             │
│      │ FAISS Search │          │ BM25 Keyword     │             │
│      │ (Vector)     │          │ Matching         │             │
│      └──────────────┘          └──────────────────┘             │
│           ↓                           ↓                         │
│      Top-5 Docs        +        Top-5 Docs                      │
│      (Semantic)               (Keyword)                          │
│           ↓                           ↓                         │
│           └───────────────────┬───────────────────┘             │
│                               ↓                                  │
│                 ┌─────────────────────────┐                     │
│                 │  Merge & Score          │                     │
│                 │  (Hybrid)               │                     │
│                 │  α * vector +           │                     │
│                 │  (1-α) * bm25           │                     │
│                 └────────────────────┬────┘                     │
│                                      ↓                          │
│                        Top-5 Candidates (Hybrid)               │
└────────────────────────────────┬──────────────────────────────┘
                                 ↓
┌─────────────────────────────────────────────────────────────────┐
│           NVIDIA RERANKER (Deep Relevance Scoring)              │
│                                                                  │
│  For each candidate: score_rerank = f(query, doc_context)      │
│  • Semantic alignment                                           │
│  • Query intent matching                                        │
│  • Document context relevance                                  │
│  • Multi-factor ranking                                         │
└────────────────────────────┬──────────────────────────────────┘
                             ↓
                   Final Top-3 Results
                   (Reranked & Sorted)
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│        NVIDIA LLAMA 3.1 (Grounded Generation)                   │
│                                                                  │
│  System Prompt:                                                  │
│  "You are an assessment recommendation assistant.               │
│   Use ONLY the following retrieved assessments..."              │
│                                                                  │
│  + Retrieved Assessments Context                                │
│  + User Messages History                                        │
│  → Llama 3.1 70B generates response grounded in docs            │
└────────────────────────────┬──────────────────────────────────┘
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│              STRUCTURED JSON RESPONSE                           │
│                                                                  │
│  {                                                               │
│    "action": "respond",                                          │
│    "reply": "...",                                              │
│    "retrieved_assessments": [...],                              │
│    "provenance": {                                              │
│      "model": "meta/llama-3.1-70b-instruct",                    │
│      "embedding_model": "nvidia/nv-embed-qa-e5-v5",            │
│      "retrieval_method": "hybrid_bm25_vector",                  │
│      "reranked": true                                           │
│    }                                                             │
│  }                                                               │
└─────────────────────────────────────────────────────────────────┘

Component Breakdown

1. NVIDIA NIM LLM Endpoint

Model: Llama 3.1 70B Instruct
Task: Grounded generation, reasoning over retrieved context
Why: Better for conversational understanding + instruction-following than smaller models

2. NVIDIA NV-Embed-QA

Model: NV-Embed-QA (specialized for question-answering retrieval)
Task: Convert text → dense vectors (768-dim embeddings)
Why: Better semantic matching than generic embeddings; optimized for retrieval tasks

3. FAISS Vector Database

Task: Fast similarity search (L2 distance)
Why: Lightweight, explainable, interview-friendly; local inference with no API calls

4. BM25 Keyword Retriever

Task: Exact keyword matching + TF-IDF scoring
Why: Catches exact matches (e.g., "Python", "OOP") that semantic search might miss

5. Hybrid Merging Strategy

hybrid_score = semantic_w * vector_score + bm25_w * bm25_score + metadata_w * metadata_score

Default weights: semantic=0.6, bm25=0.3, metadata=0.1
Metadata boost: uses catalog features such as skills overlap and title hints
Query expansion: expands intent terms (e.g., client-facing → communication/stakeholder/interpersonal)
Result: Best of both worlds — semantic + keyword robustness

6. NVIDIA Reranker

Model: NVIDIA Reranker (or comparable cross-encoder)
Task: Re-score top-K candidates using deep relevance model
Why: Modern RAG systems use reranking for dramatic Recall@K improvements

7. FastAPI Orchestrator

Task: Stateless orchestration of entire pipeline
Why:
- Production-ready async framework
- Easy to scale horizontally
- Clear separation of concerns

Data Flow Example

User Input:

"I need to assess a Python developer with stakeholder collaboration skills."

Step 1: Clarification Check

If len(query) < 8:
  → Ask for more details
Else:
  → Proceed to retrieval

Step 2: Generate Embedding

query_embedding = NIM.embed("I need to assess a Python developer...")
# → [0.224, -0.18, 0.56, ..., -0.09]  (768-dim vector)

Step 3: Hybrid Retrieval

# Vector search: top-3 by L2 distance
vector_results = [
  {"id": "py_backend", "score": 0.92},
  {"id": "py_junior", "score": 0.85},
  {"id": "full_stack", "score": 0.78}
]

# BM25 search: top-3 by keyword match
bm25_results = [
  {"id": "py_backend", "score": 0.88},
  {"id": "oop_assessment", "score": 0.72},
  {"id": "communication", "score": 0.65}
]

# Merge (α=0.5)
merged = {
  "py_backend": 0.5 * 0.92 + 0.5 * 0.88 = 0.90,
  "py_junior": 0.5 * 0.85 + 0.5 * 0 = 0.425,
  "full_stack": 0.5 * 0.78 + 0.5 * 0 = 0.39,
  "oop_assessment": 0.5 * 0 + 0.5 * 0.72 = 0.36,
  "communication": 0.5 * 0 + 0.5 * 0.65 = 0.325
}

# Top-5 after merge
top_5 = sorted by score desc = [py_backend, py_junior, full_stack, oop_assessment, communication]

Step 4: Reranking

For each top-5 result:
  rerank_score = NVIDIA_Reranker(query, doc_context)
  
# Results after reranking (deep model vs. simple heuristic):
[
  {"id": "py_backend", "rerank_score": 0.96},  # Much higher confidence
  {"id": "communication", "rerank_score": 0.75},  # Now relevant!
  {"id": "py_junior", "rerank_score": 0.68},
  {"id": "oop_assessment", "rerank_score": 0.52},
  {"id": "full_stack", "rerank_score": 0.48}
]

Step 5: Grounded Generation

system_prompt = """
You are an assessment recommendation assistant.
Use ONLY the following retrieved assessments to ground your response.

Retrieved Assessments:
{
  "id": "py_backend",
  "title": "Python Backend Developer Assessment",
  "seniority": "mid",
  "skills": ["Python", "FastAPI", "Testing"]
},
...
"""

messages = [
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": "I need to assess a Python developer..."}
]

response = Llama3.1(messages)
# Llama generates: "Based on the retrieved assessments, I recommend
# the 'Python Backend Developer Assessment' because..."

Step 6: Return Structured Response

{
  "action": "respond",
  "reply": "Based on the retrieved assessments, I recommend the 'Python Backend Developer Assessment'...",
  "retrieved_assessments": [
    {
      "rank": 1,
      "id": "py_backend",
      "title": "Python Backend Developer Assessment",
      "hybrid_score": 0.90,
      "vector_score": 0.92,
      "bm25_score": 0.88,
      "rerank_score": 0.96,
      "final_rank": 1
    },
    ...
  ],
  "turn_count": 1,
  "provenance": {
    "model": "meta/llama-3.1-70b-instruct",
    "embedding_model": "nvidia/nv-embed-qa-e5-v5",
    "retrieval_method": "hybrid_bm25_vector",
    "reranked": true
  }
}

Interview-Ready Talking Points

Q: "Why NVIDIA NIM instead of OpenAI?"

A: "NVIDIA NIM provides enterprise-grade GPU inference with open-weight models, better cost efficiency, and full control over the RAG pipeline. For this assignment, it showcases understanding of:

Inference optimization
Open-source LLM ecosystems
Enterprise AI infrastructure
Retrieval-augmented generation best practices

Specifically, I used NV-Embed for high-quality semantic embeddings, FAISS for efficient local search, BM25 for keyword robustness, and the NVIDIA reranker for deep relevance scoring—this mirrors production RAG architectures used in industry."

Q: "Walk me through the retrieval pipeline."

A: "The system uses hybrid retrieval combining vector search and keyword matching:

Vector search (FAISS): User query → NV-Embed → dense vector → similarity search
Keyword search (BM25): Query → tokenize → TF-IDF scoring
Merge: Weighted combination (α·vector + (1-α)·bm25) balances semantic and exact matching
Rerank: Top-K candidates passed to NVIDIA reranker for deep relevance rescoring

This approach is standard in modern RAG systems and dramatically improves Recall@K."

Q: "How does grounding work?"

A: "The system passes retrieved assessments directly into the system prompt, so Llama 3.1 generates responses strictly grounded in those documents. This prevents hallucinations and ensures all recommendations are backed by the catalog."

Q: "What if the query is ambiguous?"

A: "The system implements a clarification-first policy: if the user's query is too short or lacks specificity, it asks for details before retrieving. This maximizes information gain while respecting the 8-turn conversation limit."

Production Deployment Considerations

Scalability

Stateless FastAPI: Horizontal scaling via load balancing
NVIDIA NIM: Managed service (handled by NVIDIA)
FAISS + BM25: Local or distributed cache (Redis for multi-instance)

Monitoring

Application Insights: Track latency, error rates
Query analytics: Log queries + rankings for feedback loops
Model performance: A/B test different α values, reranker thresholds

Cost Optimization

NVIDIA NIM: Pay-per-token (typically $0.001–$0.01 per 1K tokens)
Local inference: FAISS + BM25 run free locally
Caching: Cache embeddings + rankings for repeated queries

Security

API keys: Store in Azure Key Vault or similar
Rate limiting: Implement per-user limits
Input validation: Pydantic models for request validation

Files Overview

project/
├── app/
│   ├── __init__.py
│   ├── main.py                      # FastAPI orchestrator
│   ├── services/
│   │   ├── __init__.py
│   │   ├── nim_client.py            # NVIDIA NIM wrapper
│   │   └── azure_client.py          # Legacy (archived)
│   └── retrieval/
│       ├── __init__.py
│       ├── faiss_index.py           # FAISS utilities
│       ├── bm25.py                  # BM25 retriever
│       └── hybrid.py                # Hybrid retrieval orchestrator
├── data/
│   ├── catalog.json                 # SHL assessment catalog
│   ├── faiss.index                  # FAISS vector index (generated)
│   ├── embeddings.pkl               # Cached embeddings (generated)
│   └── bm25_retriever.pkl           # BM25 index (generated)
├── scripts/
│   └── build_embeddings.py          # Generate embeddings + indices
├── tests/
│   └── test_health.py               # Health check + basic tests
├── requirements.txt                 # Dependencies
├── .env.example                     # Environment config template
├── README.md                        # User-facing documentation
└── ARCHITECTURE.md                  # This file

Next Steps / Future Enhancements

Conversation State Management
- Track user context across turns
- Implement "memory" of previous queries
Advanced Ranking
- Metadata-aware scoring (seniority, skills match)
- Threshold-based filtering
Catalog Integration
- Real SHL API integration
- Dynamic catalog updates
Evaluation Framework
- Compute Recall@K metrics
- A/B test retrieval strategies
Fine-tuning
- Domain-specific embeddings on SHL data
- Custom reranker training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA NIM Architecture for SHL Assignment

Overview

Why NVIDIA Over Azure/OpenAI?

System Architecture

Component Breakdown

1. NVIDIA NIM LLM Endpoint

2. NVIDIA NV-Embed-QA

3. FAISS Vector Database

4. BM25 Keyword Retriever

5. Hybrid Merging Strategy

6. NVIDIA Reranker

7. FastAPI Orchestrator

Data Flow Example

Interview-Ready Talking Points

Q: "Why NVIDIA NIM instead of OpenAI?"

Q: "Walk me through the retrieval pipeline."

Q: "How does grounding work?"

Q: "What if the query is ambiguous?"

Production Deployment Considerations

Scalability

Monitoring

Cost Optimization

Security

Files Overview

Next Steps / Future Enhancements

References

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

NVIDIA NIM Architecture for SHL Assignment

Overview

Why NVIDIA Over Azure/OpenAI?

System Architecture

Component Breakdown

1. NVIDIA NIM LLM Endpoint

2. NVIDIA NV-Embed-QA

3. FAISS Vector Database

4. BM25 Keyword Retriever

5. Hybrid Merging Strategy

6. NVIDIA Reranker

7. FastAPI Orchestrator

Data Flow Example

Interview-Ready Talking Points

Q: "Why NVIDIA NIM instead of OpenAI?"

Q: "Walk me through the retrieval pipeline."

Q: "How does grounding work?"

Q: "What if the query is ambiguous?"

Production Deployment Considerations

Scalability

Monitoring

Cost Optimization

Security

Files Overview

Next Steps / Future Enhancements

References