Skip to content

Commit ffb23ce

Browse files
committed
docs: update DEPLOYMENT.md with comprehensive NIMs deployment and configuration
- Add detailed NVIDIA NIMs deployment section covering cloud, self-hosted, and hybrid options - Document all 8 NIMs with their models, purposes, and environment variables - Clarify configuration method: all endpoint URLs and API keys via environment variables - Add deployment steps, verification, and troubleshooting for NIMs - Update Kubernetes secrets configuration - Ensure high accuracy in all statements about NIM deployment options
1 parent bd4395c commit ffb23ce

File tree

2 files changed

+1188
-8
lines changed

2 files changed

+1188
-8
lines changed

DEPLOYMENT.md

Lines changed: 263 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Complete deployment guide for the Warehouse Operational Assistant with Docker an
77
- [Quick Start](#quick-start)
88
- [Prerequisites](#prerequisites)
99
- [Environment Configuration](#environment-configuration)
10+
- [NVIDIA NIMs Deployment & Configuration](#nvidia-nims-deployment--configuration)
1011
- [Deployment Options](#deployment-options)
1112
- [Option 1: Docker Deployment](#option-1-docker-deployment)
1213
- [Option 2: Kubernetes/Helm Deployment](#option-2-kuberneteshelm-deployment)
@@ -173,12 +174,6 @@ REDIS_PORT=6379
173174
# Kafka
174175
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
175176

176-
# NVIDIA NIMs (optional)
177-
NIM_LLM_BASE_URL=http://localhost:8000/v1
178-
NIM_LLM_API_KEY=your-nim-llm-api-key
179-
NIM_EMBEDDINGS_BASE_URL=http://localhost:8001/v1
180-
NIM_EMBEDDINGS_API_KEY=your-nim-embeddings-api-key
181-
182177
# CORS (for frontend access)
183178
CORS_ORIGINS=http://localhost:3001,http://localhost:3000
184179
```
@@ -191,6 +186,266 @@ CORS_ORIGINS=http://localhost:3001,http://localhost:3000
191186

192187
See [docs/secrets.md](docs/secrets.md) for detailed security configuration.
193188

189+
## NVIDIA NIMs Deployment & Configuration
190+
191+
The Warehouse Operational Assistant uses **NVIDIA NIMs (NVIDIA Inference Microservices)** for AI-powered capabilities including LLM inference, embeddings, document processing, and content safety. All NIMs use **OpenAI-compatible API endpoints**, allowing for flexible deployment options.
192+
193+
**Configuration Method:** All NIM endpoint URLs and API keys are configured via **environment variables**. The NeMo Guardrails SDK additionally uses Colang (`.co`) and YAML (`.yml`) configuration files for guardrails logic, but these files reference environment variables for endpoint URLs and API keys.
194+
195+
### NIMs Overview
196+
197+
The system uses the following NVIDIA NIMs:
198+
199+
| NIM Service | Model | Purpose | Environment Variable | Default Endpoint |
200+
|-------------|-------|---------|---------------------|------------------|
201+
| **LLM Service** | Llama 3.3 Nemotron Super 49B | Primary language model for chat, reasoning, and generation | `LLM_NIM_URL` | `https://api.brev.dev/v1` |
202+
| **Embedding Service** | llama-3_2-nv-embedqa-1b-v2 | Semantic search embeddings for RAG | `EMBEDDING_NIM_URL` | `https://integrate.api.nvidia.com/v1` |
203+
| **NeMo Retriever** | NeMo Retriever | Document preprocessing and structure analysis | `NEMO_RETRIEVER_URL` | `https://integrate.api.nvidia.com/v1` |
204+
| **NeMo OCR** | NeMoRetriever-OCR-v1 | Intelligent OCR with layout understanding | `NEMO_OCR_URL` | `https://integrate.api.nvidia.com/v1` |
205+
| **Nemotron Parse** | Nemotron Parse | Advanced document parsing and extraction | `NEMO_PARSE_URL` | `https://integrate.api.nvidia.com/v1` |
206+
| **Small LLM** | nemotron-nano-12b-v2-vl | Structured data extraction and entity recognition | `LLAMA_NANO_VL_URL` | `https://integrate.api.nvidia.com/v1` |
207+
| **Large LLM Judge** | Llama 3.3 Nemotron Super 49B | Quality validation and confidence scoring | `LLAMA_70B_URL` | `https://integrate.api.nvidia.com/v1` |
208+
| **NeMo Guardrails** | NeMo Guardrails | Content safety and compliance validation | `RAIL_API_URL` | `https://integrate.api.nvidia.com/v1` |
209+
210+
### Deployment Options
211+
212+
NIMs can be deployed in three ways:
213+
214+
#### Option 1: Cloud Endpoints (Recommended for Quick Start)
215+
216+
Use NVIDIA-hosted cloud endpoints for immediate deployment without infrastructure setup.
217+
218+
**For the 49B LLM Model:**
219+
- **Endpoint**: `https://api.brev.dev/v1`
220+
- **Use Case**: Production deployments, quick setup
221+
- **Configuration**: Set `LLM_NIM_URL=https://api.brev.dev/v1`
222+
223+
**For Other NIMs:**
224+
- **Endpoint**: `https://integrate.api.nvidia.com/v1`
225+
- **Use Case**: Production deployments, quick setup
226+
- **Configuration**: Set respective environment variables (e.g., `EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1`)
227+
228+
**Environment Variables:**
229+
```bash
230+
# NVIDIA API Key (required for all cloud endpoints)
231+
NVIDIA_API_KEY=your-nvidia-api-key-here
232+
233+
# LLM Service (49B model - uses brev.dev)
234+
LLM_NIM_URL=https://api.brev.dev/v1
235+
LLM_MODEL=nvcf:nvidia/llama-3.3-nemotron-super-49b-v1:dep-36ZiLbQIG2ZzK7gIIC5yh1E6lGk
236+
237+
# Embedding Service (uses integrate.api.nvidia.com)
238+
EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1
239+
240+
# Document Processing NIMs (all use integrate.api.nvidia.com)
241+
NEMO_RETRIEVER_URL=https://integrate.api.nvidia.com/v1
242+
NEMO_OCR_URL=https://integrate.api.nvidia.com/v1
243+
NEMO_PARSE_URL=https://integrate.api.nvidia.com/v1
244+
LLAMA_NANO_VL_URL=https://integrate.api.nvidia.com/v1
245+
LLAMA_70B_URL=https://integrate.api.nvidia.com/v1
246+
247+
# NeMo Guardrails
248+
RAIL_API_URL=https://integrate.api.nvidia.com/v1
249+
RAIL_API_KEY=your-nvidia-api-key-here # Falls back to NVIDIA_API_KEY if not set
250+
```
251+
252+
#### Option 2: Self-Hosted NIMs (Recommended for Production)
253+
254+
Deploy NIMs on your own infrastructure for data privacy, cost control, and custom requirements.
255+
256+
**Benefits:**
257+
- **Data Privacy**: Keep sensitive data on-premises
258+
- **Cost Control**: Avoid per-request cloud costs
259+
- **Custom Requirements**: Full control over infrastructure and configuration
260+
- **Low Latency**: Reduced network latency for on-premises deployments
261+
262+
**Deployment Steps:**
263+
264+
1. **Deploy NIMs on your infrastructure** (using NVIDIA NGC containers or Kubernetes):
265+
```bash
266+
# Example: Deploy LLM NIM on port 8000
267+
docker run --gpus all -p 8000:8000 \
268+
nvcr.io/nvidia/nim/llama-3.3-nemotron-super-49b:latest
269+
270+
# Example: Deploy Embedding NIM on port 8001
271+
docker run --gpus all -p 8001:8001 \
272+
nvcr.io/nvidia/nim/nv-embedqa-e5-v5:latest
273+
```
274+
275+
2. **Configure environment variables** to point to your self-hosted endpoints:
276+
```bash
277+
# Self-hosted LLM NIM
278+
LLM_NIM_URL=http://your-nim-host:8000/v1
279+
LLM_MODEL=nvidia/llama-3.3-nemotron-super-49b-v1
280+
281+
# Self-hosted Embedding NIM
282+
EMBEDDING_NIM_URL=http://your-nim-host:8001/v1
283+
284+
# Self-hosted Document Processing NIMs
285+
NEMO_RETRIEVER_URL=http://your-nim-host:8002/v1
286+
NEMO_OCR_URL=http://your-nim-host:8003/v1
287+
NEMO_PARSE_URL=http://your-nim-host:8004/v1
288+
LLAMA_NANO_VL_URL=http://your-nim-host:8005/v1
289+
LLAMA_70B_URL=http://your-nim-host:8006/v1
290+
291+
# Self-hosted NeMo Guardrails
292+
RAIL_API_URL=http://your-nim-host:8007/v1
293+
294+
# API Key (if your self-hosted NIMs require authentication)
295+
NVIDIA_API_KEY=your-api-key-here
296+
```
297+
298+
3. **Verify connectivity**:
299+
```bash
300+
# Test LLM endpoint
301+
curl -X POST http://your-nim-host:8000/v1/chat/completions \
302+
-H "Authorization: Bearer $NVIDIA_API_KEY" \
303+
-H "Content-Type: application/json" \
304+
-d '{"model":"nvidia/llama-3.3-nemotron-super-49b-v1","messages":[{"role":"user","content":"test"}]}'
305+
306+
# Test Embedding endpoint
307+
curl -X POST http://your-nim-host:8001/v1/embeddings \
308+
-H "Authorization: Bearer $NVIDIA_API_KEY" \
309+
-H "Content-Type: application/json" \
310+
-d '{"model":"nvidia/nv-embedqa-e5-v5","input":"test"}'
311+
```
312+
313+
**Important Notes:**
314+
- All NIMs use **OpenAI-compatible API endpoints** (`/v1/chat/completions`, `/v1/embeddings`, etc.)
315+
- Self-hosted NIMs can be accessed via HTTP/HTTPS endpoints in the same fashion as cloud endpoints
316+
- Ensure your self-hosted NIMs are accessible from the Warehouse Operational Assistant application
317+
- For production, use HTTPS and proper authentication/authorization
318+
319+
#### Option 3: Hybrid Deployment
320+
321+
Mix cloud and self-hosted NIMs based on your requirements:
322+
323+
```bash
324+
# Use cloud for LLM (49B model)
325+
LLM_NIM_URL=https://api.brev.dev/v1
326+
327+
# Use self-hosted for embeddings (for data privacy)
328+
EMBEDDING_NIM_URL=http://your-nim-host:8001/v1
329+
330+
# Use cloud for document processing
331+
NEMO_RETRIEVER_URL=https://integrate.api.nvidia.com/v1
332+
NEMO_OCR_URL=https://integrate.api.nvidia.com/v1
333+
```
334+
335+
### Configuration Details
336+
337+
#### LLM Service Configuration
338+
339+
```bash
340+
# Required: API endpoint (cloud or self-hosted)
341+
LLM_NIM_URL=https://api.brev.dev/v1 # or http://your-nim-host:8000/v1
342+
343+
# Required: Model identifier
344+
LLM_MODEL=nvcf:nvidia/llama-3.3-nemotron-super-49b-v1:dep-36ZiLbQIG2ZzK7gIIC5yh1E6lGk # Cloud
345+
# OR
346+
LLM_MODEL=nvidia/llama-3.3-nemotron-super-49b-v1 # Self-hosted
347+
348+
# Required: API key (same key works for all NVIDIA endpoints)
349+
NVIDIA_API_KEY=your-nvidia-api-key-here
350+
351+
# Optional: Generation parameters
352+
LLM_TEMPERATURE=0.1
353+
LLM_MAX_TOKENS=2000
354+
LLM_TOP_P=1.0
355+
LLM_FREQUENCY_PENALTY=0.0
356+
LLM_PRESENCE_PENALTY=0.0
357+
358+
# Optional: Client timeout (seconds)
359+
LLM_CLIENT_TIMEOUT=120
360+
361+
# Optional: Caching
362+
LLM_CACHE_ENABLED=true
363+
LLM_CACHE_TTL_SECONDS=300
364+
```
365+
366+
#### Embedding Service Configuration
367+
368+
```bash
369+
# Required: API endpoint (cloud or self-hosted)
370+
EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1 # or http://your-nim-host:8001/v1
371+
372+
# Required: API key
373+
NVIDIA_API_KEY=your-nvidia-api-key-here
374+
```
375+
376+
#### NeMo Guardrails Configuration
377+
378+
```bash
379+
# Required: API endpoint (cloud or self-hosted)
380+
RAIL_API_URL=https://integrate.api.nvidia.com/v1 # or http://your-nim-host:8007/v1
381+
382+
# Required: API key (falls back to NVIDIA_API_KEY if not set)
383+
RAIL_API_KEY=your-nvidia-api-key-here
384+
385+
# Optional: Guardrails implementation mode
386+
USE_NEMO_GUARDRAILS_SDK=false # Set to 'true' to use SDK with Colang (recommended)
387+
GUARDRAILS_USE_API=true # Set to 'false' to use pattern-based fallback
388+
GUARDRAILS_TIMEOUT=10 # Timeout in seconds
389+
```
390+
391+
### Getting NVIDIA API Keys
392+
393+
1. **Sign up** for NVIDIA API access at [NVIDIA API Portal](https://build.nvidia.com/)
394+
2. **Generate API key** from your account dashboard
395+
3. **Set environment variable**: `NVIDIA_API_KEY=your-api-key-here`
396+
397+
**Note:** The same API key works for all NVIDIA cloud endpoints (`api.brev.dev` and `integrate.api.nvidia.com`).
398+
399+
### Verification
400+
401+
After configuring NIMs, verify they are working:
402+
403+
```bash
404+
# Test LLM endpoint
405+
curl -X POST $LLM_NIM_URL/chat/completions \
406+
-H "Authorization: Bearer $NVIDIA_API_KEY" \
407+
-H "Content-Type: application/json" \
408+
-d '{"model":"'$LLM_MODEL'","messages":[{"role":"user","content":"Hello"}]}'
409+
410+
# Test Embedding endpoint
411+
curl -X POST $EMBEDDING_NIM_URL/embeddings \
412+
-H "Authorization: Bearer $NVIDIA_API_KEY" \
413+
-H "Content-Type: application/json" \
414+
-d '{"model":"nvidia/nv-embedqa-e5-v5","input":"test"}'
415+
416+
# Check application health (includes NIM connectivity)
417+
curl http://localhost:8001/api/v1/health
418+
```
419+
420+
### Troubleshooting NIMs
421+
422+
**Common Issues:**
423+
424+
1. **Authentication Errors (401/403)**:
425+
- Verify `NVIDIA_API_KEY` is set correctly
426+
- Ensure API key has access to the requested models
427+
- Check API key hasn't expired
428+
429+
2. **Connection Timeouts**:
430+
- Verify NIM endpoint URLs are correct
431+
- Check network connectivity to endpoints
432+
- Increase `LLM_CLIENT_TIMEOUT` if needed
433+
- For self-hosted NIMs, ensure they are running and accessible
434+
435+
3. **Model Not Found (404)**:
436+
- Verify `LLM_MODEL` matches the model available at your endpoint
437+
- For cloud endpoints, check model identifier format (e.g., `nvcf:nvidia/...`)
438+
- For self-hosted, use model name format (e.g., `nvidia/llama-3.3-nemotron-super-49b-v1`)
439+
440+
4. **Rate Limiting (429)**:
441+
- Reduce request frequency
442+
- Implement request queuing/retry logic
443+
- Consider self-hosting for higher throughput
444+
445+
**For detailed NIM deployment guides, see:**
446+
- [NVIDIA NIM Documentation](https://docs.nvidia.com/nim/)
447+
- [NVIDIA NGC Containers](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=scoreDESC&query=nim)
448+
194449
## Deployment Options
195450

196451
### Option 1: Docker Deployment
@@ -329,8 +584,8 @@ docker-compose -f deploy/compose/docker-compose.yaml up -d
329584
kubectl create secret generic warehouse-secrets \
330585
--from-literal=db-password=your-db-password \
331586
--from-literal=jwt-secret=your-jwt-secret \
332-
--from-literal=nim-llm-api-key=your-nim-key \
333-
--from-literal=nim-embeddings-api-key=your-embeddings-key \
587+
--from-literal=nvidia-api-key=your-nvidia-api-key \
588+
--from-literal=rail-api-key=your-rail-api-key \
334589
--from-literal=admin-password=your-admin-password \
335590
--namespace=warehouse-assistant
336591
```

0 commit comments

Comments
 (0)