@@ -7,6 +7,7 @@ Complete deployment guide for the Warehouse Operational Assistant with Docker an
77- [ Quick Start] ( #quick-start )
88- [ Prerequisites] ( #prerequisites )
99- [ Environment Configuration] ( #environment-configuration )
10+ - [ NVIDIA NIMs Deployment & Configuration] ( #nvidia-nims-deployment--configuration )
1011- [ Deployment Options] ( #deployment-options )
1112 - [ Option 1: Docker Deployment] ( #option-1-docker-deployment )
1213 - [ Option 2: Kubernetes/Helm Deployment] ( #option-2-kuberneteshelm-deployment )
@@ -173,12 +174,6 @@ REDIS_PORT=6379
173174# Kafka
174175KAFKA_BOOTSTRAP_SERVERS=localhost:9092
175176
176- # NVIDIA NIMs (optional)
177- NIM_LLM_BASE_URL=http://localhost:8000/v1
178- NIM_LLM_API_KEY=your-nim-llm-api-key
179- NIM_EMBEDDINGS_BASE_URL=http://localhost:8001/v1
180- NIM_EMBEDDINGS_API_KEY=your-nim-embeddings-api-key
181-
182177# CORS (for frontend access)
183178CORS_ORIGINS=http://localhost:3001,http://localhost:3000
184179```
@@ -191,6 +186,266 @@ CORS_ORIGINS=http://localhost:3001,http://localhost:3000
191186
192187See [ docs/secrets.md] ( docs/secrets.md ) for detailed security configuration.
193188
189+ ## NVIDIA NIMs Deployment & Configuration
190+
191+ The Warehouse Operational Assistant uses ** NVIDIA NIMs (NVIDIA Inference Microservices)** for AI-powered capabilities including LLM inference, embeddings, document processing, and content safety. All NIMs use ** OpenAI-compatible API endpoints** , allowing for flexible deployment options.
192+
193+ ** Configuration Method:** All NIM endpoint URLs and API keys are configured via ** environment variables** . The NeMo Guardrails SDK additionally uses Colang (` .co ` ) and YAML (` .yml ` ) configuration files for guardrails logic, but these files reference environment variables for endpoint URLs and API keys.
194+
195+ ### NIMs Overview
196+
197+ The system uses the following NVIDIA NIMs:
198+
199+ | NIM Service | Model | Purpose | Environment Variable | Default Endpoint |
200+ | -------------| -------| ---------| ---------------------| ------------------|
201+ | ** LLM Service** | Llama 3.3 Nemotron Super 49B | Primary language model for chat, reasoning, and generation | ` LLM_NIM_URL ` | ` https://api.brev.dev/v1 ` |
202+ | ** Embedding Service** | llama-3_2-nv-embedqa-1b-v2 | Semantic search embeddings for RAG | ` EMBEDDING_NIM_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
203+ | ** NeMo Retriever** | NeMo Retriever | Document preprocessing and structure analysis | ` NEMO_RETRIEVER_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
204+ | ** NeMo OCR** | NeMoRetriever-OCR-v1 | Intelligent OCR with layout understanding | ` NEMO_OCR_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
205+ | ** Nemotron Parse** | Nemotron Parse | Advanced document parsing and extraction | ` NEMO_PARSE_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
206+ | ** Small LLM** | nemotron-nano-12b-v2-vl | Structured data extraction and entity recognition | ` LLAMA_NANO_VL_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
207+ | ** Large LLM Judge** | Llama 3.3 Nemotron Super 49B | Quality validation and confidence scoring | ` LLAMA_70B_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
208+ | ** NeMo Guardrails** | NeMo Guardrails | Content safety and compliance validation | ` RAIL_API_URL ` | ` https://integrate.api.nvidia.com/v1 ` |
209+
210+ ### Deployment Options
211+
212+ NIMs can be deployed in three ways:
213+
214+ #### Option 1: Cloud Endpoints (Recommended for Quick Start)
215+
216+ Use NVIDIA-hosted cloud endpoints for immediate deployment without infrastructure setup.
217+
218+ ** For the 49B LLM Model:**
219+ - ** Endpoint** : ` https://api.brev.dev/v1 `
220+ - ** Use Case** : Production deployments, quick setup
221+ - ** Configuration** : Set ` LLM_NIM_URL=https://api.brev.dev/v1 `
222+
223+ ** For Other NIMs:**
224+ - ** Endpoint** : ` https://integrate.api.nvidia.com/v1 `
225+ - ** Use Case** : Production deployments, quick setup
226+ - ** Configuration** : Set respective environment variables (e.g., ` EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1 ` )
227+
228+ ** Environment Variables:**
229+ ``` bash
230+ # NVIDIA API Key (required for all cloud endpoints)
231+ NVIDIA_API_KEY=your-nvidia-api-key-here
232+
233+ # LLM Service (49B model - uses brev.dev)
234+ LLM_NIM_URL=https://api.brev.dev/v1
235+ LLM_MODEL=nvcf:nvidia/llama-3.3-nemotron-super-49b-v1:dep-36ZiLbQIG2ZzK7gIIC5yh1E6lGk
236+
237+ # Embedding Service (uses integrate.api.nvidia.com)
238+ EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1
239+
240+ # Document Processing NIMs (all use integrate.api.nvidia.com)
241+ NEMO_RETRIEVER_URL=https://integrate.api.nvidia.com/v1
242+ NEMO_OCR_URL=https://integrate.api.nvidia.com/v1
243+ NEMO_PARSE_URL=https://integrate.api.nvidia.com/v1
244+ LLAMA_NANO_VL_URL=https://integrate.api.nvidia.com/v1
245+ LLAMA_70B_URL=https://integrate.api.nvidia.com/v1
246+
247+ # NeMo Guardrails
248+ RAIL_API_URL=https://integrate.api.nvidia.com/v1
249+ RAIL_API_KEY=your-nvidia-api-key-here # Falls back to NVIDIA_API_KEY if not set
250+ ```
251+
252+ #### Option 2: Self-Hosted NIMs (Recommended for Production)
253+
254+ Deploy NIMs on your own infrastructure for data privacy, cost control, and custom requirements.
255+
256+ ** Benefits:**
257+ - ** Data Privacy** : Keep sensitive data on-premises
258+ - ** Cost Control** : Avoid per-request cloud costs
259+ - ** Custom Requirements** : Full control over infrastructure and configuration
260+ - ** Low Latency** : Reduced network latency for on-premises deployments
261+
262+ ** Deployment Steps:**
263+
264+ 1 . ** Deploy NIMs on your infrastructure** (using NVIDIA NGC containers or Kubernetes):
265+ ``` bash
266+ # Example: Deploy LLM NIM on port 8000
267+ docker run --gpus all -p 8000:8000 \
268+ nvcr.io/nvidia/nim/llama-3.3-nemotron-super-49b:latest
269+
270+ # Example: Deploy Embedding NIM on port 8001
271+ docker run --gpus all -p 8001:8001 \
272+ nvcr.io/nvidia/nim/nv-embedqa-e5-v5:latest
273+ ```
274+
275+ 2 . ** Configure environment variables** to point to your self-hosted endpoints:
276+ ``` bash
277+ # Self-hosted LLM NIM
278+ LLM_NIM_URL=http://your-nim-host:8000/v1
279+ LLM_MODEL=nvidia/llama-3.3-nemotron-super-49b-v1
280+
281+ # Self-hosted Embedding NIM
282+ EMBEDDING_NIM_URL=http://your-nim-host:8001/v1
283+
284+ # Self-hosted Document Processing NIMs
285+ NEMO_RETRIEVER_URL=http://your-nim-host:8002/v1
286+ NEMO_OCR_URL=http://your-nim-host:8003/v1
287+ NEMO_PARSE_URL=http://your-nim-host:8004/v1
288+ LLAMA_NANO_VL_URL=http://your-nim-host:8005/v1
289+ LLAMA_70B_URL=http://your-nim-host:8006/v1
290+
291+ # Self-hosted NeMo Guardrails
292+ RAIL_API_URL=http://your-nim-host:8007/v1
293+
294+ # API Key (if your self-hosted NIMs require authentication)
295+ NVIDIA_API_KEY=your-api-key-here
296+ ```
297+
298+ 3 . ** Verify connectivity** :
299+ ``` bash
300+ # Test LLM endpoint
301+ curl -X POST http://your-nim-host:8000/v1/chat/completions \
302+ -H " Authorization: Bearer $NVIDIA_API_KEY " \
303+ -H " Content-Type: application/json" \
304+ -d ' {"model":"nvidia/llama-3.3-nemotron-super-49b-v1","messages":[{"role":"user","content":"test"}]}'
305+
306+ # Test Embedding endpoint
307+ curl -X POST http://your-nim-host:8001/v1/embeddings \
308+ -H " Authorization: Bearer $NVIDIA_API_KEY " \
309+ -H " Content-Type: application/json" \
310+ -d ' {"model":"nvidia/nv-embedqa-e5-v5","input":"test"}'
311+ ```
312+
313+ ** Important Notes:**
314+ - All NIMs use ** OpenAI-compatible API endpoints** (` /v1/chat/completions ` , ` /v1/embeddings ` , etc.)
315+ - Self-hosted NIMs can be accessed via HTTP/HTTPS endpoints in the same fashion as cloud endpoints
316+ - Ensure your self-hosted NIMs are accessible from the Warehouse Operational Assistant application
317+ - For production, use HTTPS and proper authentication/authorization
318+
319+ #### Option 3: Hybrid Deployment
320+
321+ Mix cloud and self-hosted NIMs based on your requirements:
322+
323+ ``` bash
324+ # Use cloud for LLM (49B model)
325+ LLM_NIM_URL=https://api.brev.dev/v1
326+
327+ # Use self-hosted for embeddings (for data privacy)
328+ EMBEDDING_NIM_URL=http://your-nim-host:8001/v1
329+
330+ # Use cloud for document processing
331+ NEMO_RETRIEVER_URL=https://integrate.api.nvidia.com/v1
332+ NEMO_OCR_URL=https://integrate.api.nvidia.com/v1
333+ ```
334+
335+ ### Configuration Details
336+
337+ #### LLM Service Configuration
338+
339+ ``` bash
340+ # Required: API endpoint (cloud or self-hosted)
341+ LLM_NIM_URL=https://api.brev.dev/v1 # or http://your-nim-host:8000/v1
342+
343+ # Required: Model identifier
344+ LLM_MODEL=nvcf:nvidia/llama-3.3-nemotron-super-49b-v1:dep-36ZiLbQIG2ZzK7gIIC5yh1E6lGk # Cloud
345+ # OR
346+ LLM_MODEL=nvidia/llama-3.3-nemotron-super-49b-v1 # Self-hosted
347+
348+ # Required: API key (same key works for all NVIDIA endpoints)
349+ NVIDIA_API_KEY=your-nvidia-api-key-here
350+
351+ # Optional: Generation parameters
352+ LLM_TEMPERATURE=0.1
353+ LLM_MAX_TOKENS=2000
354+ LLM_TOP_P=1.0
355+ LLM_FREQUENCY_PENALTY=0.0
356+ LLM_PRESENCE_PENALTY=0.0
357+
358+ # Optional: Client timeout (seconds)
359+ LLM_CLIENT_TIMEOUT=120
360+
361+ # Optional: Caching
362+ LLM_CACHE_ENABLED=true
363+ LLM_CACHE_TTL_SECONDS=300
364+ ```
365+
366+ #### Embedding Service Configuration
367+
368+ ``` bash
369+ # Required: API endpoint (cloud or self-hosted)
370+ EMBEDDING_NIM_URL=https://integrate.api.nvidia.com/v1 # or http://your-nim-host:8001/v1
371+
372+ # Required: API key
373+ NVIDIA_API_KEY=your-nvidia-api-key-here
374+ ```
375+
376+ #### NeMo Guardrails Configuration
377+
378+ ``` bash
379+ # Required: API endpoint (cloud or self-hosted)
380+ RAIL_API_URL=https://integrate.api.nvidia.com/v1 # or http://your-nim-host:8007/v1
381+
382+ # Required: API key (falls back to NVIDIA_API_KEY if not set)
383+ RAIL_API_KEY=your-nvidia-api-key-here
384+
385+ # Optional: Guardrails implementation mode
386+ USE_NEMO_GUARDRAILS_SDK=false # Set to 'true' to use SDK with Colang (recommended)
387+ GUARDRAILS_USE_API=true # Set to 'false' to use pattern-based fallback
388+ GUARDRAILS_TIMEOUT=10 # Timeout in seconds
389+ ```
390+
391+ ### Getting NVIDIA API Keys
392+
393+ 1 . ** Sign up** for NVIDIA API access at [ NVIDIA API Portal] ( https://build.nvidia.com/ )
394+ 2 . ** Generate API key** from your account dashboard
395+ 3 . ** Set environment variable** : ` NVIDIA_API_KEY=your-api-key-here `
396+
397+ ** Note:** The same API key works for all NVIDIA cloud endpoints (` api.brev.dev ` and ` integrate.api.nvidia.com ` ).
398+
399+ ### Verification
400+
401+ After configuring NIMs, verify they are working:
402+
403+ ``` bash
404+ # Test LLM endpoint
405+ curl -X POST $LLM_NIM_URL /chat/completions \
406+ -H " Authorization: Bearer $NVIDIA_API_KEY " \
407+ -H " Content-Type: application/json" \
408+ -d ' {"model":"' $LLM_MODEL ' ","messages":[{"role":"user","content":"Hello"}]}'
409+
410+ # Test Embedding endpoint
411+ curl -X POST $EMBEDDING_NIM_URL /embeddings \
412+ -H " Authorization: Bearer $NVIDIA_API_KEY " \
413+ -H " Content-Type: application/json" \
414+ -d ' {"model":"nvidia/nv-embedqa-e5-v5","input":"test"}'
415+
416+ # Check application health (includes NIM connectivity)
417+ curl http://localhost:8001/api/v1/health
418+ ```
419+
420+ ### Troubleshooting NIMs
421+
422+ ** Common Issues:**
423+
424+ 1 . ** Authentication Errors (401/403)** :
425+ - Verify ` NVIDIA_API_KEY ` is set correctly
426+ - Ensure API key has access to the requested models
427+ - Check API key hasn't expired
428+
429+ 2 . ** Connection Timeouts** :
430+ - Verify NIM endpoint URLs are correct
431+ - Check network connectivity to endpoints
432+ - Increase ` LLM_CLIENT_TIMEOUT ` if needed
433+ - For self-hosted NIMs, ensure they are running and accessible
434+
435+ 3 . ** Model Not Found (404)** :
436+ - Verify ` LLM_MODEL ` matches the model available at your endpoint
437+ - For cloud endpoints, check model identifier format (e.g., ` nvcf:nvidia/... ` )
438+ - For self-hosted, use model name format (e.g., ` nvidia/llama-3.3-nemotron-super-49b-v1 ` )
439+
440+ 4 . ** Rate Limiting (429)** :
441+ - Reduce request frequency
442+ - Implement request queuing/retry logic
443+ - Consider self-hosting for higher throughput
444+
445+ ** For detailed NIM deployment guides, see:**
446+ - [ NVIDIA NIM Documentation] ( https://docs.nvidia.com/nim/ )
447+ - [ NVIDIA NGC Containers] ( https://catalog.ngc.nvidia.com/containers?filters=&orderBy=scoreDESC&query=nim )
448+
194449## Deployment Options
195450
196451### Option 1: Docker Deployment
@@ -329,8 +584,8 @@ docker-compose -f deploy/compose/docker-compose.yaml up -d
329584 kubectl create secret generic warehouse-secrets \
330585 --from-literal=db-password=your-db-password \
331586 --from-literal=jwt-secret=your-jwt-secret \
332- --from-literal=nim-llm- api-key=your-nim -key \
333- --from-literal=nim-embeddings- api-key=your-embeddings -key \
587+ --from-literal=nvidia- api-key=your-nvidia-api -key \
588+ --from-literal=rail- api-key=your-rail-api -key \
334589 --from-literal=admin-password=your-admin-password \
335590 --namespace=warehouse-assistant
336591 ```
0 commit comments