This project demonstrates GPU-Accelerated Generative AI and Agentic AI concepts in a minimal, reproducible form.
It uses NVIDIA NIM (NVIDIA Inference Microservices) to execute LLM workloads, combined with NeMo Guardrails for safety and controllability.
Because computation runs entirely on NVIDIA’s cloud GPU infrastructure, the demo operates without any local GPU and can run at zero cost.
| Component | Technology | Description |
|---|---|---|
| Framework | FastAPI | Lightweight, high-performance web API framework |
| AI API | NVIDIA NIM (meta/llama-3.1-8b-instruct) | Free, serverless LLM inference environment |
| Safety Layer | NeMo Guardrails | Dialogue control and tool access management |
| Agentic Tools | calc / kb | Calculator and FAQ retrieval (RAG-like behavior) |
| Storage | kb.json | Local knowledge base (replaceable with cloud storage) |
- Parses user messages and automatically routes to internal tools:
- Math expressions ↁE
calc - Business hours / pricing / contact info ↁE
kb
- Math expressions ↁE
- Generates concise, safe replies based on Guardrails policies.
- Works even with
OFFLINE_MODE=1(mock mode without NIM API).
| Input | Internal Tool | Example Output |
|---|---|---|
(1000-250)*0.1 |
calc |
75.0 |
What are your business hours and prices? |
kb |
Summarized response in Japanese |
git clone [email protected]:REICHIYAN/nvda_stack_agent.git
cd nvda_stack_agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtCreate a .env file:
# Option 1: Using real NIM API
NVIDIA_API_KEY=your_api_key_here
# Option 2: Offline mode (mock responses)
OFFLINE_MODE=1uvicorn app.main:app --reloadExpected output:
INFO: Uvicorn running on http://127.0.0.1:8000
curl -s http://127.0.0.1:8000/chat \
-H 'Content-Type: application/json' \
-d '{"message":"Tell me business hours and pricing"}' | jqExpected Output:
{
"reply": "(Local response) Tool result: ...",
"tool_calls": [
{"name": "kb", "input": "Tell me business hours...", "result": "..."}
]
}curl -s http://127.0.0.1:8000/chat \
-H 'Content-Type: application/json' \
-d '{"message":"(1000-250)*0.1"}' | jqExpected Output:
{
"reply": "(Local response) Tool result: 75.0",
"tool_calls": [
{"name": "calc", "input": "(1000-250)*0.1", "result": "75.0"}
]
}nvda_stack_agent/
├─ app/
━E ├─ main.py # FastAPI entry point
━E ├─ schemas.py # Pydantic data models
━E ├─ rails/
━E ━E ├─ colang/flows.co # Guardrails Colang definitions
━E ━E └─ tools.py # calc / kb implementations
━E └─ __init__.py
├─ kb.json # Local knowledge base
├─ requirements.txt
├─ README.md
└─ .env (ignored)
- Multi-Cloud / Hybrid: NIM supports both API and on-prem GPU hosting.
- AIOps / MLOps: Integrate Triton Inference Server for multi-model orchestration.
- Storage / Migration: Replace
kb.jsonwith S3, GCS, or enterprise-grade storage. - Safety / Compliance: Guardrails enforces output filters, tool whitelisting, and secure execution.
MIT License © 2025 Rei Taguchi