Welcome! This demo shows NVSentinel's core functionality running locally on your laptop. You'll see how NVSentinel automatically detects GPU failures and protects your cluster by cordoning faulty nodes.
💡 For this demo, no GPU Required!
This demo runs on any laptop - no GPU needed! We simulate GPU failures by sending test events directly to NVSentinel, allowing you to see the full detection and response workflow without any special hardware.
- GPU Health Monitoring - How NVSentinel detects hardware failures
- Automated Response - How faulty nodes are automatically quarantined
- Event-Driven Architecture - How health events flow through the system
No GPU required! This demo works on any laptop.
System Requirements:
- Disk Space: ~10GB free (for Docker images and KIND cluster)
- Memory: 4GB RAM minimum, 8GB recommended
- CPU: 2 cores minimum
Required tools:
- Docker - For running KIND (Kubernetes in Docker) (install)
- kubectl - Kubernetes command-line tool (install)
- kind - Kubernetes IN Docker (install)
- helm - Kubernetes package manager (install)
- jq - JSON processor for parsing Kubernetes output (install)
Optional:
- curl - For sending HTTP requests (usually pre-installed)
Best for: Quick overview, presentations, or if you're short on time.
What you'll see: The entire workflow runs automatically from cluster creation through error injection to verification. Great for getting a quick sense of NVSentinel's capabilities, but you won't see the details of each step.
# Run the complete demo (takes ~5-10 minutes)
make demo
# Clean up when done
make cleanupBest for: Understanding how NVSentinel works, learning the architecture, seeing logs and events in detail.
What you'll learn: By running each script individually, you'll see the cluster state before and after each action, understand the event flow, and have time to explore logs and Kubernetes resources at each stage.
See below for expected output after each step!
# Step 0: Create cluster and install NVSentinel
./scripts/00-setup.sh
# Step 1: View the healthy cluster
./scripts/01-show-cluster.sh
# Step 2: Inject a GPU fault (simulates hardware failure)
./scripts/02-inject-error.sh
# Step 3: Verify node was cordoned
./scripts/03-verify-cordon.sh
# Clean up
./scripts/99-cleanup.shNVSentinel monitors GPU health through multiple channels and can detect various hardware and driver failures through DCGM health checks, XID error codes, and system logs.
This demo simulates a fatal GPU hardware fault (corrupt InfoROM) using DCGM's error injection capability. This type of fault requires the node to be removed from service to protect workloads.
When a GPU fault is detected, NVSentinel:
- Health Monitor detects the GPU error (from DCGM, syslog, or other sources)
- Platform Connectors receives the health event via gRPC
- MongoDB stores the event in the persistent event database
- Fault Quarantine watches for new events and evaluates rules
- Kubernetes API - Node is cordoned (no new pods scheduled)
- Node Drainer - (if enabled) Gracefully evicts running workloads
- Fault Remediation - (if enabled) Triggers repair workflows
In this simplified demo, we focus on #1-5.
This demo uses a minimal NVSentinel deployment with:
- KIND Cluster - 1 control plane + 1 worker node
- Fake DCGM - Simulates NVIDIA GPU monitoring (with NVML injection)
- GPU Health Monitor - Detects GPU errors from DCGM
- Platform Connectors - gRPC server for receiving health events
- Fault Quarantine - Rule engine that cordons nodes on fatal errors
- MongoDB - Event storage and change streams
┌───────────────────────────────────────────────────────┐
│ Your Laptop (KIND Cluster) │
│ │
│ ┌──────────────────────────┐ │
│ │ Worker Node │ │
│ │ │ │
│ │ * Fake DCGM (Injected) │ │
│ │ * GPU Health Monitor │ │
│ └──────────┬───────────────┘ │
│ │ gRPC │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ NVSentinel Core │ │
│ │ │ │
│ │ ┌───────────────┐ ┌──────────────────┐ │ │
│ │ │ Platform │─────>│ MongoDB │ │ │
│ │ │ Connectors │ │ (Event Store) │ │ │
│ │ └───────────────┘ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ Change │ │ │
│ │ Stream ↓ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Fault Quarantine │ │ │
│ │ │ (CEL Rules) │ │ │
│ │ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Kubernetes API │ │ │
│ │ │ (Cordon Node) │ │ │
│ │ └──────────────────┘ │ │
│ └────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────┘
- Creates KIND cluster with 1 worker node (minimal config)
- Installs cert-manager (for TLS certificates)
- Installs NVSentinel with minimal components:
- Platform Connectors (event ingestion)
- MongoDB (3-node replica set for change streams - adds ~2-3 min)
- Simple Health Client (test tool)
- Fault Quarantine (auto-cordon)
- Waits for all pods to be ready (~5-6 minutes total)
Shows the healthy cluster:
- ✅ All nodes are
ReadyandSchedulingEnabled - ✅ All NVSentinel pods are
Running - ✅ No health events in the database
Expected output:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
nvsentinel-demo-control-plane Ready control-plane 2m v1.31.0
nvsentinel-demo-worker Ready <none> 2m v1.31.0Both nodes should be Ready with no scheduling restrictions.
Injects a GPU hardware fault into the fake DCGM service. The GPU Health Monitor detects this error automatically (just like it would with real GPU hardware) and NVSentinel quarantines the node.
How it works:
- We use
dcgmi test --injectto simulate a corrupt InfoROM (a fatal GPU hardware fault) - GPU Health Monitor polls DCGM every few seconds and detects the error
- NVSentinel automatically processes the event and cordons the node
The GPU Health Monitor detects this from DCGM and sends it via gRPC to Platform Connectors - exactly like production!
Demo magic: We use fake DCGM to simulate GPU faults without actual hardware - NVSentinel's detection and response is 100% authentic.
Confirms the automated response:
- 🔒 Worker node shows as
SchedulingDisabled(cordoned) - 📊 Node condition shows the health event
- 🎯 Fault Quarantine logs show the rule evaluation
Expected output:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
nvsentinel-demo-control-plane Ready control-plane 12m v1.31.0
nvsentinel-demo-worker Ready,SchedulingDisabled <none> 12m v1.31.0Notice the worker node now shows SchedulingDisabled - NVSentinel automatically cordoned it after detecting the GPU fault! 🎉
Removes the KIND cluster and cleans up resources.
After completing this demo, you'll understand:
- Event-Driven Architecture - How health events flow through NVSentinel
- Kubernetes Integration - How NVSentinel interacts with Kubernetes API
- Rule-Based Quarantine - How CEL rules determine when to cordon nodes
- Production Readiness - What a minimal NVSentinel deployment looks like
By default, the demo uses NVSentinel v0.6.0 (the latest published release). To use a different version:
# Use a specific version (replace vX.Y.Z with your desired version)
NVSENTINEL_VERSION=vX.Y.Z ./scripts/00-setup.sh
# Or set it for the entire demo
NVSENTINEL_VERSION=vX.Y.Z make demoTo test with local development code:
- Build and push images to a registry accessible from KIND
- Update the image tags in the Helm values file
- See DEVELOPMENT.md for details
The demo uses approximately 8-10GB of disk space:
- KIND cluster images: ~2GB
- Container images (NVSentinel, MongoDB, DCGM): ~6-8GB
To minimize disk usage:
# After completing the demo, clean up immediately
make cleanup
# Or manually clean Docker
docker system prune -a -f --volumesIf you're low on disk:
- The demo creates a minimal deployment:
- 1 MongoDB instance (single-member replica set for change streams)
- 1 worker node (not 2+)
- Persistence disabled (no storage volumes)
- Only essential components enabled
# Check disk usage
df -h /
# Clean up Docker
docker system prune -a -f --volumes
# Delete old KIND clusters
kind get clusters
kind delete cluster --name <old-cluster-name># Clean up and retry
kind delete cluster --name nvsentinel-demo
./scripts/00-setup.sh# Check pod status
kubectl get pods -n nvsentinel
# View logs
kubectl logs -n nvsentinel deployment/platform-connectors
kubectl logs -n nvsentinel deployment/simple-health-client# Check fault-quarantine logs
kubectl logs -n nvsentinel deployment/fault-quarantine
# Verify event was received
kubectl get events -A | grep GPU# Change the port used by simple-health-client
kubectl edit service -n nvsentinel simple-health-clientAfter trying this demo, explore more NVSentinel capabilities:
- Full Installation - Deploy on a real cluster with GPU nodes (Quick Start Guide)
- Production Configuration - Enable node drainer and fault remediation (Configuration Guide)
- Custom Rules - Write your own CEL rules for fault quarantine
- Scale Testing - Try the scale test suite
- Real GPU Monitoring - Connect to actual NVIDIA GPUs with DCGM
- NVSentinel README - Project overview and features
- Architecture Guide - Detailed system architecture
- Development Guide - Contributing and development setup
- Helm Chart Configuration - All configuration options
- NVIDIA GPU Error Codes (XIDs) - Reference for GPU error codes
Found an issue with this demo? Want to improve it? We welcome contributions!
- Check the Contributing Guide
- Open an issue or pull request
- Sign your commits with
git commit -s
This demo is part of NVSentinel and is licensed under the Apache License 2.0.
Questions? Start a discussion or open an issue.
Enjoy the demo! 🎉