Official Helm chart for NVSentinel - GPU node resilience system for automated fault detection, quarantine, and remediation.
- Kubernetes: 1.25 or later
- Helm: 3.0 or later
- NVIDIA GPU Operator: Required for GPU monitoring (includes DCGM)
- cert-manager: Required for TLS certificate management (v1.19.0+)
- Persistent Storage: For MongoDB (default: 10GB per replica)
# Install cert-manager
helm repo add jetstack https://charts.jetstack.io --force-update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.19.1 \
--set installCRDs=true
# Install Prometheus (optional, for metrics collection)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace# Install with default configuration
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.0.0 \
--namespace nvsentinel \
--create-namespace
# Upgrade to a new version
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.0.0 \
--namespace nvsentinel
# Uninstall
helm uninstall nvsentinel --namespace nvsentinelThe NVSentinel Helm chart follows a modular architecture with independent subcharts:
nvsentinel/
├── Chart.yaml # Main chart metadata
├── values.yaml # Default configuration
├── values-full.yaml # Complete reference with all options
├── charts/ # Subcharts for each module
│ ├── platform-connectors/ # Core: gRPC server & event persistence
│ ├── gpu-health-monitor/ # Health: GPU monitoring via DCGM
│ ├── syslog-health-monitor/ # Health: System log analysis
│ ├── csp-health-monitor/ # Health: Cloud provider events
│ ├── fault-quarantine/ # Core: Node cordoning & tainting
│ ├── node-drainer/ # Core: Workload eviction
│ ├── fault-remediation/ # Core: Maintenance CRD creation
│ ├── janitor/ # Core: Cloud provider API integration
│ ├── health-events-analyzer/ # Analysis: Event pattern detection
│ ├── labeler/ # Utility: Node labeling
│ ├── mongodb-store/ # Storage: Event database
│ └── incluster-file-server/ # Utility: Log collection storage
└── templates/ # Main chart templates
values.yaml: Default configuration (monitoring only)values-full.yaml: Complete reference with all available options and detailed documentationvalues-tilt.yaml: Development configuration for Tiltvalues-tilt-arm64.yaml: ARM64-specific overrides
global:
# Image tag for all components
image:
tag: "main"
# Dry-run mode - log actions without executing them
dryRun: false
# Module enable/disable flags
gpuHealthMonitor:
enabled: true
syslogHealthMonitor:
enabled: true
faultQuarantine:
enabled: false
nodeDrainer:
enabled: false
faultRemediation:
enabled: false
janitor:
enabled: false
mongodbStore:
enabled: falseConfigure quarantine rules using CEL expressions:
fault-quarantine:
# Label prefix for node labels and annotations
labelPrefix: "k8saas.nvidia.com/"
# Circuit breaker to prevent mass quarantines
circuitBreaker:
enabled: true
percentage: 50 # Max % of nodes to quarantine
duration: "5m" # Cooldown period
# Quarantine rules
ruleSets:
- version: "1"
name: "GPU fatal error ruleset"
match:
all:
- kind: "HealthEvent"
expression: "event.agent == 'gpu-health-monitor' && event.componentClass == 'GPU' && event.isFatal == true"
- kind: "Node"
expression: |
!('k8saas.nvidia.com/ManagedByNVSentinel' in node.metadata.labels && node.metadata.labels['k8saas.nvidia.com/ManagedByNVSentinel'] == "false")
cordon:
shouldCordon: true
# Optional taint
# taint:
# key: "nvidia.com/gpu-error"
# value: "fatal"
# effect: "NoSchedule"Configure eviction strategies per namespace:
node-drainer:
# Eviction timeout in seconds
evictionTimeoutInSeconds: "60"
# System namespaces to skip
systemNamespaces: "^(nvsentinel|kube-system|gpu-operator|gmp-system|network-operator|skyhook)$"
# Force delete timeout (minutes)
deleteAfterTimeoutMinutes: 60
# NotReady pod timeout (minutes)
notReadyTimeoutMinutes: 5
# Per-namespace eviction strategies
userNamespaces:
- name: "*"
mode: "AllowCompletion" # Options: "Immediate", "AllowCompletion", "DeleteAfterTimeout"Configure maintenance resource creation and log collection:
fault-remediation:
# Maintenance resource configuration
maintenance:
apiGroup: "janitor.dgxc.nvidia.com"
version: "v1alpha1"
namespace: "nvsentinel"
resourceNames:
- "rebootnodes"
- "terminatenodes"
# Go template for maintenance resources
template: |
apiVersion: janitor.dgxc.nvidia.com/v1alpha1
kind: RebootNode
metadata:
name: maintenance-{{ .NodeName }}-{{ .HealthEventID }}
spec:
nodeName: {{ .NodeName }}
# Retry configuration for node updates
updateRetry:
maxRetries: 5
retryDelaySeconds: 10
# Log collection (optional)
logCollector:
enabled: false
image:
repository: ghcr.io/nvidia/nvsentinel/log-collector
pullPolicy: IfNotPresent
uploadURL: "http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local/upload"
gpuOperatorNamespaces: "gpu-operator"
enableGcpSosCollection: false
enableAwsSosCollection: falseConfigure cloud provider integration:
janitor:
serviceAccount:
create: true
config:
timeout: "25m"
manualMode: false
httpPort: 8082
# Node exclusions
nodes:
exclusions: []
# Controller configuration
controllers:
rebootNode:
enabled: true
timeout: "25m"
terminateNode:
enabled: true
timeout: "25m"
# Cloud provider configuration
csp:
provider: "kind" # Options: kind, kwok, aws, gcp, azure, ocijanitor:
csp:
provider: "aws"
aws:
region: "us-east-1"
accountId: "123456789012"
iamRoleName: "nvsentinel-janitor-role"janitor:
csp:
provider: "gcp"
gcp:
project: "my-project"
zone: "us-central1-a"
serviceAccount: "nvsentinel-janitor"janitor:
csp:
provider: "azure"
azure:
subscriptionId: "12345678-1234-1234-1234-123456789012"
resourceGroup: ""
location: ""
clientId: "12345678-1234-1234-1234-123456789012"janitor:
csp:
provider: "oci"
oci:
region: "us-ashburn-1"
compartment: "ocid1.compartment.oc1..aaa..."
credentialsFile: ""
profile: "DEFAULT"
principalId: "ocid1.principal.oc1..aaa..."For detailed documentation of all available configuration options, see:
values-full.yaml: Complete reference with detailed comments and examplesvalues.yaml: Default values used by the chart
To view all options from the published chart:
helm show values oci://ghcr.io/nvidia/nvsentinel --version v1.0.0