Releases: NVIDIA/NVSentinel
Release v1.1.0
Release v1.1.0
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v1.1.0
Release v1.0.0
NVSentinel v1.0.0 Release Notes
Status: Beta / Stable
With v1.0.0, NVSentinel moves from Experimental to Beta/Stable. We now recommend NVSentinel for production testing and use. The project continues to evolve rapidly and APIs may change between releases, but we follow semantic versioning going forward: breaking changes will increment the major version.
What's in v1.0.0
This release represents 13 prior releases and 400+ commits since the initial open-source launch in October 2025. The highlights below cover the full arc from v0.1.0 through v1.0.0.
GPU reset and remediation pipeline
NVSentinel now supports a complete GPU reset workflow as an alternative to full node reboot. The GPU health monitor detects reset-eligible errors, fault remediation creates GPUReset CRDs, and the janitor executes the reset. This reduces remediation time from minutes (reboot) to seconds (reset) for recoverable GPU faults. End-to-end remediation metrics track the full pipeline from fault detection through resolution.
Kubernetes object monitor
A new policy-based health monitor that watches any Kubernetes resource and evaluates CEL expressions to generate health events. This enables monitoring of custom resources, operator status, and application-level health signals without writing code.
Event exporter
Health events can now be streamed to external systems in CloudEvents format. This enables integration with existing observability platforms and data pipelines.
Preflight checks
A new preflight framework validates cluster readiness before GPU workloads are scheduled. Includes DCGM diagnostics and NCCL loopback/all-reduce tests to catch hardware issues before they affect production jobs.
Slurm drain monitor
A new health monitor for hybrid Kubernetes/Slurm environments. Monitors Slurm drain state and generates health events when nodes are drained by the Slurm scheduler, enabling NVSentinel to coordinate remediation across both schedulers.
Metadata collector
Automatically gathers GPU and NVSwitch topology information and enriches health events with hardware context. Integrated with both GPU and syslog health monitors.
PostgreSQL backend
MongoDB is no longer the only storage option. PostgreSQL is now supported as an alternative database backend, with LISTEN/NOTIFY change streams for real-time event processing.
Slinky (NVIDIA DPU) drain support
Custom drain integration for Slinky-managed nodes, including parallel drain handling and proper annotation coordination.
NVLink and XID workflow improvements
Dedicated workflows for NVLink failures (XID 13, 31, 154) with GPU-topology-aware fault classification. The syslog health monitor now includes driver-version-dependent parsing for NVL5 decoding rules.
Cloud provider improvements
- Bare-metal reboot support via sudo in janitor
- Generic CSP plugin with reboot capability
- Configurable IAM role names for EKS
- OCI, Azure, GCP, and AWS all supported with provider-specific janitor configurations
Operational improvements
- Circuit breaker prevents mass quarantines during cluster-wide events
- Audit logging for all NVSentinel write operations
- Breakfix cancellation via manual uncordon
- Partial drain support in node drainer (per-namespace eviction strategies)
- Custom drain modes with parallel drain handling
- Log collection for diagnostic reports, including AWS SOS and GCP SOS report collection
- Optional TLS for MongoDB connections
Build and security
- All container images built with ko and attested with SLSA build provenance
- SPDX SBOM attestation on every image
- Daily vulnerability scanning
- Supply chain verification via Sigstore Policy Controller
Upgrading
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.0.0 \
--namespace nvsentinelNVSentinel includes a pre-upgrade hook that cleans up deprecated node conditions automatically. Review the Helm Chart Configuration Guide for new configuration options.
What's next
See the Roadmap and Project Board for planned work toward General Availability.
Release v0.10.1
Release v0.10.1
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.10.1
Release v0.10.0
Release v0.10.0
This release introduces multi-node NCCL all-reduce preflight testing across all major cloud fabrics, concurrent event exporting for large-scale clusters, Slinky (Slurm-on-Kubernetes) drainer improvements, DCGM 4.4.2 compatibility with destructive XID detection, breakfix response time metrics, and significant fault management reliability fixes.
Major New Features
Multi-Node NCCL All-Reduce Preflight Tests
- Cross-Node GPU Interconnect Validation (#837): A mutating webhook injects an init container that runs a multi-node NCCL all-reduce bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.
- New
preflight-nccl-allreducecontainer image (PyTorch + torchrun) networkFabricHelm selector with data-driven fabric profiles (ib,efa,mnnvl-efa,tcpxo)- DRA resource claim mirroring for GB200 MNNVL/IMEX
extraVolumeMountsfor GCP TCPXO plugin injection- Auto-created NCCL topology ConfigMap for Azure IB
- Tested on A100 (Azure IB), H100 (AWS EFA, GCP TCPXO), and GB200 (AWS MNNVL)
- New
Concurrent Event Exporter
- Worker Pool for Event Publishing (#906): Event exporter now supports concurrent publishing via a
--workersflag. On a 1,100-node production cluster, sequential publishing (~3.3 events/sec) fell behind the event production rate (~10 events/sec), causing MongoDB oplog rotation and an unrecoverableChangeStreamHistoryLostloop — leaving health events unexported for 4+ days. The new worker pool with sequence-tracked resume tokens provides at-least-once delivery with no event loss. At 10 workers, throughput reaches ~33 events/sec (supporting ~3,300 nodes).
DCGM 4.4.2 Compatibility & Destructive XID Detection
- DCGM_HEALTH_WATCH_ALL Support (#905): Upgraded gpu-health-monitor, preflight dcgm-diag, and fake-dcgm to DCGM 4.4.2. Previously,
DCGM_HEALTH_WATCH_ALLincidents (used by DCGM to report destructive XIDs like XID 95) were silently excluded, causing aKeyErrorcrash in the health check loop and leaving GPU failures undetected. The fix removes the exclusion filter, adds safe.get()fallbacks for unknown systems/error codes, and mapsDCGM_HEALTH_WATCH_ALLto Fatal severity. Backward compatible with DCGM 4.2.x.
Breakfix Response Time Metrics
- End-to-End Remediation Latency Tracking (#714): New histogram metrics across the remediation pipeline to answer critical operational questions:
fault_remediation_cr_generate_duration_seconds— Mean time for CR creation in fault-remediationfault_quarantine_node_quarantine_duration_seconds— Mean time to quarantine a nodenode_drainer_pod_eviction_duration_seconds— Mean time waiting for user workloads to complete- Janitor remediation duration metrics — Mean time to remediate
Slinky Drainer Improvements
The Slinky (Slurm-on-Kubernetes) drainer received multiple improvements for production reliability:
- Annotation Handling (#909): Slinky drainer now only adds drain reason annotations if none already exist, and cleans up NVSentinel-owned annotations (prefixed with
[J] [NVSentinel]) upon drain completion. Includes envtest-based tests for the full drain lifecycle. - Wait for Fully Drained State (#919): Fixed a critical bug where the drainer deleted pods in
DRAININGstate (drain accepted but jobs still running) instead of waiting forDRAINEDstate (all jobs complete). Now mirrors the Slinky operator's ownIsNodeDrained()logic by checking busy-state conditions (Allocated,Mixed,Completing). - Wait Only for Ready Pods (#916): Slinky drainer now correctly waits only for pods in Ready state, avoiding stalls on non-ready pods that will never drain.
- CI Pipeline Integration (#885): Slinky drain tests are now included in the GitHub CI pipeline.
Configuration & Cloud Provider
- Configurable IAM Role for EKS (#877): The IAM role name used by the CSP health monitor for EKS is now configurable via Helm values (
iamRoleName), supporting environments with custom IAM role naming. - Terminate Node Template (#894): Added template for creating
TerminateNodeCRs in fault-remediation values, enablingREPLACE_VMremediation actions for node replacement workflows.
Bug Fixes & Reliability
- FRM Multiple Reconciliation Fix (#897): Fixed multiple issues in fault-remediation: duplicate event reconciliation from concurrent status updates, missing
nvsentinel-statelabel for taint-only configurations, duplicate fields inuserPodsEvictionStatus, and missinglastRemediationTimestampupdates in Postgres queries. - Quarantine Metric Accuracy (#759): Fixed
fault_quarantine_current_quarantined_nodesmetric reporting inflated values. Root cause: manual taint removal wasn't triggering unquarantine flow and annotation cleanup. Also fixed cases wherequarantineHealthEventannotation had empty values alongsidequarantinedNodeUncordonedManually. - GPU Reset RuntimeClassName (#887): Set
RuntimeClassNametonvidiain GPU reset pods, ensuring proper GPU access during reset operations. - GPU Reset UAT Improvements (#879, #892): Wait for GPUReset CRD (instead of checking syslog) in UAT tests for more reliable validation; fix uninitialized variable.
- Unquarantine Timeout (#876): Increased unquarantine timeout from default to 5 minutes to prevent premature timeout failures.
- Nolint Directive Cleanup (#832, #884): Continued cleanup of TODO-marked
nolintdirectives (Parts 3 & 4).
Dependency Updates
- Bumped
github.com/aws/aws-sdk-go-v2/configfrom 1.32.7 to 1.32.9 (#902) - Bumped
google.golang.org/apifrom 0.266.0 to 0.267.0 (#903) - Bumped
aquasecurity/trivy-actionfrom 0.33.1 to 0.34.0 (#839) - Multiple dependency updates via dependabot (#875, #904)
Acknowledgments
This release includes contributions from:
- @XRFXLP
- @natherz97
- @tanishagoyal2
- @deesharma24
- @cbump
- @KaivalyaMDabhadkar
- @faganihajizada
- @nitz2407
- @neerajnv
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.10.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.9.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.10.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.9.0
Release v0.9.0
This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.
Major New Features
End-to-End GPU Reset
GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:
- GPU Reset Controller in Janitor (#797): New controller that consumes
GPUResetCRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services. - GPU Reset Container Image (#788): Dedicated
gpu-resetcontainer image used by Janitor's reset jobs to perform the actual GPU reset on target nodes. - E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping
COMPONENT_RESETtoGPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback toRESTART_VMwhen UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.
This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.
Preflight Check Framework Expansion
The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:
- DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
- NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
- Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.
Kubernetes Operator Health Monitoring
- GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in
gpu-operatorandnetwork-operatornamespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.
Performance & Observability
Histogram Bucket Cardinality Reduction
- 96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.
Configurable Network Policy
- Optional Metrics Network Policy (#789): The
metrics-accessnetwork policy can now be disabled vianetworkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.
Bug Fixes & Reliability
- Nolint Directive Cleanup (#828, #831): Cleaned up
nolintdirectives previously marked as TODO across the codebase, improving lint compliance and code quality. - E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
- Demo Script Fix (#809): Fixed demo script to display correct node conditions.
- SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
- CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.
Build & Infrastructure
- Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
- Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
- Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.
Documentation
- K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.
Dependency Updates
- Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
- Multiple dependency updates via dependabot (#803, #806, #829)
Acknowledgments
This release includes contributions from:
- @natherz97
- @XRFXLP
- @deesharma24
- @tanishagoyal2
- @ksaur
- @jtschelling
- @cbumb
- @yuanchen8911
- @yavinash007
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.8.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.8.0
Release v0.8.0
This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.
🎯 Major New Features
Distributed Node Locking
NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.
Partial Drain Support
The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.
Comprehensive GPU Reset Support
GPU reset functionality has been expanded across multiple components:
- GPU Health Monitor: Native GPU reset support for DCGM-detected issues
- Fault Remediation: Integrated GPU reset as a remediation action
- Syslog Health Monitor: GPU reset support for syslog-detected faults
This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.
Preflight Check Framework
Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.
Event Handling Strategy Enhancements
Expanded event handling strategy support to additional components:
- CSP Health Monitor: Event handling strategy configuration for cloud provider events
- Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events
This provides consistent, fine-grained control over event processing across all health monitoring components.
🔧 Configuration & Control Improvements
Custom Certificate Secrets
Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.
MongoDB Client Tracking
Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.
ArgoCD Integration
Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.
🐛 Bug Fixes & Reliability Improvements
GPU Health Monitor Event Cache
Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.
Labeler Improvements
- Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
- Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures
Circuit Breaker Fixes
- Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
- Runbook Updates: Enhanced circuit breaker runbook with better operational guidance
Event Filtering
Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.
MongoDB Authentication
Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.
Fault Remediation Business Logic
Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.
Nebius Cloud Reboot Handling
Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.
Health Events Analyzer Test Fixes
Resolved test failures in health-events-analyzer that were causing CI pipeline issues.
🏗️ Architecture & Performance
Driver Version Dependent Parsing
Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.
🧪 Testing & Quality Improvements
UAT Test Reliability
- Improved UAT tests reliability with better error handling and retry logic
- Enhanced test configurability for different cluster environments
- Better cleanup and resource management in test environments
Documentation Improvements
- Added comprehensive preflight check design documentation
- New alert runbook for improved operational guidance
- Fixed typos in runbook documentation
Dependency Updates
- Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
- Multiple security updates and dependency version bumps via dependabot
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @lalitadithya
- @natherz97
- @jtschelling
- @XRFXLP
- @tanishagoyal2
- @deesharma24
- @KaivalyaMDabhadkar
- @ksaur
- @c-fteixeira
- @miguelvramos92
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Partial drain support requires proper GPU workload identification
- Distributed locking requires coordination between components
- Preflight checks are in early stages and should be thoroughly tested
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.7.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.7.1
Release v0.7.1
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.7.1
Release v0.7.0
Release v0.7.0
This release introduces advanced event processing strategies, gRPC-based remediation services, enhanced templating capabilities, and improved support for cloud service providers. We've also made significant reliability enhancements across the platform.
🎯 Major New Features
Event Processing Strategies
NVSentinel now supports configurable event processing strategies across health monitors and analyzers. The new processingStrategy field in health events allows fine-grained control over how events are handled, enabling operators to customize event processing behavior based on specific operational requirements. This feature has been implemented in:
- GPU Health Monitor
- Syslog Health Monitor
- Health Events Analyzer
gRPC Remediation Service
Added a new gRPC-based remediation service that enables programmatic fault remediation operations. This provides a powerful API for external systems to integrate with NVSentinel's remediation capabilities, supporting advanced automation workflows and custom orchestration scenarios.
Enhanced Templating Support
Multi-template support in fault remediation allows using multiple notification templates for different channels and audiences. Additionally, all fields in health events can now be used for templating, providing complete flexibility in crafting notifications and alerts.
Nebius Cloud Support
Added comprehensive support for Nebius Cloud (MK8s) CSP, including environment variable and secret support for node reboot operations. This expands NVSentinel's multi-cloud capabilities to include another major cloud provider.
🔧 Configuration & Control Improvements
Pod GPU Device Allocation Tracking
The metadata collector now tracks pod GPU device allocation, providing visibility into which pods are using which GPUs. This enables more informed remediation decisions and better troubleshooting of GPU-related issues.
Syslog Runtime Journal Support
Added runtime journal support to the syslog health monitor, enabling direct integration with systemd journal for more efficient log collection and processing.
PodMonitor Configuration
Helm charts now support making PodMonitor optional and configurable, providing flexibility for environments with different monitoring setups and requirements.
🐛 Bug Fixes & Reliability Improvements
MongoDB Retry Logic
Implemented retry mechanism for MongoDB write failures, improving resilience against transient database connectivity issues and ensuring health events are not lost during temporary network problems.
Janitor Reconciliation Simplification
Simplified janitor reconciliation loops for better reliability and maintainability. The refactored logic reduces complexity and improves predictability of cleanup operations.
XID 154 Case Handling
Fixed a critical case statement issue in XID 154 handling that could cause incorrect error processing. This ensures GPU recovery action changes are properly detected and remediated.
GpuNvlinkWatch Message Parsing
Fixed message parsing logic collision in GpuNvlinkWatch that was causing stale node conditions. This resolves false positives and improves the accuracy of NVLink health monitoring.
Missing RESTART_VM Action
Added the missing RESTART_VM remediation action to fault remediation configurations, ensuring all supported remediation actions are properly exposed in the configuration.
Default CSP Provider Host
Fixed missing default cspProviderHost value that was causing configuration issues in certain deployment scenarios.
Apple Silicon Demo Support
Added ARM64 (Apple Silicon) support for local fault injection demos, improving the developer experience on macOS machines with Apple Silicon processors.
GpuNvlinkWatch Stale Conditions
Resolved issue where GpuNvlinkWatch could report stale node conditions due to message parsing logic collision, improving monitoring accuracy.
🏗️ Architecture & Performance
Certificate Hot-Reloading
Added option to enable automatic certificate hot-reloading, allowing certificate updates without service restarts. The certificate watcher is now non-blocking, improving overall system responsiveness.
MongoDB Query Metrics
Added comprehensive MongoDB query metrics for better observability of database operations, enabling performance analysis and optimization of data access patterns.
Enhanced Logging
Improved logging configuration across multiple components for better consistency and observability.
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @lalitadithya
- @tanishagoyal2
- @jtschelling
- @jamyu
- @Sukets
- @KaivalyaMDabhadkar
- @natherz97
- @XRFXLP
- @oseeniraj
- @aireet
- @xuegangjie
- @miguelvramos92
- @deesharma24
- @zydee3
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- gRPC remediation service requires proper network configuration and authentication
- Event processing strategies should be carefully tested before production deployment
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.7.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.6.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.7.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.6.0
Release v0.6.0
This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.
🎯 Major New Features
NVLink XID 74 Workflow
NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.
Certificate Rotation Support
The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.
Selective Health Event Analyzer Rules
Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.
Health Event Property Overrides
Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.
Controller-Runtime for Fault Remediation
Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.
🔧 Configuration & Control Improvements
Circuit Breaker Fresh Start
Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.
🐛 Bug Fixes & Reliability Improvements
Node Drainer Priority Handling
Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.
Event Exporter Logging
Fixed an issue that caused the event exporter to not use the standard logging configuration.
PostgreSQL Test Stability
Fixed flaky PostgreSQL tests that were causing intermittent CI failures.
Node Condition Message Formatting
Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.
UAT Environment Management
Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.
🏗️ Architecture & Performance
Enhanced GPU Health Monitor Logging
Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.
Code Quality Improvements
Optimized import ordering and code organization across the codebase for better readability and maintainability.
🧪 Testing & Quality Improvements
Scale Testing Validation
Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.
Test Reliability Improvements
- Fixed flaky syslog XID monitoring UAT tests
- Resolved CSP health monitor test timeout issues
CI/CD Enhancements
- Set explicit Go version during CI dependency installation for reproducible builds
- Improved tilt installation process by using temporary directory
- Added GitHub Action to automatically clean up old untagged container images
- Enhanced fork repository handling to prevent unnecessary workflow triggers
📚 Documentation Improvements
- Enhanced Kubernetes object monitor architecture diagrams
- Updated Slinky drain demo documentation
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @tanishagoyal2
- @XRFXLP
- @miguelvr
- @deesharma24
- @KaivalyaMDabhadkar
- @rupalis-nv
- @ksaur
- @mchmarny
- @lalitadithya
- @dims
- @yafengio
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Audit logging is disabled by default - enable explicitly when needed
- Certificate rotation requires proper certificate management infrastructure
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.5.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.5.0
Release v0.5.0
This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.
🎯 Major New Features
Custom Drain Extensibility
NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.
PostgreSQL Database Backend
Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.
Note: Support for PostgreSQL is experimental and it is not recommended in production clusters
Audit Logging
Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.
🔧 Enhanced Fault Detection & Remediation
XID 13 & XID 31 Workflow Implementation
Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.
XID 154 Support
Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.
Pre-Installed Driver Support
Enhanced support for environments with driver installed outside of GPU operator
🏗️ Infrastructure & Architecture Improvements
ko-based Kubernetes Object Monitor
Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.
Enhanced Build System
Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.
🐛 Bug Fixes & Reliability Improvements
Node Condition Message Limiting
Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.
Quarantine Override Handling
Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.
Data Model Type Safety
Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.
Data Model Consistency
Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.
Log Collector Concurrency
Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.
🧪 Testing & Quality Improvements
Enhanced Tilt Testing
Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.
Scale Testing Framework
New performance and scale tests to validate NVSentinel behavior under load:
- FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
- API Server & MongoDB Performance: Validation of data layer performance at scale
Log Collector Tilt Tests
Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.
📚 Documentation Improvements
Operational Documentation
- Datastore architecture and migration documentation
- Comprehensive configuration reference
- Feature documentation and user guides
- Runbooks for common operational scenarios
- Upgrade procedures and best practices
- IAM setup guide for CSP health monitor
- Documentation for pre-installed GPU driver support
🔄 Dependencies & Maintenance
Security Updates
- Upgraded Go modules to address CVEs in dependencies
- Bumped various dependencies to latest stable versions
CI/CD Improvements
- Added dependabot configuration for GPU API
- Enhanced GitHub Actions workflows
- Improved contributor automation with copy-pr-bot updates
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @rupalis-nv
- @XRFXLP
- @tanishagoyal2
- @dims
- @lalitadithya
- @KaivalyaMDabhadkar
- @deesharma24
- @nitz2407
- @ksaur
- @pteranodan
- @natherz97
- @jtschelling
- @ArangoGutierrez
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporterkubernetes-object-monitor
All images include the latest bug fixes, security updates, and feature enhancements from this release.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Custom drain handlers require implementing the drain handler interface
- PostgreSQL backend is in preview and should be thoroughly tested before production use
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.4.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.