Releases: NVIDIA/NVSentinel
Release v0.10.1
Release v0.10.1
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.10.1
Release v0.10.0
Release v0.10.0
This release introduces multi-node NCCL all-reduce preflight testing across all major cloud fabrics, concurrent event exporting for large-scale clusters, Slinky (Slurm-on-Kubernetes) drainer improvements, DCGM 4.4.2 compatibility with destructive XID detection, breakfix response time metrics, and significant fault management reliability fixes.
Major New Features
Multi-Node NCCL All-Reduce Preflight Tests
- Cross-Node GPU Interconnect Validation (#837): A mutating webhook injects an init container that runs a multi-node NCCL all-reduce bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.
- New
preflight-nccl-allreducecontainer image (PyTorch + torchrun) networkFabricHelm selector with data-driven fabric profiles (ib,efa,mnnvl-efa,tcpxo)- DRA resource claim mirroring for GB200 MNNVL/IMEX
extraVolumeMountsfor GCP TCPXO plugin injection- Auto-created NCCL topology ConfigMap for Azure IB
- Tested on A100 (Azure IB), H100 (AWS EFA, GCP TCPXO), and GB200 (AWS MNNVL)
- New
Concurrent Event Exporter
- Worker Pool for Event Publishing (#906): Event exporter now supports concurrent publishing via a
--workersflag. On a 1,100-node production cluster, sequential publishing (~3.3 events/sec) fell behind the event production rate (~10 events/sec), causing MongoDB oplog rotation and an unrecoverableChangeStreamHistoryLostloop — leaving health events unexported for 4+ days. The new worker pool with sequence-tracked resume tokens provides at-least-once delivery with no event loss. At 10 workers, throughput reaches ~33 events/sec (supporting ~3,300 nodes).
DCGM 4.4.2 Compatibility & Destructive XID Detection
- DCGM_HEALTH_WATCH_ALL Support (#905): Upgraded gpu-health-monitor, preflight dcgm-diag, and fake-dcgm to DCGM 4.4.2. Previously,
DCGM_HEALTH_WATCH_ALLincidents (used by DCGM to report destructive XIDs like XID 95) were silently excluded, causing aKeyErrorcrash in the health check loop and leaving GPU failures undetected. The fix removes the exclusion filter, adds safe.get()fallbacks for unknown systems/error codes, and mapsDCGM_HEALTH_WATCH_ALLto Fatal severity. Backward compatible with DCGM 4.2.x.
Breakfix Response Time Metrics
- End-to-End Remediation Latency Tracking (#714): New histogram metrics across the remediation pipeline to answer critical operational questions:
fault_remediation_cr_generate_duration_seconds— Mean time for CR creation in fault-remediationfault_quarantine_node_quarantine_duration_seconds— Mean time to quarantine a nodenode_drainer_pod_eviction_duration_seconds— Mean time waiting for user workloads to complete- Janitor remediation duration metrics — Mean time to remediate
Slinky Drainer Improvements
The Slinky (Slurm-on-Kubernetes) drainer received multiple improvements for production reliability:
- Annotation Handling (#909): Slinky drainer now only adds drain reason annotations if none already exist, and cleans up NVSentinel-owned annotations (prefixed with
[J] [NVSentinel]) upon drain completion. Includes envtest-based tests for the full drain lifecycle. - Wait for Fully Drained State (#919): Fixed a critical bug where the drainer deleted pods in
DRAININGstate (drain accepted but jobs still running) instead of waiting forDRAINEDstate (all jobs complete). Now mirrors the Slinky operator's ownIsNodeDrained()logic by checking busy-state conditions (Allocated,Mixed,Completing). - Wait Only for Ready Pods (#916): Slinky drainer now correctly waits only for pods in Ready state, avoiding stalls on non-ready pods that will never drain.
- CI Pipeline Integration (#885): Slinky drain tests are now included in the GitHub CI pipeline.
Configuration & Cloud Provider
- Configurable IAM Role for EKS (#877): The IAM role name used by the CSP health monitor for EKS is now configurable via Helm values (
iamRoleName), supporting environments with custom IAM role naming. - Terminate Node Template (#894): Added template for creating
TerminateNodeCRs in fault-remediation values, enablingREPLACE_VMremediation actions for node replacement workflows.
Bug Fixes & Reliability
- FRM Multiple Reconciliation Fix (#897): Fixed multiple issues in fault-remediation: duplicate event reconciliation from concurrent status updates, missing
nvsentinel-statelabel for taint-only configurations, duplicate fields inuserPodsEvictionStatus, and missinglastRemediationTimestampupdates in Postgres queries. - Quarantine Metric Accuracy (#759): Fixed
fault_quarantine_current_quarantined_nodesmetric reporting inflated values. Root cause: manual taint removal wasn't triggering unquarantine flow and annotation cleanup. Also fixed cases wherequarantineHealthEventannotation had empty values alongsidequarantinedNodeUncordonedManually. - GPU Reset RuntimeClassName (#887): Set
RuntimeClassNametonvidiain GPU reset pods, ensuring proper GPU access during reset operations. - GPU Reset UAT Improvements (#879, #892): Wait for GPUReset CRD (instead of checking syslog) in UAT tests for more reliable validation; fix uninitialized variable.
- Unquarantine Timeout (#876): Increased unquarantine timeout from default to 5 minutes to prevent premature timeout failures.
- Nolint Directive Cleanup (#832, #884): Continued cleanup of TODO-marked
nolintdirectives (Parts 3 & 4).
Dependency Updates
- Bumped
github.com/aws/aws-sdk-go-v2/configfrom 1.32.7 to 1.32.9 (#902) - Bumped
google.golang.org/apifrom 0.266.0 to 0.267.0 (#903) - Bumped
aquasecurity/trivy-actionfrom 0.33.1 to 0.34.0 (#839) - Multiple dependency updates via dependabot (#875, #904)
Acknowledgments
This release includes contributions from:
- @XRFXLP
- @natherz97
- @tanishagoyal2
- @deesharma24
- @cbump
- @KaivalyaMDabhadkar
- @faganihajizada
- @nitz2407
- @neerajnv
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.10.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.9.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.10.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.9.0
Release v0.9.0
This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.
Major New Features
End-to-End GPU Reset
GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:
- GPU Reset Controller in Janitor (#797): New controller that consumes
GPUResetCRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services. - GPU Reset Container Image (#788): Dedicated
gpu-resetcontainer image used by Janitor's reset jobs to perform the actual GPU reset on target nodes. - E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping
COMPONENT_RESETtoGPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback toRESTART_VMwhen UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.
This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.
Preflight Check Framework Expansion
The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:
- DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
- NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
- Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.
Kubernetes Operator Health Monitoring
- GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in
gpu-operatorandnetwork-operatornamespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.
Performance & Observability
Histogram Bucket Cardinality Reduction
- 96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.
Configurable Network Policy
- Optional Metrics Network Policy (#789): The
metrics-accessnetwork policy can now be disabled vianetworkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.
Bug Fixes & Reliability
- Nolint Directive Cleanup (#828, #831): Cleaned up
nolintdirectives previously marked as TODO across the codebase, improving lint compliance and code quality. - E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
- Demo Script Fix (#809): Fixed demo script to display correct node conditions.
- SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
- CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.
Build & Infrastructure
- Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
- Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
- Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.
Documentation
- K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.
Dependency Updates
- Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
- Multiple dependency updates via dependabot (#803, #806, #829)
Acknowledgments
This release includes contributions from:
- @natherz97
- @XRFXLP
- @deesharma24
- @tanishagoyal2
- @ksaur
- @jtschelling
- @cbumb
- @yuanchen8911
- @yavinash007
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.8.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.9.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.8.0
Release v0.8.0
This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.
🎯 Major New Features
Distributed Node Locking
NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.
Partial Drain Support
The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.
Comprehensive GPU Reset Support
GPU reset functionality has been expanded across multiple components:
- GPU Health Monitor: Native GPU reset support for DCGM-detected issues
- Fault Remediation: Integrated GPU reset as a remediation action
- Syslog Health Monitor: GPU reset support for syslog-detected faults
This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.
Preflight Check Framework
Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.
Event Handling Strategy Enhancements
Expanded event handling strategy support to additional components:
- CSP Health Monitor: Event handling strategy configuration for cloud provider events
- Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events
This provides consistent, fine-grained control over event processing across all health monitoring components.
🔧 Configuration & Control Improvements
Custom Certificate Secrets
Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.
MongoDB Client Tracking
Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.
ArgoCD Integration
Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.
🐛 Bug Fixes & Reliability Improvements
GPU Health Monitor Event Cache
Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.
Labeler Improvements
- Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
- Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures
Circuit Breaker Fixes
- Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
- Runbook Updates: Enhanced circuit breaker runbook with better operational guidance
Event Filtering
Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.
MongoDB Authentication
Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.
Fault Remediation Business Logic
Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.
Nebius Cloud Reboot Handling
Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.
Health Events Analyzer Test Fixes
Resolved test failures in health-events-analyzer that were causing CI pipeline issues.
🏗️ Architecture & Performance
Driver Version Dependent Parsing
Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.
🧪 Testing & Quality Improvements
UAT Test Reliability
- Improved UAT tests reliability with better error handling and retry logic
- Enhanced test configurability for different cluster environments
- Better cleanup and resource management in test environments
Documentation Improvements
- Added comprehensive preflight check design documentation
- New alert runbook for improved operational guidance
- Fixed typos in runbook documentation
Dependency Updates
- Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
- Multiple security updates and dependency version bumps via dependabot
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @lalitadithya
- @natherz97
- @jtschelling
- @XRFXLP
- @tanishagoyal2
- @deesharma24
- @KaivalyaMDabhadkar
- @ksaur
- @c-fteixeira
- @miguelvramos92
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Partial drain support requires proper GPU workload identification
- Distributed locking requires coordination between components
- Preflight checks are in early stages and should be thoroughly tested
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.7.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.7.1
Release v0.7.1
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.7.1
Release v0.7.0
Release v0.7.0
This release introduces advanced event processing strategies, gRPC-based remediation services, enhanced templating capabilities, and improved support for cloud service providers. We've also made significant reliability enhancements across the platform.
🎯 Major New Features
Event Processing Strategies
NVSentinel now supports configurable event processing strategies across health monitors and analyzers. The new processingStrategy field in health events allows fine-grained control over how events are handled, enabling operators to customize event processing behavior based on specific operational requirements. This feature has been implemented in:
- GPU Health Monitor
- Syslog Health Monitor
- Health Events Analyzer
gRPC Remediation Service
Added a new gRPC-based remediation service that enables programmatic fault remediation operations. This provides a powerful API for external systems to integrate with NVSentinel's remediation capabilities, supporting advanced automation workflows and custom orchestration scenarios.
Enhanced Templating Support
Multi-template support in fault remediation allows using multiple notification templates for different channels and audiences. Additionally, all fields in health events can now be used for templating, providing complete flexibility in crafting notifications and alerts.
Nebius Cloud Support
Added comprehensive support for Nebius Cloud (MK8s) CSP, including environment variable and secret support for node reboot operations. This expands NVSentinel's multi-cloud capabilities to include another major cloud provider.
🔧 Configuration & Control Improvements
Pod GPU Device Allocation Tracking
The metadata collector now tracks pod GPU device allocation, providing visibility into which pods are using which GPUs. This enables more informed remediation decisions and better troubleshooting of GPU-related issues.
Syslog Runtime Journal Support
Added runtime journal support to the syslog health monitor, enabling direct integration with systemd journal for more efficient log collection and processing.
PodMonitor Configuration
Helm charts now support making PodMonitor optional and configurable, providing flexibility for environments with different monitoring setups and requirements.
🐛 Bug Fixes & Reliability Improvements
MongoDB Retry Logic
Implemented retry mechanism for MongoDB write failures, improving resilience against transient database connectivity issues and ensuring health events are not lost during temporary network problems.
Janitor Reconciliation Simplification
Simplified janitor reconciliation loops for better reliability and maintainability. The refactored logic reduces complexity and improves predictability of cleanup operations.
XID 154 Case Handling
Fixed a critical case statement issue in XID 154 handling that could cause incorrect error processing. This ensures GPU recovery action changes are properly detected and remediated.
GpuNvlinkWatch Message Parsing
Fixed message parsing logic collision in GpuNvlinkWatch that was causing stale node conditions. This resolves false positives and improves the accuracy of NVLink health monitoring.
Missing RESTART_VM Action
Added the missing RESTART_VM remediation action to fault remediation configurations, ensuring all supported remediation actions are properly exposed in the configuration.
Default CSP Provider Host
Fixed missing default cspProviderHost value that was causing configuration issues in certain deployment scenarios.
Apple Silicon Demo Support
Added ARM64 (Apple Silicon) support for local fault injection demos, improving the developer experience on macOS machines with Apple Silicon processors.
GpuNvlinkWatch Stale Conditions
Resolved issue where GpuNvlinkWatch could report stale node conditions due to message parsing logic collision, improving monitoring accuracy.
🏗️ Architecture & Performance
Certificate Hot-Reloading
Added option to enable automatic certificate hot-reloading, allowing certificate updates without service restarts. The certificate watcher is now non-blocking, improving overall system responsiveness.
MongoDB Query Metrics
Added comprehensive MongoDB query metrics for better observability of database operations, enabling performance analysis and optimization of data access patterns.
Enhanced Logging
Improved logging configuration across multiple components for better consistency and observability.
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @lalitadithya
- @tanishagoyal2
- @jtschelling
- @jamyu
- @Sukets
- @KaivalyaMDabhadkar
- @natherz97
- @XRFXLP
- @oseeniraj
- @aireet
- @xuegangjie
- @miguelvramos92
- @deesharma24
- @zydee3
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- gRPC remediation service requires proper network configuration and authentication
- Event processing strategies should be carefully tested before production deployment
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.7.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.6.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.7.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.6.0
Release v0.6.0
This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.
🎯 Major New Features
NVLink XID 74 Workflow
NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.
Certificate Rotation Support
The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.
Selective Health Event Analyzer Rules
Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.
Health Event Property Overrides
Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.
Controller-Runtime for Fault Remediation
Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.
🔧 Configuration & Control Improvements
Circuit Breaker Fresh Start
Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.
🐛 Bug Fixes & Reliability Improvements
Node Drainer Priority Handling
Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.
Event Exporter Logging
Fixed an issue that caused the event exporter to not use the standard logging configuration.
PostgreSQL Test Stability
Fixed flaky PostgreSQL tests that were causing intermittent CI failures.
Node Condition Message Formatting
Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.
UAT Environment Management
Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.
🏗️ Architecture & Performance
Enhanced GPU Health Monitor Logging
Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.
Code Quality Improvements
Optimized import ordering and code organization across the codebase for better readability and maintainability.
🧪 Testing & Quality Improvements
Scale Testing Validation
Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.
Test Reliability Improvements
- Fixed flaky syslog XID monitoring UAT tests
- Resolved CSP health monitor test timeout issues
CI/CD Enhancements
- Set explicit Go version during CI dependency installation for reproducible builds
- Improved tilt installation process by using temporary directory
- Added GitHub Action to automatically clean up old untagged container images
- Enhanced fork repository handling to prevent unnecessary workflow triggers
📚 Documentation Improvements
- Enhanced Kubernetes object monitor architecture diagrams
- Updated Slinky drain demo documentation
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @tanishagoyal2
- @XRFXLP
- @miguelvr
- @deesharma24
- @KaivalyaMDabhadkar
- @rupalis-nv
- @ksaur
- @mchmarny
- @lalitadithya
- @dims
- @yafengio
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Audit logging is disabled by default - enable explicitly when needed
- Certificate rotation requires proper certificate management infrastructure
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.5.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.5.0
Release v0.5.0
This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.
🎯 Major New Features
Custom Drain Extensibility
NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.
PostgreSQL Database Backend
Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.
Note: Support for PostgreSQL is experimental and it is not recommended in production clusters
Audit Logging
Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.
🔧 Enhanced Fault Detection & Remediation
XID 13 & XID 31 Workflow Implementation
Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.
XID 154 Support
Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.
Pre-Installed Driver Support
Enhanced support for environments with driver installed outside of GPU operator
🏗️ Infrastructure & Architecture Improvements
ko-based Kubernetes Object Monitor
Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.
Enhanced Build System
Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.
🐛 Bug Fixes & Reliability Improvements
Node Condition Message Limiting
Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.
Quarantine Override Handling
Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.
Data Model Type Safety
Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.
Data Model Consistency
Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.
Log Collector Concurrency
Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.
🧪 Testing & Quality Improvements
Enhanced Tilt Testing
Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.
Scale Testing Framework
New performance and scale tests to validate NVSentinel behavior under load:
- FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
- API Server & MongoDB Performance: Validation of data layer performance at scale
Log Collector Tilt Tests
Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.
📚 Documentation Improvements
Operational Documentation
- Datastore architecture and migration documentation
- Comprehensive configuration reference
- Feature documentation and user guides
- Runbooks for common operational scenarios
- Upgrade procedures and best practices
- IAM setup guide for CSP health monitor
- Documentation for pre-installed GPU driver support
🔄 Dependencies & Maintenance
Security Updates
- Upgraded Go modules to address CVEs in dependencies
- Bumped various dependencies to latest stable versions
CI/CD Improvements
- Added dependabot configuration for GPU API
- Enhanced GitHub Actions workflows
- Improved contributor automation with copy-pr-bot updates
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @rupalis-nv
- @XRFXLP
- @tanishagoyal2
- @dims
- @lalitadithya
- @KaivalyaMDabhadkar
- @deesharma24
- @nitz2407
- @ksaur
- @pteranodan
- @natherz97
- @jtschelling
- @ArangoGutierrez
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporterkubernetes-object-monitor
All images include the latest bug fixes, security updates, and feature enhancements from this release.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Custom drain handlers require implementing the drain handler interface
- PostgreSQL backend is in preview and should be thoroughly tested before production use
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.4.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.4.1
Release v0.4.1
This is a hotfix release addressing bugs discovered in v0.4.0. We recommend all users running v0.4.0 upgrade to v0.4.1.
🐛 Bug Fixes
Fault Quarantine Uncordoning Issue
Fixed: Resolved a critical issue where the fault quarantine module's node annotations map could become stale, preventing proper uncordoning of nodes. This fix ensures that manual uncordon operations and automated recovery workflows function correctly.
Event Exporter Package Publishing
Fixed: Corrected the event exporter package publishing configuration, ensuring the event exporter component is properly included in releases and can be deployed as expected.
CRIO Runtime Support
Fixed: Added ability to unset runtimeclass as a workaround for CRIO environments where the default runtime class configuration may cause deployment issues. This provides better compatibility with different container runtime configurations.
🔄 Upgrade Instructions
To upgrade from v0.4.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.1 \
--namespace nvsentinel \
--reuse-valuesTo install v0.4.1:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.1 \
--namespace nvsentinel \
--create-namespace🙏 Acknowledgments
This hotfix release includes contributions from:
Thank you for the quick turnaround on these critical fixes!
📦 What's Included
All 15 container images from v0.4.0 with the above bug fixes applied.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
Release v0.4.0
Release v0.4.0
This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.
🎯 Major New Features
Health Event Exporter
NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.
Kubernetes Object Health Monitor
A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.
Repeated XID Pattern Detection
The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.
Enhanced Database Flexibility
You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.
Local Development & Testing with KIND
We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.
Unified MongoDB SDK
All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.
🔧 Configuration & Usability Improvements
Component-Specific Tolerations
Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.
🐛 Bug Fixes & Reliability Improvements
- Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
- Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
- Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
- Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
- Fixed: CSP monitor reliability improvements for better cloud provider integration
- Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
- Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
- Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades
🧪 Testing & Quality Improvements
Automated User Acceptance Testing (UAT)
- AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
- GCP UAT: Comprehensive UAT coverage on Google Cloud Platform
Development Environment
- Fixed: Linux development environment setup issues resolved
Test Configuration
- Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage
🏗️ Infrastructure & Development
Dependency Management
- Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
- Helm version pinned to v3.19.2 to ensure consistent behavior across environments
- Upgraded various Go and Python packages to latest stable versions
CI/CD Improvements
- Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
- Enhanced workflow reliability and error handling
- Better handling of branch names and special characters in automation
📚 Documentation
Updated Documentation
- Comprehensive log collection documentation with detailed troubleshooting guides
- Updated guides to reflect current best practices
🙏 Acknowledgments
This release includes contributions from multiple contributors across NVIDIA and the community:
- @lalitadithya
- @XRFXLP
- @ksaur
- @KaivalyaMDabhadkar
- @Gyan172004
- @mchmarny
- @dims
- @tanishagoyal2
- @rupalis-nv
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporter(NEW)kubernetes-object-monitor(NEW)
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- The Kubernetes object monitor is in preview and may require tuning for specific workloads
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.3.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.