Release v0.8.0
Release v0.8.0
This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.
🎯 Major New Features
Distributed Node Locking
NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.
Partial Drain Support
The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.
Comprehensive GPU Reset Support
GPU reset functionality has been expanded across multiple components:
- GPU Health Monitor: Native GPU reset support for DCGM-detected issues
- Fault Remediation: Integrated GPU reset as a remediation action
- Syslog Health Monitor: GPU reset support for syslog-detected faults
This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.
Preflight Check Framework
Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.
Event Handling Strategy Enhancements
Expanded event handling strategy support to additional components:
- CSP Health Monitor: Event handling strategy configuration for cloud provider events
- Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events
This provides consistent, fine-grained control over event processing across all health monitoring components.
🔧 Configuration & Control Improvements
Custom Certificate Secrets
Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.
MongoDB Client Tracking
Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.
ArgoCD Integration
Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.
🐛 Bug Fixes & Reliability Improvements
GPU Health Monitor Event Cache
Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.
Labeler Improvements
- Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
- Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures
Circuit Breaker Fixes
- Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
- Runbook Updates: Enhanced circuit breaker runbook with better operational guidance
Event Filtering
Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.
MongoDB Authentication
Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.
Fault Remediation Business Logic
Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.
Nebius Cloud Reboot Handling
Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.
Health Events Analyzer Test Fixes
Resolved test failures in health-events-analyzer that were causing CI pipeline issues.
🏗️ Architecture & Performance
Driver Version Dependent Parsing
Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.
🧪 Testing & Quality Improvements
UAT Test Reliability
- Improved UAT tests reliability with better error handling and retry logic
- Enhanced test configurability for different cluster environments
- Better cleanup and resource management in test environments
Documentation Improvements
- Added comprehensive preflight check design documentation
- New alert runbook for improved operational guidance
- Fixed typos in runbook documentation
Dependency Updates
- Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
- Multiple security updates and dependency version bumps via dependabot
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @lalitadithya
- @natherz97
- @jtschelling
- @XRFXLP
- @tanishagoyal2
- @deesharma24
- @KaivalyaMDabhadkar
- @ksaur
- @c-fteixeira
- @miguelvramos92
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Partial drain support requires proper GPU workload identification
- Distributed locking requires coordination between components
- Preflight checks are in early stages and should be thoroughly tested
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.7.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.8.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.