Release v0.6.0
This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.
🎯 Major New Features
NVLink XID 74 Workflow
NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.
Certificate Rotation Support
The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.
Selective Health Event Analyzer Rules
Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.
Health Event Property Overrides
Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.
Controller-Runtime for Fault Remediation
Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.
🔧 Configuration & Control Improvements
Circuit Breaker Fresh Start
Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.
🐛 Bug Fixes & Reliability Improvements
Node Drainer Priority Handling
Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.
Event Exporter Logging
Fixed an issue that caused the event exporter to not use the standard logging configuration.
PostgreSQL Test Stability
Fixed flaky PostgreSQL tests that were causing intermittent CI failures.
Node Condition Message Formatting
Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.
UAT Environment Management
Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.
🏗️ Architecture & Performance
Enhanced GPU Health Monitor Logging
Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.
Code Quality Improvements
Optimized import ordering and code organization across the codebase for better readability and maintainability.
🧪 Testing & Quality Improvements
Scale Testing Validation
Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.
Test Reliability Improvements
- Fixed flaky syslog XID monitoring UAT tests
- Resolved CSP health monitor test timeout issues
CI/CD Enhancements
- Set explicit Go version during CI dependency installation for reproducible builds
- Improved tilt installation process by using temporary directory
- Added GitHub Action to automatically clean up old untagged container images
- Enhanced fork repository handling to prevent unnecessary workflow triggers
📚 Documentation Improvements
- Enhanced Kubernetes object monitor architecture diagrams
- Updated Slinky drain demo documentation
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @tanishagoyal2
- @XRFXLP
- @miguelvr
- @deesharma24
- @KaivalyaMDabhadkar
- @rupalis-nv
- @ksaur
- @mchmarny
- @lalitadithya
- @dims
- @yafengio
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Audit logging is disabled by default - enable explicitly when needed
- Certificate rotation requires proper certificate management infrastructure
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.5.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.