Release v0.8.0

This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.

🎯 Major New Features

Distributed Node Locking

NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.

Partial Drain Support

The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.

Comprehensive GPU Reset Support

GPU reset functionality has been expanded across multiple components:

GPU Health Monitor: Native GPU reset support for DCGM-detected issues
Fault Remediation: Integrated GPU reset as a remediation action
Syslog Health Monitor: GPU reset support for syslog-detected faults

This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.

Preflight Check Framework

Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.

Event Handling Strategy Enhancements

Expanded event handling strategy support to additional components:

CSP Health Monitor: Event handling strategy configuration for cloud provider events
Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events

This provides consistent, fine-grained control over event processing across all health monitoring components.

🔧 Configuration & Control Improvements

Custom Certificate Secrets

Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.

MongoDB Client Tracking

Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.

ArgoCD Integration

Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.

🐛 Bug Fixes & Reliability Improvements

GPU Health Monitor Event Cache

Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.

Labeler Improvements

Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures

Circuit Breaker Fixes

Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
Runbook Updates: Enhanced circuit breaker runbook with better operational guidance

Event Filtering

Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.

MongoDB Authentication

Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.

Fault Remediation Business Logic

Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.

Nebius Cloud Reboot Handling

Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.

Health Events Analyzer Test Fixes

Resolved test failures in health-events-analyzer that were causing CI pipeline issues.

🏗️ Architecture & Performance

Driver Version Dependent Parsing

Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.

🧪 Testing & Quality Improvements

UAT Test Reliability

Improved UAT tests reliability with better error handling and retry logic
Enhanced test configurability for different cluster environments
Better cleanup and resource management in test environments

Documentation Improvements

Added comprehensive preflight check design documentation
New alert runbook for improved operational guidance
Fixed typos in runbook documentation

Dependency Updates

Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
Multiple security updates and dependency version bumps via dependabot

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

GitHub Repository: https://github.com/NVIDIA/NVSentinel
Container Registry: ghcr.io/nvidia/nvsentinel
Documentation: See /docs directory in repository
Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
Discussions: https://github.com/NVIDIA/NVSentinel/discussions

⚠️ Known Limitations

This is an experimental/preview release - use caution in production environments
Some features are disabled by default and must be explicitly enabled
Manual intervention may still be required for certain complex failure scenarios
Partial drain support requires proper GPU workload identification
Distributed locking requires coordination between components
Preflight checks are in early stages and should be thoroughly tested

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.7.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.8.0

Release v0.8.0

🎯 Major New Features

Distributed Node Locking

Partial Drain Support

Comprehensive GPU Reset Support

Preflight Check Framework

Event Handling Strategy Enhancements

🔧 Configuration & Control Improvements

Custom Certificate Secrets

MongoDB Client Tracking

ArgoCD Integration

🐛 Bug Fixes & Reliability Improvements

GPU Health Monitor Event Cache

Labeler Improvements

Circuit Breaker Fixes

Event Filtering

MongoDB Authentication

Fault Remediation Business Logic

Nebius Cloud Reboot Handling

Health Events Analyzer Test Fixes

🏗️ Architecture & Performance

Driver Version Dependent Parsing

🧪 Testing & Quality Improvements

UAT Test Reliability

Documentation Improvements

Dependency Updates

🙏 Acknowledgments

🔗 Resources

⚠️ Known Limitations

🚀 Getting Started

Contributors

Uh oh!