Skip to content

Release v0.8.0

Choose a tag to compare

@github-actions github-actions released this 02 Feb 11:47
· 68 commits to main since this release
v0.8.0
48f045b

Release v0.8.0

This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.

🎯 Major New Features

Distributed Node Locking

NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.

Partial Drain Support

The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.

Comprehensive GPU Reset Support

GPU reset functionality has been expanded across multiple components:

  • GPU Health Monitor: Native GPU reset support for DCGM-detected issues
  • Fault Remediation: Integrated GPU reset as a remediation action
  • Syslog Health Monitor: GPU reset support for syslog-detected faults

This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.

Preflight Check Framework

Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.

Event Handling Strategy Enhancements

Expanded event handling strategy support to additional components:

  • CSP Health Monitor: Event handling strategy configuration for cloud provider events
  • Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events

This provides consistent, fine-grained control over event processing across all health monitoring components.

🔧 Configuration & Control Improvements

Custom Certificate Secrets

Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.

MongoDB Client Tracking

Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.

ArgoCD Integration

Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.

🐛 Bug Fixes & Reliability Improvements

GPU Health Monitor Event Cache

Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.

Labeler Improvements

  • Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
  • Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures

Circuit Breaker Fixes

  • Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
  • Runbook Updates: Enhanced circuit breaker runbook with better operational guidance

Event Filtering

Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.

MongoDB Authentication

Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.

Fault Remediation Business Logic

Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.

Nebius Cloud Reboot Handling

Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.

Health Events Analyzer Test Fixes

Resolved test failures in health-events-analyzer that were causing CI pipeline issues.

🏗️ Architecture & Performance

Driver Version Dependent Parsing

Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.

🧪 Testing & Quality Improvements

UAT Test Reliability

  • Improved UAT tests reliability with better error handling and retry logic
  • Enhanced test configurability for different cluster environments
  • Better cleanup and resource management in test environments

Documentation Improvements

  • Added comprehensive preflight check design documentation
  • New alert runbook for improved operational guidance
  • Fixed typos in runbook documentation

Dependency Updates

  • Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
  • Multiple security updates and dependency version bumps via dependabot

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Partial drain support requires proper GPU workload identification
  • Distributed locking requires coordination between components
  • Preflight checks are in early stages and should be thoroughly tested

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.7.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.