Skip to content

Release v0.7.0

Choose a tag to compare

@github-actions github-actions released this 20 Jan 13:11
· 96 commits to main since this release
v0.7.0
3f98a45

Release v0.7.0

This release introduces advanced event processing strategies, gRPC-based remediation services, enhanced templating capabilities, and improved support for cloud service providers. We've also made significant reliability enhancements across the platform.

🎯 Major New Features

Event Processing Strategies

NVSentinel now supports configurable event processing strategies across health monitors and analyzers. The new processingStrategy field in health events allows fine-grained control over how events are handled, enabling operators to customize event processing behavior based on specific operational requirements. This feature has been implemented in:

  • GPU Health Monitor
  • Syslog Health Monitor
  • Health Events Analyzer

gRPC Remediation Service

Added a new gRPC-based remediation service that enables programmatic fault remediation operations. This provides a powerful API for external systems to integrate with NVSentinel's remediation capabilities, supporting advanced automation workflows and custom orchestration scenarios.

Enhanced Templating Support

Multi-template support in fault remediation allows using multiple notification templates for different channels and audiences. Additionally, all fields in health events can now be used for templating, providing complete flexibility in crafting notifications and alerts.

Nebius Cloud Support

Added comprehensive support for Nebius Cloud (MK8s) CSP, including environment variable and secret support for node reboot operations. This expands NVSentinel's multi-cloud capabilities to include another major cloud provider.

🔧 Configuration & Control Improvements

Pod GPU Device Allocation Tracking

The metadata collector now tracks pod GPU device allocation, providing visibility into which pods are using which GPUs. This enables more informed remediation decisions and better troubleshooting of GPU-related issues.

Syslog Runtime Journal Support

Added runtime journal support to the syslog health monitor, enabling direct integration with systemd journal for more efficient log collection and processing.

PodMonitor Configuration

Helm charts now support making PodMonitor optional and configurable, providing flexibility for environments with different monitoring setups and requirements.

🐛 Bug Fixes & Reliability Improvements

MongoDB Retry Logic

Implemented retry mechanism for MongoDB write failures, improving resilience against transient database connectivity issues and ensuring health events are not lost during temporary network problems.

Janitor Reconciliation Simplification

Simplified janitor reconciliation loops for better reliability and maintainability. The refactored logic reduces complexity and improves predictability of cleanup operations.

XID 154 Case Handling

Fixed a critical case statement issue in XID 154 handling that could cause incorrect error processing. This ensures GPU recovery action changes are properly detected and remediated.

GpuNvlinkWatch Message Parsing

Fixed message parsing logic collision in GpuNvlinkWatch that was causing stale node conditions. This resolves false positives and improves the accuracy of NVLink health monitoring.

Missing RESTART_VM Action

Added the missing RESTART_VM remediation action to fault remediation configurations, ensuring all supported remediation actions are properly exposed in the configuration.

Default CSP Provider Host

Fixed missing default cspProviderHost value that was causing configuration issues in certain deployment scenarios.

Apple Silicon Demo Support

Added ARM64 (Apple Silicon) support for local fault injection demos, improving the developer experience on macOS machines with Apple Silicon processors.

GpuNvlinkWatch Stale Conditions

Resolved issue where GpuNvlinkWatch could report stale node conditions due to message parsing logic collision, improving monitoring accuracy.

🏗️ Architecture & Performance

Certificate Hot-Reloading

Added option to enable automatic certificate hot-reloading, allowing certificate updates without service restarts. The certificate watcher is now non-blocking, improving overall system responsiveness.

MongoDB Query Metrics

Added comprehensive MongoDB query metrics for better observability of database operations, enabling performance analysis and optimization of data access patterns.

Enhanced Logging

Improved logging configuration across multiple components for better consistency and observability.

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • gRPC remediation service requires proper network configuration and authentication
  • Event processing strategies should be carefully tested before production deployment

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.6.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.