Skip to content

Releases: NVIDIA/NVSentinel

Release v0.6.0

22 Dec 11:33
v0.6.0
82e7180

Choose a tag to compare

Release v0.6.0

This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.

🎯 Major New Features

NVLink XID 74 Workflow

NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.

Certificate Rotation Support

The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.

Selective Health Event Analyzer Rules

Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.

Health Event Property Overrides

Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.

Controller-Runtime for Fault Remediation

Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.

🔧 Configuration & Control Improvements

Circuit Breaker Fresh Start

Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.

🐛 Bug Fixes & Reliability Improvements

Node Drainer Priority Handling

Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.

Event Exporter Logging

Fixed an issue that caused the event exporter to not use the standard logging configuration.

PostgreSQL Test Stability

Fixed flaky PostgreSQL tests that were causing intermittent CI failures.

Node Condition Message Formatting

Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.

UAT Environment Management

Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.

🏗️ Architecture & Performance

Enhanced GPU Health Monitor Logging

Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.

Code Quality Improvements

Optimized import ordering and code organization across the codebase for better readability and maintainability.

🧪 Testing & Quality Improvements

Scale Testing Validation

Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.

Test Reliability Improvements

  • Fixed flaky syslog XID monitoring UAT tests
  • Resolved CSP health monitor test timeout issues

CI/CD Enhancements

  • Set explicit Go version during CI dependency installation for reproducible builds
  • Improved tilt installation process by using temporary directory
  • Added GitHub Action to automatically clean up old untagged container images
  • Enhanced fork repository handling to prevent unnecessary workflow triggers

📚 Documentation Improvements

  • Enhanced Kubernetes object monitor architecture diagrams
  • Updated Slinky drain demo documentation

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Audit logging is disabled by default - enable explicitly when needed
  • Certificate rotation requires proper certificate management infrastructure

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.5.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.5.0

08 Dec 13:32
v0.5.0
6cce9d6

Choose a tag to compare

Release v0.5.0

This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.

🎯 Major New Features

Custom Drain Extensibility

NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.

PostgreSQL Database Backend

Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.

Note: Support for PostgreSQL is experimental and it is not recommended in production clusters

Audit Logging

Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.

🔧 Enhanced Fault Detection & Remediation

XID 13 & XID 31 Workflow Implementation

Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.

XID 154 Support

Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.

Pre-Installed Driver Support

Enhanced support for environments with driver installed outside of GPU operator

🏗️ Infrastructure & Architecture Improvements

ko-based Kubernetes Object Monitor

Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.

Enhanced Build System

Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.

🐛 Bug Fixes & Reliability Improvements

Node Condition Message Limiting

Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.

Quarantine Override Handling

Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.

Data Model Type Safety

Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.

Data Model Consistency

Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.

Log Collector Concurrency

Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.

🧪 Testing & Quality Improvements

Enhanced Tilt Testing

Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.

Scale Testing Framework

New performance and scale tests to validate NVSentinel behavior under load:

  • FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
  • API Server & MongoDB Performance: Validation of data layer performance at scale

Log Collector Tilt Tests

Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.

📚 Documentation Improvements

Operational Documentation

  • Datastore architecture and migration documentation
  • Comprehensive configuration reference
  • Feature documentation and user guides
  • Runbooks for common operational scenarios
  • Upgrade procedures and best practices
  • IAM setup guide for CSP health monitor
  • Documentation for pre-installed GPU driver support

🔄 Dependencies & Maintenance

Security Updates

  • Upgraded Go modules to address CVEs in dependencies
  • Bumped various dependencies to latest stable versions

CI/CD Improvements

  • Added dependabot configuration for GPU API
  • Enhanced GitHub Actions workflows
  • Improved contributor automation with copy-pr-bot updates

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter
  • kubernetes-object-monitor

All images include the latest bug fixes, security updates, and feature enhancements from this release.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Custom drain handlers require implementing the drain handler interface
  • PostgreSQL backend is in preview and should be thoroughly tested before production use

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.4.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.4.1

25 Nov 11:11
v0.4.1
a96ba5e

Choose a tag to compare

Release v0.4.1

This is a hotfix release addressing bugs discovered in v0.4.0. We recommend all users running v0.4.0 upgrade to v0.4.1.

🐛 Bug Fixes

Fault Quarantine Uncordoning Issue

Fixed: Resolved a critical issue where the fault quarantine module's node annotations map could become stale, preventing proper uncordoning of nodes. This fix ensures that manual uncordon operations and automated recovery workflows function correctly.

Event Exporter Package Publishing

Fixed: Corrected the event exporter package publishing configuration, ensuring the event exporter component is properly included in releases and can be deployed as expected.

CRIO Runtime Support

Fixed: Added ability to unset runtimeclass as a workaround for CRIO environments where the default runtime class configuration may cause deployment issues. This provides better compatibility with different container runtime configurations.

🔄 Upgrade Instructions

To upgrade from v0.4.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.1 \
  --namespace nvsentinel \
  --reuse-values

To install v0.4.1:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.1 \
  --namespace nvsentinel \
  --create-namespace

🙏 Acknowledgments

This hotfix release includes contributions from:

Thank you for the quick turnaround on these critical fixes!

📦 What's Included

All 15 container images from v0.4.0 with the above bug fixes applied.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios

Release v0.4.0

24 Nov 12:02
v0.4.0
7209959

Choose a tag to compare

Release v0.4.0

This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.

🎯 Major New Features

Health Event Exporter

NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.

Kubernetes Object Health Monitor

A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.

Repeated XID Pattern Detection

The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.

Enhanced Database Flexibility

You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.

Local Development & Testing with KIND

We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.

Unified MongoDB SDK

All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.

🔧 Configuration & Usability Improvements

Component-Specific Tolerations

Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.

🐛 Bug Fixes & Reliability Improvements

  • Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
  • Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
  • Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
  • Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
  • Fixed: CSP monitor reliability improvements for better cloud provider integration
  • Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
  • Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
  • Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades

🧪 Testing & Quality Improvements

Automated User Acceptance Testing (UAT)

  • AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
  • GCP UAT: Comprehensive UAT coverage on Google Cloud Platform

Development Environment

  • Fixed: Linux development environment setup issues resolved

Test Configuration

  • Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage

🏗️ Infrastructure & Development

Dependency Management

  • Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
  • Helm version pinned to v3.19.2 to ensure consistent behavior across environments
  • Upgraded various Go and Python packages to latest stable versions

CI/CD Improvements

  • Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
  • Enhanced workflow reliability and error handling
  • Better handling of branch names and special characters in automation

📚 Documentation

Updated Documentation

  • Comprehensive log collection documentation with detailed troubleshooting guides
  • Updated guides to reflect current best practices

🙏 Acknowledgments

This release includes contributions from multiple contributors across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter (NEW)
  • kubernetes-object-monitor (NEW)

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • The Kubernetes object monitor is in preview and may require tuning for specific workloads

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.3.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.3.0

07 Nov 13:31
v0.3.0
315fd8f

Choose a tag to compare

This release introduces significant new capabilities for GPU infrastructure monitoring, enhanced automation features, and improved reliability. We've focused on making it easier to understand your GPU environment and giving you more control over how NVSentinel responds to issues.

🎯 Major New Features

GPU Metadata Collection

NVSentinel can now automatically collect detailed information about your GPU hardware, including GPU topology, NVSwitch info, and some hardware specifications. This information helps with troubleshooting and provides better visibility into your GPU infrastructure.

Enhanced Health Event Data

Health events now include rich contextual information about your nodes, including cloud provider details, availability zones, instance types, and CUDA driver versions. This automatic enrichment helps correlate issues across your infrastructure and speeds up root cause analysis.

Intelligent Pattern Detection

The health events analyzer can now detect when multiple issues occur on the same node within a time window. For example, if a node requires remediation multiple times in a short period, NVSentinel can automatically escalate this to support.

Manual Override Capability

You can now manually uncordon a quarantined node, which will automatically cancel the entire automated remediation pipeline for that node. This gives operators direct control when they need to intervene.

Advanced Log Collection

The log collector now automatically gathers AWS SOS reports (sosreport) in addition to existing NVIDIA bug reports and GPU Operator logs. This provides comprehensive diagnostic information for AWS-hosted GPU nodes.

🔧 Configuration & Usability Improvements

Comprehensive Configuration Documentation

The Helm chart now includes extensive inline documentation for all configuration options, making it easier to customize NVSentinel for your environment. A new values-full.yaml reference file provides detailed examples.

Unified Configuration Management

All modules now use a standardized configuration system, making it more consistent and predictable to configure different parts of NVSentinel.

Kata Container Auto-Detection

NVSentinel can now automatically detect when running in Kata containers and adjust its monitoring approach accordingly.

🐛 Bug Fixes & Reliability Improvements

Fault Quarantine Improvements

  • Fixed: Unnecessary events are no longer propagated to node drainer and fault remediation modules, reducing noise in the system
  • Fixed: Taints are no longer applied in dry-run mode, allowing you to safely test configurations
  • Fixed: Race conditions in node monitoring that could cause inconsistent state

Health Monitoring Fixes

  • Fixed: Health events are now properly sent even when DCGM connectivity temporarily fails
  • Fixed: GPU falling off the bus is now detected even without specific XID error codes
  • Fixed: Resource cleanup and connection handling after DCGM failures is more robust
  • Fixed: Raw journal messages are now fully stored in health events for better debugging

End-to-End Testing

  • Fixed: Node drainer restarts properly in end-to-end test environments
  • Fixed: Multiple test flakes and race conditions resolved
  • Fixed: Log collector configuration paths corrected

Data Flow Optimizations

  • Fixed: MongoDB change streams now properly handle error conditions
  • Fixed: Platform connectors fail fast when health events cannot be published, preventing data loss
  • Fixed: Improved error handling throughout the event processing pipeline

🔒 Security & Compliance Enhancements

SLSA Build Provenance

All container images now include SLSA (Supply chain Levels for Software Artifacts) attestations and Software Bill of Materials (SBOM). Sigstore Policy Controller integration enables verification of build provenance.

Security Scanning

  • Daily vulnerability scanning implemented for all container images
  • Security validation now excludes test directories for more focused results

📊 Monitoring & Observability

Improved Metrics

  • Comprehensive audit and documentation of all Prometheus metrics
  • Better labeling and organization of metrics across modules
  • New metrics for manual uncordon operations and pattern detection

Enhanced Logging

  • Structured logging implemented across all modules for consistency
  • Reduced log verbosity while maintaining useful information
  • Better error messages and debugging context

🏗️ Infrastructure & Development

Build System Improvements

  • Images can now be built with either Docker or ko (Kubernetes-optimized builder)
  • ARM64 architecture support across all container images
  • Optimized build times and smaller image sizes
  • Improved GitHub Actions workflows for faster CI/CD

Dependency Updates

  • Upgraded to golangci-lint v2 for better code quality checking
  • Updated multiple cloud provider SDKs (AWS, GCP, Azure)
  • Updated various Go and Python dependencies to latest stable versions
  • Updated CUDA base images

📚 Documentation

New Design Documents

  • GPU metadata retrieval design
  • Data flow through NVSentinel (from detection through remediation)
  • Overview documentation explaining what NVSentinel is and why it's important
  • Integration guides

Updated Guides

  • All documentation updated to reflect current repository structure
  • Development guide improvements
  • Contributing guidelines clarification
  • Roadmap published showing planned features

🔄 Breaking Changes & Migration Notes

Generic Maintenance Resources

The fault remediation module now uses generic maintenance resources instead of reboot-specific resources. If you're using custom remediation integrations, you may need to update your configurations.

Configuration Schema Changes

Some configuration parameters have been renamed or restructured for consistency. Review the updated values-full.yaml for the latest schema.

📈 Quality Improvements

Testing Infrastructure

  • Added comprehensive end-to-end tests for all modules
  • UAT (User Acceptance Testing) framework for AWS environments
  • Improved test coverage reporting
  • Better test isolation and reliability

Code Quality

  • Streamlined Makefiles to reduce duplication and cognitive load
  • Improved linting rules and enforcement
  • Better code organization and module boundaries
  • Reduced technical debt across the codebase

🙏 Acknowledgments

This release includes contributions from 10 contributors, with over 140 commits improving virtually every aspect of NVSentinel:

Thank you to everyone who contributed code, documentation, testing, and feedback!

📦 What's Included

Container Images (14 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain failure scenarios

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.3.0 \
  --namespace nvsentinel \
  --create-namespace

For detailed installation and configuration instructions, see the README.

Release v0.2.0

17 Oct 17:22
v0.2.0
d670f87

Choose a tag to compare

Release v0.2.0

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.2.0

Release v0.1.0

17 Oct 15:38

Choose a tag to compare

Release v0.1.0

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.1.0