Releases: NVIDIA/NVSentinel
Release v0.6.0
Release v0.6.0
This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.
🎯 Major New Features
NVLink XID 74 Workflow
NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.
Certificate Rotation Support
The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.
Selective Health Event Analyzer Rules
Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.
Health Event Property Overrides
Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.
Controller-Runtime for Fault Remediation
Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.
🔧 Configuration & Control Improvements
Circuit Breaker Fresh Start
Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.
🐛 Bug Fixes & Reliability Improvements
Node Drainer Priority Handling
Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.
Event Exporter Logging
Fixed an issue that caused the event exporter to not use the standard logging configuration.
PostgreSQL Test Stability
Fixed flaky PostgreSQL tests that were causing intermittent CI failures.
Node Condition Message Formatting
Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.
UAT Environment Management
Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.
🏗️ Architecture & Performance
Enhanced GPU Health Monitor Logging
Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.
Code Quality Improvements
Optimized import ordering and code organization across the codebase for better readability and maintainability.
🧪 Testing & Quality Improvements
Scale Testing Validation
Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.
Test Reliability Improvements
- Fixed flaky syslog XID monitoring UAT tests
- Resolved CSP health monitor test timeout issues
CI/CD Enhancements
- Set explicit Go version during CI dependency installation for reproducible builds
- Improved tilt installation process by using temporary directory
- Added GitHub Action to automatically clean up old untagged container images
- Enhanced fork repository handling to prevent unnecessary workflow triggers
📚 Documentation Improvements
- Enhanced Kubernetes object monitor architecture diagrams
- Updated Slinky drain demo documentation
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @tanishagoyal2
- @XRFXLP
- @miguelvr
- @deesharma24
- @KaivalyaMDabhadkar
- @rupalis-nv
- @ksaur
- @mchmarny
- @lalitadithya
- @dims
- @yafengio
- @ivelichkovich
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Audit logging is disabled by default - enable explicitly when needed
- Certificate rotation requires proper certificate management infrastructure
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.5.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.6.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.5.0
Release v0.5.0
This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.
🎯 Major New Features
Custom Drain Extensibility
NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.
PostgreSQL Database Backend
Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.
Note: Support for PostgreSQL is experimental and it is not recommended in production clusters
Audit Logging
Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.
🔧 Enhanced Fault Detection & Remediation
XID 13 & XID 31 Workflow Implementation
Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.
XID 154 Support
Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.
Pre-Installed Driver Support
Enhanced support for environments with driver installed outside of GPU operator
🏗️ Infrastructure & Architecture Improvements
ko-based Kubernetes Object Monitor
Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.
Enhanced Build System
Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.
🐛 Bug Fixes & Reliability Improvements
Node Condition Message Limiting
Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.
Quarantine Override Handling
Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.
Data Model Type Safety
Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.
Data Model Consistency
Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.
Log Collector Concurrency
Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.
🧪 Testing & Quality Improvements
Enhanced Tilt Testing
Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.
Scale Testing Framework
New performance and scale tests to validate NVSentinel behavior under load:
- FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
- API Server & MongoDB Performance: Validation of data layer performance at scale
Log Collector Tilt Tests
Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.
📚 Documentation Improvements
Operational Documentation
- Datastore architecture and migration documentation
- Comprehensive configuration reference
- Feature documentation and user guides
- Runbooks for common operational scenarios
- Upgrade procedures and best practices
- IAM setup guide for CSP health monitor
- Documentation for pre-installed GPU driver support
🔄 Dependencies & Maintenance
Security Updates
- Upgraded Go modules to address CVEs in dependencies
- Bumped various dependencies to latest stable versions
CI/CD Improvements
- Added dependabot configuration for GPU API
- Enhanced GitHub Actions workflows
- Improved contributor automation with copy-pr-bot updates
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @rupalis-nv
- @XRFXLP
- @tanishagoyal2
- @dims
- @lalitadithya
- @KaivalyaMDabhadkar
- @deesharma24
- @nitz2407
- @ksaur
- @pteranodan
- @natherz97
- @jtschelling
- @ArangoGutierrez
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporterkubernetes-object-monitor
All images include the latest bug fixes, security updates, and feature enhancements from this release.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Custom drain handlers require implementing the drain handler interface
- PostgreSQL backend is in preview and should be thoroughly tested before production use
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.4.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.4.1
Release v0.4.1
This is a hotfix release addressing bugs discovered in v0.4.0. We recommend all users running v0.4.0 upgrade to v0.4.1.
🐛 Bug Fixes
Fault Quarantine Uncordoning Issue
Fixed: Resolved a critical issue where the fault quarantine module's node annotations map could become stale, preventing proper uncordoning of nodes. This fix ensures that manual uncordon operations and automated recovery workflows function correctly.
Event Exporter Package Publishing
Fixed: Corrected the event exporter package publishing configuration, ensuring the event exporter component is properly included in releases and can be deployed as expected.
CRIO Runtime Support
Fixed: Added ability to unset runtimeclass as a workaround for CRIO environments where the default runtime class configuration may cause deployment issues. This provides better compatibility with different container runtime configurations.
🔄 Upgrade Instructions
To upgrade from v0.4.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.1 \
--namespace nvsentinel \
--reuse-valuesTo install v0.4.1:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.1 \
--namespace nvsentinel \
--create-namespace🙏 Acknowledgments
This hotfix release includes contributions from:
Thank you for the quick turnaround on these critical fixes!
📦 What's Included
All 15 container images from v0.4.0 with the above bug fixes applied.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
Release v0.4.0
Release v0.4.0
This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.
🎯 Major New Features
Health Event Exporter
NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.
Kubernetes Object Health Monitor
A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.
Repeated XID Pattern Detection
The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.
Enhanced Database Flexibility
You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.
Local Development & Testing with KIND
We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.
Unified MongoDB SDK
All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.
🔧 Configuration & Usability Improvements
Component-Specific Tolerations
Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.
🐛 Bug Fixes & Reliability Improvements
- Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
- Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
- Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
- Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
- Fixed: CSP monitor reliability improvements for better cloud provider integration
- Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
- Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
- Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades
🧪 Testing & Quality Improvements
Automated User Acceptance Testing (UAT)
- AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
- GCP UAT: Comprehensive UAT coverage on Google Cloud Platform
Development Environment
- Fixed: Linux development environment setup issues resolved
Test Configuration
- Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage
🏗️ Infrastructure & Development
Dependency Management
- Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
- Helm version pinned to v3.19.2 to ensure consistent behavior across environments
- Upgraded various Go and Python packages to latest stable versions
CI/CD Improvements
- Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
- Enhanced workflow reliability and error handling
- Better handling of branch names and special characters in automation
📚 Documentation
Updated Documentation
- Comprehensive log collection documentation with detailed troubleshooting guides
- Updated guides to reflect current best practices
🙏 Acknowledgments
This release includes contributions from multiple contributors across NVIDIA and the community:
- @lalitadithya
- @XRFXLP
- @ksaur
- @KaivalyaMDabhadkar
- @Gyan172004
- @mchmarny
- @dims
- @tanishagoyal2
- @rupalis-nv
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporter(NEW)kubernetes-object-monitor(NEW)
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- The Kubernetes object monitor is in preview and may require tuning for specific workloads
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.3.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.
Release v0.3.0
This release introduces significant new capabilities for GPU infrastructure monitoring, enhanced automation features, and improved reliability. We've focused on making it easier to understand your GPU environment and giving you more control over how NVSentinel responds to issues.
🎯 Major New Features
GPU Metadata Collection
NVSentinel can now automatically collect detailed information about your GPU hardware, including GPU topology, NVSwitch info, and some hardware specifications. This information helps with troubleshooting and provides better visibility into your GPU infrastructure.
Enhanced Health Event Data
Health events now include rich contextual information about your nodes, including cloud provider details, availability zones, instance types, and CUDA driver versions. This automatic enrichment helps correlate issues across your infrastructure and speeds up root cause analysis.
Intelligent Pattern Detection
The health events analyzer can now detect when multiple issues occur on the same node within a time window. For example, if a node requires remediation multiple times in a short period, NVSentinel can automatically escalate this to support.
Manual Override Capability
You can now manually uncordon a quarantined node, which will automatically cancel the entire automated remediation pipeline for that node. This gives operators direct control when they need to intervene.
Advanced Log Collection
The log collector now automatically gathers AWS SOS reports (sosreport) in addition to existing NVIDIA bug reports and GPU Operator logs. This provides comprehensive diagnostic information for AWS-hosted GPU nodes.
🔧 Configuration & Usability Improvements
Comprehensive Configuration Documentation
The Helm chart now includes extensive inline documentation for all configuration options, making it easier to customize NVSentinel for your environment. A new values-full.yaml reference file provides detailed examples.
Unified Configuration Management
All modules now use a standardized configuration system, making it more consistent and predictable to configure different parts of NVSentinel.
Kata Container Auto-Detection
NVSentinel can now automatically detect when running in Kata containers and adjust its monitoring approach accordingly.
🐛 Bug Fixes & Reliability Improvements
Fault Quarantine Improvements
- Fixed: Unnecessary events are no longer propagated to node drainer and fault remediation modules, reducing noise in the system
- Fixed: Taints are no longer applied in dry-run mode, allowing you to safely test configurations
- Fixed: Race conditions in node monitoring that could cause inconsistent state
Health Monitoring Fixes
- Fixed: Health events are now properly sent even when DCGM connectivity temporarily fails
- Fixed: GPU falling off the bus is now detected even without specific XID error codes
- Fixed: Resource cleanup and connection handling after DCGM failures is more robust
- Fixed: Raw journal messages are now fully stored in health events for better debugging
End-to-End Testing
- Fixed: Node drainer restarts properly in end-to-end test environments
- Fixed: Multiple test flakes and race conditions resolved
- Fixed: Log collector configuration paths corrected
Data Flow Optimizations
- Fixed: MongoDB change streams now properly handle error conditions
- Fixed: Platform connectors fail fast when health events cannot be published, preventing data loss
- Fixed: Improved error handling throughout the event processing pipeline
🔒 Security & Compliance Enhancements
SLSA Build Provenance
All container images now include SLSA (Supply chain Levels for Software Artifacts) attestations and Software Bill of Materials (SBOM). Sigstore Policy Controller integration enables verification of build provenance.
Security Scanning
- Daily vulnerability scanning implemented for all container images
- Security validation now excludes test directories for more focused results
📊 Monitoring & Observability
Improved Metrics
- Comprehensive audit and documentation of all Prometheus metrics
- Better labeling and organization of metrics across modules
- New metrics for manual uncordon operations and pattern detection
Enhanced Logging
- Structured logging implemented across all modules for consistency
- Reduced log verbosity while maintaining useful information
- Better error messages and debugging context
🏗️ Infrastructure & Development
Build System Improvements
- Images can now be built with either Docker or ko (Kubernetes-optimized builder)
- ARM64 architecture support across all container images
- Optimized build times and smaller image sizes
- Improved GitHub Actions workflows for faster CI/CD
Dependency Updates
- Upgraded to golangci-lint v2 for better code quality checking
- Updated multiple cloud provider SDKs (AWS, GCP, Azure)
- Updated various Go and Python dependencies to latest stable versions
- Updated CUDA base images
📚 Documentation
New Design Documents
- GPU metadata retrieval design
- Data flow through NVSentinel (from detection through remediation)
- Overview documentation explaining what NVSentinel is and why it's important
- Integration guides
Updated Guides
- All documentation updated to reflect current repository structure
- Development guide improvements
- Contributing guidelines clarification
- Roadmap published showing planned features
🔄 Breaking Changes & Migration Notes
Generic Maintenance Resources
The fault remediation module now uses generic maintenance resources instead of reboot-specific resources. If you're using custom remediation integrations, you may need to update your configurations.
Configuration Schema Changes
Some configuration parameters have been renamed or restructured for consistency. Review the updated values-full.yaml for the latest schema.
📈 Quality Improvements
Testing Infrastructure
- Added comprehensive end-to-end tests for all modules
- UAT (User Acceptance Testing) framework for AWS environments
- Improved test coverage reporting
- Better test isolation and reliability
Code Quality
- Streamlined Makefiles to reduce duplication and cognitive load
- Improved linting rules and enforcement
- Better code organization and module boundaries
- Reduced technical debt across the codebase
🙏 Acknowledgments
This release includes contributions from 10 contributors, with over 140 commits improving virtually every aspect of NVSentinel:
- @lalitadithya
- @mchmarny
- @dims
- @XRFXLP
- @KaivalyaMDabhadkar
- @rupalis-nv
- @Gyan172004
- @nitz2407
- @tabern
- @tanishagoyal2
Thank you to everyone who contributed code, documentation, testing, and feedback!
📦 What's Included
Container Images (14 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanup
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain failure scenarios
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.3.0 \
--namespace nvsentinel \
--create-namespaceFor detailed installation and configuration instructions, see the README.
Release v0.2.0
Release v0.2.0
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.2.0
Release v0.1.0
Release v0.1.0
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.1.0