Skip to content

Releases: NVIDIA/NVSentinel

Release v0.10.1

07 Mar 12:55
v0.10.1
afcc326

Choose a tag to compare

Release v0.10.1

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.10.1

Release v0.10.0

03 Mar 11:41
v0.10.0
429d2b8

Choose a tag to compare

Release v0.10.0

This release introduces multi-node NCCL all-reduce preflight testing across all major cloud fabrics, concurrent event exporting for large-scale clusters, Slinky (Slurm-on-Kubernetes) drainer improvements, DCGM 4.4.2 compatibility with destructive XID detection, breakfix response time metrics, and significant fault management reliability fixes.

Major New Features

Multi-Node NCCL All-Reduce Preflight Tests

  • Cross-Node GPU Interconnect Validation (#837): A mutating webhook injects an init container that runs a multi-node NCCL all-reduce bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.
    • New preflight-nccl-allreduce container image (PyTorch + torchrun)
    • networkFabric Helm selector with data-driven fabric profiles (ib, efa, mnnvl-efa, tcpxo)
    • DRA resource claim mirroring for GB200 MNNVL/IMEX
    • extraVolumeMounts for GCP TCPXO plugin injection
    • Auto-created NCCL topology ConfigMap for Azure IB
    • Tested on A100 (Azure IB), H100 (AWS EFA, GCP TCPXO), and GB200 (AWS MNNVL)

Concurrent Event Exporter

  • Worker Pool for Event Publishing (#906): Event exporter now supports concurrent publishing via a --workers flag. On a 1,100-node production cluster, sequential publishing (~3.3 events/sec) fell behind the event production rate (~10 events/sec), causing MongoDB oplog rotation and an unrecoverable ChangeStreamHistoryLost loop — leaving health events unexported for 4+ days. The new worker pool with sequence-tracked resume tokens provides at-least-once delivery with no event loss. At 10 workers, throughput reaches ~33 events/sec (supporting ~3,300 nodes).

DCGM 4.4.2 Compatibility & Destructive XID Detection

  • DCGM_HEALTH_WATCH_ALL Support (#905): Upgraded gpu-health-monitor, preflight dcgm-diag, and fake-dcgm to DCGM 4.4.2. Previously, DCGM_HEALTH_WATCH_ALL incidents (used by DCGM to report destructive XIDs like XID 95) were silently excluded, causing a KeyError crash in the health check loop and leaving GPU failures undetected. The fix removes the exclusion filter, adds safe .get() fallbacks for unknown systems/error codes, and maps DCGM_HEALTH_WATCH_ALL to Fatal severity. Backward compatible with DCGM 4.2.x.

Breakfix Response Time Metrics

  • End-to-End Remediation Latency Tracking (#714): New histogram metrics across the remediation pipeline to answer critical operational questions:
    • fault_remediation_cr_generate_duration_seconds — Mean time for CR creation in fault-remediation
    • fault_quarantine_node_quarantine_duration_seconds — Mean time to quarantine a node
    • node_drainer_pod_eviction_duration_seconds — Mean time waiting for user workloads to complete
    • Janitor remediation duration metrics — Mean time to remediate

Slinky Drainer Improvements

The Slinky (Slurm-on-Kubernetes) drainer received multiple improvements for production reliability:

  • Annotation Handling (#909): Slinky drainer now only adds drain reason annotations if none already exist, and cleans up NVSentinel-owned annotations (prefixed with [J] [NVSentinel]) upon drain completion. Includes envtest-based tests for the full drain lifecycle.
  • Wait for Fully Drained State (#919): Fixed a critical bug where the drainer deleted pods in DRAINING state (drain accepted but jobs still running) instead of waiting for DRAINED state (all jobs complete). Now mirrors the Slinky operator's own IsNodeDrained() logic by checking busy-state conditions (Allocated, Mixed, Completing).
  • Wait Only for Ready Pods (#916): Slinky drainer now correctly waits only for pods in Ready state, avoiding stalls on non-ready pods that will never drain.
  • CI Pipeline Integration (#885): Slinky drain tests are now included in the GitHub CI pipeline.

Configuration & Cloud Provider

  • Configurable IAM Role for EKS (#877): The IAM role name used by the CSP health monitor for EKS is now configurable via Helm values (iamRoleName), supporting environments with custom IAM role naming.
  • Terminate Node Template (#894): Added template for creating TerminateNode CRs in fault-remediation values, enabling REPLACE_VM remediation actions for node replacement workflows.

Bug Fixes & Reliability

  • FRM Multiple Reconciliation Fix (#897): Fixed multiple issues in fault-remediation: duplicate event reconciliation from concurrent status updates, missing nvsentinel-state label for taint-only configurations, duplicate fields in userPodsEvictionStatus, and missing lastRemediationTimestamp updates in Postgres queries.
  • Quarantine Metric Accuracy (#759): Fixed fault_quarantine_current_quarantined_nodes metric reporting inflated values. Root cause: manual taint removal wasn't triggering unquarantine flow and annotation cleanup. Also fixed cases where quarantineHealthEvent annotation had empty values alongside quarantinedNodeUncordonedManually.
  • GPU Reset RuntimeClassName (#887): Set RuntimeClassName to nvidia in GPU reset pods, ensuring proper GPU access during reset operations.
  • GPU Reset UAT Improvements (#879, #892): Wait for GPUReset CRD (instead of checking syslog) in UAT tests for more reliable validation; fix uninitialized variable.
  • Unquarantine Timeout (#876): Increased unquarantine timeout from default to 5 minutes to prevent premature timeout failures.
  • Nolint Directive Cleanup (#832, #884): Continued cleanup of TODO-marked nolint directives (Parts 3 & 4).

Dependency Updates

  • Bumped github.com/aws/aws-sdk-go-v2/config from 1.32.7 to 1.32.9 (#902)
  • Bumped google.golang.org/api from 0.266.0 to 0.267.0 (#903)
  • Bumped aquasecurity/trivy-action from 0.33.1 to 0.34.0 (#839)
  • Multiple dependency updates via dependabot (#875, #904)

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.10.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.9.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.10.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.9.0

16 Feb 10:43
v0.9.0
2823f76

Choose a tag to compare

Release v0.9.0

This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.

Major New Features

End-to-End GPU Reset

GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:

  • GPU Reset Controller in Janitor (#797): New controller that consumes GPUReset CRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services.
  • GPU Reset Container Image (#788): Dedicated gpu-reset container image used by Janitor's reset jobs to perform the actual GPU reset on target nodes.
  • E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping COMPONENT_RESET to GPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback to RESTART_VM when UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.

This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.

Preflight Check Framework Expansion

The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:

  • DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
  • NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
  • Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.

Kubernetes Operator Health Monitoring

  • GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in gpu-operator and network-operator namespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.

Performance & Observability

Histogram Bucket Cardinality Reduction

  • 96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.

Configurable Network Policy

  • Optional Metrics Network Policy (#789): The metrics-access network policy can now be disabled via networkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.

Bug Fixes & Reliability

  • Nolint Directive Cleanup (#828, #831): Cleaned up nolint directives previously marked as TODO across the codebase, improving lint compliance and code quality.
  • E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
  • Demo Script Fix (#809): Fixed demo script to display correct node conditions.
  • SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
  • CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.

Build & Infrastructure

  • Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
  • Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
  • Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.

Documentation

  • K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.

Dependency Updates

  • Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
  • Multiple dependency updates via dependabot (#803, #806, #829)

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.8.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.8.0

02 Feb 11:47
v0.8.0
48f045b

Choose a tag to compare

Release v0.8.0

This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.

🎯 Major New Features

Distributed Node Locking

NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.

Partial Drain Support

The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.

Comprehensive GPU Reset Support

GPU reset functionality has been expanded across multiple components:

  • GPU Health Monitor: Native GPU reset support for DCGM-detected issues
  • Fault Remediation: Integrated GPU reset as a remediation action
  • Syslog Health Monitor: GPU reset support for syslog-detected faults

This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.

Preflight Check Framework

Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.

Event Handling Strategy Enhancements

Expanded event handling strategy support to additional components:

  • CSP Health Monitor: Event handling strategy configuration for cloud provider events
  • Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events

This provides consistent, fine-grained control over event processing across all health monitoring components.

🔧 Configuration & Control Improvements

Custom Certificate Secrets

Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.

MongoDB Client Tracking

Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.

ArgoCD Integration

Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.

🐛 Bug Fixes & Reliability Improvements

GPU Health Monitor Event Cache

Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.

Labeler Improvements

  • Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
  • Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures

Circuit Breaker Fixes

  • Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
  • Runbook Updates: Enhanced circuit breaker runbook with better operational guidance

Event Filtering

Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.

MongoDB Authentication

Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.

Fault Remediation Business Logic

Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.

Nebius Cloud Reboot Handling

Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.

Health Events Analyzer Test Fixes

Resolved test failures in health-events-analyzer that were causing CI pipeline issues.

🏗️ Architecture & Performance

Driver Version Dependent Parsing

Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.

🧪 Testing & Quality Improvements

UAT Test Reliability

  • Improved UAT tests reliability with better error handling and retry logic
  • Enhanced test configurability for different cluster environments
  • Better cleanup and resource management in test environments

Documentation Improvements

  • Added comprehensive preflight check design documentation
  • New alert runbook for improved operational guidance
  • Fixed typos in runbook documentation

Dependency Updates

  • Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
  • Multiple security updates and dependency version bumps via dependabot

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Partial drain support requires proper GPU workload identification
  • Distributed locking requires coordination between components
  • Preflight checks are in early stages and should be thoroughly tested

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.7.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.7.1

26 Jan 20:02
v0.7.1
53b2c50

Choose a tag to compare

Release v0.7.1

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.7.1

Release v0.7.0

20 Jan 13:11
v0.7.0
3f98a45

Choose a tag to compare

Release v0.7.0

This release introduces advanced event processing strategies, gRPC-based remediation services, enhanced templating capabilities, and improved support for cloud service providers. We've also made significant reliability enhancements across the platform.

🎯 Major New Features

Event Processing Strategies

NVSentinel now supports configurable event processing strategies across health monitors and analyzers. The new processingStrategy field in health events allows fine-grained control over how events are handled, enabling operators to customize event processing behavior based on specific operational requirements. This feature has been implemented in:

  • GPU Health Monitor
  • Syslog Health Monitor
  • Health Events Analyzer

gRPC Remediation Service

Added a new gRPC-based remediation service that enables programmatic fault remediation operations. This provides a powerful API for external systems to integrate with NVSentinel's remediation capabilities, supporting advanced automation workflows and custom orchestration scenarios.

Enhanced Templating Support

Multi-template support in fault remediation allows using multiple notification templates for different channels and audiences. Additionally, all fields in health events can now be used for templating, providing complete flexibility in crafting notifications and alerts.

Nebius Cloud Support

Added comprehensive support for Nebius Cloud (MK8s) CSP, including environment variable and secret support for node reboot operations. This expands NVSentinel's multi-cloud capabilities to include another major cloud provider.

🔧 Configuration & Control Improvements

Pod GPU Device Allocation Tracking

The metadata collector now tracks pod GPU device allocation, providing visibility into which pods are using which GPUs. This enables more informed remediation decisions and better troubleshooting of GPU-related issues.

Syslog Runtime Journal Support

Added runtime journal support to the syslog health monitor, enabling direct integration with systemd journal for more efficient log collection and processing.

PodMonitor Configuration

Helm charts now support making PodMonitor optional and configurable, providing flexibility for environments with different monitoring setups and requirements.

🐛 Bug Fixes & Reliability Improvements

MongoDB Retry Logic

Implemented retry mechanism for MongoDB write failures, improving resilience against transient database connectivity issues and ensuring health events are not lost during temporary network problems.

Janitor Reconciliation Simplification

Simplified janitor reconciliation loops for better reliability and maintainability. The refactored logic reduces complexity and improves predictability of cleanup operations.

XID 154 Case Handling

Fixed a critical case statement issue in XID 154 handling that could cause incorrect error processing. This ensures GPU recovery action changes are properly detected and remediated.

GpuNvlinkWatch Message Parsing

Fixed message parsing logic collision in GpuNvlinkWatch that was causing stale node conditions. This resolves false positives and improves the accuracy of NVLink health monitoring.

Missing RESTART_VM Action

Added the missing RESTART_VM remediation action to fault remediation configurations, ensuring all supported remediation actions are properly exposed in the configuration.

Default CSP Provider Host

Fixed missing default cspProviderHost value that was causing configuration issues in certain deployment scenarios.

Apple Silicon Demo Support

Added ARM64 (Apple Silicon) support for local fault injection demos, improving the developer experience on macOS machines with Apple Silicon processors.

GpuNvlinkWatch Stale Conditions

Resolved issue where GpuNvlinkWatch could report stale node conditions due to message parsing logic collision, improving monitoring accuracy.

🏗️ Architecture & Performance

Certificate Hot-Reloading

Added option to enable automatic certificate hot-reloading, allowing certificate updates without service restarts. The certificate watcher is now non-blocking, improving overall system responsiveness.

MongoDB Query Metrics

Added comprehensive MongoDB query metrics for better observability of database operations, enabling performance analysis and optimization of data access patterns.

Enhanced Logging

Improved logging configuration across multiple components for better consistency and observability.

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • gRPC remediation service requires proper network configuration and authentication
  • Event processing strategies should be carefully tested before production deployment

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.6.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.6.0

22 Dec 11:33
v0.6.0
82e7180

Choose a tag to compare

Release v0.6.0

This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.

🎯 Major New Features

NVLink XID 74 Workflow

NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.

Certificate Rotation Support

The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.

Selective Health Event Analyzer Rules

Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.

Health Event Property Overrides

Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.

Controller-Runtime for Fault Remediation

Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.

🔧 Configuration & Control Improvements

Circuit Breaker Fresh Start

Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.

🐛 Bug Fixes & Reliability Improvements

Node Drainer Priority Handling

Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.

Event Exporter Logging

Fixed an issue that caused the event exporter to not use the standard logging configuration.

PostgreSQL Test Stability

Fixed flaky PostgreSQL tests that were causing intermittent CI failures.

Node Condition Message Formatting

Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.

UAT Environment Management

Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.

🏗️ Architecture & Performance

Enhanced GPU Health Monitor Logging

Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.

Code Quality Improvements

Optimized import ordering and code organization across the codebase for better readability and maintainability.

🧪 Testing & Quality Improvements

Scale Testing Validation

Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.

Test Reliability Improvements

  • Fixed flaky syslog XID monitoring UAT tests
  • Resolved CSP health monitor test timeout issues

CI/CD Enhancements

  • Set explicit Go version during CI dependency installation for reproducible builds
  • Improved tilt installation process by using temporary directory
  • Added GitHub Action to automatically clean up old untagged container images
  • Enhanced fork repository handling to prevent unnecessary workflow triggers

📚 Documentation Improvements

  • Enhanced Kubernetes object monitor architecture diagrams
  • Updated Slinky drain demo documentation

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Audit logging is disabled by default - enable explicitly when needed
  • Certificate rotation requires proper certificate management infrastructure

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.5.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.5.0

08 Dec 13:32
v0.5.0
6cce9d6

Choose a tag to compare

Release v0.5.0

This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.

🎯 Major New Features

Custom Drain Extensibility

NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.

PostgreSQL Database Backend

Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.

Note: Support for PostgreSQL is experimental and it is not recommended in production clusters

Audit Logging

Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.

🔧 Enhanced Fault Detection & Remediation

XID 13 & XID 31 Workflow Implementation

Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.

XID 154 Support

Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.

Pre-Installed Driver Support

Enhanced support for environments with driver installed outside of GPU operator

🏗️ Infrastructure & Architecture Improvements

ko-based Kubernetes Object Monitor

Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.

Enhanced Build System

Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.

🐛 Bug Fixes & Reliability Improvements

Node Condition Message Limiting

Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.

Quarantine Override Handling

Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.

Data Model Type Safety

Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.

Data Model Consistency

Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.

Log Collector Concurrency

Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.

🧪 Testing & Quality Improvements

Enhanced Tilt Testing

Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.

Scale Testing Framework

New performance and scale tests to validate NVSentinel behavior under load:

  • FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
  • API Server & MongoDB Performance: Validation of data layer performance at scale

Log Collector Tilt Tests

Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.

📚 Documentation Improvements

Operational Documentation

  • Datastore architecture and migration documentation
  • Comprehensive configuration reference
  • Feature documentation and user guides
  • Runbooks for common operational scenarios
  • Upgrade procedures and best practices
  • IAM setup guide for CSP health monitor
  • Documentation for pre-installed GPU driver support

🔄 Dependencies & Maintenance

Security Updates

  • Upgraded Go modules to address CVEs in dependencies
  • Bumped various dependencies to latest stable versions

CI/CD Improvements

  • Added dependabot configuration for GPU API
  • Enhanced GitHub Actions workflows
  • Improved contributor automation with copy-pr-bot updates

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter
  • kubernetes-object-monitor

All images include the latest bug fixes, security updates, and feature enhancements from this release.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Custom drain handlers require implementing the drain handler interface
  • PostgreSQL backend is in preview and should be thoroughly tested before production use

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.4.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.4.1

25 Nov 11:11
v0.4.1
a96ba5e

Choose a tag to compare

Release v0.4.1

This is a hotfix release addressing bugs discovered in v0.4.0. We recommend all users running v0.4.0 upgrade to v0.4.1.

🐛 Bug Fixes

Fault Quarantine Uncordoning Issue

Fixed: Resolved a critical issue where the fault quarantine module's node annotations map could become stale, preventing proper uncordoning of nodes. This fix ensures that manual uncordon operations and automated recovery workflows function correctly.

Event Exporter Package Publishing

Fixed: Corrected the event exporter package publishing configuration, ensuring the event exporter component is properly included in releases and can be deployed as expected.

CRIO Runtime Support

Fixed: Added ability to unset runtimeclass as a workaround for CRIO environments where the default runtime class configuration may cause deployment issues. This provides better compatibility with different container runtime configurations.

🔄 Upgrade Instructions

To upgrade from v0.4.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.1 \
  --namespace nvsentinel \
  --reuse-values

To install v0.4.1:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.1 \
  --namespace nvsentinel \
  --create-namespace

🙏 Acknowledgments

This hotfix release includes contributions from:

Thank you for the quick turnaround on these critical fixes!

📦 What's Included

All 15 container images from v0.4.0 with the above bug fixes applied.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios

Release v0.4.0

24 Nov 12:02
v0.4.0
7209959

Choose a tag to compare

Release v0.4.0

This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.

🎯 Major New Features

Health Event Exporter

NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.

Kubernetes Object Health Monitor

A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.

Repeated XID Pattern Detection

The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.

Enhanced Database Flexibility

You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.

Local Development & Testing with KIND

We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.

Unified MongoDB SDK

All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.

🔧 Configuration & Usability Improvements

Component-Specific Tolerations

Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.

🐛 Bug Fixes & Reliability Improvements

  • Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
  • Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
  • Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
  • Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
  • Fixed: CSP monitor reliability improvements for better cloud provider integration
  • Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
  • Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
  • Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades

🧪 Testing & Quality Improvements

Automated User Acceptance Testing (UAT)

  • AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
  • GCP UAT: Comprehensive UAT coverage on Google Cloud Platform

Development Environment

  • Fixed: Linux development environment setup issues resolved

Test Configuration

  • Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage

🏗️ Infrastructure & Development

Dependency Management

  • Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
  • Helm version pinned to v3.19.2 to ensure consistent behavior across environments
  • Upgraded various Go and Python packages to latest stable versions

CI/CD Improvements

  • Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
  • Enhanced workflow reliability and error handling
  • Better handling of branch names and special characters in automation

📚 Documentation

Updated Documentation

  • Comprehensive log collection documentation with detailed troubleshooting guides
  • Updated guides to reflect current best practices

🙏 Acknowledgments

This release includes contributions from multiple contributors across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter (NEW)
  • kubernetes-object-monitor (NEW)

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • The Kubernetes object monitor is in preview and may require tuning for specific workloads

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.3.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.