Skip to content

Releases: NVIDIA/NVSentinel

Release v1.1.0

18 Mar 07:30
v1.1.0
952cd1c

Choose a tag to compare

Release v1.1.0

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v1.1.0

Release v1.0.0

17 Mar 13:08
v1.0.0
adbab1e

Choose a tag to compare

NVSentinel v1.0.0 Release Notes

Status: Beta / Stable

With v1.0.0, NVSentinel moves from Experimental to Beta/Stable. We now recommend NVSentinel for production testing and use. The project continues to evolve rapidly and APIs may change between releases, but we follow semantic versioning going forward: breaking changes will increment the major version.

What's in v1.0.0

This release represents 13 prior releases and 400+ commits since the initial open-source launch in October 2025. The highlights below cover the full arc from v0.1.0 through v1.0.0.

GPU reset and remediation pipeline

NVSentinel now supports a complete GPU reset workflow as an alternative to full node reboot. The GPU health monitor detects reset-eligible errors, fault remediation creates GPUReset CRDs, and the janitor executes the reset. This reduces remediation time from minutes (reboot) to seconds (reset) for recoverable GPU faults. End-to-end remediation metrics track the full pipeline from fault detection through resolution.

Kubernetes object monitor

A new policy-based health monitor that watches any Kubernetes resource and evaluates CEL expressions to generate health events. This enables monitoring of custom resources, operator status, and application-level health signals without writing code.

Event exporter

Health events can now be streamed to external systems in CloudEvents format. This enables integration with existing observability platforms and data pipelines.

Preflight checks

A new preflight framework validates cluster readiness before GPU workloads are scheduled. Includes DCGM diagnostics and NCCL loopback/all-reduce tests to catch hardware issues before they affect production jobs.

Slurm drain monitor

A new health monitor for hybrid Kubernetes/Slurm environments. Monitors Slurm drain state and generates health events when nodes are drained by the Slurm scheduler, enabling NVSentinel to coordinate remediation across both schedulers.

Metadata collector

Automatically gathers GPU and NVSwitch topology information and enriches health events with hardware context. Integrated with both GPU and syslog health monitors.

PostgreSQL backend

MongoDB is no longer the only storage option. PostgreSQL is now supported as an alternative database backend, with LISTEN/NOTIFY change streams for real-time event processing.

Slinky (NVIDIA DPU) drain support

Custom drain integration for Slinky-managed nodes, including parallel drain handling and proper annotation coordination.

NVLink and XID workflow improvements

Dedicated workflows for NVLink failures (XID 13, 31, 154) with GPU-topology-aware fault classification. The syslog health monitor now includes driver-version-dependent parsing for NVL5 decoding rules.

Cloud provider improvements

  • Bare-metal reboot support via sudo in janitor
  • Generic CSP plugin with reboot capability
  • Configurable IAM role names for EKS
  • OCI, Azure, GCP, and AWS all supported with provider-specific janitor configurations

Operational improvements

  • Circuit breaker prevents mass quarantines during cluster-wide events
  • Audit logging for all NVSentinel write operations
  • Breakfix cancellation via manual uncordon
  • Partial drain support in node drainer (per-namespace eviction strategies)
  • Custom drain modes with parallel drain handling
  • Log collection for diagnostic reports, including AWS SOS and GCP SOS report collection
  • Optional TLS for MongoDB connections

Build and security

  • All container images built with ko and attested with SLSA build provenance
  • SPDX SBOM attestation on every image
  • Daily vulnerability scanning
  • Supply chain verification via Sigstore Policy Controller

Upgrading

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.0.0 \
  --namespace nvsentinel

NVSentinel includes a pre-upgrade hook that cleans up deprecated node conditions automatically. Review the Helm Chart Configuration Guide for new configuration options.

What's next

See the Roadmap and Project Board for planned work toward General Availability.

Release v0.10.1

07 Mar 12:55
v0.10.1
afcc326

Choose a tag to compare

Release v0.10.1

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.10.1

Release v0.10.0

03 Mar 11:41
v0.10.0
429d2b8

Choose a tag to compare

Release v0.10.0

This release introduces multi-node NCCL all-reduce preflight testing across all major cloud fabrics, concurrent event exporting for large-scale clusters, Slinky (Slurm-on-Kubernetes) drainer improvements, DCGM 4.4.2 compatibility with destructive XID detection, breakfix response time metrics, and significant fault management reliability fixes.

Major New Features

Multi-Node NCCL All-Reduce Preflight Tests

  • Cross-Node GPU Interconnect Validation (#837): A mutating webhook injects an init container that runs a multi-node NCCL all-reduce bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.
    • New preflight-nccl-allreduce container image (PyTorch + torchrun)
    • networkFabric Helm selector with data-driven fabric profiles (ib, efa, mnnvl-efa, tcpxo)
    • DRA resource claim mirroring for GB200 MNNVL/IMEX
    • extraVolumeMounts for GCP TCPXO plugin injection
    • Auto-created NCCL topology ConfigMap for Azure IB
    • Tested on A100 (Azure IB), H100 (AWS EFA, GCP TCPXO), and GB200 (AWS MNNVL)

Concurrent Event Exporter

  • Worker Pool for Event Publishing (#906): Event exporter now supports concurrent publishing via a --workers flag. On a 1,100-node production cluster, sequential publishing (~3.3 events/sec) fell behind the event production rate (~10 events/sec), causing MongoDB oplog rotation and an unrecoverable ChangeStreamHistoryLost loop — leaving health events unexported for 4+ days. The new worker pool with sequence-tracked resume tokens provides at-least-once delivery with no event loss. At 10 workers, throughput reaches ~33 events/sec (supporting ~3,300 nodes).

DCGM 4.4.2 Compatibility & Destructive XID Detection

  • DCGM_HEALTH_WATCH_ALL Support (#905): Upgraded gpu-health-monitor, preflight dcgm-diag, and fake-dcgm to DCGM 4.4.2. Previously, DCGM_HEALTH_WATCH_ALL incidents (used by DCGM to report destructive XIDs like XID 95) were silently excluded, causing a KeyError crash in the health check loop and leaving GPU failures undetected. The fix removes the exclusion filter, adds safe .get() fallbacks for unknown systems/error codes, and maps DCGM_HEALTH_WATCH_ALL to Fatal severity. Backward compatible with DCGM 4.2.x.

Breakfix Response Time Metrics

  • End-to-End Remediation Latency Tracking (#714): New histogram metrics across the remediation pipeline to answer critical operational questions:
    • fault_remediation_cr_generate_duration_seconds — Mean time for CR creation in fault-remediation
    • fault_quarantine_node_quarantine_duration_seconds — Mean time to quarantine a node
    • node_drainer_pod_eviction_duration_seconds — Mean time waiting for user workloads to complete
    • Janitor remediation duration metrics — Mean time to remediate

Slinky Drainer Improvements

The Slinky (Slurm-on-Kubernetes) drainer received multiple improvements for production reliability:

  • Annotation Handling (#909): Slinky drainer now only adds drain reason annotations if none already exist, and cleans up NVSentinel-owned annotations (prefixed with [J] [NVSentinel]) upon drain completion. Includes envtest-based tests for the full drain lifecycle.
  • Wait for Fully Drained State (#919): Fixed a critical bug where the drainer deleted pods in DRAINING state (drain accepted but jobs still running) instead of waiting for DRAINED state (all jobs complete). Now mirrors the Slinky operator's own IsNodeDrained() logic by checking busy-state conditions (Allocated, Mixed, Completing).
  • Wait Only for Ready Pods (#916): Slinky drainer now correctly waits only for pods in Ready state, avoiding stalls on non-ready pods that will never drain.
  • CI Pipeline Integration (#885): Slinky drain tests are now included in the GitHub CI pipeline.

Configuration & Cloud Provider

  • Configurable IAM Role for EKS (#877): The IAM role name used by the CSP health monitor for EKS is now configurable via Helm values (iamRoleName), supporting environments with custom IAM role naming.
  • Terminate Node Template (#894): Added template for creating TerminateNode CRs in fault-remediation values, enabling REPLACE_VM remediation actions for node replacement workflows.

Bug Fixes & Reliability

  • FRM Multiple Reconciliation Fix (#897): Fixed multiple issues in fault-remediation: duplicate event reconciliation from concurrent status updates, missing nvsentinel-state label for taint-only configurations, duplicate fields in userPodsEvictionStatus, and missing lastRemediationTimestamp updates in Postgres queries.
  • Quarantine Metric Accuracy (#759): Fixed fault_quarantine_current_quarantined_nodes metric reporting inflated values. Root cause: manual taint removal wasn't triggering unquarantine flow and annotation cleanup. Also fixed cases where quarantineHealthEvent annotation had empty values alongside quarantinedNodeUncordonedManually.
  • GPU Reset RuntimeClassName (#887): Set RuntimeClassName to nvidia in GPU reset pods, ensuring proper GPU access during reset operations.
  • GPU Reset UAT Improvements (#879, #892): Wait for GPUReset CRD (instead of checking syslog) in UAT tests for more reliable validation; fix uninitialized variable.
  • Unquarantine Timeout (#876): Increased unquarantine timeout from default to 5 minutes to prevent premature timeout failures.
  • Nolint Directive Cleanup (#832, #884): Continued cleanup of TODO-marked nolint directives (Parts 3 & 4).

Dependency Updates

  • Bumped github.com/aws/aws-sdk-go-v2/config from 1.32.7 to 1.32.9 (#902)
  • Bumped google.golang.org/api from 0.266.0 to 0.267.0 (#903)
  • Bumped aquasecurity/trivy-action from 0.33.1 to 0.34.0 (#839)
  • Multiple dependency updates via dependabot (#875, #904)

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.10.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.9.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.10.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.9.0

16 Feb 10:43
v0.9.0
2823f76

Choose a tag to compare

Release v0.9.0

This release delivers end-to-end GPU reset support as a first-class remediation action, major expansions to the preflight check framework (DCGM diagnostics, NCCL loopback tests, gang discovery), enhanced Kubernetes operator health monitoring, and significant performance and reliability improvements across the platform.

Major New Features

End-to-End GPU Reset

GPU reset is now a fully integrated remediation path in NVSentinel. Building on the foundational work in v0.8.0, this release completes the pipeline:

  • GPU Reset Controller in Janitor (#797): New controller that consumes GPUReset CRDs and orchestrates the full reset lifecycle — tearing down GPU Operator components, executing the reset via nvidia-smi, and restoring services.
  • GPU Reset Container Image (#788): Dedicated gpu-reset container image used by Janitor's reset jobs to perform the actual GPU reset on target nodes.
  • E2E and UAT Test Coverage (#768): Enables GPU reset across fault-remediation (mapping COMPONENT_RESET to GPUReset), node-drainer (partial drain for GPU-scoped events), and health monitors (fallback to RESTART_VM when UUID discovery fails). Includes comprehensive end-to-end and UAT tests validating the full reset workflow.

This provides a lightweight recovery mechanism that resolves many GPU issues without full node reboots — resetting only the affected GPU while keeping healthy workloads running via partial drain.

Preflight Check Framework Expansion

The preflight check framework introduced in v0.8.0 now includes real diagnostic capabilities:

  • DCGM Diagnostics (#772): Runs DCGM diagnostic tests as preflight checks, discovering allocated GPUs via gonvml and executing diagnostics via pydcgm. Reports per-GPU, per-test health events (fatal for failures, non-fatal for warnings, healthy for passes).
  • NCCL Loopback Tests (#808): Validates intra-node GPU interconnect health by running NCCL all-reduce loopback tests. Detects degraded PCIe/NVLink bandwidth — tested across A100, H100, and GB200/GB300 hardware.
  • Gang Discovery (#818): Discovers pods belonging to the same scheduling group as a prerequisite for multi-node NCCL tests. Supports both native Kubernetes Workload API (1.35+) and PodGroup-based schedulers (Volcano, etc.) with config-driven CRD resolution. Coordinates peer discovery via ConfigMap injection at admission time.

Kubernetes Operator Health Monitoring

  • GPU & Network Operator Pod Monitoring (#751): The kubernetes-object-monitor now tracks DaemonSet pod health in gpu-operator and network-operator namespaces. Detects pods that fail to reach Running state within a configurable timeout and publishes fatal health events. Automatically publishes healthy events when pods recover.

Performance & Observability

Histogram Bucket Cardinality Reduction

  • 96% Series Reduction (#799): Replaced linear histogram buckets (500 buckets) with exponential buckets (12 buckets) in platform-connector metrics. Eliminates ~500K metric series cluster-wide, resolving Prometheus remote write bottlenecks and significantly reducing memory usage.

Configurable Network Policy

  • Optional Metrics Network Policy (#789): The metrics-access network policy can now be disabled via networkPolicy.enabled: false. Resolves conflicts when NVSentinel shares a namespace with services like cert-manager that require ingress on non-metrics ports.

Bug Fixes & Reliability

  • Nolint Directive Cleanup (#828, #831): Cleaned up nolint directives previously marked as TODO across the codebase, improving lint compliance and code quality.
  • E2E Test Retry for InfoROM Errors (#834): Added retry logic when injecting InfoROM errors in E2E tests, improving test reliability.
  • Demo Script Fix (#809): Fixed demo script to display correct node conditions.
  • SBOM Generation Disk Space (#817, #827): Added disk cleanup logic before SBOM generation in the publish container CI job, preventing build failures due to insufficient disk space.
  • CUDA Image Source (#792): Switched to CUDA images from NVCR to avoid Docker Hub rate limits in CI.

Build & Infrastructure

  • Overrideable Module Names (#816): Component Makefiles can now override the Go module name, improving build flexibility.
  • Mixed Eviction Scale Tests (#830): Added scale test results for mixed eviction modes (Immediate, AllowCompletion, DeleteAfterTimeout) on a 1500-node cluster, validating correct behavior at 10%, 25%, and 50% cluster scale.
  • Copy-PR-Bot Config (#805): Added username to copy-pr-bot configuration.

Documentation

  • K8s Data Store Design Doc (#787): Design document for introducing a Kubernetes-native data store for health events, reducing dependency on MongoDB.

Dependency Updates

  • Bumped protobuf from 6.33.4 to 6.33.5 in gpu-health-monitor (#769)
  • Multiple dependency updates via dependabot (#803, #806, #829)

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.8.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.9.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.8.0

02 Feb 11:47
v0.8.0
48f045b

Choose a tag to compare

Release v0.8.0

This release introduces distributed node locking for safer concurrent operations, partial drain support for GPU level remediation, GPU reset capabilities across multiple components, and enhanced event handling strategies. We've also started implementation for preflight checks, improved cloud provider support, and made significant reliability improvements across the platform.

🎯 Major New Features

Distributed Node Locking

NVSentinel now includes distributed node locking to prevent concurrent maintenance operations on the same node. This critical safety feature ensures that multiple remediation workflows don't interfere with each other, preventing race conditions and ensuring predictable behavior when multiple components need to perform maintenance operations simultaneously.

Partial Drain Support

The node-drainer now supports partial drain operations, enabling GPU-level remediation without draining the entire node. This significantly reduces the blast radius of remediation actions, allowing healthy workloads to continue running while only affected GPUs are serviced. This feature is particularly valuable in large-scale clusters where preserving workload availability is critical.

Comprehensive GPU Reset Support

GPU reset functionality has been expanded across multiple components:

  • GPU Health Monitor: Native GPU reset support for DCGM-detected issues
  • Fault Remediation: Integrated GPU reset as a remediation action
  • Syslog Health Monitor: GPU reset support for syslog-detected faults

This provides a lightweight, fast recovery mechanism that can resolve many GPU issues without requiring full node reboots, dramatically reducing recovery times and improving cluster availability.

Preflight Check Framework

Added preflight check scaffold and comprehensive design documentation for pre-job validation. This new framework enables operators to validate cluster state and prerequisites before a job starts to execute, thereby reducing the likelihood of failed jobs.

Event Handling Strategy Enhancements

Expanded event handling strategy support to additional components:

  • CSP Health Monitor: Event handling strategy configuration for cloud provider events
  • Kubernetes Object Monitor: Event handling strategy for Kubernetes resource events

This provides consistent, fine-grained control over event processing across all health monitoring components.

🔧 Configuration & Control Improvements

Custom Certificate Secrets

Helm charts now support custom certificate secrets, providing flexibility for organizations with existing certificate management infrastructure and specific security requirements. This enables seamless integration with enterprise PKI systems and certificate management workflows.

MongoDB Client Tracking

Added support for passing application names to MongoDB connections, enabling better client tracking and operational visibility. This helps operators understand which NVSentinel components are generating database load and simplifies troubleshooting of database performance issues.

ArgoCD Integration

Added checksum and sync-wave annotations for ArgoCD ConfigMap restarts, ensuring proper sequencing and change detection in GitOps workflows. This improves reliability when deploying NVSentinel via ArgoCD and prevents configuration drift issues.

🐛 Bug Fixes & Reliability Improvements

GPU Health Monitor Event Cache

Fixed critical race condition where the GPU health monitor event cache was updated before health events were successfully sent to platform-connector. This ensures events are not lost during transient connectivity issues and improves overall event delivery reliability.

Labeler Improvements

  • Stale Label Removal: Labeler now properly removes stale labels that no longer apply to nodes
  • Flaky Test Fixes: Resolved flaky labeler tests that were causing intermittent CI failures

Circuit Breaker Fixes

  • Cursor Mode: Fixed cursor mode handling in circuit breaker reset mechanism
  • Runbook Updates: Enhanced circuit breaker runbook with better operational guidance

Event Filtering

Fixed filtering logic in health-events-analyzer queries, ensuring events are properly matched against configured rules and improving detection accuracy.

MongoDB Authentication

Corrected authentication mechanism in MongoDB metrics URL, resolving connection issues in secured MongoDB deployments.

Fault Remediation Business Logic

Improved fault remediation to properly use controller-runtime business logic, enhancing reliability and consistency with Kubernetes controller patterns.

Nebius Cloud Reboot Handling

Fixed SendRebootSignal in Nebius provider to wait for instance stop completion before proceeding, preventing race conditions and ensuring reliable node reboots in Nebius Cloud environments.

Health Events Analyzer Test Fixes

Resolved test failures in health-events-analyzer that were causing CI pipeline issues.

🏗️ Architecture & Performance

Driver Version Dependent Parsing

Added driver version dependent parsing of NVL5 decoding rules, ensuring correct interpretation of NVLink errors across different driver versions. This improves accuracy of NVLink fault detection and reduces false positives.

🧪 Testing & Quality Improvements

UAT Test Reliability

  • Improved UAT tests reliability with better error handling and retry logic
  • Enhanced test configurability for different cluster environments
  • Better cleanup and resource management in test environments

Documentation Improvements

  • Added comprehensive preflight check design documentation
  • New alert runbook for improved operational guidance
  • Fixed typos in runbook documentation

Dependency Updates

  • Bumped google.golang.org/api from 0.259.0 to 0.260.0 in csp-health-monitor
  • Multiple security updates and dependency version bumps via dependabot

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Partial drain support requires proper GPU workload identification
  • Distributed locking requires coordination between components
  • Preflight checks are in early stages and should be thoroughly tested

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.7.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.8.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.7.1

26 Jan 20:02
v0.7.1
53b2c50

Choose a tag to compare

Release v0.7.1

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v0.7.1

Release v0.7.0

20 Jan 13:11
v0.7.0
3f98a45

Choose a tag to compare

Release v0.7.0

This release introduces advanced event processing strategies, gRPC-based remediation services, enhanced templating capabilities, and improved support for cloud service providers. We've also made significant reliability enhancements across the platform.

🎯 Major New Features

Event Processing Strategies

NVSentinel now supports configurable event processing strategies across health monitors and analyzers. The new processingStrategy field in health events allows fine-grained control over how events are handled, enabling operators to customize event processing behavior based on specific operational requirements. This feature has been implemented in:

  • GPU Health Monitor
  • Syslog Health Monitor
  • Health Events Analyzer

gRPC Remediation Service

Added a new gRPC-based remediation service that enables programmatic fault remediation operations. This provides a powerful API for external systems to integrate with NVSentinel's remediation capabilities, supporting advanced automation workflows and custom orchestration scenarios.

Enhanced Templating Support

Multi-template support in fault remediation allows using multiple notification templates for different channels and audiences. Additionally, all fields in health events can now be used for templating, providing complete flexibility in crafting notifications and alerts.

Nebius Cloud Support

Added comprehensive support for Nebius Cloud (MK8s) CSP, including environment variable and secret support for node reboot operations. This expands NVSentinel's multi-cloud capabilities to include another major cloud provider.

🔧 Configuration & Control Improvements

Pod GPU Device Allocation Tracking

The metadata collector now tracks pod GPU device allocation, providing visibility into which pods are using which GPUs. This enables more informed remediation decisions and better troubleshooting of GPU-related issues.

Syslog Runtime Journal Support

Added runtime journal support to the syslog health monitor, enabling direct integration with systemd journal for more efficient log collection and processing.

PodMonitor Configuration

Helm charts now support making PodMonitor optional and configurable, providing flexibility for environments with different monitoring setups and requirements.

🐛 Bug Fixes & Reliability Improvements

MongoDB Retry Logic

Implemented retry mechanism for MongoDB write failures, improving resilience against transient database connectivity issues and ensuring health events are not lost during temporary network problems.

Janitor Reconciliation Simplification

Simplified janitor reconciliation loops for better reliability and maintainability. The refactored logic reduces complexity and improves predictability of cleanup operations.

XID 154 Case Handling

Fixed a critical case statement issue in XID 154 handling that could cause incorrect error processing. This ensures GPU recovery action changes are properly detected and remediated.

GpuNvlinkWatch Message Parsing

Fixed message parsing logic collision in GpuNvlinkWatch that was causing stale node conditions. This resolves false positives and improves the accuracy of NVLink health monitoring.

Missing RESTART_VM Action

Added the missing RESTART_VM remediation action to fault remediation configurations, ensuring all supported remediation actions are properly exposed in the configuration.

Default CSP Provider Host

Fixed missing default cspProviderHost value that was causing configuration issues in certain deployment scenarios.

Apple Silicon Demo Support

Added ARM64 (Apple Silicon) support for local fault injection demos, improving the developer experience on macOS machines with Apple Silicon processors.

GpuNvlinkWatch Stale Conditions

Resolved issue where GpuNvlinkWatch could report stale node conditions due to message parsing logic collision, improving monitoring accuracy.

🏗️ Architecture & Performance

Certificate Hot-Reloading

Added option to enable automatic certificate hot-reloading, allowing certificate updates without service restarts. The certificate watcher is now non-blocking, improving overall system responsiveness.

MongoDB Query Metrics

Added comprehensive MongoDB query metrics for better observability of database operations, enabling performance analysis and optimization of data access patterns.

Enhanced Logging

Improved logging configuration across multiple components for better consistency and observability.

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • gRPC remediation service requires proper network configuration and authentication
  • Event processing strategies should be carefully tested before production deployment

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.6.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.7.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.6.0

22 Dec 11:33
v0.6.0
82e7180

Choose a tag to compare

Release v0.6.0

This release brings NVLink fault detection and remediation, enhanced security with certificate rotation support, and fine-grained control over health event processing. We've also started the migration of fault remediation to controller-runtime for better scalability and made significant reliability improvements across the platform.

🎯 Major New Features

NVLink XID 74 Workflow

NVSentinel now includes automated detection and remediation for XID 74 errors in the health events analyzer. XID 74 indicates NVLink hardware faults that can disrupt GPU-to-GPU communication. The workflow detects these errors and executes appropriate remediation actions to restore cluster health.

Certificate Rotation Support

The store-client module now supports automatic certificate rotation. This enables zero-downtime certificate updates in production environments, addressing security compliance requirements and operational best practices for long-running deployments.

Selective Health Event Analyzer Rules

Operators can now selectively enable or disable specific health events analyzer rules based on operational needs. This provides granular control over which error patterns trigger detection and remediation, allowing customization for different cluster configurations and workload requirements.

Health Event Property Overrides

Added capability to override specific fields in health events. This allows customization of NVSentinel's behavior to match specific operational requirements and policies.

Controller-Runtime for Fault Remediation

Migrated the fault remediation module from custom reconciliation code to the controller-runtime framework. This brings improved scalability, better resource efficiency, standardized controller patterns, and easier maintainability.

🔧 Configuration & Control Improvements

Circuit Breaker Fresh Start

Added option to reset fault quarantine state via circuit breaker ConfigMap. This enables controlled recovery scenarios where operators need to clear historical state and restart with a clean slate.

🐛 Bug Fixes & Reliability Improvements

Node Drainer Priority Handling

Fixed node drainer to ensure delete-after-timeout properly takes priority over allow-completion setting. This ensures nodes are drained within configured timeout windows even when pods don't terminate gracefully, preventing stuck drain operations.

Event Exporter Logging

Fixed an issue that caused the event exporter to not use the standard logging configuration.

PostgreSQL Test Stability

Fixed flaky PostgreSQL tests that were causing intermittent CI failures.

Node Condition Message Formatting

Improved the formatting of truncated node condition messages to ensure readability when messages are trimmed to fit within Kubernetes API limits.

UAT Environment Management

Improved AWS UAT environment deletion handling to prevent resource leaks and reduce costs from orphaned test infrastructure.

🏗️ Architecture & Performance

Enhanced GPU Health Monitor Logging

Unified the GPU health monitor logging format with other NVSentinel components. This provides consistent log structure across the platform, simplifying log aggregation and analysis.

Code Quality Improvements

Optimized import ordering and code organization across the codebase for better readability and maintainability.

🧪 Testing & Quality Improvements

Scale Testing Validation

Added concurrent drain operations scale tests with validation on 1500-node clusters. These tests ensure NVSentinel maintains performance and reliability characteristics at large scale.

Test Reliability Improvements

  • Fixed flaky syslog XID monitoring UAT tests
  • Resolved CSP health monitor test timeout issues

CI/CD Enhancements

  • Set explicit Go version during CI dependency installation for reproducible builds
  • Improved tilt installation process by using temporary directory
  • Added GitHub Action to automatically clean up old untagged container images
  • Enhanced fork repository handling to prevent unnecessary workflow triggers

📚 Documentation Improvements

  • Enhanced Kubernetes object monitor architecture diagrams
  • Updated Slinky drain demo documentation

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Audit logging is disabled by default - enable explicitly when needed
  • Certificate rotation requires proper certificate management infrastructure

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.5.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.6.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.

Release v0.5.0

08 Dec 13:32
v0.5.0
6cce9d6

Choose a tag to compare

Release v0.5.0

This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.

🎯 Major New Features

Custom Drain Extensibility

NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.

PostgreSQL Database Backend

Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.

Note: Support for PostgreSQL is experimental and it is not recommended in production clusters

Audit Logging

Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.

🔧 Enhanced Fault Detection & Remediation

XID 13 & XID 31 Workflow Implementation

Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.

XID 154 Support

Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.

Pre-Installed Driver Support

Enhanced support for environments with driver installed outside of GPU operator

🏗️ Infrastructure & Architecture Improvements

ko-based Kubernetes Object Monitor

Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.

Enhanced Build System

Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.

🐛 Bug Fixes & Reliability Improvements

Node Condition Message Limiting

Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.

Quarantine Override Handling

Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.

Data Model Type Safety

Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.

Data Model Consistency

Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.

Log Collector Concurrency

Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.

🧪 Testing & Quality Improvements

Enhanced Tilt Testing

Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.

Scale Testing Framework

New performance and scale tests to validate NVSentinel behavior under load:

  • FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
  • API Server & MongoDB Performance: Validation of data layer performance at scale

Log Collector Tilt Tests

Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.

📚 Documentation Improvements

Operational Documentation

  • Datastore architecture and migration documentation
  • Comprehensive configuration reference
  • Feature documentation and user guides
  • Runbooks for common operational scenarios
  • Upgrade procedures and best practices
  • IAM setup guide for CSP health monitor
  • Documentation for pre-installed GPU driver support

🔄 Dependencies & Maintenance

Security Updates

  • Upgraded Go modules to address CVEs in dependencies
  • Bumped various dependencies to latest stable versions

CI/CD Improvements

  • Added dependabot configuration for GPU API
  • Enhanced GitHub Actions workflows
  • Improved contributor automation with copy-pr-bot updates

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter
  • kubernetes-object-monitor

All images include the latest bug fixes, security updates, and feature enhancements from this release.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Custom drain handlers require implementing the drain handler interface
  • PostgreSQL backend is in preview and should be thoroughly tested before production use

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.4.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.