Release v1.7.0
This release fixes a fault-remediation bug where every historical cancellation replayed on every restart (eventually causing OOM kills), adds a Helm gate to disable the external-MongoDB setup job for tenants who provision the database themselves, brings the docs site onto NVIDIA's shared Fern global theme, and ships a large set of GitHub repository automation workflows.
Major New Features
External MongoDB Setup Job Gate (#1311)
The post-install/post-upgrade hook job that provisions collections, indexes, and x509 users on external MongoDB can now be disabled independently of the external-MongoDB configuration. Set global.datastore.setupJob.enabled: false to opt out — useful for deployments where the datastore is provisioned out-of-band and the setup job's auth requirements don't match the tenant identity. Defaults to true, so existing deployments are unaffected.
Repository Automation Workflows (#1306)
Adds a suite of GitHub Actions workflows and issue templates for repository hygiene:
- Merge conflict check — runs on PR creation and
mainpush; adds aneeds-rebaselabel when a PR diverges frommain. - Dependabot auto-merge — auto-merges Dependabot PRs that contain only semver-patch updates.
- Issue triage — applies
needs-triageandarea/*labels to new issues. - Labeler — applies
area/*labels to PRs based on the paths touched. - Welcome — posts a templated message on first-time contributors' issues and PRs.
- Inactive PR reminder — comments on PRs that have been inactive for 14–30 days.
- Issue SLAs — labels and comments on issues that have breached priority-tiered SLAs.
- Lock threads — locks closed issues and PRs after 90 days.
New issue templates for documentation requests and updates are added; the Question template is removed in favor of Discussions; the Bug/Feature templates now require a contributor agreement checkbox and add a component selector.
Bug Fixes & Reliability
-
Fault-Remediation Cancellation Completion Marker (#1335): Fixed a bug where
handleCancellationEventcleared Kubernetes annotations and advanced the change-stream resume token but never wrotefaultRemediatedback to MongoDB, while the cold-start cancellation query had nofaultremediated == nilfilter. Together this meant every historical cancellation replayed on every fault-remediation restart, growing monotonically and eventually causing OOM kills. The fix:handleCancellationEventnow callsupdateNodeRemediatedStatus(true)after clearing annotations, writing the same completion marker the remediation path already writes.- The cold-start cancellation query leg now requires
faultremediated == nil, so already-processed cancellations are excluded. - The call returns an error (rather than just logging) if the marker write fails, preventing the resume token from advancing without a durable terminal state.
-
Slinky Drainer Annotation Prefix (#1318): Corrected the node annotation prefix used by the Slinky Drainer plugin from
[J] [NVSentinel]to[T] [NVSentinel]so automated breakfix is detected with the expectedTprefix. Demo documentation updated to match.
Docs Site
-
NVIDIA Global Theme (#1320, #1321): Migrated the Fern docs site from per-repo theme assets to the shared
global-theme: nvidia, deleting~1,126lines of custom theme code (footer/badge components, NVIDIA SVGs,main.css, and thefooter/layout/colors/theme/logo/favicon/js/cssblocks indocs.yml). Addedmulti-source: trueto the Fern instance config so the global theme's JS bundle (OneTrust cookie consent SDK) loads alongside the CSS portion. Fern CLI was bumped to5.30.2(required forglobal-themesupport). -
Frozen-Only Versioning (#1319, #1315): All versions in the docs dropdown now serve frozen content from their git tag — the "live docs" entry served from
mainhas been removed. The newest version is stamped "Latest · vX.Y.Z" transiently at publish time. Eliminates duplicate dropdown entries, off-by-one pruning, and the dependency on the GitHub releases API for stamping. Version entries are sorted by semver descending (sort -rV) after insertion, so backport patches like v1.5.1 don't end up above newer releases; registration is now skipped when the publishing tag equals the latest release (the "Latest" stamp already covers it). -
CI Runner Migration (#1324): Standardized CI runners onto a dedicated
linux-amd64-cpu4flavor to unblock Dependabot PR merging.
Acknowledgments
This release includes contributions from:
Thanks also to @rohansav for diagnosing and authoring the cancellation completion marker fix that was cherry-picked into #1335.
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.7.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.6.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.7.0 \
--namespace nvsentinel \
--reuse-values