docs: Add design doc for prefilght check #662

XRFXLP · 2026-01-06T13:56:55Z

Summary

Preview: ADR-026: Feature — Preflight Checks via Init Container Injection

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

Documentation
- Added documentation for the Preflight Checks feature, including architecture, Helm configuration, supported check types (DCGM diagnostics and NCCL coordination), and failure behaviors for GPU workload deployments.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Ajay Mishra <[email protected]>

coderabbitai · 2026-01-06T13:57:06Z

📝 Walkthrough

Walkthrough

A new Architecture Decision Record (ADR-026) documents the design for preflight checks via Mutating Admission Webhook injection into GPU pod deployments. The design specifies init container injection for NCCL gang coordination and DCGM diagnostics, including webhook configuration, check plugin contracts, configuration via Helm, and comprehensive metrics and failure behavior definitions.

Changes

Cohort / File(s)	Summary
Preflight Checks ADR `docs/designs/026-preflight-checks.md`	New comprehensive Architecture Decision Record defining preflight checks feature: Mutating Webhook approach for init container injection, workloadRef-based peer discovery, ConfigMap-based NCCL ID sharing, plugin-based check architecture (dcgm-diag, nccl-loopback, nccl-allreduce), webhook configuration, failure modes, metrics definitions, and Helm-driven configuration guidance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 The webhook hops and checks take flight,
Init containers run through the GPU night,
NCCL friends coordinate with care,
While DCGM diagnostics fill the air—
Ready pods await the dawn! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title contains a typo ('prefilght' instead of 'preflight') and is otherwise clear and directly related to the main change, which is adding a design document for preflight checks.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI Agents

In @docs/designs/026-preflight-checks.md:
- Line 34: Update the flowchart code fence to include the Mermaid language
identifier by replacing the existing triple-backtick fence ("```") that opens
the flowchart block with "```mermaid" so the flowchart (flowchart TD ...) is
correctly recognized and highlighted; leave the block contents unchanged and
only modify the opening fence.
- Around line 373-374: The two bare URLs
"https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/"
and "https://github.com/NVIDIA/NVSentinel/issues/658" should be converted to
markdown link syntax; replace each bare URL with a descriptive link text (e.g.,
"K8s 1.35 Workload API" and "GitHub Issue: NVSentinel#658") followed by the URL
in parentheses so they render as [K8s 1.35 Workload
API](https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/)
and [NVSentinel issue #658](https://github.com/NVIDIA/NVSentinel/issues/658).
- Line 348: The phrase "Requires privileged init container for DCGM" uses a
double modal; change this line to a standard construction such as "Requires the
init container to be privileged for DCGM" or, if you prefer an imperative tone,
"Init container must be privileged for DCGM" — replace the exact string
"Requires privileged init container for DCGM" with one of these alternatives in
the docs/designs/026-preflight-checks.md content.
- Line 352: Fix the typo in the docs string "Latency: Use DCGM level 1 (~30s)
instead of level 2 (~2-3min); skip expensive checks for non-critica workloads"
by replacing "non-critica" with "non-critical" so the sentence reads "...skip
expensive checks for non-critical workloads"; update the occurrence in the
docs/designs/026-preflight-checks.md content where that phrase appears.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd4466e and 9de7c2a.

📒 Files selected for processing (1)

docs/designs/026-preflight-checks.md

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:114-154
Timestamp: 2025-12-01T17:54:40.682Z
Learning: Error handling, security configurations, and other implementation details in code stubs within design documents (docs/designs/) in the NVSentinel repository can be simplified or omitted, as these are illustrative examples to convey architecture rather than production code. Actual implementation will be addressed in implementation PRs.

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

🪛 LanguageTool

docs/designs/026-preflight-checks.md

[style] ~348-~348: The double modal “Requires privileged” is nonstandard (only accepted in certain dialects). Consider “to be privileged”.
Context: ... latency (DCGM diag level 1) - Requires privileged init container for DCGM - Webhook downt...

(NEEDS_FIXED)

[grammar] ~352-~352: Ensure spelling is correct
Context: ...M level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critica...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/designs/026-preflight-checks.md

34-34: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

373-373: Bare URL used

(MD034, no-bare-urls)

374-374: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (5)

docs/designs/026-preflight-checks.md (5)

3-29: Context and decision are well-articulated.

The comparison table clearly differentiates preflight checks from existing health monitors, and the decision leverages Kubernetes 1.35's workloadRef for gang coordination. Kubernetes v1.35 introduces the Workload API with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion.

32-89: Component structure and webhook flow are clear.

The separation of concerns between the injector (webhook) and checker (init container image) is well-defined. The flowchart accurately represents the conditional injection logic.

91-149: Webhook configuration and init container spec are appropriate.

The MutatingWebhookConfiguration correctly filters for GPU pods via resource requests, uses namespace selectors for opt-in control, and sets failurePolicy: Fail for strict validation. The init container spec appropriately copies GPU resources from the main container and mounts required volumes for DCGM and Platform Connector access.

151-239: Plugin interface and gang coordination are well-designed.

The plugin contract with exit codes and HealthEvent integration maintains consistency with existing NVSentinel patterns. Gang coordination via workloadRef, peer discovery through alphabetical sorting, and ConfigMap-based NCCL ID sharing is sound. The distinction between hardware failures (isFatal: true) and timeout/config errors (isFatal: false) is correct.

242-375: Failure behavior, error mapping, metrics, and design rationale are comprehensive.

The failure behavior spec correctly maps exit codes to pod states and HealthEvent creation for integration with existing workflows. Error-to-action mappings distinguish hardware failures (fatal) from transient issues. Helm values support flexible configuration with sensible defaults. Metrics cover both the init container runner and the webhook injector. The consequences section acknowledges tradeoffs and provides mitigations. Alternatives are briefly justified.

docs/designs/026-preflight-checks.md

natherz97 · 2026-01-07T17:19:17Z

docs/designs/026-preflight-checks.md

+
+### Distinction from Health Monitors
+
+NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:


Did we consider running these tests post-remediation while a node is still cordoned? We're tying our preflight check ability to customer workloads which may not have comprehensive coverage on all GPUs (specifically if their pod is requesting a subset of GPUs on a given node). We could delay fault-quarantine-module uncordoning a node until the corresponding health monitor sends a healthy event + this preflight check runs while the node is still cordoned. We'd have to put more thought into how this would work for multi-node tests but this would be a simpler approach for the single-node dcgm-diag + nccl-loopback checks while also ensuring that we run these preflight checks on any unhealthy GPUs we've identified without modifying customer workloads.

natherz97 · 2026-01-07T17:19:24Z

docs/designs/026-preflight-checks.md

+        value: "300s"
+      - name: GANG_TIMEOUT
+        value: "600s"
+    resources:


Will the NCCL tests require that the initContainer has access to InfiniBand-related device plugins?

natherz97 · 2026-01-07T17:19:31Z

docs/designs/026-preflight-checks.md

+        value: "600s"
+    resources:
+      limits:
+        nvidia.com/gpu: 8  # Copied from main container


There may be more complex scenarios where other initContainers are requesting GPUs and there might be multiple primary containers all requesting GPUs. My understanding is that the scheduler will ensure a pod is placed on a node that's the max(largest initContainer resource request, sum of all primary containers). Wouldn't we want to take this maximum value to ensure we're capturing all GPUs that the pod may use across either its init or primary containers?

natherz97 · 2026-01-07T17:19:38Z

docs/designs/026-preflight-checks.md

+
+  dcgmDiagLevel: 1       # 1 (quick, ~30s) or 2 (medium, ~2-3min)
+  checkTimeout: "300s"   # Per-check timeout
+  gangTimeout: "600s"    # Gang coordination timeout


Will the overall timeout for the nccl-allreduce test be 900s? How do we communicate this to customers who may be sensitive with pod start-up times?

KaivalyaMDabhadkar · 2026-01-08T05:58:03Z

docs/designs/026-preflight-checks.md

+
+### Gang Coordination
+
+For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:


Implementation Note: AFAIK the NCCL all-reduce test on the cluster can disrupt the other workloads already running on the cluster by using the network bandwidth (when the mutating webhook injects an init container into a new workload pod, some older workload pods might already be running on the cluster). We would need to ensure that we run the all-reduce test in a lightweight manner so as to minimize the impact on existing workloads.

XRFXLP added 5 commits January 6, 2026 15:10

docs: add design doc for preflight check

f3ce511

Signed-off-by: Ajay Mishra <[email protected]>

docs: few changes

6606d6b

Signed-off-by: Ajay Mishra <[email protected]>

docs: few changes

cbc9501

Signed-off-by: Ajay Mishra <[email protected]>

chore: minor changes

26c804a

Signed-off-by: Ajay Mishra <[email protected]>

chore: minor changes

9de7c2a

Signed-off-by: Ajay Mishra <[email protected]>

XRFXLP added this to the v0.7.0 milestone Jan 6, 2026

XRFXLP self-assigned this Jan 6, 2026

XRFXLP added the enhancement New feature or request label Jan 6, 2026

XRFXLP linked an issue Jan 6, 2026 that may be closed by this pull request

[Feature]: Support for preflight/prolog checks #658

Open

1 task

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

docs/designs/026-preflight-checks.md Show resolved Hide resolved

docs/designs/026-preflight-checks.md Show resolved Hide resolved

docs/designs/026-preflight-checks.md Show resolved Hide resolved

docs/designs/026-preflight-checks.md Show resolved Hide resolved

natherz97 reviewed Jan 7, 2026

View reviewed changes

KaivalyaMDabhadkar reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Add design doc for prefilght check #662

docs: Add design doc for prefilght check #662

XRFXLP commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natherz97 Jan 7, 2026

Uh oh!

natherz97 Jan 7, 2026

Uh oh!

natherz97 Jan 7, 2026

Uh oh!

natherz97 Jan 7, 2026

Uh oh!

KaivalyaMDabhadkar Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### Distinction from Health Monitors

		NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:


		### Gang Coordination

		For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:

docs: Add design doc for prefilght check #662

Are you sure you want to change the base?

docs: Add design doc for prefilght check #662

Conversation

XRFXLP commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natherz97 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

natherz97 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

natherz97 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

natherz97 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

KaivalyaMDabhadkar Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XRFXLP commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading