Skip to content

Conversation

@XRFXLP
Copy link
Member

@XRFXLP XRFXLP commented Jan 6, 2026

Summary

Preview: ADR-026: Feature β€” Preflight Checks via Init Container Injection

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added documentation for the Preflight Checks feature, including architecture, Helm configuration, supported check types (DCGM diagnostics and NCCL coordination), and failure behaviors for GPU workload deployments.

✏️ Tip: You can customize this high-level summary in your review settings.

XRFXLP added 5 commits January 6, 2026 15:10
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
@XRFXLP XRFXLP added this to the v0.7.0 milestone Jan 6, 2026
@XRFXLP XRFXLP self-assigned this Jan 6, 2026
@XRFXLP XRFXLP added the enhancement New feature or request label Jan 6, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

πŸ“ Walkthrough

Walkthrough

A new Architecture Decision Record (ADR-026) documents the design for preflight checks via Mutating Admission Webhook injection into GPU pod deployments. The design specifies init container injection for NCCL gang coordination and DCGM diagnostics, including webhook configuration, check plugin contracts, configuration via Helm, and comprehensive metrics and failure behavior definitions.

Changes

Cohort / File(s) Summary
Preflight Checks ADR
docs/designs/026-preflight-checks.md
New comprehensive Architecture Decision Record defining preflight checks feature: Mutating Webhook approach for init container injection, workloadRef-based peer discovery, ConfigMap-based NCCL ID sharing, plugin-based check architecture (dcgm-diag, nccl-loopback, nccl-allreduce), webhook configuration, failure modes, metrics definitions, and Helm-driven configuration guidance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 The webhook hops and checks take flight,
Init containers run through the GPU night,
NCCL friends coordinate with care,
While DCGM diagnostics fill the airβ€”
Ready pods await the dawn! ✨

Pre-merge checks and finishing touches

βœ… Passed checks (3 passed)
Check name Status Explanation
Title check βœ… Passed The title contains a typo ('prefilght' instead of 'preflight') and is otherwise clear and directly related to the main change, which is adding a design document for preflight checks.
Docstring Coverage βœ… Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • πŸ“ Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@XRFXLP XRFXLP linked an issue Jan 6, 2026 that may be closed by this pull request
1 task
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

πŸ€– Fix all issues with AI Agents
In @docs/designs/026-preflight-checks.md:
- Line 34: Update the flowchart code fence to include the Mermaid language
identifier by replacing the existing triple-backtick fence ("```") that opens
the flowchart block with "```mermaid" so the flowchart (flowchart TD ...) is
correctly recognized and highlighted; leave the block contents unchanged and
only modify the opening fence.
- Around line 373-374: The two bare URLs
"https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/"
and "https://github.com/NVIDIA/NVSentinel/issues/658" should be converted to
markdown link syntax; replace each bare URL with a descriptive link text (e.g.,
"K8s 1.35 Workload API" and "GitHub Issue: NVSentinel#658") followed by the URL
in parentheses so they render as [K8s 1.35 Workload
API](https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/)
and [NVSentinel issue #658](https://github.com/NVIDIA/NVSentinel/issues/658).
- Line 348: The phrase "Requires privileged init container for DCGM" uses a
double modal; change this line to a standard construction such as "Requires the
init container to be privileged for DCGM" or, if you prefer an imperative tone,
"Init container must be privileged for DCGM" β€” replace the exact string
"Requires privileged init container for DCGM" with one of these alternatives in
the docs/designs/026-preflight-checks.md content.
- Line 352: Fix the typo in the docs string "Latency: Use DCGM level 1 (~30s)
instead of level 2 (~2-3min); skip expensive checks for non-critica workloads"
by replacing "non-critica" with "non-critical" so the sentence reads "...skip
expensive checks for non-critical workloads"; update the occurrence in the
docs/designs/026-preflight-checks.md content where that phrase appears.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between fd4466e and 9de7c2a.

πŸ“’ Files selected for processing (1)
  • docs/designs/026-preflight-checks.md
🧰 Additional context used
🧠 Learnings (1)
πŸ““ Common learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:114-154
Timestamp: 2025-12-01T17:54:40.682Z
Learning: Error handling, security configurations, and other implementation details in code stubs within design documents (docs/designs/) in the NVSentinel repository can be simplified or omitted, as these are illustrative examples to convey architecture rather than production code. Actual implementation will be addressed in implementation PRs.
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
πŸͺ› LanguageTool
docs/designs/026-preflight-checks.md

[style] ~348-~348: The double modal β€œRequires privileged” is nonstandard (only accepted in certain dialects). Consider β€œto be privileged”.
Context: ... latency (DCGM diag level 1) - Requires privileged init container for DCGM - Webhook downt...

(NEEDS_FIXED)


[grammar] ~352-~352: Ensure spelling is correct
Context: ...M level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critica...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

πŸͺ› markdownlint-cli2 (0.18.1)
docs/designs/026-preflight-checks.md

34-34: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


373-373: Bare URL used

(MD034, no-bare-urls)


374-374: Bare URL used

(MD034, no-bare-urls)

πŸ”‡ Additional comments (5)
docs/designs/026-preflight-checks.md (5)

3-29: Context and decision are well-articulated.

The comparison table clearly differentiates preflight checks from existing health monitors, and the decision leverages Kubernetes 1.35's workloadRef for gang coordination. Kubernetes v1.35 introduces the Workload API with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion.


32-89: Component structure and webhook flow are clear.

The separation of concerns between the injector (webhook) and checker (init container image) is well-defined. The flowchart accurately represents the conditional injection logic.


91-149: Webhook configuration and init container spec are appropriate.

The MutatingWebhookConfiguration correctly filters for GPU pods via resource requests, uses namespace selectors for opt-in control, and sets failurePolicy: Fail for strict validation. The init container spec appropriately copies GPU resources from the main container and mounts required volumes for DCGM and Platform Connector access.


151-239: Plugin interface and gang coordination are well-designed.

The plugin contract with exit codes and HealthEvent integration maintains consistency with existing NVSentinel patterns. Gang coordination via workloadRef, peer discovery through alphabetical sorting, and ConfigMap-based NCCL ID sharing is sound. The distinction between hardware failures (isFatal: true) and timeout/config errors (isFatal: false) is correct.


242-375: Failure behavior, error mapping, metrics, and design rationale are comprehensive.

The failure behavior spec correctly maps exit codes to pod states and HealthEvent creation for integration with existing workflows. Error-to-action mappings distinguish hardware failures (fatal) from transient issues. Helm values support flexible configuration with sensible defaults. Metrics cover both the init container runner and the webhook injector. The consequences section acknowledges tradeoffs and provides mitigations. Alternatives are briefly justified.


### Distinction from Health Monitors

NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider running these tests post-remediation while a node is still cordoned? We're tying our preflight check ability to customer workloads which may not have comprehensive coverage on all GPUs (specifically if their pod is requesting a subset of GPUs on a given node). We could delay fault-quarantine-module uncordoning a node until the corresponding health monitor sends a healthy event + this preflight check runs while the node is still cordoned. We'd have to put more thought into how this would work for multi-node tests but this would be a simpler approach for the single-node dcgm-diag + nccl-loopback checks while also ensuring that we run these preflight checks on any unhealthy GPUs we've identified without modifying customer workloads.

value: "300s"
- name: GANG_TIMEOUT
value: "600s"
resources:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the NCCL tests require that the initContainer has access to InfiniBand-related device plugins?

value: "600s"
resources:
limits:
nvidia.com/gpu: 8 # Copied from main container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be more complex scenarios where other initContainers are requesting GPUs and there might be multiple primary containers all requesting GPUs. My understanding is that the scheduler will ensure a pod is placed on a node that's the max(largest initContainer resource request, sum of all primary containers). Wouldn't we want to take this maximum value to ensure we're capturing all GPUs that the pod may use across either its init or primary containers?


dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min)
checkTimeout: "300s" # Per-check timeout
gangTimeout: "600s" # Gang coordination timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the overall timeout for the nccl-allreduce test be 900s? How do we communicate this to customers who may be sensitive with pod start-up times?


### Gang Coordination

For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation Note: AFAIK the NCCL all-reduce test on the cluster can disrupt the other workloads already running on the cluster by using the network bandwidth (when the mutating webhook injects an init container into a new workload pod, some older workload pods might already be running on the cluster). We would need to ensure that we run the all-reduce test in a lightweight manner so as to minimize the impact on existing workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support for preflight/prolog checks

3 participants