-
Notifications
You must be signed in to change notification settings - Fork 33
docs: Add design doc for prefilght check #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
Signed-off-by: Ajay Mishra <[email protected]>
π WalkthroughWalkthroughA new Architecture Decision Record (ADR-026) documents the design for preflight checks via Mutating Admission Webhook injection into GPU pod deployments. The design specifies init container injection for NCCL gang coordination and DCGM diagnostics, including webhook configuration, check plugin contracts, configuration via Helm, and comprehensive metrics and failure behavior definitions. Changes
Estimated code review effortπ― 4 (Complex) | β±οΈ ~50 minutes Poem
Pre-merge checks and finishing touchesβ Passed checks (3 passed)
β¨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
π€ Fix all issues with AI Agents
In @docs/designs/026-preflight-checks.md:
- Line 34: Update the flowchart code fence to include the Mermaid language
identifier by replacing the existing triple-backtick fence ("```") that opens
the flowchart block with "```mermaid" so the flowchart (flowchart TD ...) is
correctly recognized and highlighted; leave the block contents unchanged and
only modify the opening fence.
- Around line 373-374: The two bare URLs
"https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/"
and "https://github.com/NVIDIA/NVSentinel/issues/658" should be converted to
markdown link syntax; replace each bare URL with a descriptive link text (e.g.,
"K8s 1.35 Workload API" and "GitHub Issue: NVSentinel#658") followed by the URL
in parentheses so they render as [K8s 1.35 Workload
API](https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/)
and [NVSentinel issue #658](https://github.com/NVIDIA/NVSentinel/issues/658).
- Line 348: The phrase "Requires privileged init container for DCGM" uses a
double modal; change this line to a standard construction such as "Requires the
init container to be privileged for DCGM" or, if you prefer an imperative tone,
"Init container must be privileged for DCGM" β replace the exact string
"Requires privileged init container for DCGM" with one of these alternatives in
the docs/designs/026-preflight-checks.md content.
- Line 352: Fix the typo in the docs string "Latency: Use DCGM level 1 (~30s)
instead of level 2 (~2-3min); skip expensive checks for non-critica workloads"
by replacing "non-critica" with "non-critical" so the sentence reads "...skip
expensive checks for non-critical workloads"; update the occurrence in the
docs/designs/026-preflight-checks.md content where that phrase appears.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
docs/designs/026-preflight-checks.md
π§° Additional context used
π§ Learnings (1)
π Common learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:114-154
Timestamp: 2025-12-01T17:54:40.682Z
Learning: Error handling, security configurations, and other implementation details in code stubs within design documents (docs/designs/) in the NVSentinel repository can be simplified or omitted, as these are illustrative examples to convey architecture rather than production code. Actual implementation will be addressed in implementation PRs.
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
πͺ LanguageTool
docs/designs/026-preflight-checks.md
[style] ~348-~348: The double modal βRequires privilegedβ is nonstandard (only accepted in certain dialects). Consider βto be privilegedβ.
Context: ... latency (DCGM diag level 1) - Requires privileged init container for DCGM - Webhook downt...
(NEEDS_FIXED)
[grammar] ~352-~352: Ensure spelling is correct
Context: ...M level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critica...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
πͺ markdownlint-cli2 (0.18.1)
docs/designs/026-preflight-checks.md
34-34: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
373-373: Bare URL used
(MD034, no-bare-urls)
374-374: Bare URL used
(MD034, no-bare-urls)
π Additional comments (5)
docs/designs/026-preflight-checks.md (5)
3-29: Context and decision are well-articulated.The comparison table clearly differentiates preflight checks from existing health monitors, and the decision leverages Kubernetes 1.35's
workloadReffor gang coordination. Kubernetes v1.35 introduces the Workload API with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion.
32-89: Component structure and webhook flow are clear.The separation of concerns between the injector (webhook) and checker (init container image) is well-defined. The flowchart accurately represents the conditional injection logic.
91-149: Webhook configuration and init container spec are appropriate.The
MutatingWebhookConfigurationcorrectly filters for GPU pods via resource requests, uses namespace selectors for opt-in control, and setsfailurePolicy: Failfor strict validation. The init container spec appropriately copies GPU resources from the main container and mounts required volumes for DCGM and Platform Connector access.
151-239: Plugin interface and gang coordination are well-designed.The plugin contract with exit codes and HealthEvent integration maintains consistency with existing NVSentinel patterns. Gang coordination via
workloadRef, peer discovery through alphabetical sorting, and ConfigMap-based NCCL ID sharing is sound. The distinction between hardware failures (isFatal: true) and timeout/config errors (isFatal: false) is correct.
242-375: Failure behavior, error mapping, metrics, and design rationale are comprehensive.The failure behavior spec correctly maps exit codes to pod states and HealthEvent creation for integration with existing workflows. Error-to-action mappings distinguish hardware failures (fatal) from transient issues. Helm values support flexible configuration with sensible defaults. Metrics cover both the init container runner and the webhook injector. The consequences section acknowledges tradeoffs and provides mitigations. Alternatives are briefly justified.
|
|
||
| ### Distinction from Health Monitors | ||
|
|
||
| NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider running these tests post-remediation while a node is still cordoned? We're tying our preflight check ability to customer workloads which may not have comprehensive coverage on all GPUs (specifically if their pod is requesting a subset of GPUs on a given node). We could delay fault-quarantine-module uncordoning a node until the corresponding health monitor sends a healthy event + this preflight check runs while the node is still cordoned. We'd have to put more thought into how this would work for multi-node tests but this would be a simpler approach for the single-node dcgm-diag + nccl-loopback checks while also ensuring that we run these preflight checks on any unhealthy GPUs we've identified without modifying customer workloads.
| value: "300s" | ||
| - name: GANG_TIMEOUT | ||
| value: "600s" | ||
| resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the NCCL tests require that the initContainer has access to InfiniBand-related device plugins?
| value: "600s" | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: 8 # Copied from main container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be more complex scenarios where other initContainers are requesting GPUs and there might be multiple primary containers all requesting GPUs. My understanding is that the scheduler will ensure a pod is placed on a node that's the max(largest initContainer resource request, sum of all primary containers). Wouldn't we want to take this maximum value to ensure we're capturing all GPUs that the pod may use across either its init or primary containers?
|
|
||
| dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min) | ||
| checkTimeout: "300s" # Per-check timeout | ||
| gangTimeout: "600s" # Gang coordination timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the overall timeout for the nccl-allreduce test be 900s? How do we communicate this to customers who may be sensitive with pod start-up times?
|
|
||
| ### Gang Coordination | ||
|
|
||
| For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation Note: AFAIK the NCCL all-reduce test on the cluster can disrupt the other workloads already running on the cluster by using the network bandwidth (when the mutating webhook injects an init container into a new workload pod, some older workload pods might already be running on the cluster). We would need to ensure that we run the all-reduce test in a lightweight manner so as to minimize the impact on existing workloads.
Summary
Preview: ADR-026: Feature β Preflight Checks via Init Container Injection
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
βοΈ Tip: You can customize this high-level summary in your review settings.