-
Notifications
You must be signed in to change notification settings - Fork 33
docs: Add design doc for prefilght check #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,375 @@ | ||
| # ADR-026: Feature β Preflight Checks via Init Container Injection | ||
|
|
||
| ## Context | ||
|
|
||
| GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. | ||
|
|
||
| Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce). | ||
|
|
||
| ### Distinction from Health Monitors | ||
|
|
||
| NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different: | ||
|
|
||
| | | Health Monitors | Preflight Checks | | ||
| |-|-----------------|------------------| | ||
| | When | Continuous (DaemonSet) | Once at pod start (init container) | | ||
| | Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) | | ||
| | Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting | | ||
| | NCCL tests | No | Yes | | ||
| | Purpose | Reactive remediation | Prevent bad starts | | ||
|
|
||
| Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?" | ||
|
|
||
| ## Decision | ||
|
|
||
| Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces. | ||
|
|
||
| - Injection trigger: GPU resource request + namespace | ||
| - Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise | ||
|
|
||
| ## Implementation | ||
|
|
||
| ### Component Structure | ||
|
|
||
| ``` | ||
XRFXLP marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| preflight/ | ||
| βββ injector/ # Webhook (Deployment) | ||
| β βββ main.go | ||
| β βββ go.mod | ||
| β βββ Makefile | ||
| β βββ Tiltfile | ||
| β βββ pkg/ | ||
| β βββ config/ | ||
| β β βββ config.go | ||
| β βββ webhook/ | ||
| β β βββ v1alpha1/ | ||
| β β βββ handler.go | ||
| β β βββ handler_test.go | ||
| β βββ injection/ | ||
| β β βββ injector.go | ||
| β β βββ injector_test.go | ||
| β βββ metrics/ | ||
| β βββ metrics.go | ||
| β | ||
| βββ checker/ # Init container image | ||
| β βββ main.go | ||
| β βββ go.mod | ||
| β βββ Makefile | ||
| β βββ Tiltfile | ||
| β βββ pkg/ | ||
| β βββ runner/ | ||
| β β βββ runner.go | ||
| β βββ checks/ | ||
| β β βββ dcgm/ | ||
| β β β βββ diag.go # dcgmi diag -r 1/2 | ||
| β β βββ nccl/ | ||
| β β βββ loopback.go | ||
| β β βββ allreduce.go | ||
| β βββ coordination/ | ||
| β β βββ discovery.go # Peer discovery via workloadRef | ||
| β β βββ configmap.go # NCCL ID sharing | ||
| β βββ reporting/ | ||
| β β βββ healthevents.go | ||
| β βββ metrics/ | ||
| β βββ metrics.go | ||
| β | ||
| βββ Makefile # Builds both | ||
| ``` | ||
|
|
||
| ### Webhook Flow | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[Pod CREATE request] --> B{Has GPU resource?} | ||
| B -->|No| C[Allow - no mutation] | ||
| B -->|Yes| D[Inject init container] | ||
| D --> E[Return JSON patch] | ||
| ``` | ||
|
|
||
| Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time. | ||
|
|
||
| ### MutatingWebhookConfiguration | ||
|
|
||
| ```yaml | ||
| apiVersion: admissionregistration.k8s.io/v1 | ||
| kind: MutatingWebhookConfiguration | ||
| metadata: | ||
| name: preflight-injector | ||
| webhooks: | ||
| - name: preflight.nvsentinel.nvidia.com | ||
| clientConfig: | ||
| service: | ||
| name: preflight-injector | ||
| namespace: nvsentinel | ||
| path: /mutate-pod | ||
| rules: | ||
| - apiGroups: [""] | ||
| apiVersions: ["v1"] | ||
| resources: ["pods"] | ||
| operations: ["CREATE"] | ||
| namespaceSelector: | ||
| matchExpressions: | ||
| - key: kubernetes.io/metadata.name | ||
| operator: In | ||
| values: [] # Populated from Helm values | ||
| failurePolicy: Fail | ||
| sideEffects: None | ||
| admissionReviewVersions: ["v1"] | ||
| ``` | ||
|
|
||
| Namespace list populated from Helm values. | ||
|
|
||
| ### Init Container Spec | ||
|
|
||
| ```yaml | ||
| initContainers: | ||
| - name: nvsentinel-preflight | ||
| image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 | ||
| env: | ||
| - name: PREFLIGHT_CHECKS | ||
| value: "dcgm-diag,nccl-loopback" | ||
| - name: DCGM_DIAG_LEVEL | ||
| value: "1" | ||
| - name: CHECK_TIMEOUT | ||
| value: "300s" | ||
| - name: GANG_TIMEOUT | ||
| value: "600s" | ||
| resources: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will the NCCL tests require that the initContainer has access to InfiniBand-related device plugins? |
||
| limits: | ||
| nvidia.com/gpu: 8 # Copied from main container | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There may be more complex scenarios where other initContainers are requesting GPUs and there might be multiple primary containers all requesting GPUs. My understanding is that the scheduler will ensure a pod is placed on a node that's the max(largest initContainer resource request, sum of all primary containers). Wouldn't we want to take this maximum value to ensure we're capturing all GPUs that the pod may use across either its init or primary containers? |
||
| securityContext: | ||
| privileged: true | ||
| volumeMounts: | ||
| - name: dcgm-socket | ||
| mountPath: /var/run/nvidia | ||
| - name: platform-connector-socket | ||
| mountPath: /var/run/nvsentinel | ||
| ``` | ||
|
|
||
| **GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod). | ||
|
|
||
| ### Check Types | ||
|
|
||
| | Check | Scope | Coordination | | ||
| |-------|-------|--------------| | ||
| | `dcgm-diag` | Single node | None | | ||
| | `nccl-loopback` | Single node | None | | ||
| | `nccl-allreduce` | Gang-wide | ConfigMap | | ||
| | `plugin:<name>` | Varies | Varies | | ||
|
|
||
| ### Plugin Interface (Third-Party Checks) | ||
|
|
||
| Plugins are separate init containers. Webhook injects one container per plugin. | ||
|
|
||
| **Registration:** | ||
| ```yaml | ||
| preflight-injector: | ||
| plugins: | ||
| - name: bandwidth-check | ||
| image: myregistry/bandwidth-check:v1 | ||
| timeout: "60s" | ||
| ``` | ||
|
|
||
| **Injected init containers:** | ||
| ```yaml | ||
| initContainers: | ||
| # Built-in checks | ||
| - name: nvsentinel-preflight | ||
| image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 | ||
| ... | ||
|
|
||
| # Plugin (separate container) | ||
| - name: preflight-bandwidth-check | ||
| image: myregistry/bandwidth-check:v1 | ||
| env: | ||
| - name: CHECK_TIMEOUT | ||
| value: "60s" | ||
| - name: NODE_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: spec.nodeName | ||
| ``` | ||
|
|
||
| **Plugin contract:** | ||
| - Exit codes: `0` (passed), `1` (check failed), `2` (config error) | ||
| - Write HealthEvent to Platform Connector socket (same as built-in checks) | ||
| - Plugin sets `isFatal`, `recommendedAction` in HealthEvent | ||
| - Platform Connector overrides can modify values | ||
| - Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket) | ||
|
|
||
| ### Configuration | ||
|
|
||
| Configured at deployment time via Helm values. No per-workload annotations. | ||
|
|
||
| ### Gang Coordination | ||
|
|
||
| For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implementation Note: AFAIK the NCCL all-reduce test on the cluster can disrupt the other workloads already running on the cluster by using the network bandwidth (when the mutating webhook injects an init container into a new workload pod, some older workload pods might already be running on the cluster). We would need to ensure that we run the all-reduce test in a lightweight manner so as to minimize the impact on existing workloads. |
||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant R0 as Rank 0 Init | ||
| participant R1 as Rank 1 Init | ||
| participant API as Kube API | ||
| participant CM as ConfigMap | ||
|
|
||
| R0->>API: List pods with same workloadRef | ||
| R1->>API: List pods with same workloadRef | ||
|
|
||
| Note over R0,R1: Determine rank by sorting pod names | ||
|
|
||
| R0->>CM: Create ConfigMap with NCCL unique ID | ||
| R1->>CM: Poll until ConfigMap exists | ||
| R1->>CM: Read NCCL unique ID | ||
|
|
||
| R0->>R1: nccl.init() (barrier inside NCCL) | ||
| R0->>R1: nccl.all_reduce() | ||
| ``` | ||
|
|
||
| **Peer discovery via workloadRef:** | ||
| - Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match | ||
| - Gets peer IPs directly from pod list | ||
| - Determines rank by sorting pod names alphabetically | ||
|
|
||
| **NCCL ID sharing:** | ||
| - Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}` | ||
| - Other ranks poll until ConfigMap exists (10 min timeout) | ||
| - ConfigMap has owner reference to Workload for cleanup | ||
|
|
||
| Webhook injects the init container. No Service or other resources created. | ||
|
|
||
| **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). | ||
|
|
||
| ### Failure Behavior | ||
|
|
||
| Init container exit codes: | ||
| - `0`: All checks passed | ||
| - `1`: Check failed, pod should not start | ||
| - `2`: Configuration error | ||
|
|
||
| On failure: | ||
| - Pod stays in `Init:Error` state | ||
| - **HealthEvent created** via Platform Connector (same as health monitors) | ||
| - Kubernetes Event created with failure details | ||
| - Metrics incremented (`preflight_check_failures_total`) | ||
|
|
||
| HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, etc). | ||
|
|
||
| ### Error to Recommended Action Mapping | ||
|
|
||
| **DCGM Diag** : | ||
|
|
||
| | Test | Result | Recommended Action | | ||
| |------|--------|-------------------| | ||
| | Memory | `FAIL` | `CONTACT_SUPPORT` | | ||
| | PCIe | `FAIL` | `CONTACT_SUPPORT` | | ||
| | NVLink | `FAIL` | `CONTACT_SUPPORT` | | ||
| | Stress | `FAIL` | `RUN_DCGMEUD` | | ||
| | Any | `WARN` | `NONE` | | ||
|
|
||
| **NCCL Checks**: | ||
|
|
||
| | Error | Recommended Action | | ||
| |-------|-------------------| | ||
| | `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` | | ||
| | `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` | | ||
| | `NCCL_INVALID_USAGE` | `NONE` | | ||
| | `NCCL_TIMEOUT` | `NONE` | | ||
| | `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` | | ||
|
|
||
| **isFatal determination**: | ||
| - DCGM diag `FAIL` β `isFatal: true` | ||
| - DCGM diag `WARN` β `isFatal: false` | ||
| - NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) β `isFatal: true` | ||
| - NCCL timeout/config errors β `isFatal: false` | ||
|
|
||
| ### Helm Values | ||
|
|
||
| ```yaml | ||
| preflight-injector: | ||
| enabled: false # Opt-in | ||
|
|
||
| checks: | ||
| - dcgm-diag | ||
| - nccl-loopback | ||
| # - nccl-allreduce # Enable for gang workloads | ||
|
|
||
| dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min) | ||
| checkTimeout: "300s" # Per-check timeout | ||
| gangTimeout: "600s" # Gang coordination timeout | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will the overall timeout for the nccl-allreduce test be 900s? How do we communicate this to customers who may be sensitive with pod start-up times? |
||
|
|
||
| # Namespaces where preflight checks apply | ||
| namespaces: | ||
| - training | ||
|
|
||
| webhook: | ||
| failurePolicy: Fail # or Ignore | ||
|
|
||
| image: | ||
| repository: ghcr.io/nvidia/nvsentinel/preflight-checker | ||
| tag: v1 | ||
| ``` | ||
|
|
||
| All GPU pods in listed namespaces get the configured checks. | ||
|
|
||
| ### Metrics | ||
|
|
||
| **preflight/checker** (exposed via pushgateway or scraped from pod annotations): | ||
|
|
||
| | Metric | Type | Labels | | ||
| |--------|------|--------| | ||
| | `preflight_check_total` | Counter | `check`, `result` | | ||
| | `preflight_check_duration_seconds` | Histogram | `check` | | ||
| | `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` | | ||
| | `preflight_gang_wait_seconds` | Histogram | `workload` | | ||
| | `preflight_config_errors_total` | Counter | `error` | | ||
|
|
||
| **preflight/injector** (standard Prometheus endpoint): | ||
|
|
||
| | Metric | Type | Labels | | ||
| |--------|------|--------| | ||
| | `preflight_injection_total` | Counter | `result` | | ||
| | `preflight_webhook_latency_seconds` | Histogram | - | | ||
|
|
||
| ## Rationale | ||
|
|
||
| - Mutating webhook, no external dependencies | ||
| - Init containers | ||
| - Namespace selector opt-in | ||
| - Deployment-level config | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
| - Catches GPU failures before workload starts | ||
| - Works with any workload controller | ||
|
|
||
| ### Negative | ||
| - Adds 30-60s pod startup latency (DCGM diag level 1) | ||
| - Requires privileged init container for DCGM | ||
XRFXLP marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - Webhook downtime blocks pod creation (if `failurePolicy: Fail`) | ||
|
|
||
| ### Mitigations | ||
| - **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads | ||
XRFXLP marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Privileged**: Required for hardware access; limit to specific namespaces | ||
| - **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| ### Kyverno Policy | ||
| Rejected: External dependency. | ||
|
|
||
| ### User-managed init containers | ||
| Rejected: No enforcement. Users forget. | ||
|
|
||
| ### Custom CRD wrapper | ||
| Rejected: Requires changing how workloads are deployed. | ||
|
|
||
| ## Out of Scope | ||
|
|
||
| - **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events. | ||
|
|
||
| ## References | ||
|
|
||
| - K8s 1.35 Workload API: https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/ | ||
| - GitHub Issue: https://github.com/NVIDIA/NVSentinel/issues/658 | ||
XRFXLP marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider running these tests post-remediation while a node is still cordoned? We're tying our preflight check ability to customer workloads which may not have comprehensive coverage on all GPUs (specifically if their pod is requesting a subset of GPUs on a given node). We could delay fault-quarantine-module uncordoning a node until the corresponding health monitor sends a healthy event + this preflight check runs while the node is still cordoned. We'd have to put more thought into how this would work for multi-node tests but this would be a simpler approach for the single-node dcgm-diag + nccl-loopback checks while also ensuring that we run these preflight checks on any unhealthy GPUs we've identified without modifying customer workloads.