Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
375 changes: 375 additions & 0 deletions docs/designs/026-preflight-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
# ADR-026: Feature β€” Preflight Checks via Init Container Injection

## Context

GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.

Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce).

### Distinction from Health Monitors

NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider running these tests post-remediation while a node is still cordoned? We're tying our preflight check ability to customer workloads which may not have comprehensive coverage on all GPUs (specifically if their pod is requesting a subset of GPUs on a given node). We could delay fault-quarantine-module uncordoning a node until the corresponding health monitor sends a healthy event + this preflight check runs while the node is still cordoned. We'd have to put more thought into how this would work for multi-node tests but this would be a simpler approach for the single-node dcgm-diag + nccl-loopback checks while also ensuring that we run these preflight checks on any unhealthy GPUs we've identified without modifying customer workloads.


| | Health Monitors | Preflight Checks |
|-|-----------------|------------------|
| When | Continuous (DaemonSet) | Once at pod start (init container) |
| Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) |
| Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting |
| NCCL tests | No | Yes |
| Purpose | Reactive remediation | Prevent bad starts |

Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?"

## Decision

Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces.

- Injection trigger: GPU resource request + namespace
- Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise

## Implementation

### Component Structure

```
preflight/
β”œβ”€β”€ injector/ # Webhook (Deployment)
β”‚ β”œβ”€β”€ main.go
β”‚ β”œβ”€β”€ go.mod
β”‚ β”œβ”€β”€ Makefile
β”‚ β”œβ”€β”€ Tiltfile
β”‚ └── pkg/
β”‚ β”œβ”€β”€ config/
β”‚ β”‚ └── config.go
β”‚ β”œβ”€β”€ webhook/
β”‚ β”‚ └── v1alpha1/
β”‚ β”‚ β”œβ”€β”€ handler.go
β”‚ β”‚ └── handler_test.go
β”‚ β”œβ”€β”€ injection/
β”‚ β”‚ β”œβ”€β”€ injector.go
β”‚ β”‚ └── injector_test.go
β”‚ └── metrics/
β”‚ └── metrics.go
β”‚
β”œβ”€β”€ checker/ # Init container image
β”‚ β”œβ”€β”€ main.go
β”‚ β”œβ”€β”€ go.mod
β”‚ β”œβ”€β”€ Makefile
β”‚ β”œβ”€β”€ Tiltfile
β”‚ └── pkg/
β”‚ β”œβ”€β”€ runner/
β”‚ β”‚ └── runner.go
β”‚ β”œβ”€β”€ checks/
β”‚ β”‚ β”œβ”€β”€ dcgm/
β”‚ β”‚ β”‚ └── diag.go # dcgmi diag -r 1/2
β”‚ β”‚ └── nccl/
β”‚ β”‚ β”œβ”€β”€ loopback.go
β”‚ β”‚ └── allreduce.go
β”‚ β”œβ”€β”€ coordination/
β”‚ β”‚ β”œβ”€β”€ discovery.go # Peer discovery via workloadRef
β”‚ β”‚ └── configmap.go # NCCL ID sharing
β”‚ β”œβ”€β”€ reporting/
β”‚ β”‚ └── healthevents.go
β”‚ └── metrics/
β”‚ └── metrics.go
β”‚
└── Makefile # Builds both
```

### Webhook Flow

```mermaid
flowchart TD
A[Pod CREATE request] --> B{Has GPU resource?}
B -->|No| C[Allow - no mutation]
B -->|Yes| D[Inject init container]
D --> E[Return JSON patch]
```

Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time.

### MutatingWebhookConfiguration

```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: preflight-injector
webhooks:
- name: preflight.nvsentinel.nvidia.com
clientConfig:
service:
name: preflight-injector
namespace: nvsentinel
path: /mutate-pod
rules:
- apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
operations: ["CREATE"]
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values: [] # Populated from Helm values
failurePolicy: Fail
sideEffects: None
admissionReviewVersions: ["v1"]
```

Namespace list populated from Helm values.

### Init Container Spec

```yaml
initContainers:
- name: nvsentinel-preflight
image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
env:
- name: PREFLIGHT_CHECKS
value: "dcgm-diag,nccl-loopback"
- name: DCGM_DIAG_LEVEL
value: "1"
- name: CHECK_TIMEOUT
value: "300s"
- name: GANG_TIMEOUT
value: "600s"
resources:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the NCCL tests require that the initContainer has access to InfiniBand-related device plugins?

limits:
nvidia.com/gpu: 8 # Copied from main container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be more complex scenarios where other initContainers are requesting GPUs and there might be multiple primary containers all requesting GPUs. My understanding is that the scheduler will ensure a pod is placed on a node that's the max(largest initContainer resource request, sum of all primary containers). Wouldn't we want to take this maximum value to ensure we're capturing all GPUs that the pod may use across either its init or primary containers?

securityContext:
privileged: true
volumeMounts:
- name: dcgm-socket
mountPath: /var/run/nvidia
- name: platform-connector-socket
mountPath: /var/run/nvsentinel
```

**GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod).

### Check Types

| Check | Scope | Coordination |
|-------|-------|--------------|
| `dcgm-diag` | Single node | None |
| `nccl-loopback` | Single node | None |
| `nccl-allreduce` | Gang-wide | ConfigMap |
| `plugin:<name>` | Varies | Varies |

### Plugin Interface (Third-Party Checks)

Plugins are separate init containers. Webhook injects one container per plugin.

**Registration:**
```yaml
preflight-injector:
plugins:
- name: bandwidth-check
image: myregistry/bandwidth-check:v1
timeout: "60s"
```

**Injected init containers:**
```yaml
initContainers:
# Built-in checks
- name: nvsentinel-preflight
image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
...

# Plugin (separate container)
- name: preflight-bandwidth-check
image: myregistry/bandwidth-check:v1
env:
- name: CHECK_TIMEOUT
value: "60s"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
```

**Plugin contract:**
- Exit codes: `0` (passed), `1` (check failed), `2` (config error)
- Write HealthEvent to Platform Connector socket (same as built-in checks)
- Plugin sets `isFatal`, `recommendedAction` in HealthEvent
- Platform Connector overrides can modify values
- Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket)

### Configuration

Configured at deployment time via Helm values. No per-workload annotations.

### Gang Coordination

For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation Note: AFAIK the NCCL all-reduce test on the cluster can disrupt the other workloads already running on the cluster by using the network bandwidth (when the mutating webhook injects an init container into a new workload pod, some older workload pods might already be running on the cluster). We would need to ensure that we run the all-reduce test in a lightweight manner so as to minimize the impact on existing workloads.


```mermaid
sequenceDiagram
participant R0 as Rank 0 Init
participant R1 as Rank 1 Init
participant API as Kube API
participant CM as ConfigMap

R0->>API: List pods with same workloadRef
R1->>API: List pods with same workloadRef

Note over R0,R1: Determine rank by sorting pod names

R0->>CM: Create ConfigMap with NCCL unique ID
R1->>CM: Poll until ConfigMap exists
R1->>CM: Read NCCL unique ID

R0->>R1: nccl.init() (barrier inside NCCL)
R0->>R1: nccl.all_reduce()
```

**Peer discovery via workloadRef:**
- Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match
- Gets peer IPs directly from pod list
- Determines rank by sorting pod names alphabetically

**NCCL ID sharing:**
- Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}`
- Other ranks poll until ConfigMap exists (10 min timeout)
- ConfigMap has owner reference to Workload for cleanup

Webhook injects the init container. No Service or other resources created.

**Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).

### Failure Behavior

Init container exit codes:
- `0`: All checks passed
- `1`: Check failed, pod should not start
- `2`: Configuration error

On failure:
- Pod stays in `Init:Error` state
- **HealthEvent created** via Platform Connector (same as health monitors)
- Kubernetes Event created with failure details
- Metrics incremented (`preflight_check_failures_total`)

HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, etc).

### Error to Recommended Action Mapping

**DCGM Diag** :

| Test | Result | Recommended Action |
|------|--------|-------------------|
| Memory | `FAIL` | `CONTACT_SUPPORT` |
| PCIe | `FAIL` | `CONTACT_SUPPORT` |
| NVLink | `FAIL` | `CONTACT_SUPPORT` |
| Stress | `FAIL` | `RUN_DCGMEUD` |
| Any | `WARN` | `NONE` |

**NCCL Checks**:

| Error | Recommended Action |
|-------|-------------------|
| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` |
| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` |
| `NCCL_INVALID_USAGE` | `NONE` |
| `NCCL_TIMEOUT` | `NONE` |
| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` |

**isFatal determination**:
- DCGM diag `FAIL` β†’ `isFatal: true`
- DCGM diag `WARN` β†’ `isFatal: false`
- NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) β†’ `isFatal: true`
- NCCL timeout/config errors β†’ `isFatal: false`

### Helm Values

```yaml
preflight-injector:
enabled: false # Opt-in

checks:
- dcgm-diag
- nccl-loopback
# - nccl-allreduce # Enable for gang workloads

dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min)
checkTimeout: "300s" # Per-check timeout
gangTimeout: "600s" # Gang coordination timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the overall timeout for the nccl-allreduce test be 900s? How do we communicate this to customers who may be sensitive with pod start-up times?


# Namespaces where preflight checks apply
namespaces:
- training

webhook:
failurePolicy: Fail # or Ignore

image:
repository: ghcr.io/nvidia/nvsentinel/preflight-checker
tag: v1
```

All GPU pods in listed namespaces get the configured checks.

### Metrics

**preflight/checker** (exposed via pushgateway or scraped from pod annotations):

| Metric | Type | Labels |
|--------|------|--------|
| `preflight_check_total` | Counter | `check`, `result` |
| `preflight_check_duration_seconds` | Histogram | `check` |
| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` |
| `preflight_gang_wait_seconds` | Histogram | `workload` |
| `preflight_config_errors_total` | Counter | `error` |

**preflight/injector** (standard Prometheus endpoint):

| Metric | Type | Labels |
|--------|------|--------|
| `preflight_injection_total` | Counter | `result` |
| `preflight_webhook_latency_seconds` | Histogram | - |

## Rationale

- Mutating webhook, no external dependencies
- Init containers
- Namespace selector opt-in
- Deployment-level config

## Consequences

### Positive
- Catches GPU failures before workload starts
- Works with any workload controller

### Negative
- Adds 30-60s pod startup latency (DCGM diag level 1)
- Requires privileged init container for DCGM
- Webhook downtime blocks pod creation (if `failurePolicy: Fail`)

### Mitigations
- **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads
- **Privileged**: Required for hardware access; limit to specific namespaces
- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down

## Alternatives Considered

### Kyverno Policy
Rejected: External dependency.

### User-managed init containers
Rejected: No enforcement. Users forget.

### Custom CRD wrapper
Rejected: Requires changing how workloads are deployed.

## Out of Scope

- **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events.

## References

- K8s 1.35 Workload API: https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/
- GitHub Issue: https://github.com/NVIDIA/NVSentinel/issues/658

Loading