Skip to content

Commit 7d441d4

Browse files
authored
Merge branch 'main' into xid-analyzer-recovery-actions
2 parents 51e8742 + 55ce76d commit 7d441d4

27 files changed

Lines changed: 1041 additions & 53 deletions

File tree

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,24 @@ data:
5353
{{- if .healthEvent.processingStrategy }}
5454
processingStrategy = {{ .healthEvent.processingStrategy | quote }}
5555
{{- end }}
56+
{{- if .healthEvent.quarantineOverrides }}
57+
[policies.healthEvent.quarantineOverrides]
58+
{{- if hasKey .healthEvent.quarantineOverrides "force" }}
59+
force = {{ .healthEvent.quarantineOverrides.force }}
60+
{{- end }}
61+
{{- if hasKey .healthEvent.quarantineOverrides "skip" }}
62+
skip = {{ .healthEvent.quarantineOverrides.skip }}
63+
{{- end }}
64+
{{- end }}
65+
{{- if .healthEvent.drainOverrides }}
66+
[policies.healthEvent.drainOverrides]
67+
{{- if hasKey .healthEvent.drainOverrides "force" }}
68+
force = {{ .healthEvent.drainOverrides.force }}
69+
{{- end }}
70+
{{- if hasKey .healthEvent.drainOverrides "skip" }}
71+
skip = {{ .healthEvent.drainOverrides.skip }}
72+
{{- end }}
73+
{{- end }}
5674
5775
{{- end }}
5876

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,14 @@ policies:
5050
recommendedAction: CONTACT_SUPPORT
5151
errorCode:
5252
- NODE_NOT_READY
53+
# Optional behavior overrides for this policy's generated HealthEvents.
54+
# Set either force or skip, never both in the same override block.
55+
# quarantineOverrides:
56+
# force: true # Force node cordon even if normal quarantine rules would not.
57+
# skip: true # Skip node cordon for this health event.
58+
# drainOverrides:
59+
# force: true # Force immediate pod eviction regardless of namespace drain mode.
60+
# skip: true # Skip pod eviction and mark the event as already drained.
5361

5462
# Example: Monitor a custom resource (e.g., a GPU Job)
5563
# Uncomment and modify to monitor your own custom resources
@@ -89,6 +97,8 @@ policies:
8997
# recommendedAction: CONTACT_SUPPORT
9098
# errorCode:
9199
# - GPU_JOB_FAILED
100+
# drainOverrides:
101+
# skip: true
92102

93103
resources:
94104
requests:

distros/kubernetes/nvsentinel/charts/node-drainer/templates/configmap.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ data:
2525
systemNamespaces = {{ .Values.systemNamespaces | quote }}
2626
deleteAfterTimeoutMinutes = {{ .Values.deleteAfterTimeoutMinutes }}
2727
notReadyTimeoutMinutes = {{ .Values.notReadyTimeoutMinutes }}
28+
drainGPUPods = {{ .Values.drainGPUPods }}
2829
partialDrainEnabled = {{ .Values.partialDrainEnabled }}
2930
3031
{{- range .Values.userNamespaces }}

distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@ deleteAfterTimeoutMinutes: 60
5555
# Default: 5 minutes if not specified (validated in config.go)
5656
notReadyTimeoutMinutes: 5
5757

58+
# Flag to restrict draining to GPU workloads
59+
# If enabled, only pods with the metadata-collector device annotation
60+
# (indicating assigned GPU devices) are eligible for draining
61+
# Default: false if not specified
62+
drainGPUPods: false
63+
5864
# User namespace configuration with eviction modes
5965
# Defines how pods in different namespaces should be evicted during node drain
6066
# Each entry specifies a namespace pattern and its corresponding eviction mode

distros/kubernetes/nvsentinel/values-full.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -590,6 +590,12 @@ node-drainer:
590590
# Default: 5 minutes
591591
notReadyTimeoutMinutes: 5
592592

593+
# Flag to restrict draining to GPU workloads
594+
# If enabled, only pods with the metadata-collector device annotation
595+
# (indicating assigned GPU devices) are eligible for draining
596+
# Default: false if not specified
597+
drainGPUPods: false
598+
593599
# Namespace-specific eviction strategies
594600
# Define how pods in different namespaces should be evicted
595601
# Multiple rules can be defined with namespace patterns

docs/configuration/kubernetes-object-monitor.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ kubernetes-object-monitor:
8383
recommendedAction: CONTACT_SUPPORT
8484
errorCode:
8585
- ERROR_CODE
86+
quarantineOverrides:
87+
force: true # Or use skip: true; do not set both
88+
drainOverrides:
89+
skip: true # Or use force: true; do not set both
8690
```
8791
8892
### Parameters
@@ -135,6 +139,12 @@ Action code from health event proto (see [health_event.proto](https://github.com
135139
##### errorCode
136140
Array of error code strings for categorization and filtering.
137141

142+
##### quarantineOverrides
143+
Optional behavior override for fault-quarantine. `force` forces node cordoning regardless of normal rules; `skip` skips node cordoning for the generated health event. Set at most one of `force` or `skip`.
144+
145+
##### drainOverrides
146+
Optional behavior override for node-drainer. `force` forces immediate pod eviction regardless of configured namespace drain modes; `skip` skips pod eviction and marks the event as already drained. Set at most one of `force` or `skip`.
147+
138148
## CEL Expressions
139149

140150
### Predicate Expressions

docs/configuration/node-drainer.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,29 @@ node-drainer:
9696

9797
When a pod has been in NotReady state for longer than this timeout, it is excluded from the list of pods to evict. This prevents attempting to evict pods that are already unhealthy and unlikely to respond to eviction requests.
9898

99+
### GPU-Only Draining
100+
101+
If enabled, the node-drainer filters pod eviction to only target workloads that request GPU resources.
102+
103+
```yaml
104+
node-drainer:
105+
drainGPUPods: false
106+
```
107+
108+
The node-drainer detects GPU resource requests through device annotations added to pods by the metadata-collector. Pods with device annotations are identified as GPU workloads and eligible for eviction.
109+
110+
Device annotations are added to pods requesting GPU resources by metadata-collector with the format:
111+
```yaml
112+
annotations:
113+
dgxc.nvidia.com/devices: '{"devices":{"nvidia.com/gpu":["GPU-123"]}}'
114+
```
115+
116+
#### Behavior
117+
118+
- **When enabled (`true`)**: Only pods with GPU device annotations are evicted during drain operations
119+
- **When disabled (`false`)**: All eligible pods in configured namespaces are evicted (default behavior)
120+
- Pods without GPU requests are preserved, maintaining critical infrastructure services
121+
99122
## User Namespaces
100123

101124
Defines eviction behavior for user workloads based on namespace patterns.

0 commit comments

Comments
 (0)