You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration/node-drainer.md
+23Lines changed: 23 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,6 +96,29 @@ node-drainer:
96
96
97
97
When a pod has been in NotReady state for longer than this timeout, it is excluded from the list of pods to evict. This prevents attempting to evict pods that are already unhealthy and unlikely to respond to eviction requests.
98
98
99
+
### GPU-Only Draining
100
+
101
+
If enabled, the node-drainer filters pod eviction to only target workloads that request GPU resources.
102
+
103
+
```yaml
104
+
node-drainer:
105
+
drainGPUPods: false
106
+
```
107
+
108
+
The node-drainer detects GPU resource requests through device annotations added to pods by the metadata-collector. Pods with device annotations are identified as GPU workloads and eligible for eviction.
109
+
110
+
Device annotations are added to pods requesting GPU resources by metadata-collector with the format:
Automatically resumes drain operations after restarts - queries datastore for in-progress drains and continues from where it left off.
109
111
110
112
### Partial Drain Functionality
111
-
For GPU faults that can be remediated with a GPU reset, the Node Drainer will only drain pods which are leveraging the unhealthy GPU. For GPU faults that require a node reboot, all pods on the given node in the configured namespaces will be drained.
113
+
For GPU faults that can be remediated with a GPU reset, the Node Drainer will only drain pods which are leveraging the unhealthy GPU. For GPU faults that require a node reboot, all pods on the given node in the configured namespaces will be drained.
114
+
115
+
### GPU-Only Draining
116
+
When `drainGPUPods: true` is set, the Node Drainer filters pod eviction to only target workloads that request GPU resources. The feature detects GPU resources using device annotations provided by the Metadata Collector, which tracks GPU allocation across the cluster. Default is `false`.
0 commit comments