You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KEP-5007: DRA Device Binding Conditions beta in 1.36
- Updated the Production Readiness Review questionnaire
and introduced metrics for troubleshooting and operations.
- Addressed review comments from the v1.35 PR #5487.
- Added Graduation Criteria for beta.
- Clarify that happy-path device migration is out of scope for beta criteria
Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
+57-32Lines changed: 57 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -226,6 +226,7 @@ Moving to beta aims to validate the feature at scale, gather real-world feedback
226
226
- While this KEP introduces a mechanism that supports such use cases, broader architectural questions — such as how to model attachment workflows or coordinate between node-local and fabric-aware components — will be addressed in follow-up discussions.
227
227
- Defining autoscaling strategies for fabric-attached devices (covered in future proposals).
228
228
- Guaranteeing zero rescheduling in all failure scenarios.
229
+
- A mechanism to safely move devices between different pools, achieving this as a happy-path flow without relying on re-scheduling triggered by BindingFailureConditions.
229
230
230
231
## Proposal
231
232
@@ -743,12 +744,14 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
743
744
- A few bugs were identified in the external controller (CoHDI component), but all of them were resolvable with fixes and do not indicate a fatal problem with the use of BindingCondition.
744
745
- Scenarios where devices in a ResourceSlice for a resource pool decrease were also tested, with no issues detected.
745
746
- Please refer [here](https://github.com/CoHDI/composable-dra-driver/tree/main/doc/Usecase_and_feedback_for_BindingCondition.md) for more details.
746
-
- Resolve the following issues
747
-
- If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes
748
-
(Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time)
749
-
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
750
-
- Additional tests are in Testgrid and linked in KEP
751
-
- Scheduler supports timeout configuration via command-line argument
747
+
- In this use case, the attachment scenario for moving devices between different pools is achieved through re-scheduling triggered by BindingFailureConditions. However, there remains an issue that device migration needs to be implemented using BindingConditions as a happy‑path flow. This will be addressed in a separate KEP and will be considered out of scope for the beta-graduation criteria.
748
+
- Feedback form NVIDIA DRA Driver (https://github.com/NVIDIA/k8s-dra-driver-gpu)
749
+
- ComputeDomain is an NVIDIA GPU Operator concept that groups multiple Kubernetes pods into a secure, isolated domain so they can share GPU memory across nodes using Multi‑Node NVLink and IMEX technology.
750
+
- BindingConditions let Kubernetes delay pod start until the ComputeDomain's prerequisites are truly ready - specifically, until IMEX daemons are scheduled and healthy - without resorting to fail‑and‑retry loops.
751
+
- This yields faster, more predictable pod startup and a simpler driver/controller design because the pod is only bound once all ComputeDomain resources signal ready via BindingConditions.
752
+
- Please refer [here](https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/653) for more details.
This feature exposes new fields such as `BindingConditions` in the
783
-
`ResourceClaim` and `ResourceSlice`, the fields willeither be present or not.
786
+
`ResourceClaim` and `ResourceSlice`, the fields will either be present or not.
784
787
785
-
This feature uses the DRA interface and will follow the DRA upgrade/downgrade
786
-
strategy.
788
+
This feature uses the DRA interface and will follow standard Kubernetes feature flag semantics.
787
789
788
790
### Version Skew Strategy
789
791
@@ -800,8 +802,7 @@ enhancement:
800
802
CRI or CNI may require updating that component before the kubelet.
801
803
-->
802
804
803
-
This feature affects only the kube-apiserver and kube-scheduler, so there is no
804
-
issue with version skew with other Kubernetes components.
805
+
Older schedulers, or schedulers with the feature flag disabled, will not see the values in the new fields, and so will proceed to binding even if the ResourceSlice contains BindingConditions. The exact behavior when the Pod reaches the kubelet will depend on the driver; in many cases it is likely the associated Pods will fail in this case.
805
806
806
807
## Production Readiness Review Questionnaire
807
808
@@ -919,10 +920,18 @@ rollout. Similarly, consider large clusters and how enablement/disablement
919
920
will rollout across nodes.
920
921
-->
921
922
922
-
When this feature is enabled, if a Pod requests a resource that has
923
-
BindingConditions, the Pod will wait in the PreBind phase until all
924
-
BindingConditions are set to True. This means that this feature only affects the
925
-
behavior before the Pod is scheduled, and does not affect running workloads.
923
+
It is safest to perform rollout and feature enablement in the following order:
924
+
925
+
1. Enable the feature gate on the kube-apiserver
926
+
2. Enable the feature gate on the kube-scheduler
927
+
3. Deploy the DRA driver that publishes ResourceSlice with BindingConditions
928
+
4. Deploy the controller related to that device ( = the binding controller)
929
+
930
+
For rollback, it is recommended to reverse this order.
931
+
932
+
An example of a rollout failure would be when step 3 is completed but step 4 is not.
933
+
In this situation, a Pod may be allocated to a device with BindingConditions and remain in a scheduling wait state.
934
+
However, since the binding controller responsible for provisioning the device is not deployed, the provisioning never occurs, and scheduling cannot succeed.
926
935
927
936
###### What specific metrics should inform a rollback?
928
937
@@ -931,9 +940,21 @@ What signals should users be paying attention to when the feature is young
931
940
that might indicate a serious problem?
932
941
-->
933
942
934
-
When a timeout occurs in BindingConditions, the Pod is repeatedly re-scheduled, which leads to an increase in the `scheduler_schedule_attempts_total` metric with the label `result=unschedulable`.
- This metric counts the number of scheduling attempts (PreBind executions) in which BindingConditions are required.
947
+
- The metric includes a label `status` - `"success"`, `"failure"`, or `"timeout"`, allowing operators to understand the processing result of BindingConditions.
948
+
- By tracking this metric over time, operators can determine, at each point in time, whether the BindingConditions feature is being used
935
949
936
-
Additionally, since the waiting time within the Pre-Bind phase increases, the `scheduler_framework_extension_point_duration_seconds` metric - especially the higher latency histogram buckets with labels `extension_point=PreBind` and `status=1` (Error) - will show elevated counts.
- This metric observes the full PreBind duration for DRA flows.
952
+
- The metric includes a label `status` - `"success"`, `"failure"`, or `"timeout"`, allowing operators to classify the duration by processing status.
953
+
- The metric includes a label `requires_bindingconditions` - `"true"` or `"false"`, allowing operators to switch the duration based on with or without BindingConditions.
954
+
955
+
When a timeout occurs in BindingConditions, the Pod is repeatedly re-scheduled, which leads to an increase in the `scheduler_dra_bindingconditions_allocations_total` metric with the label `status=timeout`.
956
+
957
+
Additionally, since the waiting time within the Pre-Bind phase increases, the `scheduler_dra_bindingconditions_prebind_duration_seconds` metric - especially the higher latency histogram buckets with labels `requires_bindingconditions=true` - will show elevated counts.
937
958
938
959
In all cases further analysis of logs and pod events is needed to determine whether errors are related to this feature.
939
960
@@ -972,12 +993,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
972
993
logs or events for this purpose.
973
994
-->
974
995
975
-
Operators can determine if this feature is in use by workloads by checking
976
-
the following:
977
-
978
-
- Presence of elements in `ResourceClaim.status.allocation.devices.results.bindingConditions`.
979
-
- Presence of elements in `ResourceSlice.spec.devices.bindingConditions`.
980
-
- Existence of a "BindingConditionsPending" message in the Pod's Event logs.
996
+
Operators can determine whether this feature is being used by workloads by checking the metric `scheduler_dra_bindingconditions_allocations_total`.
997
+
By tracking this metric over time, operators can determine, at each point in time, whether the BindingConditions feature is being used.
981
998
982
999
###### How can someone using this feature know that it is working for their instance?
983
1000
@@ -1026,8 +1043,10 @@ consistently low.
1026
1043
Pick one more of these and delete the rest.
1027
1044
-->
1028
1045
1029
-
This can be determined by comparing the time required for binding, specifically:
1030
-
The time from "BindingConditionsPending" to "Scheduled" in the Pod's event logs.
1046
+
This can be determined by monitoring the histogram metric `scheduler_dra_bindingconditions_prebind_duration_seconds`
1047
+
with the labels `"status=success"` and `"requires_bindingconditions=true"`.
1048
+
This metric shows the time it takes for BindingConditions to be processed and for the Pod to be scheduled.
1049
+
If this duration increases, it indicates that the SLI is degrading.
1031
1050
1032
1051
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1033
1052
@@ -1064,10 +1083,13 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
1064
1083
- Impact of its degraded performance or high-error rates on the feature:
1065
1084
-->
1066
1085
1067
-
Yes. The proper functioning and latency of this feature depend on external
1068
-
controllers (e.g., composable DRA controllers).
1069
-
This is because the scheduler expects state updates from external controllers
1070
-
to satisfy the BindingConditions and allow the schedule to complete.
1086
+
-[External controller]
1087
+
- Usage description:
1088
+
The external controller is responsible for provisioning the resource state and satisfying BindingConditions so that the scheduler can complete the scheduling process.
1089
+
- Impact of its outage on the feature:
1090
+
If the external controller is not deployed or is completely unavailable, Pods that require devices with BindingConditions will remain in a pending state. Scheduling cannot succeed because the controller never provisions the device or updates the conditions.
1091
+
- Impact of its degraded performance or high error rates on the feature:
1092
+
If the controller is slow or error-prone, scheduling latency will increase significantly. Pods may experience long delays before becoming runnable, and in some cases, scheduling may fail if timeouts occur.
1071
1093
1072
1094
### Scalability
1073
1095
@@ -1096,7 +1118,7 @@ Focusing mostly on:
1096
1118
heartbeats, leader election, etc.)
1097
1119
-->
1098
1120
1099
-
Yes, there will be additional Get() calls to ResourceClaim for communication
1121
+
Yes, there will be additional Get() and Watch() calls to ResourceClaim for communication
1100
1122
with the external controller. However, these calls are executed with a
1101
1123
backoff interval and are therefore negligible.
1102
1124
@@ -1166,6 +1188,9 @@ time it takes for a Pod to transition from creation to a Running state, as well
1166
1188
as on Pod creation throughput. This is due to waiting for status updates from
1167
1189
external controllers in the scheduler's PreBind phase.
1168
1190
1191
+
This impact can be observed through the metric `scheduler_dra_bindingconditions_prebind_duration_seconds`.
1192
+
By switching `requires_bindingconditions="true"/"false"`, it's possible to choose to display only requests that exclude BindingConditions, allowing to verify the transition times limited to requests involving BindingConditions.
1193
+
1169
1194
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
1170
1195
1171
1196
<!--
@@ -1178,7 +1203,7 @@ This through this both in small and large cases, again with respect to the
Yes, CPU utilization will slightly increase for evaluating BindingConditions,
1206
+
No, CPU utilization will slightly increase for evaluating BindingConditions,
1182
1207
but this increase is negligible, because these operations are processed with a
1183
1208
backoff interval. And the addition of fields to existing structs will also
1184
1209
slightly increase memory consumption, but as mentioned in the previous item,
@@ -1235,7 +1260,7 @@ Failure of the external controller, or the absence of a corresponding external
1235
1260
controller, will lead to a scheduling timeout, causing Pods that were waiting
1236
1261
in the PreBind phase to be re-queued for scheduling.
1237
1262
1238
-
- Detection: Timeout of scheduling using the BindingConditions feature can be detected from the Pod's logs.
1263
+
- Detection: Timeout of scheduling using the BindingConditions feature can be detected from the metrics.
1239
1264
- Mitigations: To prevent resources using the BindingConditions feature from being deployed, stop the controller that generates ResourceSlices with BindingConditions, and then delete those ResourceSlices.
1240
1265
- Diagnostics: Depends on the implementation of external controllers corresponding to BindingConditions.
1241
1266
- Testing: From the scheduler's perspective, this has been addressed through integration tests and unit tests, which confirm that timeouts do not occur under appropriate conditions and verify the behavior after a timeout occurs.
0 commit comments