Skip to content

Commit cbc56b6

Browse files
committed
KEP-5007: DRA Device Binding Conditions beta in 1.36
- Updated the Production Readiness Review questionnaire and introduced metrics for troubleshooting and operations. - Addressed review comments from the v1.35 PR #5487. - Added Graduation Criteria for beta. - Clarify that happy-path device migration is out of scope for beta criteria Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
1 parent c695e44 commit cbc56b6

File tree

2 files changed

+59
-34
lines changed

2 files changed

+59
-34
lines changed

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

Lines changed: 57 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@ Moving to beta aims to validate the feature at scale, gather real-world feedback
226226
- While this KEP introduces a mechanism that supports such use cases, broader architectural questions — such as how to model attachment workflows or coordinate between node-local and fabric-aware components — will be addressed in follow-up discussions.
227227
- Defining autoscaling strategies for fabric-attached devices (covered in future proposals).
228228
- Guaranteeing zero rescheduling in all failure scenarios.
229+
- A mechanism to safely move devices between different pools, achieving this as a happy-path flow without relying on re-scheduling triggered by BindingFailureConditions.
229230

230231
## Proposal
231232

@@ -743,12 +744,14 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
743744
- A few bugs were identified in the external controller (CoHDI component), but all of them were resolvable with fixes and do not indicate a fatal problem with the use of BindingCondition.
744745
- Scenarios where devices in a ResourceSlice for a resource pool decrease were also tested, with no issues detected.
745746
- Please refer [here](https://github.com/CoHDI/composable-dra-driver/tree/main/doc/Usecase_and_feedback_for_BindingCondition.md) for more details.
746-
- Resolve the following issues
747-
- If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes
748-
(Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time)
749-
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
750-
- Additional tests are in Testgrid and linked in KEP
751-
- Scheduler supports timeout configuration via command-line argument
747+
- In this use case, the attachment scenario for moving devices between different pools is achieved through re-scheduling triggered by BindingFailureConditions. However, there remains an issue that device migration needs to be implemented using BindingConditions as a happy‑path flow. This will be addressed in a separate KEP and will be considered out of scope for the beta-graduation criteria.
748+
- Feedback form NVIDIA DRA Driver (https://github.com/NVIDIA/k8s-dra-driver-gpu)
749+
- ComputeDomain is an NVIDIA GPU Operator concept that groups multiple Kubernetes pods into a secure, isolated domain so they can share GPU memory across nodes using Multi‑Node NVLink and IMEX technology.
750+
- BindingConditions let Kubernetes delay pod start until the ComputeDomain's prerequisites are truly ready - specifically, until IMEX daemons are scheduled and healthy - without resorting to fail‑and‑retry loops.
751+
- This yields faster, more predictable pod startup and a simpler driver/controller design because the pod is only bound once all ComputeDomain resources signal ready via BindingConditions.
752+
- Please refer [here](https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/653) for more details.
753+
- Resolve the issues listed in the following:
754+
https://github.com/kubernetes/kubernetes/issues/135472
752755

753756
#### GA
754757

@@ -780,10 +783,9 @@ enhancement:
780783
-->
781784

782785
This feature exposes new fields such as `BindingConditions` in the
783-
`ResourceClaim` and `ResourceSlice`, the fields willeither be present or not.
786+
`ResourceClaim` and `ResourceSlice`, the fields will either be present or not.
784787

785-
This feature uses the DRA interface and will follow the DRA upgrade/downgrade
786-
strategy.
788+
This feature uses the DRA interface and will follow standard Kubernetes feature flag semantics.
787789

788790
### Version Skew Strategy
789791

@@ -800,8 +802,7 @@ enhancement:
800802
CRI or CNI may require updating that component before the kubelet.
801803
-->
802804

803-
This feature affects only the kube-apiserver and kube-scheduler, so there is no
804-
issue with version skew with other Kubernetes components.
805+
Older schedulers, or schedulers with the feature flag disabled, will not see the values in the new fields, and so will proceed to binding even if the ResourceSlice contains BindingConditions. The exact behavior when the Pod reaches the kubelet will depend on the driver; in many cases it is likely the associated Pods will fail in this case.
805806

806807
## Production Readiness Review Questionnaire
807808

@@ -919,10 +920,18 @@ rollout. Similarly, consider large clusters and how enablement/disablement
919920
will rollout across nodes.
920921
-->
921922

922-
When this feature is enabled, if a Pod requests a resource that has
923-
BindingConditions, the Pod will wait in the PreBind phase until all
924-
BindingConditions are set to True. This means that this feature only affects the
925-
behavior before the Pod is scheduled, and does not affect running workloads.
923+
It is safest to perform rollout and feature enablement in the following order:
924+
925+
1. Enable the feature gate on the kube-apiserver
926+
2. Enable the feature gate on the kube-scheduler
927+
3. Deploy the DRA driver that publishes ResourceSlice with BindingConditions
928+
4. Deploy the controller related to that device ( = the binding controller)
929+
930+
For rollback, it is recommended to reverse this order.
931+
932+
An example of a rollout failure would be when step 3 is completed but step 4 is not.
933+
In this situation, a Pod may be allocated to a device with BindingConditions and remain in a scheduling wait state.
934+
However, since the binding controller responsible for provisioning the device is not deployed, the provisioning never occurs, and scheduling cannot succeed.
926935

927936
###### What specific metrics should inform a rollback?
928937

@@ -931,9 +940,21 @@ What signals should users be paying attention to when the feature is young
931940
that might indicate a serious problem?
932941
-->
933942

934-
When a timeout occurs in BindingConditions, the Pod is repeatedly re-scheduled, which leads to an increase in the `scheduler_schedule_attempts_total` metric with the label `result=unschedulable`.
943+
Two metrics will be introduced.
944+
945+
- `scheduler_dra_bindingconditions_allocations_total` (type: `CounterVec`)
946+
- This metric counts the number of scheduling attempts (PreBind executions) in which BindingConditions are required.
947+
- The metric includes a label `status` - `"success"`, `"failure"`, or `"timeout"`, allowing operators to understand the processing result of BindingConditions.
948+
- By tracking this metric over time, operators can determine, at each point in time, whether the BindingConditions feature is being used
935949

936-
Additionally, since the waiting time within the Pre-Bind phase increases, the `scheduler_framework_extension_point_duration_seconds` metric - especially the higher latency histogram buckets with labels `extension_point=PreBind` and `status=1` (Error) - will show elevated counts.
950+
- `scheduler_dra_bindingconditions_prebind_duration_seconds` (type: `HistogramVec`)
951+
- This metric observes the full PreBind duration for DRA flows.
952+
- The metric includes a label `status` - `"success"`, `"failure"`, or `"timeout"`, allowing operators to classify the duration by processing status.
953+
- The metric includes a label `requires_bindingconditions` - `"true"` or `"false"`, allowing operators to switch the duration based on with or without BindingConditions.
954+
955+
When a timeout occurs in BindingConditions, the Pod is repeatedly re-scheduled, which leads to an increase in the `scheduler_dra_bindingconditions_allocations_total` metric with the label `status=timeout`.
956+
957+
Additionally, since the waiting time within the Pre-Bind phase increases, the `scheduler_dra_bindingconditions_prebind_duration_seconds` metric - especially the higher latency histogram buckets with labels `requires_bindingconditions=true` - will show elevated counts.
937958

938959
In all cases further analysis of logs and pod events is needed to determine whether errors are related to this feature.
939960

@@ -972,12 +993,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
972993
logs or events for this purpose.
973994
-->
974995

975-
Operators can determine if this feature is in use by workloads by checking
976-
the following:
977-
978-
- Presence of elements in `ResourceClaim.status.allocation.devices.results.bindingConditions`.
979-
- Presence of elements in `ResourceSlice.spec.devices.bindingConditions`.
980-
- Existence of a "BindingConditionsPending" message in the Pod's Event logs.
996+
Operators can determine whether this feature is being used by workloads by checking the metric `scheduler_dra_bindingconditions_allocations_total`.
997+
By tracking this metric over time, operators can determine, at each point in time, whether the BindingConditions feature is being used.
981998

982999
###### How can someone using this feature know that it is working for their instance?
9831000

@@ -1026,8 +1043,10 @@ consistently low.
10261043
Pick one more of these and delete the rest.
10271044
-->
10281045

1029-
This can be determined by comparing the time required for binding, specifically:
1030-
The time from "BindingConditionsPending" to "Scheduled" in the Pod's event logs.
1046+
This can be determined by monitoring the histogram metric `scheduler_dra_bindingconditions_prebind_duration_seconds`
1047+
with the labels `"status=success"` and `"requires_bindingconditions=true"`.
1048+
This metric shows the time it takes for BindingConditions to be processed and for the Pod to be scheduled.
1049+
If this duration increases, it indicates that the SLI is degrading.
10311050

10321051
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10331052

@@ -1064,10 +1083,13 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
10641083
- Impact of its degraded performance or high-error rates on the feature:
10651084
-->
10661085

1067-
Yes. The proper functioning and latency of this feature depend on external
1068-
controllers (e.g., composable DRA controllers).
1069-
This is because the scheduler expects state updates from external controllers
1070-
to satisfy the BindingConditions and allow the schedule to complete.
1086+
- [External controller]
1087+
- Usage description:
1088+
The external controller is responsible for provisioning the resource state and satisfying BindingConditions so that the scheduler can complete the scheduling process.
1089+
- Impact of its outage on the feature:
1090+
If the external controller is not deployed or is completely unavailable, Pods that require devices with BindingConditions will remain in a pending state. Scheduling cannot succeed because the controller never provisions the device or updates the conditions.
1091+
- Impact of its degraded performance or high error rates on the feature:
1092+
If the controller is slow or error-prone, scheduling latency will increase significantly. Pods may experience long delays before becoming runnable, and in some cases, scheduling may fail if timeouts occur.
10711093

10721094
### Scalability
10731095

@@ -1096,7 +1118,7 @@ Focusing mostly on:
10961118
heartbeats, leader election, etc.)
10971119
-->
10981120

1099-
Yes, there will be additional Get() calls to ResourceClaim for communication
1121+
Yes, there will be additional Get() and Watch() calls to ResourceClaim for communication
11001122
with the external controller. However, these calls are executed with a
11011123
backoff interval and are therefore negligible.
11021124

@@ -1166,6 +1188,9 @@ time it takes for a Pod to transition from creation to a Running state, as well
11661188
as on Pod creation throughput. This is due to waiting for status updates from
11671189
external controllers in the scheduler's PreBind phase.
11681190

1191+
This impact can be observed through the metric `scheduler_dra_bindingconditions_prebind_duration_seconds`.
1192+
By switching `requires_bindingconditions="true"/"false"`, it's possible to choose to display only requests that exclude BindingConditions, allowing to verify the transition times limited to requests involving BindingConditions.
1193+
11691194
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
11701195

11711196
<!--
@@ -1178,7 +1203,7 @@ This through this both in small and large cases, again with respect to the
11781203
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
11791204
-->
11801205

1181-
Yes, CPU utilization will slightly increase for evaluating BindingConditions,
1206+
No, CPU utilization will slightly increase for evaluating BindingConditions,
11821207
but this increase is negligible, because these operations are processed with a
11831208
backoff interval. And the addition of fields to existing structs will also
11841209
slightly increase memory consumption, but as mentioned in the previous item,
@@ -1235,7 +1260,7 @@ Failure of the external controller, or the absence of a corresponding external
12351260
controller, will lead to a scheduling timeout, causing Pods that were waiting
12361261
in the PreBind phase to be re-queued for scheduling.
12371262

1238-
- Detection: Timeout of scheduling using the BindingConditions feature can be detected from the Pod's logs.
1263+
- Detection: Timeout of scheduling using the BindingConditions feature can be detected from the metrics.
12391264
- Mitigations: To prevent resources using the BindingConditions feature from being deployed, stop the controller that generates ResourceSlices with BindingConditions, and then delete those ResourceSlices.
12401265
- Diagnostics: Depends on the implementation of external controllers corresponding to BindingConditions.
12411266
- Testing: From the scheduler's perspective, this has been addressed through integration tests and unit tests, which confirm that timeouts do not occur under appropriate conditions and verify the behavior after a timeout occurs.

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@ see-also:
2323
- https://github.com/kubernetes/kubernetes/issues/124042#issuecomment-2548068135
2424

2525
# The target maturity stage in the current dev cycle for this KEP.
26-
stage: alpha
26+
stage: beta
2727
#|beta|stable
2828

2929
# The most recent milestone for which work toward delivery of this KEP has been
3030
# done. This can be the current (upcoming) milestone, if it is being actively
3131
# worked on.
32-
latest-milestone: "v1.35"
32+
latest-milestone: "v1.36"
3333

3434
# The milestone at which this feature was, or is targeted to be, at each stage.
3535
milestone:

0 commit comments

Comments
 (0)