-
Notifications
You must be signed in to change notification settings - Fork 15.2k
DRA Device Binding Conditions #5007 document #49814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA Device Binding Conditions #5007 document #49814
Conversation
👷 Deploy Preview for kubernetes-io-vnext-staging processing.
|
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
187d479 to
32d24ef
Compare
|
Hello @KobayashiD27 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on 8th April, 2025 18:00 PDT. |
|
Hello @Urvashi0109 ! |
|
@KobayashiD27: please edit the PR base to target dev-1.34. |
258e5cb to
4ed9a3f
Compare
f6ccb3c to
3ac1bdb
Compare
| {{< feature-state feature_gate_name="DRADeviceBindingConditions" >}} | ||
|
|
||
| Device Binding Conditions allow for the setting of `BindingConditions` and `BindingFailureConditions`, | ||
| which help determine if a device needs preparation before proceeding to the Bind Phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a PreBind plugin? (https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) If yes, we should say so early on and link to that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and sorry for late.
Yes, this is a PreBind plugin. I will add the link.
(But this is a first draft, so I think it will be revised throughout.)
| Device Binding Conditions allow for the setting of `BindingConditions` and `BindingFailureConditions`, | ||
| which help determine if a device needs preparation before proceeding to the Bind Phase. | ||
| This is particularly useful in systems where devices must be attached to nodes. | ||
| The DRA driver sets these conditions based on the specific characteristics of the device it manages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes it sound like Device Binding Conditions is more important for driver owners (not cluster operators). Is that correct? What happens if the cluster admin doesn't enable the feature gate but the driver has these conditions configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes it sound like Device Binding Conditions is more important for driver owners (not cluster operators). Is that correct?
Probably so, but I think there will need to be some communication between the two.
What happens if the cluster admin doesn't enable the feature gate but the driver has these conditions configured?
If feature-gate is disabled, these values will be dropped by validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we have an error reporting mechanism that DRA Drivers are required to implement where they are informed about that.
|
|
||
| If you want to set a timeout period for waiting during the PreBind phase, | ||
| you can specify the desired number of seconds in `BindingTimeoutSeconds`. | ||
| Furthermore, by setting `BindsToNode` to `true`, you can configure the nodeSelector to match only a single node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it clear that this is the devices.nodeSelector field in a ResourceSlice, not the nodeSelector in a Pod spec
7f33dad to
29aaa69
Compare
|
I've updated the documentation based on the latest implementation. I'd appreciate it if you could review it when convenient. |
|
Hi @pohly @klueska, The PR for the KEP implementation has been successfully merged! I've updated this documentation PR to reflect the latest implementation. Also, if you know anyone else who would be a good fit to review this, I’d really appreciate it if you could loop them in. Thanks in advance! |
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
shannonxtreme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I left some suggestions to improve clarity and structure. Thank you for expanding on this content, it's a lot clearer!
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
| must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature | ||
| gates for the scheduler to honor these fields. | ||
|
|
||
| - `bindingConditions`: a list of condition keys that must have status `True` before binding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition keys on what objects? Pod/Node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition keys (bindingConditions and bindingFailureConditions) refer to condition types defined in the status.conditions field of the ResourceClaim object, not the Pod or Node.
These conditions are typically updated by external device controllers to reflect the readiness or failure status of the associated device (e.g., GPU, FPGA). The scheduler evaluates these conditions during the PreBind phase to decide whether to proceed with Pod binding.
| - `bindingConditions`: a list of condition keys that must have status `True` before binding. | ||
| This indicate readiness signals such as "device attached" or "initialized". | ||
| - `bindingFailureConditions`: a list of failure condition keys. If any have status `True`, | ||
| indicate that binding should be aborted and the Pod rescheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rescheduled meaning go back to the beginning of the scheduling workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're mostly right! To be more precise, "rescheduled" in this context means that the current scheduling cycle will be terminated, and the item will wait until the next scheduling cycle begins.
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
| node-specific setup on the selected node. | ||
|
|
||
| This feature is useful for asynchronous device preparation workflows, | ||
| such as dynamic GPU attachment or FPGA initialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this value-add statement to the first paragraph of this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated.
|
Hello @KobayashiD27 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Wednesday August 6, 2025 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you! |
c1dcf14 to
4ddbd18
Compare
|
/assign @nate-double-u |
|
@johnbelamaric @shannonxtreme |
johnbelamaric
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for DRA technical content
|
@nate-double-u @michellengnx |
|
@johnbelamaric do you mind adding the LGTM label to indicate your tech review? i've reviewed for docs – looks good, so lets get this in before Docs Freeze! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, natalisucks The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
am adding the LGTM as well, as folks can see we have the tech review – big thanks to all 🚀 |
|
LGTM label has been added. DetailsGit tree hash: ab47cbad0831aebb2ca4de8be66c56344d7ac72e |
Description
k/k development PR: kubernetes/kubernetes#130160
Issue
k/enhancement issue: kubernetes/enhancements#5007
Closes: #