Skip to content

Conversation

@KobayashiD27
Copy link
Contributor

@KobayashiD27 KobayashiD27 commented Feb 18, 2025

Description

k/k development PR: kubernetes/kubernetes#130160

Issue

k/enhancement issue: kubernetes/enhancements#5007
Closes: #

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 18, 2025
@netlify
Copy link

netlify bot commented Feb 18, 2025

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name Link
🔨 Latest commit 4ddbd18
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-vnext-staging/deploys/68900f46f351c000083ad650

@k8s-ci-robot k8s-ci-robot added the language/en Issues or PRs related to English language label Feb 18, 2025
@netlify
Copy link

netlify bot commented Feb 18, 2025

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 4ddbd18
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/68900f46fc25da000843d33c
😎 Deploy Preview https://deploy-preview-49814--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@KobayashiD27 KobayashiD27 force-pushed the dev-1.33-dra-device-binding-conditions branch from 187d479 to 32d24ef Compare March 6, 2025 06:36
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 6, 2025
@KobayashiD27 KobayashiD27 marked this pull request as ready for review March 12, 2025 01:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2025
@KobayashiD27 KobayashiD27 changed the title DRA Device Binding Conditions #5007 : placeholder PR DRA Device Binding Conditions #5007 document Mar 12, 2025
@Urvashi0109
Copy link
Contributor

Hello @KobayashiD27 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on 8th April, 2025 18:00 PDT.
This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. The status of this enhancement is marked as at risk for docs freeze. Thank you!

@KobayashiD27
Copy link
Contributor Author

Hello @Urvashi0109 !
Thank you for the reminder!
Unfortunately, the feature addition didn't make it in time. So, there will be no documentation work required for v1.33.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2025
@pohly
Copy link
Contributor

pohly commented Jul 2, 2025

@KobayashiD27: please edit the PR base to target dev-1.34.

@KobayashiD27 KobayashiD27 changed the base branch from dev-1.33 to dev-1.34 July 3, 2025 00:48
@KobayashiD27 KobayashiD27 force-pushed the dev-1.33-dra-device-binding-conditions branch from 258e5cb to 4ed9a3f Compare July 3, 2025 01:25
@k8s-ci-robot k8s-ci-robot added area/localization General issues or PRs related to localization area/web-development Issues or PRs related to the kubernetes.io's infrastructure, design, or build processes language/ja Issues or PRs related to Japanese language language/pt Issues or PRs related to Portuguese language sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 3, 2025
@KobayashiD27 KobayashiD27 force-pushed the dev-1.33-dra-device-binding-conditions branch from f6ccb3c to 3ac1bdb Compare July 17, 2025 02:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 17, 2025
{{< feature-state feature_gate_name="DRADeviceBindingConditions" >}}

Device Binding Conditions allow for the setting of `BindingConditions` and `BindingFailureConditions`,
which help determine if a device needs preparation before proceeding to the Bind Phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a PreBind plugin? (https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind) If yes, we should say so early on and link to that page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review and sorry for late.

Yes, this is a PreBind plugin. I will add the link.
(But this is a first draft, so I think it will be revised throughout.)

Device Binding Conditions allow for the setting of `BindingConditions` and `BindingFailureConditions`,
which help determine if a device needs preparation before proceeding to the Bind Phase.
This is particularly useful in systems where devices must be attached to nodes.
The DRA driver sets these conditions based on the specific characteristics of the device it manages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it sound like Device Binding Conditions is more important for driver owners (not cluster operators). Is that correct? What happens if the cluster admin doesn't enable the feature gate but the driver has these conditions configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it sound like Device Binding Conditions is more important for driver owners (not cluster operators). Is that correct?

Probably so, but I think there will need to be some communication between the two.

What happens if the cluster admin doesn't enable the feature gate but the driver has these conditions configured?

If feature-gate is disabled, these values will be dropped by validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we have an error reporting mechanism that DRA Drivers are required to implement where they are informed about that.


If you want to set a timeout period for waiting during the PreBind phase,
you can specify the desired number of seconds in `BindingTimeoutSeconds`.
Furthermore, by setting `BindsToNode` to `true`, you can configure the nodeSelector to match only a single node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it clear that this is the devices.nodeSelector field in a ResourceSlice, not the nodeSelector in a Pod spec

@KobayashiD27 KobayashiD27 force-pushed the dev-1.33-dra-device-binding-conditions branch 2 times, most recently from 7f33dad to 29aaa69 Compare July 28, 2025 02:21
@KobayashiD27
Copy link
Contributor Author

@shannonxtreme

I've updated the documentation based on the latest implementation. I'd appreciate it if you could review it when convenient.

@KobayashiD27
Copy link
Contributor Author

Hi @pohly @klueska,
cc @shannonxtreme @johnbelamaric

The PR for the KEP implementation has been successfully merged!

I've updated this documentation PR to reflect the latest implementation.
Could you please take a moment to review it when you have time?

Also, if you know anyone else who would be a good fit to review this, I’d really appreciate it if you could loop them in.

Thanks in advance!

Copy link
Contributor

@shannonxtreme shannonxtreme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I left some suggestions to improve clarity and structure. Thank you for expanding on this content, it's a lot clearer!

must enable the `DRADeviceBindingConditions` and `DRAResourceClaimDeviceStatus` feature
gates for the scheduler to honor these fields.

- `bindingConditions`: a list of condition keys that must have status `True` before binding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition keys on what objects? Pod/Node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition keys (bindingConditions and bindingFailureConditions) refer to condition types defined in the status.conditions field of the ResourceClaim object, not the Pod or Node.

These conditions are typically updated by external device controllers to reflect the readiness or failure status of the associated device (e.g., GPU, FPGA). The scheduler evaluates these conditions during the PreBind phase to decide whether to proceed with Pod binding.

- `bindingConditions`: a list of condition keys that must have status `True` before binding.
This indicate readiness signals such as "device attached" or "initialized".
- `bindingFailureConditions`: a list of failure condition keys. If any have status `True`,
indicate that binding should be aborted and the Pod rescheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescheduled meaning go back to the beginning of the scheduling workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're mostly right! To be more precise, "rescheduled" in this context means that the current scheduling cycle will be terminated, and the item will wait until the next scheduling cycle begins.

node-specific setup on the selected node.

This feature is useful for asynchronous device preparation workflows,
such as dynamic GPU attachment or FPGA initialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this value-add statement to the first paragraph of this section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated.

@michellengnx
Copy link
Contributor

Hello @KobayashiD27 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Wednesday August 6, 2025 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you!

@KobayashiD27 KobayashiD27 force-pushed the dev-1.33-dra-device-binding-conditions branch from c1dcf14 to 4ddbd18 Compare August 4, 2025 01:39
@KobayashiD27
Copy link
Contributor Author

/assign @nate-double-u

@KobayashiD27
Copy link
Contributor Author

@johnbelamaric @shannonxtreme
Could you please check again? If there are no problems, I would appreciate it if you could comment that the technical review is complete.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for DRA technical content

@KobayashiD27
Copy link
Contributor Author

@nate-double-u @michellengnx
Now that the technical review is complete, would SIG docs be able to review and approve this? If there are other suitable people who can help, please let me know.

@natalisucks
Copy link
Contributor

@johnbelamaric do you mind adding the LGTM label to indicate your tech review? i've reviewed for docs – looks good, so lets get this in before Docs Freeze!
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, natalisucks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2025
@natalisucks
Copy link
Contributor

am adding the LGTM as well, as folks can see we have the tech review – big thanks to all 🚀
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: ab47cbad0831aebb2ca4de8be66c56344d7ac72e

@k8s-ci-robot k8s-ci-robot merged commit c41eebb into kubernetes:dev-1.34 Aug 6, 2025
6 checks passed
@k8s-ci-robot k8s-ci-robot added this to the 1.34 milestone Aug 6, 2025
@KobayashiD27 KobayashiD27 deleted the dev-1.33-dra-device-binding-conditions branch August 7, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants