KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034

pohly · 2025-01-10T15:52:02Z

One-line PR description: DRA: admin-controlled device attributes + taints
Issue links:
- DRA: admin-controlled device attributes #5027
- DRA: device taints and tolerations #5055
Other comments: first revision

/cc @johnbelamaric

pohly · 2025-01-12T12:31:53Z

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

k8s-ci-robot · 2025-01-12T12:31:56Z

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

pohly · 2025-01-13T08:02:06Z

/wg device-management
/sig node

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

eero-t

There was earlier discussion of common (driver independent) tool(ing) for listing, adding and removing device taints. Would it make sense to mention something about that in the tainting KEP?

keps/sig-node/5055-device-taints-and-tolerations/README.md

everpeace

I deeply appreciated for your quick action for device taints/tolerations KEP!! I left some comments. PTAL.

keps/sig-node/5055-device-taints-and-tolerations/README.md

keps/sig-node/5055-device-taints-and-tolerations/kep.yaml

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

keps/sig-node/5055-device-taints-and-tolerations/README.md

dom4ha · 2025-02-07T20:32:29Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+
+During version skew where the apiserver supports the feature and the scheduler
+doesn't, taints can be set without encountering errors or
+warnings, but they won't have any effect.


Aren't taints persisted? If the component responsible for turning down pods is using the newer version, scheduler would keep scheduling pods there.

Scheduling would succeed, but only to have pods evicted by the controller for NoExecute taints - assuming that the controller isn't also a component where the feature is off.

So for the time of this version skew, scheduler will see a resource as available, but controller will constantly turn scheduled pods down. Not sure how much this is a problem, probably depending for how long such skew can last.

pohly · 2025-02-07T21:37:42Z

I'd consider deferring 5027, but go with 5055 in this cycle and use reservation API. It should be sufficient for achieving all goals. Note that 5027 is the most heavy and complex to implement.

That's because 5027 describes the patching mechanism which is also needed for 5055. The patching of attributes and capacities is done in 20 lines of code (there's a prototype).

use reservation API

I'm still not sure what you mean with that. If you are suggesting that your in-progress #5149 should be used instead as basis, then I disagree.

dom4ha · 2025-02-08T22:28:03Z

/lgtm
As discussed in https://kubernetes.slack.com/archives/C09TP78DV/p1739053351531189?thread_ts=1738793685.308189&cid=C09TP78DV, a possibility to use ResourceClaims for reservations is unclear and definitely not feasible to agree on it within a week, so withdrawing my suggestion.

pohly · 2025-02-10T12:56:52Z

keps/prod-readiness/sig-scheduling/5055.yaml

+# of http://git.k8s.io/enhancements/OWNERS_ALIASES
+kep-number: 5055
+alpha:
+  approver: "@johnbelamaric"


@soltysh: John suggested that you might be able to take this and it's twin, 5027, for PRR.

I'll keep John listed here for now, but let me assign it to you:

/assign @soltysh

Changed to @soltysh as discussed in #prod-readiness.

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

pohly · 2025-02-10T13:17:49Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)


Catch-22? I cannot link to a directory in the issue because this PR creates it, and I cannot set the checkmark prior to merging.

alculquicondor

/approve
for sig scheduling

alculquicondor · 2025-02-10T14:52:59Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+it is preserved even when a ResourceSlice gets removed or updated. To achieve
+this, a new cluster-scoped ResourceSlicePatch type gets added. A single
+ResourceSlicePatch object specifies device attributes that apply to all
+devices matching a CEL expression (i.e. the same way as users select devices in


aren't we overdoing it with the CEL? (any CEL parsing and matching will be relatively expensive)
Can't we use label selector, along with the information listed next?

Devices don't have labels to select on. For selecting a device by its attributes we have to use CEL.

One use case which might need a device attribute is "taint this particular device with UID 12345". That particular device might be exposed under a specific stable name in a ResourceSlice (e.g. "gpu-0" for "PCI slot 0") and the taint no longer applies once the device gets swapped out. But I am speculating. We need feedback before we can decide wether it's needed.

Devices don't have labels to select on

Perhaps they should. But again: YAGNI should prevail. Worth discussing, but I'll leave it up to you.

alculquicondor · 2025-02-10T14:57:59Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+Helper code which keeps an up-to-date list of devices with all patches added
+to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It
+will be based on informers such that evaluating the filter only is
+necessary when ResourceSlices or ResourceSlicePatches change.


What if multiple patches match the same resourceslice and they have conflicting information?
Please clarify how that is resolved.

This is explained under DevicePatch.Priority:

// If a ResourceSlice and a DevicePatch define the same attribute or // capacity, the value of the DevicePatch is used. If multiple // different DevicePatches match the same device, then the one with // the highest priority wins. If priorities are equal, the older // patch wins. This ensures that adding a new patch does not // accidentally change the effect of some existing patch unless // that is clearly intended according to the priority. // // +optional Priority *int

I was discussing with @dom4ha whether it should be "newest one wins" in #5034 (comment) and he convinced me that "don't accidentally overwrite existing patch" is the better semantic.

I like "the oldest patch wins", it sounds less prone to race conditions.
But it should still probably be discouraged in the documentation.

"newest one wins

With that you mean by creation timestamp or how is newer defined?

Yes. "Last modification time" might also be useful, but we don't have that.

Clarified and added the "discouraged" part because choosing different priorities is clearer.

kannon92 · 2025-02-10T20:15:10Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+    // +optional
+    // +listType=atomic
+    // +featureGate=DRADeviceTaints
+    Taints []DeviceTaint


Do we have any API markers here to enforce 8 limit?

We can add kubebuilder validations for out-of-tree controllers..

Also, why 8?

Do we have any API markers here to enforce 8 limit?

I think such markers exist for generating CRDs, but they wouldn't have any effect in-tree.

Also, why 8?

As usual, a more or less arbitrary choice. For additional discussion, see #5034 (comment). Because ResourceSlice is size-constrained, I made it a bit smaller than suggested there.

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

soltysh

PRR for both documents looks mostly good, but I left a few questions

soltysh · 2025-02-12T14:33:21Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+whereas health monitoring might prefer to be specific and use a vendor-defined
+unique ID. Both are supported, which creates additional complexity.
+
+Without a kubectl extension similar to `kubectl taint nodes`, the user


I wouldn't call that a risk 😉 ,but more like an inconvenience. I do miss the risk that a device is tainted and users bypass that (you've mentioned that for testing). What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim? That's a risk that's worth describing here.

I wouldn't call that a risk 😉 ,but more like an inconvenience.

Building something that is unusable is a risk, but yeah, I get your point 😁

What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim?

Elsewhere (might have been in a comment, because I don't see it in the KEP) I said "users do that at their own risk", which is the same situation we have with node taints. Added here, with some more thoughts.

soltysh · 2025-02-12T14:35:08Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+
+The intent to patch device attributes must be recorded persistently so that
+it is preserved even when a ResourceSlice gets removed or updated. To achieve
+this, a new cluster-scoped ResourceSlicePatch type gets added. A single


Honestly, the ResourceSlicePatch name isn't best, imo. It's very tightly coupled with patch operation. I see there was discussion around ResourceSliceOverrides, but maybe something generic like ResourceSliceModifiers would be nice? Definitely not a blocking comment. It's will most likely return during API review.

something generic like ResourceSliceModifiers would be nice

I like that.

It will most likely return during API review.

Oh, absolutely. Naming always does.

We tried to do API reviews early for all KEPs, but this one here hasn't been reviewed by anyone API reviewer yet. Let's proceed with the risk that I will have to rename things during the implementation phase.

soltysh · 2025-02-12T14:36:24Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+part of this proposal.
+
+Perhaps `kubectl describe resourceslices` can be extended to include the
+additional information. For now this is out of scope.


👍 that's one of the reason we have describe, to allow for a bit more expressive information.

soltysh · 2025-02-12T14:37:34Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+Perhaps `kubectl describe resourceslices` can be extended to include the
+additional information. For now this is out of scope.
+
+Creating a ResourceSlicePatch is racing with on-going scheduling attempts,


This seems like it should go to Risks section below.

Moved and added some thoughts on mitigation (client-side evaluation was chosen because of this).

soltysh · 2025-02-12T14:45:28Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+
+type ResourceSlicePatchSpec struct {
+    // Devices defines how to patch device attributes and taints.
+    Devices DevicePatch


Any particular reason you're not directly embedding the contents of DevicePatch struct here? Are there any potential extensions that would made you do it this way?

This is for symmetry with ResourceSlice, which (theoretically) one day might get extended to describe something other than "a device". I know Tim hates this and certainly will remind me again... 😁

soltysh · 2025-02-12T14:46:54Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+Helper code which keeps an up-to-date list of devices with all patches added
+to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It
+will be based on informers such that evaluating the filter only is
+necessary when ResourceSlices or ResourceSlicePatches change.


"newest one wins

With that you mean by creation timestamp or how is newer defined?

soltysh · 2025-02-12T14:59:49Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+
+- `k8s.io/dynamic-resource-allocation/structured`: 91.3%
+- `k8s.io/kubernetes/pkg/apis/resource/validation`: 98.6%
+- `k8s.io/kubernetes/pkg/controller/tainteviction`: 81.8%


Earlier in the doc you've mentioned that this will be entirely new controller, only similar to tainteviction. If so why including tainteviction controller here?

I am going to copy the tainteviction controller, then modify it. My aim is to have better coverage than the original GA tainteviction controller, so it seemed like a useful baseline.

soltysh · 2025-02-12T15:05:59Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+###### What happens if we reenable the feature if it was previously rolled back?
+
+It takes effect again for scheduling.
+Running applications are not affected.


Hmm... I might be missing something, but iiuc if I create a ResourceSlicePatch, which will modify a ResourceSlice with a taint forcing eviction (and in combination with the controller from the other KEP) does it mean we can potentially evict running applications? Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?

(and in combination with the controller from the other KEP)

This section is focused exclusively on admin-controlled attributes. Those don't cause eviction. Basically ignore that taints exist and consider only the DRAAdminControlledDeviceAttributes.

Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?

Yes, absolutely. "Taints defined by an admin in a ResourceSlicePatch get added to the set of taints defined by the DRA driver in a ResourceSlice" is specified in the 5055 KEP.

KEP 5055 also has "It takes effect again for scheduling and may evict pods." under "What happens if we reenable the feature if it was previously rolled back?", i.e. exactly the thing you asked about here.

Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods.

Makes sense. I've updated this section here and squashed:

Admin-controlled attributes take effect again for scheduling. Running applications are not affected because allocations are never updated once they are made. Note that this is different for reenabling device taints (KEP 5055): that can cause pod to get evicted.

soltysh · 2025-02-12T15:06:50Z

keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md

+
+###### What specific metrics should inform a rollback?
+
+<!--


Both here and in the other document, I'd suggest thinking through potential metrics for measuring how this affects the workloads.

There are some from the existing tainteviction controller and I am also thinking about how to expose performance of the patching mechanism, but would like to add that to the KEPs when targeting beta.

soltysh

/approve
the PRR

soltysh · 2025-02-12T16:28:29Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+###### What happens if we reenable the feature if it was previously rolled back?
+
+It takes effect again for scheduling.
+Running applications are not affected.


Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.

k8s-ci-robot · 2025-02-12T16:29:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

These are two different KEPs that provide two features that can be enabled and disabled independently. However, both use the same new ResourceSliceOverride type and thus get described and implemented together.

johnbelamaric · 2025-02-12T17:24:37Z

/lgtm

k8s-ci-robot requested a review from johnbelamaric January 10, 2025 15:52

k8s-ci-robot requested a review from byako January 12, 2025 12:31

pohly commented Jan 12, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025

pohly mentioned this pull request Jan 13, 2025

DRA: admin-controlled device attributes #5027

Open

4 tasks

pohly commented Jan 13, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

everpeace mentioned this pull request Jan 14, 2025

Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) NVIDIA/k8s-dra-driver-gpu#213

Open

everpeace reviewed Jan 16, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

pohly mentioned this pull request Jan 20, 2025

DRA: device taints and tolerations #5055

Open

4 tasks

pohly force-pushed the dra-device-attribute-overrides branch from 531a905 to cddc84f Compare January 20, 2025 14:31

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025

pohly force-pushed the dra-device-attribute-overrides branch 2 times, most recently from c4a6f66 to 41cdbf5 Compare January 20, 2025 14:44

eero-t reviewed Jan 20, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

pohly changed the title ~~KEP-5027: DRA: admin-controlled device attributes~~ KEP-5027: DRA: admin-controlled device attributes + device taints Jan 20, 2025

everpeace reviewed Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/kep.yaml Outdated Show resolved Hide resolved

pohly commented Jan 21, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

pohly commented Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

pohly commented Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

eero-t reviewed Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

asm582 reviewed Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 21, 2025

k8s-ci-robot requested a review from alculquicondor February 7, 2025 20:19

dom4ha reviewed Feb 7, 2025

View reviewed changes

k8s-ci-robot assigned dom4ha Feb 8, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2025

pohly commented Feb 10, 2025

View reviewed changes

k8s-ci-robot assigned soltysh Feb 10, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025

alculquicondor reviewed Feb 10, 2025

View reviewed changes

pohly mentioned this pull request Feb 10, 2025

Add KEP for DRA: Extended Resource #5136

Merged

kannon92 reviewed Feb 10, 2025

View reviewed changes

pohly mentioned this pull request Feb 11, 2025

KEP-4815 DRA Partitionable devices design update #5066

Merged

pohly force-pushed the dra-device-attribute-overrides branch from 2bbe533 to 86824bf Compare February 11, 2025 17:00

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 11, 2025

pohly changed the title ~~KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints~~ KEP-5027 + 5055: DRA: admin-controlled device attributes + taints Feb 11, 2025

johnbelamaric reviewed Feb 11, 2025

View reviewed changes

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md Show resolved Hide resolved

soltysh reviewed Feb 12, 2025

View reviewed changes

soltysh approved these changes Feb 12, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025

DRA: add admin controlled device attributes + device taints

7990805

These are two different KEPs that provide two features that can be enabled and disabled independently. However, both use the same new ResourceSliceOverride type and thus get described and implemented together.

pohly force-pushed the dra-device-attribute-overrides branch from d096544 to 7990805 Compare February 12, 2025 16:35

k8s-ci-robot assigned johnbelamaric Feb 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

k8s-ci-robot merged commit 783b3ac into kubernetes:master Feb 12, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025

pohly mentioned this pull request Mar 6, 2025

WIP: Implement DRA admin-controlled attributes (KEP-5027) kubernetes/kubernetes#130120

Closed


		Items marked with (R) are required prior to targeting to a milestone / release.

		- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)

KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034

KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034

Uh oh!

Conversation

pohly commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented Jan 12, 2025

Uh oh!

k8s-ci-robot commented Jan 12, 2025

Uh oh!

Uh oh!

pohly commented Jan 13, 2025

Uh oh!

Uh oh!

Uh oh!

eero-t left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

everpeace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Feb 7, 2025

Uh oh!

dom4ha commented Feb 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alculquicondor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soltysh left a comment

pohly commented Jan 10, 2025 •

edited

Loading