KEP-5690: Add proposal (DRA preemption) by mortent · Pull Request #6113 · kubernetes/enhancements

mortent · 2026-05-25T22:59:34Z

One-line PR description: Preemption for DRA
Issue link: DRA: preemption #5690
Other comments:

mortent · 2026-05-26T15:40:58Z

/wg device-managment

k8s-ci-robot · 2026-05-26T15:41:01Z

@mortent: The label(s) wg/device-managment cannot be applied, because the repository doesn't have them.

Details

In response to this:

/wg device-managment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mortent · 2026-05-26T15:42:49Z

/wg device-management

pohly

KEP looks reasonable to me. The purpose is clear and it describes well what can be done.

The devil will be in implementing this correctly - lot's of details!

johnbelamaric · 2026-05-27T22:41:56Z

+* **ResourceClaims that span multiple nodes**: The default preemption algorithm only considers preempting
+  pods on a single node, meaning that it will not be able to free up devices that are allocated to claims
+  referenced by pods spanning multiple nodes.
+* **ResourceClaims for PodGroups**: These claims are allocated to a PodGroup, meaning that all pods in the


@nojnhuh have you looked at how we can integrate PodGroup RCs with #5710 ?

@wojtek-t mentioned in #6113 (comment) that #5710 is keeping this in mind so I'm not too worried about being backed into a corner by the time we're ready to integrate these features. I'll keep tabs on the WAS pieces.

johnbelamaric · 2026-05-27T22:43:49Z

+###### How can an operator determine if the feature is in use by workloads?
+
+It is enabled on a cluster-level, so if it is enabled, it is in use. It will only impact
+pods that are referencing ResourceClaims.


Is there a metric to show it is happening? Is it worth decorating any existing metric?

There are two metrics that are particularly relevant for this, scheduler_preemption_attempts_total and scheduler_preemption_victims. But they don't currently don't have any labels and it is not clear to me if adding information about the pods as labels on metrics is a good solution. At the very least, it doesn't seem scalable.

johnbelamaric · 2026-05-27T22:45:41Z

+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.37"


Since there is no API, in theory we can go straight to default-on beta. I think this could be possible if:

We can show that this code ONLY kicks in when a pod has RCs.

We have really good tests.

Of course the FG can be turned off

I think all of those are doable. So I have updated the KEP to target direct to beta. I assume that in the end it will come down to whether we feel confident enough in the implementation to go through with it.

wojtek-t · 2026-05-28T07:19:24Z

+* **Device Binding Conditions**: Pods might be waiting for a resource binding condition, so we want to make
+  sure this is handled correctly during preemption simulations.
+
+Features/scenarios that we will not support:


FWIW - we already do have PodGroup-oriented preemption implemented here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L455
invoked from:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one_podgroup.go#L219
We want to graduate workload-aware preemption to Beta in 1.37:
#6106
so there are places which DRA-preemption should integrate with.

That being said, I think it makes sense to focus on single-node devices first (given that we will be making non-trivial changes in WAP in 1.37 to avoid conflicts). And expand to multinode devices as a potential followup.

/cc @Argh4k

I think that support for WAP and single/podgroup devices are actually two different things.

In the end in WAP we will do the following:

remove all potential victims via NodeInfo.RemovePod

run PodGroupSchedulingCycle (creates fresh cycle state based on nodes with victims removed)

for each victim try to reprieve it via Reprieve method

if Reprieve returns true, add back the node via NodeInfo.AddPod

at the end preempt all victims that failed Reprieve

From DRA perspective, IIUC, if we remove pods from the "scheduler" view of the cluster via NodeInfo.RemovePod, it won't be actually reflected by the DRAManager as manager gathers resource claims allocated state based on its informer cache.

That means that in WAP model from beta proposal, even when we remove pods with resource claims that could unblock the preemptor, preemptor will not be able to use them.

We would need a way to create an allocatedState that also includes pods removals. Maybe we could use for that podGroupState (store in it information about all removals and put that on top of allocatedState?).

Another option would be to actually keep the cycle state of pods from pod group and pass it to WAP. Then during victim removal we would run PreFilterExtensions.RemovePod for each pod from pod group and later reuse the cycle state in PodGroupSchedulingCycle. This would require a major (?) refactoring of the schedule_one_podgroup as everything there is tightly coupled.

Nevertheless, I think that for this proposal we can aim at support single pod cycle only as I do not think that we will have the capacity to implement: this one, WAP beta update and marry them together.

So, we reached an agreement in WAP KEP () that we want to rerun the whole pod group scheduling algorithm within WAP. So the option of keeping the cycle state is out. This means that we will do the following:

remove all potential victims via NodeInfo.RemovePod

run PodGroupSchedulingCycle (creates fresh cycle state based on nodes with victims removed)

for each victim try to reprieve it. Reprieving will be done by running Filter plugins on all preemptor pods on nodes where they are supposed to run, with the victim added back onto its original node

if Reprieve returns true, add back the node via NodeInfo.AddPod

So from this KEP, my main question is whether the PreFilter method of DRA is implemented in such a way that it detects that some pods holding resource claims are actually not in the NodeInfos? And if yes, whether PreFilterExtension for DRA plugin will work on top of that?

@Argh4k The dynamicresources plugin does not check the NodeInfo slice in the PreFilter implementation, so it will not discover if any pods were removed. It fetches the state from the DRAManager, which is a snapshot of the live state of the cluster. Is this a problem? It might be possible to fix this by reconciling the state from the DRAManager with the nodes available from NodeInfo, but I'm not yet sure if we have the necessary information to do this without larger changes.

In my WIP implementation of this KEP, I store some of the information from the DRAManager in the stateData struct so it can be updated as AddPod and RemovePod are called during the preemption simulations: kubernetes/kubernetes#139266. I think that aligns with what you are suggesting here, where the dynamicresources plugin needs to handle that some pods are removed during simulation of the pod group scheduling. Am I understanding this correctly?

Given that the primary bottlenecks is review bandwidth, I would like to avoid doing things twice and start already with something that will work for WAP (unless you can show that making it work for WAP will be pure extension - but I don't see that now).

@wojtek-t To support WAP, DRA plugin would have to also support PodGroup-level snapshotting, so it may be a bigger change.

Currently the plugin snapshots DRA state (DRAManager) only for a single pod scheduling cycle (not PodGroup scheduling cycle), so the state is constantly updating after each pod sub-cycle.

The problem results from the fact that DRA has its own state even thought the plugins should be stateless, so there is no (yet) an extension point that would allow the plugin to follow PodGroup scheduling cycle boundaries (when it starts and ends) and assure scheduling consistency, required to assure WAP and PodGroups algorithm correctness.

@mortent What @Argh4k is saying is that the framework will not call PreFilterExtensions.RemovePod, but NodeInfo.RemovePod on PodGroup level snapshot, but it will call PreFilterExtensions.AddPod on reprieve, so in fact there DRA plugin does not need to implement PreFilterExtensions.RemovePod at all, but it should be able to rebuild its state from the PodGroup level snapshot.

So I think this means that the dynamicresources plugin must be able to look at the information in the DRAManager and NodeInfos and determine the removed Pods and the impact that has on the available devices and their capacity. I will take a look at the code today and try to come up with a proposal for how we can implement this.

So I looked at the code and I definitely think it is doable for the dynamicresources plugin to rebuild the state in PreFilter based on information in NodeInfos:

Capture the UIDs of all remaining pods (i.e. not removed by the NodeInfo.RemovePod calls) by iterating over the provided nodeInfos and the referenced pods.

Iterate over all the claims from DRAManager.ResourceClaims and determine which should be simulated as released by since none of the pods referenced in their .status.reservedFor list remain in the simulation.

This should let us track the state in a similar way to what I have in my POC for the pod-by-pod scheduling cycle.

@Argh4k Does the new WAP implementation replace the current one, meaning we no longer need to implement PreFilterExtensions.RemovePod? Does this impact other parts of the preemption lifecycle, meaning does it change when/if the cyclestate is cloned?

@dom4ha What is the consequence of PodGroup-level shapshotting, in particular what changes do we need to make to the DRAManager to handle it?

@dom4ha What is the consequence of PodGroup-level shapshotting, in particular what changes do we need to make to the DRAManager to handle it?

It's mainly important to support WAS itself, as the PodGroup scheduling algorithm assumes the cluster state does not change during (what we call) the workload scheduling cycle, so we do snapshot the cluster state at the beginning of the cycle and perform scheduling simulation of all PodGroups' pods. After each pod scheduling sub-cycle, the pod is assumed before the next sub-cycle starts. Later the pods may be reverted and scheduled again, but everything needs to happen within a consistent cluster state.

Preemption (WAP) is a part of this cycle and (as @Argh4k said) reruns the scheduling algorithm after removal all eligible victim pods. Performing it under a consistent cluster state is essential for correctness. We also need to store cycle states of sub-cycles to perform simulations like PreFilterExtensions.AddPod/RemovePod.

So we should think how the DRA state could be snapshotted. It could either be migrated to the framework or snapshotted inside the plugin. But to achieve that, the framework would have to define new extension points defining the workload scheduling cycles boundaries.

Initially I was thinking that DRA plugin could just implement PreFilterExtensions.AddPod/RemovePod and support preemption for a single pod, but the issue with resource nomination seems to be a bigger blocker.

k8s-ci-robot · 2026-05-30T23:27:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent, pohly
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

troychiu · 2026-06-01T19:10:53Z

+* Support preemption of workloads utilizing PodGroup-level ResourceClaims.
+* Support preemption of workloads using multi-node or network-attached devices.
+
+## Proposal


I am curious about how deallocation would work. If I understand correctly, the current ResourceClaim controller asynchronously checks if the resource claim is free for deallocation. This means a pod triggering eviction may not be able to be scheduled immediately, as it has to wait for the ResourceClaim controller to complete.

That's an interesting point. While deallocation is generally handled by the RC controller, I don't know of any real reason the scheduler couldn't do it in this case. What about shared claims that are node-local? We can't deallocate those unless all the pods are being preempted. @pohly

There's no reason why the scheduler shouldn't update the ResourceClaim status to remove an unused allocation. It has to be prepared for that not succeeding when something adds to ReservedFor in the meantime (shouldn't happen, but the API allows it), but that's business as usual (any API call can fail).

Why would the scheduler do this? Will the pod then still go through the same flow of unschedulable + informer events? Beware that directly after an Update call, the informer cache used for the next scheduling attempt may still have the status from before the Update, unless the scheduler waits for the informer event or we do more tricks with the assume cache.

It may not be necessary. I am not really sure how pre-emption works internally in the scheduler. Once it finds viable preemption candidates, what does it do? Does it actually schedule the new Pod and evict the other Pods? Or does it set nominated node? I think the point is - will waiting for the RC controller to do the deallocation result in a race condition?

My understanding is that the scheduler evicts victim pods and then reschedules the high-priority pod once the victim pods terminate. This introduces a potential race condition where the high-priority pod might be scheduled either before or after the RC controller performs the deallocation.

Good question. So I think there is a potential problem here, but not for the basic scenario where maybe the resourceclaim controller is a bit slow.

So what happens is that once a high priority preemptor pod fails to get schedule, the scheduler will trigger the preemption logic. That can find a lower priority victim pod that, if removed, would allow the preemptor pod to be scheduled. Once the scheduler has found a victim pod, it will terminate it, set the nominatedNode field on the preemptor pod and then put the preemptor pod in the unschedulable queue. It will only retry scheduling of the preemptor pod when the scheduler gets one of the queuing hints that are considered relevant for the dynamicresources plugin, since that is the plugin that blocked the scheduling of the preemptor pod in the first place. The hint we are waiting for here, is an update on the resourceclaim referenced by the pod. So essentially rescheduling will not be attempted until there is a signal that rescheduling is likely to succeed and deletion/termination of a pod is not one of those hints.

However, there are several hints that can trigger rescheduling, including updates the DeviceClass, ResourceSlice and more and also after a timeout (I think it is 5 minutes). If this happens, and the resourceclaim controller has not yet deallocated the claim, the pod will still not be schedulable, and will trigger another preemption attempt. I think the consequence here is that if the resourceclaim controller fails to deallocate claims, it can trigger a cascade of preemptions of lower priority pods.

I think maybe doing the deallocation through the scheduler might address this, but I'm not yet sure of the consequences.

I think we need input from @kubernetes/sig-scheduling-approvers as to the right approach here.

As @mortent said, the preemptor pod will wait in the unschedulable queue until the cluster events wakes it up. So it may be victim pod deletion or ResourceClaim deallocation. The mechanism that prevents scheduler from making further preemptions is base on NNN and it's a heuristic, so unfortunately the concern is valid.

The mechanism is implemented under PodEligibleToPreemptOthers https://github.com/kubernetes/kubernetes/blob/d88c48520ecd5a3bd9d0c9e90a48bcfca6f2ae11/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L364-L365

If the preemptor is UnschedulableAndUnresolvable on the node, it means there is no point waiting for preemption process to finish and new preemption will be triggered

If the preemptor is just Unschedulable and there are still pods terminating with "Preemption" in reason, the preemption waits.

So despite all pods terminated, the preemptor may remain Unschedulable (due to ResourceClaim not being deallocated) and may trigger yet another preemption as there are no pods on the node in terminated state.

Deallocation as a part of the preemption process most likely won't help either as some pods needs to honor the grace period, so we cannot deallocate claims earlier.

I think we may need to extend the NNN mechanism to not only nominate a candidate Node, but also "nominate" DRA devices. It would prevent other ResourceClaim from stealing the devices for which the preemptor pod (ResourceClaim) is waiting for. The DRA plugin would have to internally maintain a map from devices to nominator ResourceClaims, similar to the nominator mechanism in the scheduler cache. Based on that, the check in point 2 would be extended to wait until all nominated devices are released.

Right, so the basic solution would work for simple use-cases for DRA, but I don't think it will work for more complicated features like Partitionable Devices and Consumable Capacity. In both cases, it is not as simple as a device being deallocated from one claim and allocated to another, but rather that some capacity are released from one claim which will allow another claim to be allocated. So we want to prevent other claims from taking the capacity (rather than just the device) and we are not necessarily waiting for the device we need to be deallocated, but rather for one or more other devices to be deallocated so the capacity is freed up.

Could we solve this by nominating devices, but rather than checking from them to be released, we set the criteria as the claims referenced by the preempted pod has been deallocated. Although, we have to be careful with this, since a pod might be using a shared claim that won't be deallocated, but also doesn't need to be in order for the preemptor pod to schedule.

And for the implementation, are you suggesting that implementing this will require changes to the scheduling framework outside of the dynamicresources plugin (i.e. changes to the PodEligibleToPreemptOthers function) which would make the framework aware of DRA? Or are you thinking that this is something that we can implement in the dynamicresources plugin, possible as part of the queuingHint functions?

lmktfy · 2026-06-03T20:38:57Z

/retitle KEP-5690: Add proposal (DRA preemption)

mortent · 2026-06-09T15:19:10Z

@wojtek-t Do you have any concerns about the PRR section of this KEP?

dom4ha

Simulating the pod add and removal in DRA plugin should be fairly easy to achieve, but preventing secondary preemptions seems to be a bigger problem which may require much more in depth design.

dom4ha · 2026-06-11T23:37:01Z

+* Support preemption of workloads utilizing PodGroup-level ResourceClaims.
+* Support preemption of workloads using multi-node or network-attached devices.
+
+## Proposal


As @mortent said, the preemptor pod will wait in the unschedulable queue until the cluster events wakes it up. So it may be victim pod deletion or ResourceClaim deallocation. The mechanism that prevents scheduler from making further preemptions is base on NNN and it's a heuristic, so unfortunately the concern is valid.

The mechanism is implemented under PodEligibleToPreemptOthers https://github.com/kubernetes/kubernetes/blob/d88c48520ecd5a3bd9d0c9e90a48bcfca6f2ae11/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L364-L365

If the preemptor is UnschedulableAndUnresolvable on the node, it means there is no point waiting for preemption process to finish and new preemption will be triggered

If the preemptor is just Unschedulable and there are still pods terminating with "Preemption" in reason, the preemption waits.

So despite all pods terminated, the preemptor may remain Unschedulable (due to ResourceClaim not being deallocated) and may trigger yet another preemption as there are no pods on the node in terminated state.

Deallocation as a part of the preemption process most likely won't help either as some pods needs to honor the grace period, so we cannot deallocate claims earlier.

I think we may need to extend the NNN mechanism to not only nominate a candidate Node, but also "nominate" DRA devices. It would prevent other ResourceClaim from stealing the devices for which the preemptor pod (ResourceClaim) is waiting for. The DRA plugin would have to internally maintain a map from devices to nominator ResourceClaims, similar to the nominator mechanism in the scheduler cache. Based on that, the check in point 2 would be extended to wait until all nominated devices are released.

macsko · 2026-06-12T13:02:15Z

+* **ResourceClaims that span multiple nodes**: The default preemption algorithm only considers preempting
+  pods on a single node, meaning that it will not be able to free up devices that are allocated to claims
+  referenced by pods spanning multiple nodes.
+* **ResourceClaims for PodGroups**: These claims are allocated to a PodGroup, meaning that all pods in the
+  group must be scheduled together. The default preemption algorithm does not consider the PodGroup at all,
+  so we will not allow preemption of pods that reference such claims.


For workload-aware preemption, AddPod/RemovePod will also be used, so this KEP cannot say that the "default preemption algorithm does not consider the PodGroup at all" or "default preemption algorithm only considers preempting pods on a single node", because for workload-aware preemption the default preemption algorithm may consider such cases. It would be good to explicitly emphasize that preemption support for such RCs will be restricted on the DRA plugin level.

Yeah, I've updated the language here.

wojtek-t · 2026-06-15T11:01:46Z

@wojtek-t Do you have any concerns about the PRR section of this KEP?

We need to figure out the design first, and I believe that both:

Add/Remove Pod: KEP-5690: Add proposal (DRA preemption) #6113 (comment)
avoiding unnnecessary preemptions: KEP-5690: Add proposal (DRA preemption) #6113 (comment)

should first be addressed in the KEP before we even get to PRR review

mortent · 2026-06-16T15:07:42Z

We have made the decision to postpone this until 1.38.

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 25, 2026

k8s-ci-robot requested review from dom4ha and macsko May 25, 2026 22:59

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 25, 2026

github-project-automation Bot added this to SIG Scheduling May 25, 2026

github-project-automation Bot moved this to Needs Triage in SIG Scheduling May 25, 2026

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 25, 2026

Initial KEP for 5690 DRA Preemption

1b41d4b

mortent force-pushed the Create5690ForAlpha branch from e83f1d4 to 1b41d4b Compare May 25, 2026 23:02

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 26, 2026

github-project-automation Bot added this to Dynamic Resource Allocation May 26, 2026

github-project-automation Bot moved this to 🆕 New in Dynamic Resource Allocation May 26, 2026

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 27, 2026

pohly approved these changes May 27, 2026

View reviewed changes

Comment thread keps/sig-scheduling/5690-dra-preemption/README.md Outdated

github-project-automation Bot moved this from Needs Triage to Needs Final Approver in SIG Scheduling May 27, 2026

johnbelamaric reviewed May 27, 2026

View reviewed changes

wojtek-t reviewed May 28, 2026

View reviewed changes

k8s-ci-robot requested a review from Argh4k May 28, 2026 07:19

wojtek-t self-assigned this May 28, 2026

Addressed comments and updated KEP for direct to beta

fbc0c8a

mortent requested review from johnbelamaric and wojtek-t May 30, 2026 23:30

troychiu reviewed Jun 1, 2026

View reviewed changes

johnbelamaric mentioned this pull request Jun 1, 2026

DRA: preemption #5690

Open

4 tasks

k8s-ci-robot changed the title ~~Initial KEP for 5690 DRA Preemption~~ KEP-5690: Add proposal (DRA preemption) Jun 3, 2026

dom4ha mentioned this pull request Jun 11, 2026

KEP-5517: DRA Node Allocatable Resources alpha2 changes #6082

Merged

dom4ha reviewed Jun 12, 2026

View reviewed changes

macsko reviewed Jun 12, 2026

View reviewed changes

Addressed additional comments

f23b2ce

Conversation

mortent commented May 25, 2026

Uh oh!

mortent commented May 26, 2026

Uh oh!

k8s-ci-robot commented May 26, 2026

Uh oh!

mortent commented May 26, 2026

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmktfy commented Jun 3, 2026

Uh oh!

mortent commented Jun 9, 2026

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment