KEP-5690: Add proposal (DRA preemption)#6113
Conversation
e83f1d4 to
1b41d4b
Compare
|
/wg device-managment |
|
@mortent: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/wg device-management |
pohly
left a comment
There was a problem hiding this comment.
KEP looks reasonable to me. The purpose is clear and it describes well what can be done.
The devil will be in implementing this correctly - lot's of details!
| * **ResourceClaims that span multiple nodes**: The default preemption algorithm only considers preempting | ||
| pods on a single node, meaning that it will not be able to free up devices that are allocated to claims | ||
| referenced by pods spanning multiple nodes. | ||
| * **ResourceClaims for PodGroups**: These claims are allocated to a PodGroup, meaning that all pods in the |
There was a problem hiding this comment.
@wojtek-t mentioned in #6113 (comment) that #5710 is keeping this in mind so I'm not too worried about being backed into a corner by the time we're ready to integrate these features. I'll keep tabs on the WAS pieces.
| ###### How can an operator determine if the feature is in use by workloads? | ||
|
|
||
| It is enabled on a cluster-level, so if it is enabled, it is in use. It will only impact | ||
| pods that are referencing ResourceClaims. |
There was a problem hiding this comment.
Is there a metric to show it is happening? Is it worth decorating any existing metric?
There was a problem hiding this comment.
There are two metrics that are particularly relevant for this, scheduler_preemption_attempts_total and scheduler_preemption_victims. But they don't currently don't have any labels and it is not clear to me if adding information about the pods as labels on metrics is a good solution. At the very least, it doesn't seem scalable.
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| alpha: "v1.37" |
There was a problem hiding this comment.
Since there is no API, in theory we can go straight to default-on beta. I think this could be possible if:
- We can show that this code ONLY kicks in when a pod has RCs.
- We have really good tests.
- Of course the FG can be turned off
There was a problem hiding this comment.
I think all of those are doable. So I have updated the KEP to target direct to beta. I assume that in the end it will come down to whether we feel confident enough in the implementation to go through with it.
| * **Device Binding Conditions**: Pods might be waiting for a resource binding condition, so we want to make | ||
| sure this is handled correctly during preemption simulations. | ||
|
|
||
| Features/scenarios that we will not support: |
There was a problem hiding this comment.
FWIW - we already do have PodGroup-oriented preemption implemented here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L455
invoked from:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one_podgroup.go#L219
We want to graduate workload-aware preemption to Beta in 1.37:
#6106
so there are places which DRA-preemption should integrate with.
That being said, I think it makes sense to focus on single-node devices first (given that we will be making non-trivial changes in WAP in 1.37 to avoid conflicts). And expand to multinode devices as a potential followup.
There was a problem hiding this comment.
I think that support for WAP and single/podgroup devices are actually two different things.
In the end in WAP we will do the following:
- remove all potential victims via NodeInfo.RemovePod
- run PodGroupSchedulingCycle (creates fresh cycle state based on nodes with victims removed)
- for each victim try to reprieve it via Reprieve method
- if Reprieve returns true, add back the node via NodeInfo.AddPod
- at the end preempt all victims that failed Reprieve
From DRA perspective, IIUC, if we remove pods from the "scheduler" view of the cluster via NodeInfo.RemovePod, it won't be actually reflected by the DRAManager as manager gathers resource claims allocated state based on its informer cache.
That means that in WAP model from beta proposal, even when we remove pods with resource claims that could unblock the preemptor, preemptor will not be able to use them.
We would need a way to create an allocatedState that also includes pods removals. Maybe we could use for that podGroupState (store in it information about all removals and put that on top of allocatedState?).
Another option would be to actually keep the cycle state of pods from pod group and pass it to WAP. Then during victim removal we would run PreFilterExtensions.RemovePod for each pod from pod group and later reuse the cycle state in PodGroupSchedulingCycle. This would require a major (?) refactoring of the schedule_one_podgroup as everything there is tightly coupled.
Nevertheless, I think that for this proposal we can aim at support single pod cycle only as I do not think that we will have the capacity to implement: this one, WAP beta update and marry them together.
There was a problem hiding this comment.
So, we reached an agreement in WAP KEP () that we want to rerun the whole pod group scheduling algorithm within WAP. So the option of keeping the cycle state is out. This means that we will do the following:
- remove all potential victims via NodeInfo.RemovePod
- run PodGroupSchedulingCycle (creates fresh cycle state based on nodes with victims removed)
- for each victim try to reprieve it. Reprieving will be done by running Filter plugins on all preemptor pods on nodes where they are supposed to run, with the victim added back onto its original node
- if Reprieve returns true, add back the node via NodeInfo.AddPod
So from this KEP, my main question is whether the PreFilter method of DRA is implemented in such a way that it detects that some pods holding resource claims are actually not in the NodeInfos? And if yes, whether PreFilterExtension for DRA plugin will work on top of that?
There was a problem hiding this comment.
@Argh4k The dynamicresources plugin does not check the NodeInfo slice in the PreFilter implementation, so it will not discover if any pods were removed. It fetches the state from the DRAManager, which is a snapshot of the live state of the cluster. Is this a problem? It might be possible to fix this by reconciling the state from the DRAManager with the nodes available from NodeInfo, but I'm not yet sure if we have the necessary information to do this without larger changes.
In my WIP implementation of this KEP, I store some of the information from the DRAManager in the stateData struct so it can be updated as AddPod and RemovePod are called during the preemption simulations: kubernetes/kubernetes#139266. I think that aligns with what you are suggesting here, where the dynamicresources plugin needs to handle that some pods are removed during simulation of the pod group scheduling. Am I understanding this correctly?
There was a problem hiding this comment.
Given that the primary bottlenecks is review bandwidth, I would like to avoid doing things twice and start already with something that will work for WAP (unless you can show that making it work for WAP will be pure extension - but I don't see that now).
There was a problem hiding this comment.
@wojtek-t To support WAP, DRA plugin would have to also support PodGroup-level snapshotting, so it may be a bigger change.
Currently the plugin snapshots DRA state (DRAManager) only for a single pod scheduling cycle (not PodGroup scheduling cycle), so the state is constantly updating after each pod sub-cycle.
The problem results from the fact that DRA has its own state even thought the plugins should be stateless, so there is no (yet) an extension point that would allow the plugin to follow PodGroup scheduling cycle boundaries (when it starts and ends) and assure scheduling consistency, required to assure WAP and PodGroups algorithm correctness.
@mortent What @Argh4k is saying is that the framework will not call PreFilterExtensions.RemovePod, but NodeInfo.RemovePod on PodGroup level snapshot, but it will call PreFilterExtensions.AddPod on reprieve, so in fact there DRA plugin does not need to implement PreFilterExtensions.RemovePod at all, but it should be able to rebuild its state from the PodGroup level snapshot.
There was a problem hiding this comment.
So I think this means that the dynamicresources plugin must be able to look at the information in the DRAManager and NodeInfos and determine the removed Pods and the impact that has on the available devices and their capacity. I will take a look at the code today and try to come up with a proposal for how we can implement this.
There was a problem hiding this comment.
So I looked at the code and I definitely think it is doable for the dynamicresources plugin to rebuild the state in PreFilter based on information in NodeInfos:
- Capture the UIDs of all remaining pods (i.e. not removed by the
NodeInfo.RemovePodcalls) by iterating over the providednodeInfosand the referenced pods. - Iterate over all the claims from
DRAManager.ResourceClaimsand determine which should be simulated as released by since none of the pods referenced in their.status.reservedForlist remain in the simulation.
This should let us track the state in a similar way to what I have in my POC for the pod-by-pod scheduling cycle.
@Argh4k Does the new WAP implementation replace the current one, meaning we no longer need to implement PreFilterExtensions.RemovePod? Does this impact other parts of the preemption lifecycle, meaning does it change when/if the cyclestate is cloned?
@dom4ha What is the consequence of PodGroup-level shapshotting, in particular what changes do we need to make to the DRAManager to handle it?
There was a problem hiding this comment.
@dom4ha What is the consequence of PodGroup-level shapshotting, in particular what changes do we need to make to the DRAManager to handle it?
It's mainly important to support WAS itself, as the PodGroup scheduling algorithm assumes the cluster state does not change during (what we call) the workload scheduling cycle, so we do snapshot the cluster state at the beginning of the cycle and perform scheduling simulation of all PodGroups' pods. After each pod scheduling sub-cycle, the pod is assumed before the next sub-cycle starts. Later the pods may be reverted and scheduled again, but everything needs to happen within a consistent cluster state.
Preemption (WAP) is a part of this cycle and (as @Argh4k said) reruns the scheduling algorithm after removal all eligible victim pods. Performing it under a consistent cluster state is essential for correctness. We also need to store cycle states of sub-cycles to perform simulations like PreFilterExtensions.AddPod/RemovePod.
So we should think how the DRA state could be snapshotted. It could either be migrated to the framework or snapshotted inside the plugin. But to achieve that, the framework would have to define new extension points defining the workload scheduling cycles boundaries.
Initially I was thinking that DRA plugin could just implement PreFilterExtensions.AddPod/RemovePod and support preemption for a single pod, but the issue with resource nomination seems to be a bigger blocker.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent, pohly The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| * Support preemption of workloads utilizing PodGroup-level ResourceClaims. | ||
| * Support preemption of workloads using multi-node or network-attached devices. | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
I am curious about how deallocation would work. If I understand correctly, the current ResourceClaim controller asynchronously checks if the resource claim is free for deallocation. This means a pod triggering eviction may not be able to be scheduled immediately, as it has to wait for the ResourceClaim controller to complete.
There was a problem hiding this comment.
That's an interesting point. While deallocation is generally handled by the RC controller, I don't know of any real reason the scheduler couldn't do it in this case. What about shared claims that are node-local? We can't deallocate those unless all the pods are being preempted. @pohly
There was a problem hiding this comment.
There's no reason why the scheduler shouldn't update the ResourceClaim status to remove an unused allocation. It has to be prepared for that not succeeding when something adds to ReservedFor in the meantime (shouldn't happen, but the API allows it), but that's business as usual (any API call can fail).
Why would the scheduler do this? Will the pod then still go through the same flow of unschedulable + informer events? Beware that directly after an Update call, the informer cache used for the next scheduling attempt may still have the status from before the Update, unless the scheduler waits for the informer event or we do more tricks with the assume cache.
There was a problem hiding this comment.
It may not be necessary. I am not really sure how pre-emption works internally in the scheduler. Once it finds viable preemption candidates, what does it do? Does it actually schedule the new Pod and evict the other Pods? Or does it set nominated node? I think the point is - will waiting for the RC controller to do the deallocation result in a race condition?
There was a problem hiding this comment.
My understanding is that the scheduler evicts victim pods and then reschedules the high-priority pod once the victim pods terminate. This introduces a potential race condition where the high-priority pod might be scheduled either before or after the RC controller performs the deallocation.
There was a problem hiding this comment.
Good question. So I think there is a potential problem here, but not for the basic scenario where maybe the resourceclaim controller is a bit slow.
So what happens is that once a high priority preemptor pod fails to get schedule, the scheduler will trigger the preemption logic. That can find a lower priority victim pod that, if removed, would allow the preemptor pod to be scheduled. Once the scheduler has found a victim pod, it will terminate it, set the nominatedNode field on the preemptor pod and then put the preemptor pod in the unschedulable queue. It will only retry scheduling of the preemptor pod when the scheduler gets one of the queuing hints that are considered relevant for the dynamicresources plugin, since that is the plugin that blocked the scheduling of the preemptor pod in the first place. The hint we are waiting for here, is an update on the resourceclaim referenced by the pod. So essentially rescheduling will not be attempted until there is a signal that rescheduling is likely to succeed and deletion/termination of a pod is not one of those hints.
However, there are several hints that can trigger rescheduling, including updates the DeviceClass, ResourceSlice and more and also after a timeout (I think it is 5 minutes). If this happens, and the resourceclaim controller has not yet deallocated the claim, the pod will still not be schedulable, and will trigger another preemption attempt. I think the consequence here is that if the resourceclaim controller fails to deallocate claims, it can trigger a cascade of preemptions of lower priority pods.
I think maybe doing the deallocation through the scheduler might address this, but I'm not yet sure of the consequences.
There was a problem hiding this comment.
I think we need input from @kubernetes/sig-scheduling-approvers as to the right approach here.
There was a problem hiding this comment.
As @mortent said, the preemptor pod will wait in the unschedulable queue until the cluster events wakes it up. So it may be victim pod deletion or ResourceClaim deallocation. The mechanism that prevents scheduler from making further preemptions is base on NNN and it's a heuristic, so unfortunately the concern is valid.
The mechanism is implemented under PodEligibleToPreemptOthers https://github.com/kubernetes/kubernetes/blob/d88c48520ecd5a3bd9d0c9e90a48bcfca6f2ae11/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L364-L365
- If the preemptor is
UnschedulableAndUnresolvableon the node, it means there is no point waiting for preemption process to finish and new preemption will be triggered - If the preemptor is just
Unschedulableand there are still pods terminating with "Preemption" in reason, the preemption waits.
So despite all pods terminated, the preemptor may remain Unschedulable (due to ResourceClaim not being deallocated) and may trigger yet another preemption as there are no pods on the node in terminated state.
Deallocation as a part of the preemption process most likely won't help either as some pods needs to honor the grace period, so we cannot deallocate claims earlier.
I think we may need to extend the NNN mechanism to not only nominate a candidate Node, but also "nominate" DRA devices. It would prevent other ResourceClaim from stealing the devices for which the preemptor pod (ResourceClaim) is waiting for. The DRA plugin would have to internally maintain a map from devices to nominator ResourceClaims, similar to the nominator mechanism in the scheduler cache. Based on that, the check in point 2 would be extended to wait until all nominated devices are released.
There was a problem hiding this comment.
Right, so the basic solution would work for simple use-cases for DRA, but I don't think it will work for more complicated features like Partitionable Devices and Consumable Capacity. In both cases, it is not as simple as a device being deallocated from one claim and allocated to another, but rather that some capacity are released from one claim which will allow another claim to be allocated. So we want to prevent other claims from taking the capacity (rather than just the device) and we are not necessarily waiting for the device we need to be deallocated, but rather for one or more other devices to be deallocated so the capacity is freed up.
Could we solve this by nominating devices, but rather than checking from them to be released, we set the criteria as the claims referenced by the preempted pod has been deallocated. Although, we have to be careful with this, since a pod might be using a shared claim that won't be deallocated, but also doesn't need to be in order for the preemptor pod to schedule.
And for the implementation, are you suggesting that implementing this will require changes to the scheduling framework outside of the dynamicresources plugin (i.e. changes to the PodEligibleToPreemptOthers function) which would make the framework aware of DRA? Or are you thinking that this is something that we can implement in the dynamicresources plugin, possible as part of the queuingHint functions?
|
/retitle KEP-5690: Add proposal (DRA preemption) |
|
@wojtek-t Do you have any concerns about the PRR section of this KEP? |
dom4ha
left a comment
There was a problem hiding this comment.
Simulating the pod add and removal in DRA plugin should be fairly easy to achieve, but preventing secondary preemptions seems to be a bigger problem which may require much more in depth design.
| * Support preemption of workloads utilizing PodGroup-level ResourceClaims. | ||
| * Support preemption of workloads using multi-node or network-attached devices. | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
As @mortent said, the preemptor pod will wait in the unschedulable queue until the cluster events wakes it up. So it may be victim pod deletion or ResourceClaim deallocation. The mechanism that prevents scheduler from making further preemptions is base on NNN and it's a heuristic, so unfortunately the concern is valid.
The mechanism is implemented under PodEligibleToPreemptOthers https://github.com/kubernetes/kubernetes/blob/d88c48520ecd5a3bd9d0c9e90a48bcfca6f2ae11/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L364-L365
- If the preemptor is
UnschedulableAndUnresolvableon the node, it means there is no point waiting for preemption process to finish and new preemption will be triggered - If the preemptor is just
Unschedulableand there are still pods terminating with "Preemption" in reason, the preemption waits.
So despite all pods terminated, the preemptor may remain Unschedulable (due to ResourceClaim not being deallocated) and may trigger yet another preemption as there are no pods on the node in terminated state.
Deallocation as a part of the preemption process most likely won't help either as some pods needs to honor the grace period, so we cannot deallocate claims earlier.
I think we may need to extend the NNN mechanism to not only nominate a candidate Node, but also "nominate" DRA devices. It would prevent other ResourceClaim from stealing the devices for which the preemptor pod (ResourceClaim) is waiting for. The DRA plugin would have to internally maintain a map from devices to nominator ResourceClaims, similar to the nominator mechanism in the scheduler cache. Based on that, the check in point 2 would be extended to wait until all nominated devices are released.
| * **ResourceClaims that span multiple nodes**: The default preemption algorithm only considers preempting | ||
| pods on a single node, meaning that it will not be able to free up devices that are allocated to claims | ||
| referenced by pods spanning multiple nodes. | ||
| * **ResourceClaims for PodGroups**: These claims are allocated to a PodGroup, meaning that all pods in the | ||
| group must be scheduled together. The default preemption algorithm does not consider the PodGroup at all, | ||
| so we will not allow preemption of pods that reference such claims. |
There was a problem hiding this comment.
For workload-aware preemption, AddPod/RemovePod will also be used, so this KEP cannot say that the "default preemption algorithm does not consider the PodGroup at all" or "default preemption algorithm only considers preempting pods on a single node", because for workload-aware preemption the default preemption algorithm may consider such cases. It would be good to explicitly emphasize that preemption support for such RCs will be restricted on the DRA plugin level.
There was a problem hiding this comment.
Yeah, I've updated the language here.
We need to figure out the design first, and I believe that both:
should first be addressed in the KEP before we even get to PRR review |
|
We have made the decision to postpone this until 1.38. |
One-line PR description: Preemption for DRA
Issue link: DRA: preemption #5690
Other comments: