KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Zeel-Patel · 2024-10-23T09:28:50Z

…ed on GPU nodes

This plugin is widely being used at Uber and thought can be useful for open source community.

KEP doc as requested in #788 PR

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

…ed on GPU nodes

k8s-ci-robot · 2024-10-23T09:28:53Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-10-23T09:29:00Z

Hi @Zeel-Patel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-10-23T09:29:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Zeel-Patel
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-10-23T09:29:07Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`bed72f0`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6718c1d59c391d000856303a

ffromani · 2024-10-25T05:34:51Z

/ok-to-test

k8s-ci-robot · 2024-10-25T05:36:55Z

@Zeel-Patel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-scheduler-plugins-verify	`bed72f0`	link	true	`/test pull-scheduler-plugins-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ffromani · 2024-10-25T05:40:31Z

kep/tbd-gpu-aware-scheduling/README.md

+The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by 
+kubernetes cluster manager through mutating admission controller without client's knowledge.


ok, but this applies also to a filter plugin, doesn't it? arguably any non-core, non-default setting needs to be added by kubernetes cluster manager, and can be unknown for whatever reason to the pod submitter(s).

In the filter plugin-based approach, pod submitters can focus solely on defining their resource requirements (e.g., nvidia.com/gpu) without needing to know about specific node taints or tolerations in the pod spec. This allows the plugin to handle those details automatically, reducing the configuration burden on the pod creator.

this is true but if the filter plugin is part of the default scheduler profile.
AFAICT this will be true until the plugin is merged in the main k8s codebase and accepted as part of the default scheduler profile, which, especially the latter, is not happening quickly (as standard process mandates).

At very least for the time being, thus, the pod submitter would need to know which schedulerName to use.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

In typical CSP Kubernetes environments one doesn't get to change the scheduler settings. Any solution based on scheduler plugins will for a long time (if not forever) have to be using a mutating webhook which injects the special scheduler name plus of course you need an extra pod for running the special scheduler. You can't get away from using the mutating webhook, unless you accept modifying all workloads manually with the scheduler name. No difference there if you choose to go for scheduler plugins or if you choose to do labels and node affinity, so the least hassle would be to deploy node feature discovery and a mutating webhook without a need for scheduler changes.

In terms of resource usage, I bet the mutating webhook + NFD is a smaller burden than a mutating webhook + extra scheduler Pod. NFD isn't doing complex scheduling, it is pretty simple.

Northstar goal is to make this plugin available as part of the default scheduler profile with the k8s-scheduler binary. I would like to understand the reservations with that approach. We are fine with time it will take and we would like to follow the full process.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory. We are providing this choice to the users who want to use this plugin of-course it will work only after adding to the scheduler-profile configurations.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

This makes sense, but is not clear to me how this would be the case if we make (as IIUC it is the goal) this plugin part of the default profile. OTOH using it on a secondary, non-default profile would be straightforward, but this would again require changes in the pod spec

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

At least at Uber, we prefer to add such burden / extra responsibilities to cluster-admin team and not to the customers / users of the platform. There can be more people / companies who would like to take this approach. taints / tolerations adds burden on cluster-admin team, customer teams and also requires co-ordination if hardware / nodes are manage by different team (cluster-admin) and workloads / pods are managed by multiple different teams. This coordination becomes complex in case of multi-team setup in large organisation.
So, adding a new approach of using scheduler plugin helps many companies / people in industry. Choice will still be with them, whether they take this approach or taints-tolerations approach.

I still need to review the latest updates, but based on the comments I read, I still don't see great benefits over the existing taints and tolerations mechanism but, provided we explore alternatives and we acknowledge them in the KEP, I'm not against adding the plugin in this repo. This is (IIUC :) ), after all, the place where experimentation is expected to happen.

I agree the benefits would really surface, as written in anoter comment, only when this plugin is both merged upstream AND part of the default profile. Both steps will require large buy-in.

OTOH, a prototype will help the conversation, so again another reason to not oppose the merge.
Will review again and comment more deeply ASAP.

ffromani · 2024-10-25T05:42:05Z

kep/tbd-gpu-aware-scheduling/README.md

+The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by 
+kubernetes cluster manager through mutating admission controller without client's knowledge.
+Taints are on nodes and tolerations are on pods. Submitter of pods and managers of nodes are two different teams sitting
+in different orgs. So one can not manage both the taints and tolerations. And it will increase the chances of errors.


I'm not against this PR for strong reasons, but I still struggle to see the benefits over taint+tolerations+mutating webhook because I don't see (yet) how the drawbacks list ed here don't apply to the extra-filter scenarios.
IOW: it seems to me the new filter will hit pretty much the same scenarios.

We’ve evaluated the taints and tolerations approach as well. However, from the standpoint of managing additional node configurations, it requires cluster admins to add and maintain specific taints on GPU nodes and tolerations on GPU-using pods. In contrast, the filter plugin handles these configurations automatically, significantly reducing the administrative burden.

One can label GPU-nodes with something like NFD and then inject node affinity to Pods with a webhook preventing scheduling to those nodes unless the Pod really uses GPU resources.

Automated, low maintenance. That's what I'd do in order to avoid maintaining a scheduler build. The gotcha is if some gitops tool will not allow using such a webhook. But usually gitops can be configured.

+1 to @uniemimu

Can we have a summary of administrative burden for both approaches?
I'm still lacking enough evidence this approch bring major (and thus worthy) benefits

Especially your column for a patched scheduler is missing a few details.

For the Setup Effort, you will need to add the burden of defining the scheduler name for every workload, unless you manage to replace the existing default scheduler, which is not typically possible in CSP environments. If you don't do those workload definition changes by hand, add a webhook. If you do it by hand for every workload, you could do manual changes for workload node affinity by hand also in the left column for NFD. So either remove webhook from the left column or add it to the middle column.

For the ongoing maintenance about labels, the GPU vendors tend to have their own set of labels which has been rather stable, but varies between vendors. The dominating market leader has NFD installation as an option in its GPU device plugin helm chart. It's not like it would be an effort to set up and maintain.

The node-level overhead for the patched scheduler version will only be "none" in case you replace the existing default scheduler, which is usually not possible in CSP managed kubernetes environments. So you typically need an extra node, possibly a big one, if the cluster is big. The NFD worker daemonset is small and in the case of those GPU nodes where you didn't want extra Pods, it shouldn't force you to increase your node counts unless 5 millicores and 128MB of memory is too much. On other nodes then, there is that amount of unnecessary resource usage. So the row there in my opinion would be "typically 0 new nodes required, plausibly 1 extra node" | "typically 1 extra node required" | "0"

From the overall hassle point of view having a default scheduler forcing this filtering for all users by default is of course the simplest, and the right column rightfully shows that.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

+1 again. I can see a way to have this code accepted in kubernetes-sigs eventually, but changing the default profile is a much more impactful change which would require extensive discussion and evaluation.

Will review the rest shortly.

I think the whole discussion is moving around "CSP managed kubernetes environments". Why so ? There are companies (specially the larger companies) who manage their own Kubernetes environments / cluster setups. For them it makes sense to use hardware from cloud as IaaS and then install Kubernetes on top of that by themselves. Everyone is not willing to use GKE or EKS or any other "CSP managed kubernetes environments".
Let's talk about the open source Kubernetes and not club it with CSP goals.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

If that's part of the process before merging a new plugin to this repo then, let's follow the process. We can present this with sig-scheduling, that's a good suggestion. @gkeesh7 @Zeel-Patel Please prepare for that.

From where I'm standing, I don't oppose having the plugin here in this repository. I just don't see much benefit myself, since I believe I could do the same with less of a maintenance burden, basically without having to keep building a custom scheduler. But from considering having alternatives for those who really prefer building their own scheduler, I get it.

There's still yet another approach available for achieving the same thing, and that is scheduler extenders. With extenders you wouldn't have to rebuild the whole scheduler, you'd only have to build the extender and hook it up. The filter functionality is available in extenders. The downside of those is that you still need to configure the scheduler, and you still need to maintain your own scheduler extender container. Plus you take a performance hit in scheduling, which can be mitigated by the use of caches and only passing in node names to the extender. But you still take a measurable performance hit.

My view is that real benefits in terms of making things simple start to only surface if you get this to Kubernetes upstream all the way into the default profile where it would be enabled by default, and that requires sig-scheduling to buy the idea.

KunWuLuan · 2024-10-25T08:53:17Z

kep/tbd-gpu-aware-scheduling/README.md

+daemonset pod does not request any GPUs. Because of the filter plugin explained above, daemonset pods can get stuck in 
+pending state forever.
+
+So, the scheduler plugin will exclude the pods which contain the GPU device-plugin container image from filtering 


Maybe we should exclude all daemonset pods?

ffromani · 2024-11-04T07:23:53Z

kep/tbd-gpu-aware-scheduling/README.md

+For a given pod resource, Filter plugin will check all the nodes in the cluster. It will check
+whether the node is GPU node or not by checking node’s Allocatable field. If a node has an allocatable
+GPU resource, it will consider that node as GPU node. 'Allocatable' on a Kubernetes 
+node is defined as the amount of compute resources that are available for pods. Note that node 
+resources which have been allocated to pods running on that node will not be subtracted from the 
+node allocatable resources.


Assuming there's no high pod churn, and assuming the working set of pod is relatively stable (e.g. pods tend to be long-running).
Would https://github.com/kubernetes-sigs/descheduler help in this use case?
Maybe not, let's discuss in the Alternatives section

Microservices use cases have low pods churn in general. But batch workload use cases have very high pod churn.

fine. This deserves to be recorded in the Alternatives section, explaining why the proposed solution is better than descheduler, and on which scenarios.

k8s-triage-robot · 2025-02-03T17:58:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dom4ha · 2025-02-26T01:44:07Z

@pohly @johnbelamaric @klueska as there was a related discussion on last wg-device-management mtg.

I don't think this plugin with such specialized functionality could become an official kube-scheduler plugin. You could try to make it more generic and capable to express negative scheduling criteria, but it would hard to make it right and not duplicate taint/tolerations mechanism.

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

kumariitr · 2025-02-26T05:01:27Z

@pohly @johnbelamaric @klueska as there was a related discussion on last wg-device-management mtg.

I don't think this plugin with such specialized functionality could become an official kube-scheduler plugin. You could try to make it more generic and capable to express negative scheduling criteria, but it would hard to make it right and not duplicate taint/tolerations mechanism.

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

@dom4ha There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

pohly · 2025-02-26T07:36:04Z

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

That approach could work. The plugin would have to be extended to do some work also for pods not using claims (currently it short-circuits itself when it detects that) and keep track of which nodes have such "tainting devices" (API to be defined...). Then during Filter it can reject nodes unless the pod actually allocates such a device.

dom4ha · 2025-02-26T09:24:48Z

@sanposhiho @Huang-Wei @macsko

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.

In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

kumariitr · 2025-02-26T09:39:04Z

@sanposhiho @Huang-Wei @macsko

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.

In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

Whether this plugin will be useful to other developers or not will be known only when we make it available to them by adding to this scheduler-plugins repo. Then only, we will come to know how likely this plugin will become official.

dom4ha · 2025-02-27T14:37:49Z

Kensei Nakada Wei Huang Maciej Skoczeń

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.
In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

Whether this plugin will be useful to other developers or not will be known only when we make it available to them by adding to this scheduler-plugins repo. Then only, we will come to know how likely this plugin will become official.

Plugins don't get promoted just based on their popularity, because they need to become a part of a consistent product, which means many different things. In this particular case, the issues that I see is no flexibility (also raised by others), meaning that it may work today, but it's not future-proof. Another concern is possibile duplication of the existing functionality.

I can see the problem it is trying to solve and thanks for bringing it up. I spend some time figuring out how we could turn it into configurable functionality. This is why I mentioned DRA, because in my opinion it should be the best place to express which devices might be critical enough to completely block scheduling of other pods. It still does not mean that it's an idea to follow.

Another alternative I see is ability to define some sort of scheduling policies (per node) in form of extensions. The problem here would be probably performance of their evaluations, so not sure if it's the way to go, since such policies could be probably compiled into taints/labels, so into already supported concepts without a risk of slowing down scheduler.

Summarizing, I don't comment whether the plugin should be part of this repo or not (leaving it up to the repo owners), just giving a perspective on becoming an official plugin. You could also evaluate a possibility of improving the taint/tolerations approach to address its pain points, but I don't have any specific suggestion here.

johnbelamaric · 2025-02-27T21:59:22Z

@dom4ha thanks for tagging me here. I have been discussing this with a few folks. Taints and tolerations were originally intended more as an administrative function. We have been using them, along with node labels, as ways to guide scheduling when there is relevant scheduling information that is not actually known to the scheduler. Examples:

Machine has a scarce resource (e.g., GPUs) and so we only want to run workloads on that machine that need that resource
Machine has a particular CPU manager or topology manager policy configured
Machine has a particular kubelet or runtime feature enabled
A particular kubelet feature gate is enabled on that node

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

cc @dchen1107 @liggitt @thockin @tallclair @samuelkarp

johnbelamaric · 2025-02-27T22:01:16Z

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

And importantly, this can be 100% transparent to both users and cluster administrators, if done right.

dom4ha · 2025-02-28T02:22:26Z

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

That sounds to me like the relation between ResourceClaims vs ResourceSlices if we again think about node -> resource generalization. Do we need anything more than just extend ResourceSlice with whatever information is needed to take the right scheduling decision?

I guess that KEP-5027 + 5055: DRA: admin-controlled device attributes + taints could help in managing such dynamic information centrally.

johnbelamaric · 2025-02-28T05:40:40Z

I don't see this as directly related to DRA. This is just something nodes can start publishing. And maybe a scheduler plugin can start interpreting. No user action needed, no DRA needed.

ffromani · 2025-02-28T07:36:45Z

@dom4ha thanks for tagging me here. I have been discussing this with a few folks. Taints and tolerations were originally intended more as an administrative function. We have been using them, along with node labels, as ways to guide scheduling when there is relevant scheduling information that is not actually known to the scheduler. Examples:
* Machine has a scarce resource (e.g., GPUs) and so we only want to run workloads on that machine that need that resource

* Machine has a particular CPU manager or topology manager policy configured

* Machine has a particular kubelet or runtime feature enabled

* A particular kubelet feature gate is enabled on that node
There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

I like this a lot. This will solve a lot of pains we have and unlock possibilities. From my (unfortunately limited) experience with DRA, which was correctly mentioned few times already, it seems to me this direction will compose nicely with the other ongoing efforts, DRA first and foremost.

It also seems a feasible pivot for the work at hand, which seems to me would address most of the core concerns some reviewers (surely yours truly) highlighted

Besides the much bigger scope, however, I can think about some concerns:

I see the above proposal work with reasonnable effort for classic devices (device plugins)
Node tuning features, like cpumanager policy, topology manager policy, policy options, are not exposed by the nodes. Nor probably they shall be. Users (humans or machines) should not be concerned about "this node has topologyManagerPolicy=single-numa-node" but rather should care about "this node can allocate resources aligned to NUMA node boundaries". This means we will need a way to model these capabilities and to expose them. We did some experiments in the context of numa-aware-scheduling
some workload requirements are opaque, is not easy or possible at all to derive them by parsing the pod spec. Easy example: exclusive CPU allocation. The kubelet does so if cpumanagerPolicy=static, the workload is guaranteed QoS, and it requires integer cpus. Problem: this is implicit. There's no explicit requirement in the podSpec. An unaware user can create a podSpec which has cpu request == cpu limits == 4 (say) but doesn't actually need exclusive CPUs and can happily run without. These requirements should be made explicit then, and this would require pod spec changes likely? (there are more examples in this vein)

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

johnbelamaric · 2025-02-28T21:01:57Z

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?

Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

samuelkarp · 2025-03-01T01:14:26Z

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

For what it's worth, this is roughly how ECS handles the challenge of heterogeneous nodes (disclosure: I was part of designing this around 2015, before I was involved with Kubernetes). The way it works is as follows:

The ECS agent (equivalent of the Kubelet) runs logic to detect supported capabilities at startup. These are based on the version of the ECS agent (so, unconditional), agent configuration, container runtime version (Docker, and looking at which API versions are supported by the installed Docker daemon), devices (GPU, etc), and kernel version.
The supported capabilities are registered with the ECS control plane as "attributes"
When a task definition (rough equivalent of a Pod spec template) is registered, the ECS control plane computes a set of implied required attributes based on the content of the task definition
At placement time, the scheduler computes which nodes have the required attributes present and limits placement to those nodes

ECS has a bit of a simpler problem to solve (much more limited feature set compared to Kubernetes), but I do generally think having the node advertise its capabilities and allowing the scheduler to match based on that makes sense.

dom4ha · 2025-03-04T12:26:05Z

When a task definition (rough equivalent of a Pod spec template) is registered, the ECS control plane computes a set of implied required attributes based on the content of the task definition

I think this is very important step to keep scheduling fast. I guess QoS could be also taken into consideration in this process.

johnbelamaric · 2025-03-06T20:39:52Z

cc @mrunalp

This is the issue I mentioned in response to your question in SIG Arch.

Huang-Wei · 2025-03-07T00:34:40Z

Carrying over my previous comment at #788 (comment)

IMHO having a separate plugin is sort of an overkill. If we don't want to non-GPU workloads to land on GPU nodes, we can simply taint GPU nodes, and only GPU workloads can land on them (with proper tolerations).

In a real-world case, GPU node will be tainted, so comparing to introducing a plugin, isn't that the lightweight solution is to conduct which workloads can or cannot impose tolerations?

ffromani · 2025-03-17T10:39:32Z

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?

Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

I'd love to help but need to carve time. I'll update here and keep commenting.

dom4ha · 2025-04-11T22:02:26Z

After having a brief look, similar idea is discussed in kubernetes/kubernetes#131208

ffromani · 2025-05-05T07:01:56Z

xref: kubernetes/kubernetes#66525

ffromani · 2025-05-05T07:02:43Z

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?
Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

I'd love to help but need to carve time. I'll update here and keep commenting.

I'm happy to correct myself: I will have bandwidth to help here. I'll sync up with relevant folks ASAP.

Zeel-Patel · 2025-05-07T15:40:13Z

Hi folks, does this KEP need any more changes to get approved?
@ffromani @dom4ha @Huang-Wei @samuelkarp

johnbelamaric · 2025-05-07T15:53:22Z

I am proposing a generalized approach to this problem, relying on the existing taints & tolerations functionality. See kubernetes/enhancements#5282

Zeel-Patel · 2025-05-07T16:20:57Z

@johnbelamaric, Since it is a similar design, can I be one of the contributor for this?

johnbelamaric · 2025-05-07T16:40:31Z

@johnbelamaric, Since it is a similar design, can I be one of the contributor for this?

Yes for sure, we would love to have your participation. Personally, I think it should be done in-tree as part of the existing taints & tolerations plugin, but that's a decision for the SIG. If the SIG agrees, this will need a core K8s KEP based on the k/enhancements issue.

k8s-triage-robot · 2025-06-06T17:29:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul…

bed72f0

…ed on GPU nodes

k8s-ci-robot requested review from seanmalloy and swatisehgal October 23, 2024 09:29

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 23, 2024

Zeel-Patel mentioned this pull request Oct 23, 2024

[Feature] Add a Filter plugin to ensure that non-GPU pods are not sch… #788

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2024

ffromani reviewed Oct 25, 2024

View reviewed changes

KunWuLuan reviewed Oct 25, 2024

View reviewed changes

ffromani reviewed Nov 4, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2025

ajaysundark mentioned this pull request Apr 14, 2025

Extensible Node Readiness/Schedulability Conditions kubernetes/kubernetes#131208

Closed

johnbelamaric mentioned this pull request Apr 24, 2025

Introduce Node Lifecycle WG kubernetes/community#8396

Merged

ffromani mentioned this pull request May 5, 2025

Kubelet should add a label to nodes on which CPU manager is running kubernetes/kubernetes#66525

Open

johnbelamaric mentioned this pull request May 6, 2025

Implicit tolerations kubernetes/enhancements#5282

Open

4 tasks

iholder101 mentioned this pull request May 12, 2025

feature: add smt label to the node labeller kubevirt/kubevirt#14596

Closed

8 tasks

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2025

		The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
		kubernetes cluster manager through mutating admission controller without client's knowledge.

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Are you sure you want to change the base?

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Conversation

Zeel-Patel commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Oct 23, 2024

Uh oh!

k8s-ci-robot commented Oct 23, 2024

Uh oh!

k8s-ci-robot commented Oct 23, 2024

Uh oh!

netlify bot commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Uh oh!

ffromani commented Oct 25, 2024

Uh oh!

k8s-ci-robot commented Oct 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumariitr Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumariitr Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zeel-Patel commented Oct 23, 2024 •

edited

Loading

netlify bot commented Oct 23, 2024 •

edited

Loading

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading

dom4ha commented Feb 27, 2025 •

edited

Loading