Skip to content

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Zeel-Patel
Copy link

@Zeel-Patel Zeel-Patel commented Oct 23, 2024

…ed on GPU nodes

This plugin is widely being used at Uber and thought can be useful for open source community.

KEP doc as requested in #788 PR

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Zeel-Patel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Zeel-Patel
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 23, 2024
Copy link

netlify bot commented Oct 23, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit bed72f0
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6718c1d59c391d000856303a

@ffromani
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2024
@k8s-ci-robot
Copy link
Contributor

@Zeel-Patel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-scheduler-plugins-verify bed72f0 link true /test pull-scheduler-plugins-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +401 to +402
The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
kubernetes cluster manager through mutating admission controller without client's knowledge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but this applies also to a filter plugin, doesn't it? arguably any non-core, non-default setting needs to be added by kubernetes cluster manager, and can be unknown for whatever reason to the pod submitter(s).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the filter plugin-based approach, pod submitters can focus solely on defining their resource requirements (e.g., nvidia.com/gpu) without needing to know about specific node taints or tolerations in the pod spec. This allows the plugin to handle those details automatically, reducing the configuration burden on the pod creator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true but if the filter plugin is part of the default scheduler profile.
AFAICT this will be true until the plugin is merged in the main k8s codebase and accepted as part of the default scheduler profile, which, especially the latter, is not happening quickly (as standard process mandates).

At very least for the time being, thus, the pod submitter would need to know which schedulerName to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

In typical CSP Kubernetes environments one doesn't get to change the scheduler settings. Any solution based on scheduler plugins will for a long time (if not forever) have to be using a mutating webhook which injects the special scheduler name plus of course you need an extra pod for running the special scheduler. You can't get away from using the mutating webhook, unless you accept modifying all workloads manually with the scheduler name. No difference there if you choose to go for scheduler plugins or if you choose to do labels and node affinity, so the least hassle would be to deploy node feature discovery and a mutating webhook without a need for scheduler changes.

In terms of resource usage, I bet the mutating webhook + NFD is a smaller burden than a mutating webhook + extra scheduler Pod. NFD isn't doing complex scheduling, it is pretty simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Northstar goal is to make this plugin available as part of the default scheduler profile with the k8s-scheduler binary. I would like to understand the reservations with that approach. We are fine with time it will take and we would like to follow the full process.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory. We are providing this choice to the users who want to use this plugin of-course it will work only after adding to the scheduler-profile configurations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

Copy link
Contributor

@ffromani ffromani Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

This makes sense, but is not clear to me how this would be the case if we make (as IIUC it is the goal) this plugin part of the default profile. OTOH using it on a secondary, non-default profile would be straightforward, but this would again require changes in the pod spec

Copy link

@kumariitr kumariitr Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

At least at Uber, we prefer to add such burden / extra responsibilities to cluster-admin team and not to the customers / users of the platform. There can be more people / companies who would like to take this approach. taints / tolerations adds burden on cluster-admin team, customer teams and also requires co-ordination if hardware / nodes are manage by different team (cluster-admin) and workloads / pods are managed by multiple different teams. This coordination becomes complex in case of multi-team setup in large organisation.
So, adding a new approach of using scheduler plugin helps many companies / people in industry. Choice will still be with them, whether they take this approach or taints-tolerations approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review the latest updates, but based on the comments I read, I still don't see great benefits over the existing taints and tolerations mechanism but, provided we explore alternatives and we acknowledge them in the KEP, I'm not against adding the plugin in this repo. This is (IIUC :) ), after all, the place where experimentation is expected to happen.

I agree the benefits would really surface, as written in anoter comment, only when this plugin is both merged upstream AND part of the default profile. Both steps will require large buy-in.

OTOH, a prototype will help the conversation, so again another reason to not oppose the merge.
Will review again and comment more deeply ASAP.

The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
kubernetes cluster manager through mutating admission controller without client's knowledge.
Taints are on nodes and tolerations are on pods. Submitter of pods and managers of nodes are two different teams sitting
in different orgs. So one can not manage both the taints and tolerations. And it will increase the chances of errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this PR for strong reasons, but I still struggle to see the benefits over taint+tolerations+mutating webhook because I don't see (yet) how the drawbacks list ed here don't apply to the extra-filter scenarios.
IOW: it seems to me the new filter will hit pretty much the same scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve evaluated the taints and tolerations approach as well. However, from the standpoint of managing additional node configurations, it requires cluster admins to add and maintain specific taints on GPU nodes and tolerations on GPU-using pods. In contrast, the filter plugin handles these configurations automatically, significantly reducing the administrative burden.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can label GPU-nodes with something like NFD and then inject node affinity to Pods with a webhook preventing scheduling to those nodes unless the Pod really uses GPU resources.

Automated, low maintenance. That's what I'd do in order to avoid maintaining a scheduler build. The gotcha is if some gitops tool will not allow using such a webhook. But usually gitops can be configured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @uniemimu

Can we have a summary of administrative burden for both approaches?
I'm still lacking enough evidence this approch bring major (and thus worthy) benefits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially your column for a patched scheduler is missing a few details.

For the Setup Effort, you will need to add the burden of defining the scheduler name for every workload, unless you manage to replace the existing default scheduler, which is not typically possible in CSP environments. If you don't do those workload definition changes by hand, add a webhook. If you do it by hand for every workload, you could do manual changes for workload node affinity by hand also in the left column for NFD. So either remove webhook from the left column or add it to the middle column.

For the ongoing maintenance about labels, the GPU vendors tend to have their own set of labels which has been rather stable, but varies between vendors. The dominating market leader has NFD installation as an option in its GPU device plugin helm chart. It's not like it would be an effort to set up and maintain.

The node-level overhead for the patched scheduler version will only be "none" in case you replace the existing default scheduler, which is usually not possible in CSP managed kubernetes environments. So you typically need an extra node, possibly a big one, if the cluster is big. The NFD worker daemonset is small and in the case of those GPU nodes where you didn't want extra Pods, it shouldn't force you to increase your node counts unless 5 millicores and 128MB of memory is too much. On other nodes then, there is that amount of unnecessary resource usage. So the row there in my opinion would be "typically 0 new nodes required, plausibly 1 extra node" | "typically 1 extra node required" | "0"

From the overall hassle point of view having a default scheduler forcing this filtering for all users by default is of course the simplest, and the right column rightfully shows that.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

Copy link
Contributor

@ffromani ffromani Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

+1 again. I can see a way to have this code accepted in kubernetes-sigs eventually, but changing the default profile is a much more impactful change which would require extensive discussion and evaluation.

Will review the rest shortly.

Copy link

@kumariitr kumariitr Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole discussion is moving around "CSP managed kubernetes environments". Why so ? There are companies (specially the larger companies) who manage their own Kubernetes environments / cluster setups. For them it makes sense to use hardware from cloud as IaaS and then install Kubernetes on top of that by themselves. Everyone is not willing to use GKE or EKS or any other "CSP managed kubernetes environments".
Let's talk about the open source Kubernetes and not club it with CSP goals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

If that's part of the process before merging a new plugin to this repo then, let's follow the process. We can present this with sig-scheduling, that's a good suggestion. @gkeesh7 @Zeel-Patel Please prepare for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where I'm standing, I don't oppose having the plugin here in this repository. I just don't see much benefit myself, since I believe I could do the same with less of a maintenance burden, basically without having to keep building a custom scheduler. But from considering having alternatives for those who really prefer building their own scheduler, I get it.

There's still yet another approach available for achieving the same thing, and that is scheduler extenders. With extenders you wouldn't have to rebuild the whole scheduler, you'd only have to build the extender and hook it up. The filter functionality is available in extenders. The downside of those is that you still need to configure the scheduler, and you still need to maintain your own scheduler extender container. Plus you take a performance hit in scheduling, which can be mitigated by the use of caches and only passing in node names to the extender. But you still take a measurable performance hit.

My view is that real benefits in terms of making things simple start to only surface if you get this to Kubernetes upstream all the way into the default profile where it would be enabled by default, and that requires sig-scheduling to buy the idea.

daemonset pod does not request any GPUs. Because of the filter plugin explained above, daemonset pods can get stuck in
pending state forever.

So, the scheduler plugin will exclude the pods which contain the GPU device-plugin container image from filtering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should exclude all daemonset pods?

Comment on lines +228 to +233
For a given pod resource, Filter plugin will check all the nodes in the cluster. It will check
whether the node is GPU node or not by checking node’s Allocatable field. If a node has an allocatable
GPU resource, it will consider that node as GPU node. 'Allocatable' on a Kubernetes
node is defined as the amount of compute resources that are available for pods. Note that node
resources which have been allocated to pods running on that node will not be subtracted from the
node allocatable resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there's no high pod churn, and assuming the working set of pod is relatively stable (e.g. pods tend to be long-running).
Would https://github.com/kubernetes-sigs/descheduler help in this use case?
Maybe not, let's discuss in the Alternatives section

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microservices use cases have low pods churn in general. But batch workload use cases have very high pod churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine. This deserves to be recorded in the Alternatives section, explaining why the proposed solution is better than descheduler, and on which scenarios.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2025
@dom4ha
Copy link

dom4ha commented Feb 26, 2025

@pohly @johnbelamaric @klueska as there was a related discussion on last wg-device-management mtg.

I don't think this plugin with such specialized functionality could become an official kube-scheduler plugin. You could try to make it more generic and capable to express negative scheduling criteria, but it would hard to make it right and not duplicate taint/tolerations mechanism.

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

@kumariitr
Copy link

@pohly @johnbelamaric @klueska as there was a related discussion on last wg-device-management mtg.

I don't think this plugin with such specialized functionality could become an official kube-scheduler plugin. You could try to make it more generic and capable to express negative scheduling criteria, but it would hard to make it right and not duplicate taint/tolerations mechanism.

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

@dom4ha There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

@pohly
Copy link

pohly commented Feb 26, 2025

Maybe the DRA plugin could have a notion of a critical resource types, which would block scheduling of pods not using them?

That approach could work. The plugin would have to be extended to do some work also for pods not using claims (currently it short-circuits itself when it detects that) and keep track of which nodes have such "tainting devices" (API to be defined...). Then during Filter it can reject nodes unless the pod actually allocates such a device.

@dom4ha
Copy link

dom4ha commented Feb 26, 2025

@sanposhiho @Huang-Wei @macsko

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.

In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

@kumariitr
Copy link

@sanposhiho @Huang-Wei @macsko

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.

In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

Whether this plugin will be useful to other developers or not will be known only when we make it available to them by adding to this scheduler-plugins repo. Then only, we will come to know how likely this plugin will become official.

@dom4ha
Copy link

dom4ha commented Feb 27, 2025

Kensei Nakada Wei Huang Maciej Skoczeń

Dominik Marciński There was also an alignment about adding this plugin to the schedulier-plugins repo Can we do that as first step so that it becomes available to users to consume ?

My comment was about making it in-tree plugin, although it might be useful to hear other opinions about it.
In general duplicating the functionality that is achievable in using existing mechanism is not a good thing. The argument used here was that the plugin may become the official one, so I'm trying to assess how likely it is.

Whether this plugin will be useful to other developers or not will be known only when we make it available to them by adding to this scheduler-plugins repo. Then only, we will come to know how likely this plugin will become official.

Plugins don't get promoted just based on their popularity, because they need to become a part of a consistent product, which means many different things. In this particular case, the issues that I see is no flexibility (also raised by others), meaning that it may work today, but it's not future-proof. Another concern is possibile duplication of the existing functionality.

I can see the problem it is trying to solve and thanks for bringing it up. I spend some time figuring out how we could turn it into configurable functionality. This is why I mentioned DRA, because in my opinion it should be the best place to express which devices might be critical enough to completely block scheduling of other pods. It still does not mean that it's an idea to follow.

Another alternative I see is ability to define some sort of scheduling policies (per node) in form of extensions. The problem here would be probably performance of their evaluations, so not sure if it's the way to go, since such policies could be probably compiled into taints/labels, so into already supported concepts without a risk of slowing down scheduler.

Summarizing, I don't comment whether the plugin should be part of this repo or not (leaving it up to the repo owners), just giving a perspective on becoming an official plugin. You could also evaluate a possibility of improving the taint/tolerations approach to address its pain points, but I don't have any specific suggestion here.

@johnbelamaric
Copy link

@dom4ha thanks for tagging me here. I have been discussing this with a few folks. Taints and tolerations were originally intended more as an administrative function. We have been using them, along with node labels, as ways to guide scheduling when there is relevant scheduling information that is not actually known to the scheduler. Examples:

  • Machine has a scarce resource (e.g., GPUs) and so we only want to run workloads on that machine that need that resource
  • Machine has a particular CPU manager or topology manager policy configured
  • Machine has a particular kubelet or runtime feature enabled
  • A particular kubelet feature gate is enabled on that node

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

cc @dchen1107 @liggitt @thockin @tallclair @samuelkarp

@johnbelamaric
Copy link

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

And importantly, this can be 100% transparent to both users and cluster administrators, if done right.

@dom4ha
Copy link

dom4ha commented Feb 28, 2025

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

That sounds to me like the relation between ResourceClaims vs ResourceSlices if we again think about node -> resource generalization. Do we need anything more than just extend ResourceSlice with whatever information is needed to take the right scheduling decision?

I guess that KEP-5027 + 5055: DRA: admin-controlled device attributes + taints could help in managing such dynamic information centrally.

@johnbelamaric
Copy link

I don't see this as directly related to DRA. This is just something nodes can start publishing. And maybe a scheduler plugin can start interpreting. No user action needed, no DRA needed.

@ffromani
Copy link
Contributor

@dom4ha thanks for tagging me here. I have been discussing this with a few folks. Taints and tolerations were originally intended more as an administrative function. We have been using them, along with node labels, as ways to guide scheduling when there is relevant scheduling information that is not actually known to the scheduler. Examples:

* Machine has a scarce resource (e.g., GPUs) and so we only want to run workloads on that machine that need that resource

* Machine has a particular CPU manager or topology manager policy configured

* Machine has a particular kubelet or runtime feature enabled

* A particular kubelet feature gate is enabled on that node

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

I like this a lot. This will solve a lot of pains we have and unlock possibilities. From my (unfortunately limited) experience with DRA, which was correctly mentioned few times already, it seems to me this direction will compose nicely with the other ongoing efforts, DRA first and foremost.

It also seems a feasible pivot for the work at hand, which seems to me would address most of the core concerns some reviewers (surely yours truly) highlighted

Besides the much bigger scope, however, I can think about some concerns:

  1. I see the above proposal work with reasonnable effort for classic devices (device plugins)
  2. Node tuning features, like cpumanager policy, topology manager policy, policy options, are not exposed by the nodes. Nor probably they shall be. Users (humans or machines) should not be concerned about "this node has topologyManagerPolicy=single-numa-node" but rather should care about "this node can allocate resources aligned to NUMA node boundaries". This means we will need a way to model these capabilities and to expose them. We did some experiments in the context of numa-aware-scheduling
  3. some workload requirements are opaque, is not easy or possible at all to derive them by parsing the pod spec. Easy example: exclusive CPU allocation. The kubelet does so if cpumanagerPolicy=static, the workload is guaranteed QoS, and it requires integer cpus. Problem: this is implicit. There's no explicit requirement in the podSpec. An unaware user can create a podSpec which has cpu request == cpu limits == 4 (say) but doesn't actually need exclusive CPUs and can happily run without. These requirements should be made explicit then, and this would require pod spec changes likely? (there are more examples in this vein)

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

@johnbelamaric
Copy link

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?

Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

@samuelkarp
Copy link
Member

There is a large class of things we use node labels and/or taints and tolerations for that could be automatically derived. We think of these "workload requirements" and "node capabilities", with some of those node capabilities being deemed "scarce".

We should be able to automatically inspect a PodSpec and associated objects (volumes, resource claims, etc.) and determine the "workload requirements" in terms of these "capabilities". Similarly, a node can publish the "node capabilities". The scheduler could then take these into consideration during scheduling.

This would fix a lot of issues we have right now, where a pod can land on a node that cannot run it, because it uses a feature that is not enabled on that node. It can also solve the issue of repelling workloads from nodes with scarce "capabilities".

For what it's worth, this is roughly how ECS handles the challenge of heterogeneous nodes (disclosure: I was part of designing this around 2015, before I was involved with Kubernetes). The way it works is as follows:

  • The ECS agent (equivalent of the Kubelet) runs logic to detect supported capabilities at startup. These are based on the version of the ECS agent (so, unconditional), agent configuration, container runtime version (Docker, and looking at which API versions are supported by the installed Docker daemon), devices (GPU, etc), and kernel version.
  • The supported capabilities are registered with the ECS control plane as "attributes"
  • When a task definition (rough equivalent of a Pod spec template) is registered, the ECS control plane computes a set of implied required attributes based on the content of the task definition
  • At placement time, the scheduler computes which nodes have the required attributes present and limits placement to those nodes

ECS has a bit of a simpler problem to solve (much more limited feature set compared to Kubernetes), but I do generally think having the node advertise its capabilities and allowing the scheduler to match based on that makes sense.

@dom4ha
Copy link

dom4ha commented Mar 4, 2025

  • When a task definition (rough equivalent of a Pod spec template) is registered, the ECS control plane computes a set of implied required attributes based on the content of the task definition

I think this is very important step to keep scheduling fast. I guess QoS could be also taken into consideration in this process.

@johnbelamaric
Copy link

cc @mrunalp

This is the issue I mentioned in response to your question in SIG Arch.

@Huang-Wei
Copy link
Contributor

Carrying over my previous comment at #788 (comment)

IMHO having a separate plugin is sort of an overkill. If we don't want to non-GPU workloads to land on GPU nodes, we can simply taint GPU nodes, and only GPU workloads can land on them (with proper tolerations).

In a real-world case, GPU node will be tainted, so comparing to introducing a plugin, isn't that the lightweight solution is to conduct which workloads can or cannot impose tolerations?

@ffromani
Copy link
Contributor

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?

Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

I'd love to help but need to carve time. I'll update here and keep commenting.

@dom4ha
Copy link

dom4ha commented Apr 11, 2025

After having a brief look, similar idea is discussed in kubernetes/kubernetes#131208

@ffromani
Copy link
Contributor

ffromani commented May 5, 2025

xref: kubernetes/kubernetes#66525

@ffromani
Copy link
Contributor

ffromani commented May 5, 2025

My 2c: I believe "just" exposing node capabilities in a more structured way can unlock possibilities even without further changes (e.g. to podSpec), and it seems to me this would integrate nicely with both this work and the other ongoing initiatives. Would this be a good first step? I feel this is worthy a kubernetes KEP.

Yes, I think a K8s KEP is warranted, and I know @dchen1107 and others are supportive. I would be be probably a sig-scheduling KEP with sig-node participating (or vice-versa, but I think that's the right ownership). You want to lead it @ffromani ?
Your points are really interesting. I don't know what to do about implicit things. But we can work it out in the KEP!

I'd love to help but need to carve time. I'll update here and keep commenting.

I'm happy to correct myself: I will have bandwidth to help here. I'll sync up with relevant folks ASAP.

@Zeel-Patel
Copy link
Author

Hi folks, does this KEP need any more changes to get approved?
@ffromani @dom4ha @Huang-Wei @samuelkarp

@johnbelamaric
Copy link

I am proposing a generalized approach to this problem, relying on the existing taints & tolerations functionality. See kubernetes/enhancements#5282

@Zeel-Patel
Copy link
Author

@johnbelamaric, Since it is a similar design, can I be one of the contributor for this?

@johnbelamaric
Copy link

@johnbelamaric, Since it is a similar design, can I be one of the contributor for this?

Yes for sure, we would love to have your participation. Personally, I think it should be done in-tree as part of the existing taints & tolerations plugin, but that's a decision for the SIG. If the SIG agrees, this will need a core K8s KEP based on the k/enhancements issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.