fix: Fix topology spread constraints with zonal volume #1907

leoryu · 2025-01-09T08:47:35Z

Description
At present, topology spread constraints in karpenter has 3 problems:

Karpenter inject volume nodeAffinity info to pod, and the nodes not compatible with the volume nodeAffinity are ignored, which will break the topology spread constraints.
~~When karpenter counting domains, the existing nodes which don't have the related domain pod are not counted, this will case missing some domains in topology spread calculations.~~ Has been fixed in https://github.com/kubernetes-sigs/karpenter/pull/852/files#diff-17989e9be7eab8ef904a0cd783153c32ac0abed4d5f7c0544673360c0e8027a7R338
In topology spread calculations, karpenter chooses a single, random min-counts domain from the eligible domains as the requirement, but the instance with this domain may not be compatible with the volume requirement(s).

The major works of this PR are as follows:

Handling pod volume requirements independently.
~~Add all existing nodes' domains in topology spread calculations.~~
Add all candidate domains to topology spread constraints when pod has volume requirement(s).

How was this change tested?
make presubmit
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-01-09T08:47:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: leoryu
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-09T08:47:45Z

Hi @leoryu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-01-12T03:18:34Z

Pull Request Test Coverage Report for Build 13786982879

Details

90 of 94 (95.74%) changed or added relevant lines in 7 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.2%) to 81.678%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/provisioning/scheduling/topologygroup.go	52	53	98.11%
pkg/controllers/provisioning/scheduling/volumetopology.go	3	4	75.0%
pkg/controllers/provisioning/scheduling/existingnode.go	6	8	75.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/test/expectations/expectations.go	2	95.0%

Totals
Change from base Build 13775162610:	0.2%
Covered Lines:	9638
Relevant Lines:	11800

💛 - Coveralls

leoryu · 2025-01-13T06:35:27Z

@jmdeal @engedaam @tallaxes @jonathan-innis @njtran hi, can you help review this PR?

engedaam · 2025-01-21T18:16:24Z

/assign @jmdeal

codeeong · 2025-03-07T06:41:55Z

Hi, any plans to release this soon? I am experiencing this issue as well 🙏

leoryu · 2025-03-11T07:29:52Z

Hi, any plans to release this soon? I am experiencing this issue as well 🙏

As of now, no one has reviewed the code. I have forked this repo in my project, but this is not what I expected.

leoryu · 2025-03-11T12:00:15Z

@jmdeal Hi, I have resolved the confilcts of this PR, please help review it. We really want this isssue can be fixed.

k8s-ci-robot · 2025-03-20T12:13:29Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ap-elmo · 2025-05-12T11:30:25Z

Just ran into this issue aswell, would be good for a fix! :D

jmdeal

Sorry this had slipped from notice, I didn't notice the GitHub notification and have a relatively deep backlog and just popped this off. I would encourage reaching out in the #karpenter-dev channel on the k8s slack if you're having issues getting traction on a PR, there's far less noise there than there is on GH.

jmdeal · 2025-06-23T17:38:01Z

pkg/controllers/provisioning/provisioner.go

-func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod {
-	var schedulablePods []*corev1.Pod
+func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {
+	var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement)


nit: this project uses this style for map initialization. Also, I think this name is more representative of what we're storing.

Suggested change

var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement)

podVolumeRequirements := map[*corev1.Pod][]corev1.NodeSelectorRequirement{}

jmdeal · 2025-06-23T17:40:44Z

pkg/controllers/provisioning/provisioner.go


-func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod {
-	var schedulablePods []*corev1.Pod
+func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {


We're not really converting anything here, right? We're just creating a mapping between pods and their volume requirements. I think something along these lines is more accurate.

Suggested change

func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {

func (p *Provisioner) volumeRequirementsForPods(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {

jmdeal · 2025-06-23T17:44:00Z

pkg/controllers/provisioning/scheduling/topology.go

 	stateNodes   []*state.StateNode
+	// podVolumeRequirements links volume requirements to pods. This is used so we
+	// can track the volume requirements in simulate scheduler
+	podVolumeRequirements map[*corev1.Pod][]corev1.NodeSelectorRequirement


I would use the pod's UID as the key here rather than a pointer to the pod object. We still use the pod object as a key elsewhere in the project, but we've moved to pod UID here (see excludedPods) and I think it would be wise for us to move to using it elsewhere when possible since we don't need to worry about copies preventing us from indexing.

jmdeal · 2025-06-23T17:46:13Z

pkg/controllers/provisioning/scheduling/topology.go

 	// these are the pods that we intend to schedule, so if they are currently in the cluster we shouldn't count them for
 	// topology purposes
-	for _, p := range pods {
+	for p := range podsVolumeRequirements {


Why are we iterating over the pods stored as keys in podVolumeRequirements rather than all pods? I believe this should be the same set in this implementation, but the intention is to exclude all pods we're attempting to schedule, not just those which have an associated volume requirement. Even if it's the same in practice today, this obfuscates intent.

jmdeal · 2025-06-23T17:55:00Z

pkg/controllers/provisioning/scheduling/topology.go

 	errs := t.updateInverseAffinities(ctx)
-	for i := range pods {
-		errs = multierr.Append(errs, t.Update(ctx, pods[i]))
+	for p := range podsVolumeRequirements {


Same comment here - we should still be using pods rather than the pods stored as keys in the podVolumeRequirements. Let me know if this is intentional and I'm missing the rationale, but as far as I can tell we should be updating the topology for all pods we're attempting to schedule.

@jmdeal Since we need to know whether the pod has VolumeRequirements, the pods is replaced by podVolumeRequirements. Please check line 238.

jmdeal · 2025-06-23T21:16:20Z

pkg/controllers/provisioning/scheduling/topologygroup.go

 // If there are no eligible domains, we return a `DoesNotExist` requirement, implying that we could not satisfy the topologySpread requirement.
 // nolint:gocyclo
-func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement) *scheduling.Requirement {
+func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement, hasVolumeRequirement bool) *scheduling.Requirement {


This is my high-level understanding of this change, correct me if I'm mistaken:

When the pod we're attempting to find the next domain for has a volume induced requirement, we want to return the set of possible domains rather than a single domain.

We block domains if there are no existing nodes for the domain which are also compatible with the node filter. However, if there are no nodes in a domain (regardless of nodeFilter compatibility), we will add them to the set of compatible domains if the pod has a volume induced requirement.

If the pod does not have volume induced requirements, we continue to use the single minDomain for the new requirement, the difference being we've no longer evaluated the blocked domains.

I have a few questions about this change:

I don't understand the purpose of blocked domains, are you able to elaborate?

Why do we need to return the full set of compatible domains? We're going to constrain the NodeClaim to the pod's volume requirement after anyway, why not just do that first since it's the only eligible domain?

I'll also note that the currently implementation will almost certainly not be acceptable from a performance standpoint - we're iterating over all of the state nodes in the cluster each time we attempt to constrain a pod to adhere to a topology spread constraint. This is one of the most important "hot-paths" in the scheduler, and we need to be extremely careful with what we add here. As far as I can tell this logic shouldn't be necessary if we inject the volume requirements into the nodeclaim requirements before performing the topology checks.

I will also add that, though this shouldn't cause issues in the common case where we're constraining the NodeClaim to a single zone to adhere to pod requirements, it is an invariant that we need to constrain requirements to a single value for TSC. We only record the domain to the topology group if there's a single domain, and failing to do so can result in Karpenter overprovisioning as it doesn't respect TSC.

@jmdeal

The purpose of blocked domains.

I block the empty domains which all existing nodes with them don't match the pod in the caculations of skew and domainMincount, bescaue in real scheduling, these domains is not considered in TopologySpread.

Why do we need to return the full set of compatible domains?

At present, Karpenter inject volume nodeAffinity info to pod, so many nodes which is not compatible with the added nodeAffinity are ignored in karpenter but not in real scheduling.

To fix this, I decide to handle pod volume requirements independently, so in this function, we should return all compatible domains to let pod volume choose the suitable domain.

@jmdeal As for performance, I have no better idea to fix this issue. But I think accuracy is more important than performance.

jmdeal · 2025-06-23T21:17:41Z

pkg/controllers/provisioning/scheduling/nodeclaim.go

+	podVolumeRequirements := scheduling.NewNodeSelectorRequirements(volumeRequirements...)
+	// Check Pod Volume Requirements
+	if err = nodeClaimRequirements.Compatible(podVolumeRequirements, scheduling.AllowUndefinedWellKnownLabels); err != nil {
+		return err


We should wrap this error, this will be propagated out to the user if we fail to schedule the pod.

k8s-triage-robot · 2025-10-14T09:49:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alex-berger · 2025-11-10T08:18:42Z

/remove-lifecycle stale

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2025

k8s-ci-robot requested review from engedaam and tallaxes January 9, 2025 08:47

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 9, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 9, 2025

leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch 5 times, most recently from c355d14 to a993af1 Compare January 12, 2025 02:56

leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch 5 times, most recently from 1dbfa0a to 420766f Compare January 12, 2025 14:49

leoryu changed the title ~~[WIP]fix: Fix topology spread constraints with zonal volume~~ fix: Fix topology spread constraints with zonal volume Jan 13, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2025

leoryu mentioned this pull request Jan 13, 2025

Karpenter will occassionally provision nodes that are way too large #1762

Closed

k8s-ci-robot assigned jmdeal Jan 21, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2025

fix topology spread constraints with zonal volume

cec1814

leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch from e464fbf to cec1814 Compare March 11, 2025 11:31

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2025

jmdeal reviewed Jun 23, 2025

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2025

	var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement)
	podVolumeRequirements := map[*corev1.Pod][]corev1.NodeSelectorRequirement{}

	func (p Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {
	func (p Provisioner) volumeRequirementsForPods(ctx context.Context, pods []corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {

fix: Fix topology spread constraints with zonal volume #1907

Are you sure you want to change the base?

fix: Fix topology spread constraints with zonal volume #1907

Uh oh!

Conversation

leoryu commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 9, 2025

Uh oh!

k8s-ci-robot commented Jan 9, 2025

Uh oh!

coveralls commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 13786982879

Details

💛 - Coveralls

Uh oh!

leoryu commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

engedaam commented Jan 21, 2025

Uh oh!

codeeong commented Mar 7, 2025

Uh oh!

leoryu commented Mar 11, 2025

Uh oh!

leoryu commented Mar 11, 2025

Uh oh!

k8s-ci-robot commented Mar 20, 2025

Uh oh!

ap-elmo commented May 12, 2025

Uh oh!

jmdeal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leoryu Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leoryu Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Oct 14, 2025

Uh oh!

alex-berger commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

leoryu commented Jan 9, 2025 •

edited

Loading

coveralls commented Jan 12, 2025 •

edited

Loading

leoryu commented Jan 13, 2025 •

edited

Loading

leoryu Jul 16, 2025 •

edited

Loading

leoryu Jul 16, 2025 •

edited

Loading