Skip to content

Conversation

@leoryu
Copy link

@leoryu leoryu commented Jan 9, 2025

Fixes #1239

Description
At present, topology spread constraints in karpenter has 3 problems:

  1. Karpenter inject volume nodeAffinity info to pod, and the nodes not compatible with the volume nodeAffinity are ignored, which will break the topology spread constraints.
  2. When karpenter counting domains, the existing nodes which don't have the related domain pod are not counted, this will case missing some domains in topology spread calculations. Has been fixed in https://github.com/kubernetes-sigs/karpenter/pull/852/files#diff-17989e9be7eab8ef904a0cd783153c32ac0abed4d5f7c0544673360c0e8027a7R338
  3. In topology spread calculations, karpenter chooses a single, random min-counts domain from the eligible domains as the requirement, but the instance with this domain may not be compatible with the volume requirement(s).

The major works of this PR are as follows:

  1. Handling pod volume requirements independently.
  2. Add all existing nodes' domains in topology spread calculations.
  3. Add all candidate domains to topology spread constraints when pod has volume requirement(s).

How was this change tested?
make presubmit
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: leoryu
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @leoryu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 9, 2025
@leoryu leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch 5 times, most recently from c355d14 to a993af1 Compare January 12, 2025 02:56
@coveralls
Copy link

coveralls commented Jan 12, 2025

Pull Request Test Coverage Report for Build 13786982879

Details

  • 90 of 94 (95.74%) changed or added relevant lines in 7 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 81.678%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/provisioning/scheduling/topologygroup.go 52 53 98.11%
pkg/controllers/provisioning/scheduling/volumetopology.go 3 4 75.0%
pkg/controllers/provisioning/scheduling/existingnode.go 6 8 75.0%
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 95.0%
Totals Coverage Status
Change from base Build 13775162610: 0.2%
Covered Lines: 9638
Relevant Lines: 11800

💛 - Coveralls

@leoryu leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch 5 times, most recently from 1dbfa0a to 420766f Compare January 12, 2025 14:49
@leoryu leoryu changed the title [WIP]fix: Fix topology spread constraints with zonal volume fix: Fix topology spread constraints with zonal volume Jan 13, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2025
@leoryu
Copy link
Author

leoryu commented Jan 13, 2025

@jmdeal @engedaam @tallaxes @jonathan-innis @njtran hi, can you help review this PR?

@engedaam
Copy link
Contributor

/assign @jmdeal

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2025
@codeeong
Copy link

codeeong commented Mar 7, 2025

Hi, any plans to release this soon? I am experiencing this issue as well 🙏

@leoryu
Copy link
Author

leoryu commented Mar 11, 2025

Hi, any plans to release this soon? I am experiencing this issue as well 🙏

As of now, no one has reviewed the code. I have forked this repo in my project, but this is not what I expected.

@leoryu leoryu force-pushed the fix-incorrect-topology-spread-constraints-with-zonal-volume branch from e464fbf to cec1814 Compare March 11, 2025 11:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025
@leoryu
Copy link
Author

leoryu commented Mar 11, 2025

@jmdeal Hi, I have resolved the confilcts of this PR, please help review it. We really want this isssue can be fixed.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ap-elmo
Copy link

ap-elmo commented May 12, 2025

Just ran into this issue aswell, would be good for a fix! :D

Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this had slipped from notice, I didn't notice the GitHub notification and have a relatively deep backlog and just popped this off. I would encourage reaching out in the #karpenter-dev channel on the k8s slack if you're having issues getting traction on a PR, there's far less noise there than there is on GH.

func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod {
var schedulablePods []*corev1.Pod
func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {
var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this project uses this style for map initialization. Also, I think this name is more representative of what we're storing.

Suggested change
var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement)
podVolumeRequirements := map[*corev1.Pod][]corev1.NodeSelectorRequirement{}


func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod {
var schedulablePods []*corev1.Pod
func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not really converting anything here, right? We're just creating a mapping between pods and their volume requirements. I think something along these lines is more accurate.

Suggested change
func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {
func (p *Provisioner) volumeRequirementsForPods(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement {

stateNodes []*state.StateNode
// podVolumeRequirements links volume requirements to pods. This is used so we
// can track the volume requirements in simulate scheduler
podVolumeRequirements map[*corev1.Pod][]corev1.NodeSelectorRequirement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the pod's UID as the key here rather than a pointer to the pod object. We still use the pod object as a key elsewhere in the project, but we've moved to pod UID here (see excludedPods) and I think it would be wise for us to move to using it elsewhere when possible since we don't need to worry about copies preventing us from indexing.

// these are the pods that we intend to schedule, so if they are currently in the cluster we shouldn't count them for
// topology purposes
for _, p := range pods {
for p := range podsVolumeRequirements {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we iterating over the pods stored as keys in podVolumeRequirements rather than all pods? I believe this should be the same set in this implementation, but the intention is to exclude all pods we're attempting to schedule, not just those which have an associated volume requirement. Even if it's the same in practice today, this obfuscates intent.

errs := t.updateInverseAffinities(ctx)
for i := range pods {
errs = multierr.Append(errs, t.Update(ctx, pods[i]))
for p := range podsVolumeRequirements {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - we should still be using pods rather than the pods stored as keys in the podVolumeRequirements. Let me know if this is intentional and I'm missing the rationale, but as far as I can tell we should be updating the topology for all pods we're attempting to schedule.

Copy link
Author

@leoryu leoryu Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmdeal Since we need to know whether the pod has VolumeRequirements, the pods is replaced by podVolumeRequirements. Please check line 238.

// If there are no eligible domains, we return a `DoesNotExist` requirement, implying that we could not satisfy the topologySpread requirement.
// nolint:gocyclo
func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement) *scheduling.Requirement {
func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement, hasVolumeRequirement bool) *scheduling.Requirement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my high-level understanding of this change, correct me if I'm mistaken:

  • When the pod we're attempting to find the next domain for has a volume induced requirement, we want to return the set of possible domains rather than a single domain.
  • We block domains if there are no existing nodes for the domain which are also compatible with the node filter. However, if there are no nodes in a domain (regardless of nodeFilter compatibility), we will add them to the set of compatible domains if the pod has a volume induced requirement.
  • If the pod does not have volume induced requirements, we continue to use the single minDomain for the new requirement, the difference being we've no longer evaluated the blocked domains.

I have a few questions about this change:

  • I don't understand the purpose of blocked domains, are you able to elaborate?
  • Why do we need to return the full set of compatible domains? We're going to constrain the NodeClaim to the pod's volume requirement after anyway, why not just do that first since it's the only eligible domain?

I'll also note that the currently implementation will almost certainly not be acceptable from a performance standpoint - we're iterating over all of the state nodes in the cluster each time we attempt to constrain a pod to adhere to a topology spread constraint. This is one of the most important "hot-paths" in the scheduler, and we need to be extremely careful with what we add here. As far as I can tell this logic shouldn't be necessary if we inject the volume requirements into the nodeclaim requirements before performing the topology checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add that, though this shouldn't cause issues in the common case where we're constraining the NodeClaim to a single zone to adhere to pod requirements, it is an invariant that we need to constrain requirements to a single value for TSC. We only record the domain to the topology group if there's a single domain, and failing to do so can result in Karpenter overprovisioning as it doesn't respect TSC.

Copy link
Author

@leoryu leoryu Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmdeal

  1. The purpose of blocked domains.

I block the empty domains which all existing nodes with them don't match the pod in the caculations of skew and domainMincount, bescaue in real scheduling, these domains is not considered in TopologySpread.

  1. Why do we need to return the full set of compatible domains?

At present, Karpenter inject volume nodeAffinity info to pod, so many nodes which is not compatible with the added nodeAffinity are ignored in karpenter but not in real scheduling.

To fix this, I decide to handle pod volume requirements independently, so in this function, we should return all compatible domains to let pod volume choose the suitable domain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmdeal As for performance, I have no better idea to fix this issue. But I think accuracy is more important than performance.

podVolumeRequirements := scheduling.NewNodeSelectorRequirements(volumeRequirements...)
// Check Pod Volume Requirements
if err = nodeClaimRequirements.Compatible(podVolumeRequirements, scheduling.AllowUndefinedWellKnownLabels); err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap this error, this will be propagated out to the user if we fail to schedule the pod.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2025
@alex-berger
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zonal Volume Requirements Break Topology Spread Constraints

9 participants