-
Notifications
You must be signed in to change notification settings - Fork 415
fix: Fix topology spread constraints with zonal volume #1907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Fix topology spread constraints with zonal volume #1907
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: leoryu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @leoryu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c355d14 to
a993af1
Compare
Pull Request Test Coverage Report for Build 13786982879Details
💛 - Coveralls |
1dbfa0a to
420766f
Compare
|
@jmdeal @engedaam @tallaxes @jonathan-innis @njtran hi, can you help review this PR? |
|
/assign @jmdeal |
|
Hi, any plans to release this soon? I am experiencing this issue as well 🙏 |
As of now, no one has reviewed the code. I have forked this repo in my project, but this is not what I expected. |
e464fbf to
cec1814
Compare
|
@jmdeal Hi, I have resolved the confilcts of this PR, please help review it. We really want this isssue can be fixed. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Just ran into this issue aswell, would be good for a fix! :D |
jmdeal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this had slipped from notice, I didn't notice the GitHub notification and have a relatively deep backlog and just popped this off. I would encourage reaching out in the #karpenter-dev channel on the k8s slack if you're having issues getting traction on a PR, there's far less noise there than there is on GH.
| func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod { | ||
| var schedulablePods []*corev1.Pod | ||
| func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement { | ||
| var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this project uses this style for map initialization. Also, I think this name is more representative of what we're storing.
| var schedulablePods = make(map[*corev1.Pod][]corev1.NodeSelectorRequirement) | |
| podVolumeRequirements := map[*corev1.Pod][]corev1.NodeSelectorRequirement{} |
|
|
||
| func (p *Provisioner) injectVolumeTopologyRequirements(ctx context.Context, pods []*corev1.Pod) []*corev1.Pod { | ||
| var schedulablePods []*corev1.Pod | ||
| func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not really converting anything here, right? We're just creating a mapping between pods and their volume requirements. I think something along these lines is more accurate.
| func (p *Provisioner) convertToPodVolumeRequirements(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement { | |
| func (p *Provisioner) volumeRequirementsForPods(ctx context.Context, pods []*corev1.Pod) map[*corev1.Pod][]corev1.NodeSelectorRequirement { |
| stateNodes []*state.StateNode | ||
| // podVolumeRequirements links volume requirements to pods. This is used so we | ||
| // can track the volume requirements in simulate scheduler | ||
| podVolumeRequirements map[*corev1.Pod][]corev1.NodeSelectorRequirement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the pod's UID as the key here rather than a pointer to the pod object. We still use the pod object as a key elsewhere in the project, but we've moved to pod UID here (see excludedPods) and I think it would be wise for us to move to using it elsewhere when possible since we don't need to worry about copies preventing us from indexing.
| // these are the pods that we intend to schedule, so if they are currently in the cluster we shouldn't count them for | ||
| // topology purposes | ||
| for _, p := range pods { | ||
| for p := range podsVolumeRequirements { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we iterating over the pods stored as keys in podVolumeRequirements rather than all pods? I believe this should be the same set in this implementation, but the intention is to exclude all pods we're attempting to schedule, not just those which have an associated volume requirement. Even if it's the same in practice today, this obfuscates intent.
| errs := t.updateInverseAffinities(ctx) | ||
| for i := range pods { | ||
| errs = multierr.Append(errs, t.Update(ctx, pods[i])) | ||
| for p := range podsVolumeRequirements { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here - we should still be using pods rather than the pods stored as keys in the podVolumeRequirements. Let me know if this is intentional and I'm missing the rationale, but as far as I can tell we should be updating the topology for all pods we're attempting to schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmdeal Since we need to know whether the pod has VolumeRequirements, the pods is replaced by podVolumeRequirements. Please check line 238.
| // If there are no eligible domains, we return a `DoesNotExist` requirement, implying that we could not satisfy the topologySpread requirement. | ||
| // nolint:gocyclo | ||
| func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement) *scheduling.Requirement { | ||
| func (t *TopologyGroup) nextDomainTopologySpread(pod *corev1.Pod, podDomains, nodeDomains *scheduling.Requirement, hasVolumeRequirement bool) *scheduling.Requirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my high-level understanding of this change, correct me if I'm mistaken:
- When the pod we're attempting to find the next domain for has a volume induced requirement, we want to return the set of possible domains rather than a single domain.
- We block domains if there are no existing nodes for the domain which are also compatible with the node filter. However, if there are no nodes in a domain (regardless of nodeFilter compatibility), we will add them to the set of compatible domains if the pod has a volume induced requirement.
- If the pod does not have volume induced requirements, we continue to use the single minDomain for the new requirement, the difference being we've no longer evaluated the blocked domains.
I have a few questions about this change:
- I don't understand the purpose of blocked domains, are you able to elaborate?
- Why do we need to return the full set of compatible domains? We're going to constrain the NodeClaim to the pod's volume requirement after anyway, why not just do that first since it's the only eligible domain?
I'll also note that the currently implementation will almost certainly not be acceptable from a performance standpoint - we're iterating over all of the state nodes in the cluster each time we attempt to constrain a pod to adhere to a topology spread constraint. This is one of the most important "hot-paths" in the scheduler, and we need to be extremely careful with what we add here. As far as I can tell this logic shouldn't be necessary if we inject the volume requirements into the nodeclaim requirements before performing the topology checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add that, though this shouldn't cause issues in the common case where we're constraining the NodeClaim to a single zone to adhere to pod requirements, it is an invariant that we need to constrain requirements to a single value for TSC. We only record the domain to the topology group if there's a single domain, and failing to do so can result in Karpenter overprovisioning as it doesn't respect TSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The purpose of blocked domains.
I block the empty domains which all existing nodes with them don't match the pod in the caculations of skew and domainMincount, bescaue in real scheduling, these domains is not considered in TopologySpread.
- Why do we need to return the full set of compatible domains?
At present, Karpenter inject volume nodeAffinity info to pod, so many nodes which is not compatible with the added nodeAffinity are ignored in karpenter but not in real scheduling.
To fix this, I decide to handle pod volume requirements independently, so in this function, we should return all compatible domains to let pod volume choose the suitable domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmdeal As for performance, I have no better idea to fix this issue. But I think accuracy is more important than performance.
| podVolumeRequirements := scheduling.NewNodeSelectorRequirements(volumeRequirements...) | ||
| // Check Pod Volume Requirements | ||
| if err = nodeClaimRequirements.Compatible(podVolumeRequirements, scheduling.AllowUndefinedWellKnownLabels); err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wrap this error, this will be propagated out to the user if we fail to schedule the pod.
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
/remove-lifecycle stale |
Fixes #1239
Description
At present, topology spread constraints in karpenter has 3 problems:
When karpenter counting domains, the existing nodes which don't have the related domain pod are not counted, this will case missing some domains in topology spread calculations.Has been fixed in https://github.com/kubernetes-sigs/karpenter/pull/852/files#diff-17989e9be7eab8ef904a0cd783153c32ac0abed4d5f7c0544673360c0e8027a7R338min-countsdomain from the eligible domains as the requirement, but the instance with this domain may not be compatible with the volume requirement(s).The major works of this PR are as follows:
Add all existing nodes' domains in topology spread calculations.How was this change tested?
make presubmitBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.