Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA: Implement in-place updates support #7673

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

maxcao13
Copy link
Member

@maxcao13 maxcao13 commented Jan 7, 2025

What type of PR is this?

Depends on: #7813
Depends on: #7896
Depends on: #7901

/kind feature
/kind api-change

NOTE: This PR is being broken into smaller reviewable pieces and then merged into a feature branch.

See below for the list that will be updated as PRs are created:

What this PR does / why we need it:

This PR is an attempt to implement VPA in-place vertical scaling according to AEP-4016. It uses the VPA updater to actuate recommendations by sending resize patch requests to pods which allows in-place resize as enabled by the InPlacePodVerticalScaling feature flag in k8s 1.27.0 alpha and above (or by eventual graduation).

It includes some e2e tests currently according to the AEP, but I am sure we will probably need more.

Also introduces feature-gates to VPA, and includes the first feature-gate InPlaceOrRecreate which allows the use of InPlaceOrRecreate update mode.

This PR is a continuation of #6652 started by @jkyros with a cleaner git commit history.

Which issue(s) this PR fixes:

Fixes #4016

Special notes for your reviewer:

Notable general areas of concern:

  • See this comment about the potential need for a third containerResize policy that currently does not exist in the k8s api.
    • We have amended the AEP since we cannot guarantee no container restarts. This is no longer a problem.
  • Needs more attention on how disruptive/disruption-free updates (should affect)/(are affected) by PodDisruptionBudgets.
    • This can be covered in future enhancements.
  • We just kind of hacked the in-place stuff into the eviction limiter, maybe it should have been its own thing, or maybe we need a "disruption limiter", but in-place and eviction needed to know about each other because they have the same "disruption limit"
  • For now, there are many many TODOs literred throughout the code which need attention from reviewers/maintainers. A lot is because of design decisions I probably shouldn't make on my own. I resolved some of John's TODOs but he still has relevant comments that need to be addressed as well. I am using the TODOs as the "special notes for your reviewer" section, if people would like a comment somewhere which lays them all out nicely, I'm more than happy to make one.
    • I've limited as many TODOS as possible, but I'm using the rest of the TODOs as review markers to draw attention to parts of the discussion that need community discussion.
  • Requires a lot more unit testing, but if a lot of architecture is to change, I chose delay writing them until I get feedback.
  • There's additional comments by John which can help aid review in the earlier commit descriptions.

Does this PR introduce a user-facing change?

In-place VPA scaling implemented, it can be enabled by setting `updateMode` on your VPA to `InPlaceOrRecreate` (Depends on `InPlaceVerticalPodScaling` feature gate being enabled or having graduated) 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[AEP] https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support
Depends on: 
[KEP] https://github.com/kubernetes/enhancements/tree/25e53c93e4730146e4ae2f22d0599124d52d02e7/keps/sig-node/1287-in-place-update-pod-resources

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jan 7, 2025
@k8s-ci-robot k8s-ci-robot added area/vertical-pod-autoscaler needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 7, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @maxcao13. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 7, 2025
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@adrianmoisey
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 7, 2025
Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just my first review after going through all the changes. I will go over it multiple times, but these are my initial comments for now.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2025
@maxcao13
Copy link
Member Author

Appreciate the review! I will respond to the other comments tomorrow, just wanted to get the easy stuff out of the way and less cluttered.

@omerap12
Copy link
Member

Appreciate the review! I will respond to the other comments tomorrow, just wanted to get the easy stuff out of the way and less cluttered.

Sure, take your time

@maxcao13
Copy link
Member Author

This is an important note/question. It may just be me being confused about the AEP details, but AFAIK there is no way for a user to guarantee that a resize request will be disruptonless. We can only guarantee that a resize will be disruptful (by configuring container resize restart policy). So how can we dictate the resizing type based on the conditions in the AEP? e.g. in the AEP it states

InPlaceOnly and InPlaceOrRecreate will attempt to apply a disruption-free update in place if it meets at least one of the following conditions:

  • Quick OOM,
    * Outside recommended range,
    * Significant change.

InPlaceOnly and InPlaceOrRecreate will attempt to apply updates that are not disruption-free in place under the same conditions that apply to updates in the Recreate mode.

Since the conditions are different, how can we actually make sure the first set of actions will actually be disruption free? I think this problem requires the need for a third resizePolicy MustNotRestart as mentioned in the AEP. But that does not exist right now. That will also help this test actually make sense.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2025
@maxcao13
Copy link
Member Author

maxcao13 commented Jan 30, 2025

This will be affected by KEP-1287 update in 1.33: kubernetes/enhancements#5089

Additionally, we should probably consider feature gating this.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2025
Comment on lines +21 to +22
const (
// VpaInPlaceUpdatedLabel is a label used by the vpa inplace updated annotation.
VpaInPlaceUpdatedLabel = "vpaInPlaceUpdated"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a label or an annotation?

Copy link
Member Author

@maxcao13 maxcao13 Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the style from here:

VpaObservedContainersLabel = "vpaObservedContainers"
. I'm not really sure of the intention of using the word label since it was written 6 years ago, but it probably makes more sense to be VpaInPlaceUpdatedAnnotationKey or something.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few TODO comments. Is this PR ready for merge?

maxcao13 added 2 commits March 7, 2025 12:45
Because of kubernetes#7813, this commit reverts a lot of the changes that introducted logic that involved actuating in-place updates based on the containerResizePolicy.

Signed-off-by: Max Cao <[email protected]>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 10, 2025
@maxcao13
Copy link
Member Author

maxcao13 commented Mar 11, 2025

This is probably ready for review now. Note:

  • There are no unit tests for
    • vertical-pod-autoscaler/pkg/updater/inplace
    • no added unit tests for /vertical-pod-autoscaler/pkg/updater/eviction/pods_eviction_restriction.go
    • This was due to me not being sure enough of the implementation details and if all that logic should exist or be in those files. If we are okay with it, I can add unit tests later in sync with the rest of the review.
  • E2E tests have not been modified to take in-place feature gates into account (I only modified the local e2e testing scripts)
    • I'm just not exactly sure how we want to do the testing here. e.g. (separate optional test just to test feature gate features, running all feature gated tests no matter what on PRs, etc.)
  • There was originally some functionality in the PR to take containerStatuses into account when calculating podPriority. I removed it so that the scope is reduced. But I think this would be a good future enhancement (along with solely using containerStatuses for recommendations instead of the spec since apparently they should be the new source of truth according to the KEP). Would love to know what you guys think.
  • This PR includes InPlaceVerticalScaling (open to name change) VPA level feature gate, and introduces features gates to VPA as part of AEP-4016: details about upgrade and downgrade, and compatibility with Kubernetes #7901
  • Any TODOs left in the code is to be changed after 1.33 is released for the KEP updates, or they are points of contention I think are needed in review.

/retitle VPA: Implement in-place updates support

tagging reviewers:
/cc @voelzmo @raywainman @adrianmoisey @omerap12

Thanks everyone!

EDIT March 12:

  • After discussing with @adrianmoisey, I added 348b5ee which adds the feature-gate to the admission-controller as well and prevents InPlaceOrRecreate VPA creations unless the feature-gate is on, and warns you about it.
  • Also, I separated the service from the deploy/admission-controller-deployment.yaml into it's own file in order to use kubectl to apply feature-gates more cleanly before actually applying the feature-gated deployments to the cluster (you can see what I'm referring to here: https://github.com/kubernetes/autoscaler/pull/7673/files#diff-8cfbbc3ba84d8d5169661c1677dd50f9edac6d82ec403e2e07b1c3955b4c40ccR90)
    • Fyi, this is likely why the e2e-vpa-full is failing, because I didn't change any ci-e2e scripts to account for this
    • The e2e feature-gate testing procedure is subject to change as part of the review, as mentioned before, so I've not changed the e2e-ci scripts for now

@k8s-ci-robot k8s-ci-robot changed the title [WIP] VPA: Implement in-place updates support VPA: Implement in-place updates support Mar 11, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2025
@omerap12
Copy link
Member

This is probably ready for review now. Note:

  • There are no unit tests for

    • vertical-pod-autoscaler/pkg/updater/inplace
    • no added unit tests for /vertical-pod-autoscaler/pkg/updater/eviction/pods_eviction_restriction.go
    • This was due to me not being sure enough of the implementation details and if all that logic should exist or be in those files. If we are okay with it, I can add unit tests later in sync with the rest of the review.
  • E2E tests have not been modified to take in-place feature gates into account (I only modified the local e2e testing scripts)

    • I'm just not exactly sure how we want to do the testing here. e.g. (separate optional test just to test feature gate features, running all feature gated tests no matter what on PRs, etc.)
  • There was originally some functionality in the PR to take containerStatuses into account when calculating podPriority. I removed it so that the scope is reduced. But I think this would be a good future enhancement (along with solely using containerStatuses for recommendations instead of the spec since apparently they should be the new source of truth according to the KEP). Would love to know what you guys think.

  • This PR includes InPlaceVerticalScaling (open to name change) VPA level feature gate, and introduces features gates to VPA as part of AEP-4016: details about upgrade and downgrade, and compatibility with Kubernetes #7901

  • Any TODOs left in the code is to be changed after 1.33 is released for the KEP updates, or they are points of contention I think are needed in review.

/retitle VPA: Implement in-place updates support

tagging reviewers: /cc @voelzmo @raywainman @adrianmoisey @omerap12

Thanks everyone!

Fantastic work, Max! I'll try to set aside some time this weekend to review it.

Allows feature gates to be added to the updater binary with the `--feature-gates=` arg.

Signed-off-by: Max Cao <[email protected]>
Updates updater unit tests. Also bump k8s.io/component-base to 0.32.2 in order to import k8s.io/component-base/featuregate/testing package. Also introduce a clock to the updater object in order to easily mock it during tests.

Signed-off-by: Max Cao <[email protected]>
Reverts the change that takes containerStatuses resources in to account when calculating update priority. This change, along with along VPA to use containerStatuses when calculating recommendations themselves, should instead be included in a future enhancement and potentially feature-gated.

Signed-off-by: Max Cao <[email protected]>
Modified the deploy-for-e2e-locally.sh script to patch the updater deployment with using a FEATURE_GATES env var. This commit does not change CI yet.

Signed-off-by: Max Cao <[email protected]>
Adds logic to vpa admission webhook to deny requests creating VPAs with InPlaceOrRecreate update mode without enabling feature gate.
Also adds admission e2e logic to wait for the vpa-webhook to be registered before starting the test.

Signed-off-by: Max Cao <[email protected]>
@maxcao13
Copy link
Member Author

Latest force push just renames the feature gate to InPlaceOrRecreate as a result of #7901

Copy link
Contributor

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple quick questions, need to take more time to dive into things deeply again.

- ""
resources:
- pods/resize
- pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we still need patch on the pod itself, isn't pods/resize sufficient?

Copy link
Member Author

@maxcao13 maxcao13 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be sufficient for resizing, but in order to patch the annotations onto the pod itself, I need that rule as well. Previously, when the admission-controller does it's own annotation (vpaObservedContainers), it can just include the annotation in the webhook pod mangling, but the updater can't do that on its own.

https://github.com/maxcao13/autoscaler/blob/maxcao13-inplace/vertical-pod-autoscaler/pkg/updater/eviction/pods_eviction_restriction.go#L544

Whether we want to use this new annotation or not is a different story though. It's purely for cosmetic reasons as noted in this comment: #7673 (comment), but the vpaObservedContainers annotation is actually used in GetUpdatePriority. Curious what people think.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
InPlacePodVerticalScaling: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update with the new name?

Copy link
Member Author

@maxcao13 maxcao13 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feature gate is for the unfortunately named KEP-1287 API to work, not our VPA one

@maxcao13
Copy link
Member Author

maxcao13 commented Mar 14, 2025

After conversing with @raywainman,, I'm going to split this PR up into separate PRs like this:

-> New Feature flags functionality: #7932
-> API changes: #7933
-> Changes to dev scripts to allow testing: #7934
-> Changes in admission controller.
-> Changes in updater.
-> E2E tests.

and then merge them into this new feature branch: https://github.com/kubernetes/autoscaler/tree/in-place-updates

This is to make reviewing easier and make better incremental progress.

Anyone is still welcome to review this PR until that migration is done, but we won't merge this anymore. I'll link the new PRs in this comment as they are created, and in the PR description as well.

At the end, we will merge that feature branch in to main.

@maxcao13
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support in-place Pod vertical scaling in VPA
8 participants