[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks #136043

natasha41575 · 2026-01-05T21:46:23Z

What type of PR is this?

/kind cleanup
/kind feature

What this PR does / why we need it:

Create an admission plugin to perform the OS and node capacity checks for pod resizes.

The last commit removes the OS feasibility check from the kubelet - the OS label on the node should be reliable, up-to-date, and IIUC immutable by the time the node is ready to have pods scheduled to it. But I would still like a second opinion that this is safe to remove.

Which issue(s) this PR is related to:

#135341

Special notes for your reviewer:

I wasn't sure if it is safe to remove the node capacity check from the kubelet? If a node is downsized, could there be a race window where the node could finish downsizing & the kubelet has restarted, but the new node allocatable in the status is not yet updated to reflect the smaller size?

Does this PR introduce a user-facing change?

For pod resizes requested on nodes where the resize request exceeds the node's allocatable capacity or the node is running an OS that does not support resize, the request will now fail in admission rather than be marked as Infeasible in the pod status later.

/hold
for alignment with kubernetes/autoscaler#8818

/sig node
/assign @tallclair

k8s-ci-robot · 2026-01-05T21:46:32Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-01-05T21:46:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: natasha41575
Once this PR has been reviewed and has the lgtm label, please assign deads2k, derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omerap12

From a VPA perspective, I have concerns about the error handling here (please correct me if I’m missing something).

VPA needs to be able to programmatically distinguish between different failure modes, such as:

infeasible resizes (requests exceed node allocatable),
unsupported platforms (e.g., non-Linux nodes),
transient errors, etc.

Each of these cases should be handled differently. However, they all currently return admission.NewForbidden() with only different error messages. This forces consumers to parse error strings, which is fragile - any change in the message text could cause VPA to break silently.

omerap12 · 2026-01-06T12:13:01Z

plugin/pkg/admission/podresize/admission.go

+	p.SetReadyFunc(nodeInformer.Informer().HasSynced)
+}
+
+// SetFeatures sets the feature gates for the plugin.


nit:

Suggested change

// SetFeatures sets the feature gates for the plugin.

// InspectFeatureGates sets the feature gates for the plugin.

natasha41575 · 2026-01-06T15:04:31Z

From a VPA perspective, I have concerns about the error handling here (please correct me if I’m missing something).

VPA needs to be able to programmatically distinguish between different failure modes, such as:

infeasible resizes (requests exceed node allocatable),

unsupported platforms (e.g., non-Linux nodes),

transient errors, etc.

Each of these cases should be handled differently. However, they all currently return admission.NewForbidden() with only different error messages. This forces consumers to parse error strings, which is fragile - any change in the message text could cause VPA to break silently.

Today, all "infeasible" resizes -- i.e. exceeding node allocatable, the node is on an unsupported platform, or the node has a feature enabled that is not compatible with resize like swap or static cpu/memory manager -- are all surfaced in the API the same way through a PodResizePending condition in the status with the Reason set to Infeasible and the only differentiation between the different failure modes being a human-readable message in the Message field. How does VPA distinguish them today?

(I'll also think on what options we have to make it easier to programatically distinguish them)

omerap12 · 2026-01-06T15:42:12Z

From a VPA perspective, I have concerns about the error handling here (please correct me if I’m missing something).
VPA needs to be able to programmatically distinguish between different failure modes, such as:

infeasible resizes (requests exceed node allocatable),

unsupported platforms (e.g., non-Linux nodes),

transient errors, etc.

Each of these cases should be handled differently. However, they all currently return admission.NewForbidden() with only different error messages. This forces consumers to parse error strings, which is fragile - any change in the message text could cause VPA to break silently.

Today, all "infeasible" resizes -- i.e. exceeding node allocatable, the node is on an unsupported platform, or the node has a feature enabled that is not compatible with resize like swap or static cpu/memory manager -- are all surfaced in the API the same way through a PodResizePending condition in the status with the Reason set to Infeasible and the only differentiation between the different failure modes being a human-readable message in the Message field. How does VPA distinguish them today?

(I'll also think on what options we have to make it easier to programatically distinguish them)

It doesn't, the VPA checks if the pod is in PodResizePending state with a Reason set to PodReasonInfeasible and evicts based on that (we have some logic to skip evictions in some cases but that's irrelevant ).
But we can't follow the same pattern in admission and we want to have some logic based on that error.
/cc @maxcao13

natasha41575 · 2026-01-06T16:39:28Z

From a VPA perspective, I have concerns about the error handling here (please correct me if I’m missing something).
VPA needs to be able to programmatically distinguish between different failure modes, such as:

infeasible resizes (requests exceed node allocatable),

unsupported platforms (e.g., non-Linux nodes),

transient errors, etc.

Each of these cases should be handled differently. However, they all currently return admission.NewForbidden() with only different error messages. This forces consumers to parse error strings, which is fragile - any change in the message text could cause VPA to break silently.

Today, all "infeasible" resizes -- i.e. exceeding node allocatable, the node is on an unsupported platform, or the node has a feature enabled that is not compatible with resize like swap or static cpu/memory manager -- are all surfaced in the API the same way through a PodResizePending condition in the status with the Reason set to Infeasible and the only differentiation between the different failure modes being a human-readable message in the Message field. How does VPA distinguish them today?
(I'll also think on what options we have to make it easier to programatically distinguish them)

It doesn't, the VPA checks if the pod is in PodResizePending state with a Reason set to PodReasonInfeasible and evicts based on that (we have some logic to skip evictions in some cases but that's irrelevant ). But we can't follow the same pattern in admission and we want to have some logic based on that error. /cc @maxcao13

Understood.

Transient errors would not return admission.NewForbidden(). VPA should be able to filter that out programatically, so from that perspective this PR does not introduce any behavior that is worse than what exists today.

Distinguishing the node capacity check from the other feasibility checks (like OS) is a net-new feature request that I don't necessarily think this change needs to be blocked on... but let me circle back on this. I have an idea but I'm not 100% sure about it so I need to double check and might need to ask some other folks about it too. I see how this could be useful for InPlace mode.

ETA: I pushed a change. See my comment below: #136043 (comment)

natasha41575 · 2026-01-06T17:22:47Z

plugin/pkg/admission/podresize/admission.go

-		return admission.NewForbidden(a, err)
+		statusErr := admission.NewForbidden(a, err).(*apierrors.StatusError)
+		statusErr.ErrStatus.Details.Causes = append(statusErr.ErrStatus.Details.Causes, metav1.StatusCause{
+			Type: ReasonNodeCapacity,


@omerap12 I added a CauseType here. Programatically, clients can do this:

_, err = clientset.CoreV1().Pods(ns).UpdateResize(ctx, podName, latestPod, metav1.UpdateOptions{}) if err != nil { if statusErr, ok := err.(*apierrors.StatusError); ok { for _, cause := range statusErr.ErrStatus.Details.Causes { fmt.Printf("Cause is: %s\n", cause.Type) } } }

I tried this out with a quick little go script. Hope this solves your use case.

Yup. I think that solves it :)
Thanks!
@adrianmoisey , we will incorporate this logic into the VPA so I believe we are good to go right?

Yup, seems good to me.
We'll need to handle both old and new methods though, since we can't guarantee which version of Kubernetes someone is running the VPA on

Already done :)

k8s-ci-robot · 2026-01-06T18:09:18Z

@natasha41575: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-integration-go-compatibility	`b472849`	link	true	`/test pull-kubernetes-integration-go-compatibility`
pull-kubernetes-unit	`45c2557`	link	true	`/test pull-kubernetes-unit`
pull-kubernetes-linter-hints	`45c2557`	link	false	`/test pull-kubernetes-linter-hints`
pull-kubernetes-verify	`45c2557`	link	true	`/test pull-kubernetes-verify`
pull-kubernetes-unit-windows-master	`45c2557`	link	false	`/test pull-kubernetes-unit-windows-master`
pull-kubernetes-e2e-capz-windows-master	`45c2557`	link	false	`/test pull-kubernetes-e2e-capz-windows-master`
pull-kubernetes-integration	`45c2557`	link	true	`/test pull-kubernetes-integration`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

github-project-automation bot added this to SIG Node: In Place Pod Vertical Scaling Jan 5, 2026

k8s-ci-robot assigned tallclair Jan 5, 2026

github-project-automation bot added this to SIG Node: code and documentation PRs Jan 5, 2026

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jan 5, 2026

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 5, 2026

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 5, 2026

k8s-ci-robot requested review from deads2k and jpbetz January 5, 2026 21:46

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2026

k8s-ci-robot added area/apiserver area/kubelet area/test labels Jan 5, 2026

github-project-automation bot added this to SIG Node CI/Test Board Jan 5, 2026

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 5, 2026

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Jan 5, 2026

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jan 5, 2026

natasha41575 changed the title ~~[InPlacePodVerticalScaling] move trivial feasibility checks to an admission plugin~~ [InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks Jan 5, 2026

natasha41575 mentioned this pull request Jan 5, 2026

AEP-8818: InPlace Update Mode kubernetes/autoscaler#8818

Open

omerap12 reviewed Jan 6, 2026

View reviewed changes

k8s-ci-robot requested a review from maxcao13 January 6, 2026 15:42

natasha41575 added 3 commits January 6, 2026 17:19

add pod resize feasibility check admission plugin

7065af1

differentiate error cases with cause field

d4e272f

remove pod resize OS feasibility check from kubelet

45c2557

natasha41575 force-pushed the os_feasibility branch from b472849 to 45c2557 Compare January 6, 2026 17:20

natasha41575 commented Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks #136043

[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks #136043

natasha41575 commented Jan 5, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

omerap12 left a comment

Uh oh!

omerap12 Jan 6, 2026

Uh oh!

natasha41575 commented Jan 6, 2026 •

edited

Loading

Uh oh!

omerap12 commented Jan 6, 2026

Uh oh!

natasha41575 commented Jan 6, 2026 •

edited

Loading

Uh oh!

natasha41575 Jan 6, 2026 •

edited

Loading

Uh oh!

omerap12 Jan 6, 2026

Uh oh!

adrianmoisey Jan 6, 2026

Uh oh!

omerap12 Jan 6, 2026

Uh oh!

k8s-ci-robot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// SetFeatures sets the feature gates for the plugin.
	// InspectFeatureGates sets the feature gates for the plugin.

[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks #136043

Are you sure you want to change the base?

[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks #136043

Conversation

natasha41575 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

omerap12 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

natasha41575 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omerap12 commented Jan 6, 2026

Uh oh!

natasha41575 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omerap12 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

adrianmoisey Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

omerap12 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

natasha41575 commented Jan 5, 2026 •

edited

Loading

natasha41575 commented Jan 6, 2026 •

edited

Loading

natasha41575 commented Jan 6, 2026 •

edited

Loading

natasha41575 Jan 6, 2026 •

edited

Loading