Skip to content

Conversation

@omerap12
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements Helm-managed MutatingWebhookConfiguration with automatic TLS certificate generation using kube-webhook-certgen. This replaces the application's self-registration logic.

Which issue(s) this PR fixes:

Related to #8587

Special notes for your reviewer:

ingress-nginx is dead, so I’m not sure about the future of kube-webhook-certgen, which is part of the old nginx stack (registry.k8s.io/ingress-nginx/kube-webhook-certgen). Does that mean we should use it?

Right now, the hook simply creates a Secret containing all the certificates, CA bundles, and the mutating webhook configuration. To rotate or update those certificates, we would need just to add a Job that deletes the Secret before.
This is just an initial proposal, I’d like to hear other opinions on this setup.

Also, I have noticed some RBAC problems so I fixed those.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/needs-area labels Nov 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omerap12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 29, 2025
@k8s-ci-robot k8s-ci-robot added area/vertical-pod-autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Nov 29, 2025
@omerap12
Copy link
Member Author

/cc @adrianmoisey

Copy link
Member

@iamzili iamzili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Omer, I checked out your branch and I'm having trouble getting the admission controller to work. I would expect the following command to deploy everything properly:

helm upgrade vpa \
   /home/zili/Repos/autoscaler/vertical-pod-autoscaler/charts/vertical-pod-autoscaler \
   --install \
   --version 0.6.0 \
   --namespace vpa

What I have found so far is that the following settings seem to be incorrect:

  1. The --webhook-service value in the admission controller's Deployment.
  2. The service name in the MutatingWebhookConfiguration object.

I'm also seeing an error in the admission controller when it attempts to perform actuation:

2025/12/04 13:37:32 http: TLS handshake error from 10.244.0.1:25304: remote error: tls: bad certificate

metadata:
name: {{ include "vertical-pod-autoscaler.admissionController.certGen.fullname" . }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a specific reason why this resource (and I see the same pattern in several others, such as ClusterRoleBinding, Role, RoleBinding) defines both pre and post hooks? I think pre-install,pre-upgrade would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, you’re right - pre-install and pre-upgrade are enough, but it really depends on how we want to handle things going forward. As mentioned above, the hook currently creates a Secret containing all certificates, CA bundles, and the mutating webhook configuration. Because this Secret already exists during an upgrade, kube-webhook-certgen will not rotate the certificate values.

To address this, we may want to add post-install and post-upgrade hooks to delete the Secret, ensuring that on the next upgrade kube-webhook-certgen generates a fresh one.

Copy link
Member

@iamzili iamzili Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But using post-install and post-upgrade hooks to delete the Secret is still not a good idea, I assume because it would mean that after a Helm upgrade or install, there would be no Secret object in the cluster as Helm executes post-install and post-upgrade hooks after all non-hook resources have been deployed to the cluster.

If we want to delete the certificate before running a Helm upgrade or install, then we need to do it in pre-install and pre-upgrade hooks, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thing: kube-webhook-certgen generates certificates that expire after 100 years, so I assume we don't need to rotate them. We could just ignore the "secret already exists" log when kube-webhook-certgen create runs.

@iamzili
Copy link
Member

iamzili commented Dec 4, 2025

Also I'm not sure if we need to bump the chart version every time at this stage of the chart's development here:

vertical-pod-autoscaler/charts/vertical-pod-autoscaler/Chart.yaml

@omerap12
Copy link
Member Author

omerap12 commented Dec 4, 2025

Hey Omer, I checked out your branch and I'm having trouble getting the admission controller to work. I would expect the following command to deploy everything properly:

helm upgrade vpa \
   /home/zili/Repos/autoscaler/vertical-pod-autoscaler/charts/vertical-pod-autoscaler \
   --install \
   --version 0.6.0 \
   --namespace vpa

What I have found so far is that the following settings seem to be incorrect:

  1. The --webhook-service value in the admission controller's Deployment.
  2. The service name in the MutatingWebhookConfiguration object.

I'm also seeing an error in the admission controller when it attempts to perform actuation:

2025/12/04 13:37:32 http: TLS handshake error from 10.244.0.1:25304: remote error: tls: bad certificate

Thanks for review! I'll check that out

@omerap12
Copy link
Member Author

omerap12 commented Dec 4, 2025

Also I'm not sure if we need to bump the chart version every time at this stage of the chart's development here:

vertical-pod-autoscaler/charts/vertical-pod-autoscaler/Chart.yaml

Yeah, we already discussed it. if I remember correctly we have to do it because of the pre-commit.
@adrianmoisey might remember.

@adrianmoisey
Copy link
Member

Yeah, we already discussed it. if I remember correctly we have to do it because of the pre-commit.
@adrianmoisey might remember.

I can't remember, your memory may be correct, since cluster-autoscaler does it.

However, I think that may be broken now?

@iamzili
Copy link
Member

iamzili commented Dec 4, 2025

Let me add my thoughts regarding kube-webhook-certgen:

  1. Personally I think we can start using kube-webhook-certgen and see if it fully meets our needs (I believe it will). Since the project is becoming unmaintained soon, let's keep an eye on whether anyone decides to fork it and start maintaining it. Side note: based on the commit history, it seems the previous maintainers mostly performed Go version bumps.
  2. I would prefer not to roll out our own solution for creating and renewing self-signed certificates like Kyverno does:
  3. I also checked how the folks at Gatekeeper handle certificate management, and they use the github.com/open-policy-agent/cert-controller library (which btw KEDA also uses).

@omerap12
Copy link
Member Author

omerap12 commented Dec 5, 2025

I would prefer not to roll out our own solution for creating and renewing self-signed certificates like Kyverno does:

I agree that we shouldn’t build our own mechanism for generating and renewing self-signed certificates.
KEDA’s (and similar) approaches rely on code-based solutions, but IMHO the better approach is to avoid handling this in code and instead delegate it to Helm, which can manage it for us - similar to what I attempted in this PR.

# Generate certificates using cert-gen job
generateCertificate: true

certGen:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be disabled?
I want to create my cert with cert-manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to use generateCertificate: false then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. Missed that. I've expected certGen.enabled=true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it might be a good idea to move it below certGen, like certGen.enabled=true. I see this pattern around in Helm charts frequently.

what do you think @omerap12

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed - I need to spend more time on this PR. Hopefully I’ll be able to over the weekend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monotek , adjusted.

@omerap12
Copy link
Member Author

@iamzili , @monotek I have made some changes regarding both problems. please check :)

Copy link
Member

@iamzili iamzili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect at least a message to be printed after helm install / helm upgrade when registerWebhook: true and admissionController.certGen.enabled: false are set, since the admission controller will not start.

I think pkg/admission-controller/gencerts.sh should be executed by the user. What else needs to be done?

image:
# admissionController.certGen.image.repository -- An image that contains certgen for creating certificates. Only used if admissionController.generateCertificate is true
repository: registry.k8s.io/ingress-nginx/kube-webhook-certgen
# admissionController.certGen.image.tag -- An image tag for the admissionController.certGen.image.repository image. Only used if admissionController.generateCertificate is true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the admissionController.generateCertificate key no longer exists, so it should be removed from multiple files, including comments and README.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot to update this as well, nice catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,49 @@
{{- if and .Values.admissionController.enabled (not .Values.admissionController.registerWebhook) (include "vertical-pod-autoscaler.admissionController.webhook.upgradable" .) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help me understand why the (include "vertical-pod-autoscaler.admissionController.webhook.upgradable" .) expression and the related logic in the _helpers.tpl file are required.

Based on my reading of _helpers.tpl, the intended goal is to upgrade only an object with a specific name and labels to address the following scenario:

  1. User creates a MutatingWebhookConfiguration object manually (or it was created by the admission-controller, i.e. --register-webhook=true)
  2. User deploys the helm chart, and the manually created MutatingWebhookConfiguration object is not going to be updated by the chart and the kube-webhook-certgen job

is my understanding correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes, but I removed all that part. this is un-needed..
The recommended way will be jut to let helm manage this, so if a user already has MutatingWebhookConfiguration object - just delete it and install the chart and let Helm do the work.

Currently in this PR with helm upgrade we don't create a new MutatingWebhookConfiguration object (and all tls stuff). I am not sure we want to do this, and if so let's tackle this in a different PR.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 22, 2025
@iamzili
Copy link
Member

iamzili commented Dec 23, 2025

I would expect at least a message to be printed after helm install / helm upgrade when registerWebhook: true and admissionController.certGen.enabled: false are set, since the admission controller will not start.
I think pkg/admission-controller/gencerts.sh should be executed by the user. What else needs to be done?

gencert.sh functionality is within kube-webhook-certgen (they basically do the same thing). so users won't need to execute this. Why would registerWebhook:true + admissionController.certGen.enabled: false will cause the VPA to crush?

my comment was meant to warn that when admissionController.registerWebhook: true and admissionController.certGen.enabled: false are used together, the admission controller fails to start:

  1. In a clean cluster where the vpa-tls-certs Secret does NOT exist, the admission-controller Pod stays in a ContainerCreating state. I believe the reason is that there is no mechanism in place to create this Secret during a helm install or upgrade (at least as far as I know).
  2. When the vpa-tls-certs Secret DOES exist in the cluster, the admission controller still fails to start, here is the error:
F1223 08:00:39.106756       1 config.go:190] mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:vpa:vpa-vertical-pod-autoscaler-admission-controller" cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

So technically, we need to solve two issues. First, when we deploy with admissionController.registerWebhook: true and admissionController.certGen.enabled: false, the Secret (with the certs, i.e. vpa-tls-certs) should be created. Second we need to fix the permission issue mentioned above

@omerap12
Copy link
Member Author

I would expect at least a message to be printed after helm install / helm upgrade when registerWebhook: true and admissionController.certGen.enabled: false are set, since the admission controller will not start.
I think pkg/admission-controller/gencerts.sh should be executed by the user. What else needs to be done?

gencert.sh functionality is within kube-webhook-certgen (they basically do the same thing). so users won't need to execute this. Why would registerWebhook:true + admissionController.certGen.enabled: false will cause the VPA to crush?

my comment was meant to warn that when admissionController.registerWebhook: true and admissionController.certGen.enabled: false are used together, the admission controller fails to start:

  1. In a clean cluster where the vpa-tls-certs Secret does NOT exist, the admission-controller Pod stays in a ContainerCreating state. I believe the reason is that there is no mechanism in place to create this Secret during a helm install or upgrade (at least as far as I know).
  2. When the vpa-tls-certs Secret DOES exist in the cluster, the admission controller still fails to start, here is the error:
F1223 08:00:39.106756       1 config.go:190] mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:vpa:vpa-vertical-pod-autoscaler-admission-controller" cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

So technically, we need to solve two issues. First, when we deploy with admissionController.registerWebhook: true and admissionController.certGen.enabled: false, the Secret (with the certs, i.e. vpa-tls-certs) should be created. Second we need to fix the permission issue mentioned above

Thanks for checking.
2. Needs to be fixed.

  1. I guess we can add the secret template back, but that's need to be configured by the user (e.g. go to the values file and set the values accordingly ). I don't think we need to create this secret if admissionController.registerWebhook: true and
    admissionController.certGen.enabled: false. I do believe that a message should be printed.

@omerap12 omerap12 force-pushed the webhook-certgen branch 2 times, most recently from 0b691d5 to 672f1f5 Compare December 23, 2025 20:05
@omerap12 omerap12 requested a review from iamzili December 23, 2025 20:13
Copy link
Member

@iamzili iamzili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments, reflecting your latest updates:

What do you think, @omerap12, can you drop the genCA and genSignedCert Helm template functions that --set admissionController.tls.create=true triggers? This mechanism adds complexity and provides limited value, especially since the certificates regenerate on every helm upgrade.

let me write down what I think would be appropriate to support in this PR. btw some of these are already implemented or partially supported.

  1. [Helm message is missing] Support https://cert-manager.io/docs/. In other words, allow users to create a CA and an X.509 certificate signed by that CA, and store the certs in a Kubernetes Secret using cert-manager. This approach appears to be already supported, although I have not tested it locally. When a user choose this approach, should use:
    --set admissionController.registerWebhook=true \
    --set admissionController.certGen.enabled=false

and when using these flags, Helm should print a message based on the NOTES.txt file stating that it is the user's responsibility to create the Kubernetes Secret object, either via cert-manager or by other means. The printed message may also include the expected format of the Secret object, which is (the format of the data stanza is crucial, as the deployment expects these specific keys):

apiVersion: v1
kind: Secret
metadata:
  name: something
type: Opaque
data:
  ca: "..."
  cert: "..."
  key: "..."

we may also notify the user that if the Secret is created after the Helm install/upgrade, the admission controller pod must be restarted to move it from the ContainerCreating state to Running.

  1. this approach should create the Kubernetes Secret object via the Helm chart, and should not call ingress-nginx/kube-webhook-certgen:
    --set admissionController.registerWebhook=true \
    --set admissionController.certGen.enabled=false \
    --set-file admissionController.tls.caCert=ca.txt \
    --set-file admissionController.tls.cert=cert.txt \
    --set-file admissionController.tls.key=key.txt

This approach is useful when the user creates the TLS related components using a mechanism other than cert-manager, such as a shell script, and simply wants to pass the certificates to helm

  1. [DONE] this is the default and the recommended approach, ingress-nginx/kube-webhook-certgen is used, i.e.:
    --set admissionController.registerWebhook=false \
    --set admissionController.certGen.enabled=true 

@omerap12
Copy link
Member Author

let me write down what I think would be appropriate to support in this PR. btw some of these are already implemented or partially supported.

So you are basically saying let's drop this: https://github.com/kubernetes/autoscaler/pull/8870/files?diff=split&w=0
And have this set by the user?

⚠️ WARNING: No TLS certificate source configured!
The admission controller may fail to start. Please set one of:
- admissionController.certGen.enabled: true (recommended)
- admissionController.tls.create: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- admissionController.tls.create: true

Mode: Helm-managed (recommended)
- Webhook registered by: Helm (MutatingWebhookConfiguration)
- Certificates managed by: certgen job
{{- else if .Values.admissionController.tls.create }}
Copy link
Member

@iamzili iamzili Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please get rid of all Values.admissionController.tls.create related part, we don't need this helm value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this on Slack. I am not sure if we wanna get rid of this.

{{- if .Values.admissionController.registerWebhook }}
Mode: Application-managed
- Webhook registered by: admission-controller application
- Certificates managed by: admission-controller application
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Certificates managed by: admission-controller application
- Be aware that, with this mode, you create the certificates by using a mechanism such as cert-manager or by creating them manually. Store the certificates in a Kubernetes Secret object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: #8870 (comment)

```
In this mode:
- The VPA admission controller creates and manages the webhook itself
- The application handles its own certificate generation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The application handles its own certificate generation
- With this mode, the end user creates the certificates using a mechanism such as cert-manager or by creating them manually, and stores the certificates in a Kubernetes Secret.

Copy link
Member Author

@omerap12 omerap12 Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. with this mode the admission controller creates the webhook (with all stack).
See here: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/flags.md?plain=1#L32

Copy link
Member

@iamzili iamzili Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally, and what I observed is that the MutatingWebhookConfiguration is created by the admission-controller only when a Secret exists in the cluster. The Secret must be created by the user (manually or, for example via cert-manager) when the chart is deployed like this:

helm upgrade vpa \
    charts/vertical-pod-autoscaler \
    --install \
    --version 0.8.0  \
    --namespace vpa --create-namespace \
    --set updater.replicas=1 \
    --set updater.extraArgs\[0]\="--min-replicas=1" \
    --set recommender.replicas=1 \
    --set admissionController.replicas=1 \
    --set admissionController.registerWebhook=true \
    --set admissionController.certGen.enabled=false

I also mentioned this behavior in a comment (see point 1): #8870 (comment)
. Specifically, if the Secret is created after the helm install, the admission pod needs to be restarted to trigger the creation of the MutatingWebhookConfiguration. That is why I think it is important to warn users that, in this mode, it is their responsibility to create the Secret

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. In the current implementation ( ./hack/vpa-up.sh ) the script is also creating the secret.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted in 8432460

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, small nit: I think the "The application handles its own certificate generation" sentence can be removed now.

Copy link
Member

@iamzili iamzili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, just found a small nit.

/lgtm

```
In this mode:
- The VPA admission controller creates and manages the webhook itself
- The application handles its own certificate generation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, small nit: I think the "The application handles its own certificate generation" sentence can be removed now.

@omerap12
Copy link
Member Author

/hold
/assign @adrianmoisey
For final review.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 31, 2025
@adrianmoisey
Copy link
Member

/lgtm

Thanks for pushing this!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 1, 2026
@omerap12
Copy link
Member Author

omerap12 commented Jan 1, 2026

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 1, 2026
@k8s-ci-robot k8s-ci-robot merged commit 934a600 into kubernetes:master Jan 1, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants