Skip to content

Conversation

@alexanderstephan
Copy link

@alexanderstephan alexanderstephan commented Dec 8, 2025

What type of PR is this?

/kind feature
/kind documentation

What this PR does / why we need it:
This PR updates the TLSRoute CRD validation rationale and proposes increasing the maxItem bound for hostnames from 16 to 4096. This change is proposed only to accommodate very large orgs. In large orgs, the current limit of 16, this can lead to hundred of thousands TLSRoute objects. These objects multiply storage, watch traffic, and controller memory/CPU, which drives up API-server latency and risks OOMs and instability.

In a similar PR the limit has been increased: #3205
To safely deploy the change the author employed a XValidation rule. For my case, such a rule would likely be rejected for being too complex. One idea would be to add the validation to a Validating Webhook. However, the webhook as been removed:

The validating webhook has been removed. CEL validation is now built-in to CRDs and replaces the webhook. (#2595, @robscott)

Do you have any ideas on how to solve this issue? I would be happy for further assistance on how to tackle this.

Rationale:

Which issue(s) this PR fixes:

No issue yet.

Does this PR introduce a user-facing change?:

The TLSRoute CRD validation has been adjusted to allow up to 4096 hostnames and rules per TLSRoute resource. Operators must validate kube-apiserver, etcd and Gateway controller behavior with representative manifests prior to enabling the new limit in production.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alexanderstephan
Once this PR has been reviewed and has the lgtm label, please assign aojea for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @alexanderstephan!

It looks like this is your first PR to kubernetes-sigs/gateway-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @alexanderstephan. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alexanderstephan alexanderstephan changed the title Increase TLSRoute hostnames limit to 4096 Increase TLSRoute hostnames limit from 16 to 4096 Dec 8, 2025
@youngnick
Copy link
Contributor

youngnick commented Dec 9, 2025

Thanks for doing the calculations @alexanderstephan, that's helpful.

However, adding this would mean that there's no further headroom for adding any other constructs to TLSRoute. In other conversations, we've talked about adding ALPN matching, and a couple of other additional complexities that escape me right now, but if we use all the space available for storing hostnames, we'll have none left for expansion.

On top of that, I'm not sure how this would work. Because TLSRoute is about sending all traffic that matches the hostname list to a single backend, are you anticipating having that backend using a certificate that has up to 4096 SANs? That seems like a very large amount, that I'd be surprised if it's supported in most certificate handlers (it would certainly massively increase the size of the certificate).

I'd like to understand the use case you're aiming for here better. What sort of use cases involve serving 4000 hostnames from one backend service, presumably with a single certificate?

@youngnick
Copy link
Contributor

Also, we did stop shipping a validating webhook, because CEL does everything we want, generally, and the complexity cost calculations are also generally a good indication that we're keeping the API complexity under control.

@alexanderstephan
Copy link
Author

alexanderstephan commented Dec 12, 2025

Thanks for looking into this, @youngnick!

However, adding this would mean that there's no further headroom for adding any other constructs to TLSRoute. In other conversations, we've talked about adding ALPN matching, and a couple of other additional complexities that escape me right now, but if we use all the space available for storing hostnames, we'll have none left for expansion.

I see. So, you're suggesting we should lower the limit more, e.g., to 2048?

That seems like a very large amount, that I'd be surprised if it's supported in most certificate handlers (it would certainly massively increase the size of the certificate).

I think it actually possible to have 10k+ hostnames per certificate from what I have seen. However, this does not apply here for this case since we can also have multiple certificates as explained below.

I'd like to understand the use case you're aiming for here better. What sort of use cases involve serving 4000 hostnames from one backend service, presumably with a single certificate?

So, the umbrella topic here would be "multi-tenant SaaS with custom domains".
Here, a single backend shard can potentially serve thousands of tenant custom domains. Our deployments terminate TLS at the backend, selecting certs dynamically via SNI. So, TLSRoute’s role is to steer all those SNI matches to the right termination tier.
In this context, it makes sense to consolidate many low-traffic domains behind one backend as it is more efficient.

Also, we did stop shipping a validating webhook, because CEL does everything we want, generally, and the complexity cost calculations are also generally a good indication that we're keeping the API complexity under control.

Okay, that makes sense. I guess this can be looked into in the next step when the motivation for this change is more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants