Skip to content

Conversation

@lukasmetzner
Copy link
Contributor

  • One-line PR description: Introduce a feature gate to enable informer-based reconciliation in the routes' controller of cloud-controller-manager, reducing API calls and improving efficiency.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 8, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @lukasmetzner!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label May 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @lukasmetzner. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 8, 2025
@lukasmetzner
Copy link
Contributor Author

/cc @elmiko @JoelSpeed

@k8s-ci-robot k8s-ci-robot requested review from JoelSpeed and elmiko May 8, 2025 08:10
@apricote
Copy link
Member

apricote commented May 8, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2025
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the core concepts make sense to me, i think we should clean up the language around the term "cloud provider". in some places we use it to mean the controllers (ie the ccm), and other places we use it to mean the infrastructure provider (eg aws, azure, gcp), and we also refer to the framework as well.

we also need to make some decisions about the open questions. i wonder if we should go over these questions at the next sig meeting?

@lukasmetzner
Copy link
Contributor Author

@elmiko If possible, I’d love to work through the open questions here in the PR so we can keep things moving, and then have a discussion in the next SIG meeting. Otherwise, we might end up losing a two of weeks of time. What do you think?

Open Questions:

  • What should be the default frequency for the periodic full reconciliation?
    • Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.
    • As we use the same shared informers' factory as in other controllers, this should already be implemented.
  • Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?
  • How should we set the interval for the periodic reconcile? Options:
    • Adjust --route-reconcile-period when feature gate enabled
    • Use --min-resync-period; currently defaults to 12h
    • Introduce a new flag
    • If we use the 12h-24h option we can probably reuse --min-resync-period.

@elmiko
Copy link
Contributor

elmiko commented May 9, 2025

i'm fine to continue the discussions here.

Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.

i think 12h sounds fine to me.

Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?

i'll have to think about this a little more, i have a feeling that those are good to start with.

How should we set the interval for the periodic reconcile?

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this.

my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller.

@lukasmetzner
Copy link
Contributor Author

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this.
my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller

To my understanding, both the service and node controllers are already watch-based and should utilize the --min-resync-period flag. Adopting this approach would bring consistency across the CCM components. In this context, the --route-reconcile-period flag could be considered for deprecation.

@elmiko
Copy link
Contributor

elmiko commented May 27, 2025

ack, thank you @lukasmetzner that makes sense to me. it seems we should focus on --min-resync period then.

@lukasmetzner lukasmetzner requested a review from elmiko June 2, 2025 05:57
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2025
@JoelSpeed
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025
@lukasmetzner
Copy link
Contributor Author

@elmiko @JoelSpeed As I have got lgtms from both of you with a small discussion remaining, shall I fill out the production readiness questionnaire? The next PRR freeze for v1.35 is on the 9th of October. If possible, I would like to target v1.35.

@elmiko
Copy link
Contributor

elmiko commented Sep 24, 2025

yeah, that sounds good to me @lukasmetzner

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2025
@lukasmetzner
Copy link
Contributor Author

lukasmetzner commented Sep 26, 2025

I saw that we already had the questionnaire filled out, so I moved it towards prod-readiness.

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I already got lgtms on the KEP in general and have moved the state to implementable.

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR? I used /label lead-opted-in, but I am unsure if this is something I should do, or a sig-cloud-provider lead.

Since I’m quite new here, I hope I’ve followed the process correctly. Apologies in advance if I’ve missed anything.

/assign @elmiko
/assign @wojtek-t
/label lead-opted-in

@k8s-ci-robot
Copy link
Contributor

@lukasmetzner: Can not set label lead-opted-in: Must be member in one of these teams: [release-team-enhancements release-team-leads sig-api-machinery-leads sig-apps-leads sig-architecture-leads sig-auth-leads sig-autoscaling-leads sig-cli-leads sig-cloud-provider-leads sig-cluster-lifecycle-leads sig-contributor-experience-leads sig-docs-leads sig-instrumentation-leads sig-k8s-infra-leads sig-multicluster-leads sig-network-leads sig-node-leads sig-release-leads sig-scalability-leads sig-scheduling-leads sig-security-leads sig-storage-leads sig-testing-leads sig-windows-leads]

In response to this:

I saw that we already had the questionnaire filled out, so I moved it towards prod-readiness.

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I already got lgtms on the KEP in general and have moved the state to implementable.

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR?

Since I’m quite new here, I hope I’ve followed the process correctly. Apologies in advance if I’ve missed anything.

/assign @elmiko
/assign @wojtek-t
/label lead-opted-in

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@elmiko
Copy link
Contributor

elmiko commented Sep 26, 2025

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR? I used /label lead-opted-in, but I am unsure if this is something I should do, or a sig-cloud-provider lead.

good question, i will investigate early next week.

lukasmetzner pushed a commit to hetznercloud/hcloud-cloud-controller-manager that referenced this pull request Oct 1, 2025
### Attach Load Balancer to a Subnet

If your CCM is configured for a Private Network, Load Balancers can now
join one of its subnets. To place a Load Balancer in a specific subnet,
use the new `load-balancer.hetzner.cloud/private-subnet-ip-range`
annotation. Learn more about this feature
[here](./docs/guides/load-balancer/private-networks.md).

### Watch-Based Route Reconciliation (Experimental)

Currently, route reconciliation is performed at a fixed interval of 30s.
This leads to unnecessary API requests, as a `GET /v1/networks/{id}`
call is triggered every 30s, even when no changes have occurred.

Upstream we have proposed an event-driven approach, similar to the
mechanism used by other controllers such as the Load Balancer
Controller. With this new approach, route reconciliation is triggered on
node additions, node deletions, or when the `PodCIDRs` or `Addresses` of
nodes change. Additionally, to ensure consistency, reconciliation will
still occur periodically at a randomized interval between 12 and 24
hours.

We are close to merging a [Kubernetes Enhancement Proposal
(KEP)](kubernetes/enhancements#5289).
Furthermore, a pull request containing the implementation is already
open in the Kubernetes repository.

#### Forked Upstream Libraries

In this release, we replaced the upstream `controller-manager` and
`cloud-provider` libraries with our own forks. These forks are based on
the upstream `v0.34.1` release (aligned with Kubernetes v1.34.1) and
include our patches on top.

#### Enabling the Feature

This feature is **disabled by default** and will not affect existing
deployments unless explicitly enabled. We recommend testing it in a
non-production environment before considering use in production.

As the KEP has not yet been reviewed for production readiness, the
feature gate name may change in an upcoming release. Since this feature
is marked as experimental, such changes will not be considered breaking.

To enable the feature, set the following Helm value:

`args.feature-gates=CloudControllerManagerWatchBasedRoutesReconciliation=true`
@JoelSpeed
Copy link
Contributor

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I'm happy with that as long as we can leverage the metric in the beta stage with the feature both enabled and disabled to make a comparison between the two implementations to determine the performance impact

@lukasmetzner lukasmetzner force-pushed the 5237-watch-based-route-controller-reconciliation branch from bd052f2 to 3270c52 Compare October 9, 2025 07:12
@deads2k
Copy link
Contributor

deads2k commented Oct 13, 2025

PRR lgtm. Thank you for addressing metrics early, that always makes things easier for making alpha to beta decisions.

/approve

@kannon92
Copy link
Contributor

@elmiko @JoelSpeed

Do you have a sig-cloud-provider owner to look at this?

@elmiko
Copy link
Contributor

elmiko commented Oct 14, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, elmiko, lukasmetzner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2025
@elmiko
Copy link
Contributor

elmiko commented Oct 14, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025
@k8s-ci-robot k8s-ci-robot merged commit 9ddf451 into kubernetes:master Oct 14, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 14, 2025
@wojtek-t
Copy link
Member

/assign @wojtek-t

Just for posterity - this proposal LGTM too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants