Skip to content

Conversation

@mattrobenolt
Copy link

Adds networking.gke.io/connection-draining-timeout-sec annotation to configure connection draining timeout for L4 LoadBalancers with RBS enabled.

Currently, L4 LoadBalancers use a hardcoded 30 second timeout. For workloads with long-lived connections, this is quite insufficient. GCE supports up to 3600 seconds, but there is no way to configure this in the Service definition.

Note: for disclosure, I used the assistance of Claude Code to help implement this PR, but I have reviewed everything personally to the best of my knowledge.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mattrobenolt
Once this PR has been reviewed and has the lgtm label, please assign felipeyepez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @mattrobenolt!

It looks like this is your first PR to kubernetes/ingress-gce 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-gce has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 5, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mattrobenolt. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 5, 2025
Copy link
Member

@TortillaZHawaii TortillaZHawaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR.

I would like to understand what happens, if the user has modified the timeout on the created Backend Service resource through Cloud Console or CLI. Unfortunately we did somewhat allow this to happen (even if it wasn't supported) and this PR would break the expectations for those users.

It's also not clear what the implications are for the rest of the stack, I'm not sure if there won't be any issues somewhere unexpected.

/hold
/ok-to-test

L4LoggingConfigMapKey = "networking.gke.io/l4-logging-config-map"

// Service annotation key for specifying connection draining timeout in seconds for L4 backend services
ConnectionDrainingTimeoutKey = "networking.gke.io/connection-draining-timeout-sec"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use "networking.gke.io/connection-draining-timeout" and use time.Duration to allow specifying time with unit, like in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#configurable-container-restart-delay

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure that makes sense in this context when GCP only accepts seconds as input in the API.

So what would it mean for a spec to say networking.gke.io/connection-draining-timeout = 100ms?

We could error on anything that is sub-second, and that'd enable us to use 5m or something else?

Is that more what you were thinking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure that makes sense in this context when GCP only accepts seconds as input in the API.
So what would it mean for a spec to say networking.gke.io/connection-draining-timeout = 100ms?

We need validation regardless of using int or duration.

and that'd enable us to use 5m or something else?

yes exactly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I swapped this to just connection-draining-timeout with some validation for the acceptable range.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 8, 2025
@mattrobenolt
Copy link
Author

I would like to understand what happens, if the user has modified the timeout on the created Backend Service resource through Cloud Console or CLI. Unfortunately we did somewhat allow this to happen (even if it wasn't supported) and this PR would break the expectations for those users.

The expected behavior with this PR is that value is preserved, and that's exactly our setup. We create the LoadBalancer, then need to modify it after with gcloud to change this attribute. pkg/backends/backends.go:472

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 17, 2025
Adds `networking.gke.io/connection-draining-timeout-sec` annotation to
configure connection draining timeout for L4 LoadBalancers with RBS
enabled.

Currently, L4 LoadBalancers use a hardcoded 30 second timeout. For
workloads with long-lived connections, this is quite insufficient. GCE
supports up to 3600 seconds, but there is no way to configure this in
the Service definition.

Note: for disclosure, I used the assistance of Claude Code to help
implement this PR, but I have reviewed everything personally to the
best of my knowledge.
@mattrobenolt mattrobenolt force-pushed the connection-draining-timeout branch from 97ddad6 to f4ea8ff Compare December 17, 2025 22:58
@mattrobenolt
Copy link
Author

@TortillaZHawaii if you'd like to take another look, I think everything is addressed. Also added more testing around pre-existing values not being clobbered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants