Skip to content

Conversation

@twz123
Copy link
Contributor

@twz123 twz123 commented Oct 21, 2025

Three goroutines could outlive a call to ClientConn.close(). Add mechanics to cancel them and wait for them to complete when closing a client connection.

Fixes #8655.

RELEASE NOTES:

  • Closing a client connection will cancel all pending goroutines and block until they complete.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 89.23077% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.22%. Comparing base (25dbb81) to head (11b448a).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
clientconn.go 80.00% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8666      +/-   ##
==========================================
- Coverage   83.29%   83.22%   -0.07%     
==========================================
  Files         414      414              
  Lines       32753    32793      +40     
==========================================
+ Hits        27280    27291      +11     
- Misses       4070     4092      +22     
- Partials     1403     1410       +7     
Files with missing lines Coverage Δ
balancer_wrapper.go 84.91% <100.00%> (+1.50%) ⬆️
internal/testutils/pipe_listener.go 84.61% <100.00%> (+0.83%) ⬆️
test/rawConnWrapper.go 66.10% <100.00%> (+0.28%) ⬆️
clientconn.go 90.27% <80.00%> (+0.30%) ⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

healthData *healthData

shutdownMu sync.Mutex
shutdown chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to shutdownCh to indicate it is a channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


func (acbw *acBalancerWrapper) UpdateAddresses(addrs []resolver.Address) {
acbw.ac.updateAddrs(addrs)
acbw.goFunc(func(shutdown <-chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we changed this to run the acbw.ac.updateAddrs(addrs) function in a go routine??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically a "bubbled-up" goroutine. Previously, the goroutine was spawned in updateAddrs itself (line 1021). But, as we now need to track those, I figured it would be most appropriate to do it here. Another option would be to somehow push this down into updateAddrs itself, by passing the acBalancerWrapper pointer, or a function pointer to acbw.goFunc or sth. along those lines, and then use that to spawn the goroutine there:

Suggested change
acbw.goFunc(func(shutdown <-chan struct{}) {
acbw.ac.updateAddrs(acbw, addrs)

Then we could write line 1021 of updateAddrs like so:

	acbw.goFunc(ac.resetTransportAndUnlock)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it this way. Seems to be clearer.

clientconn.go Outdated
ac.mu.Lock()
defer ac.mu.Unlock()
acMu := &ac.mu
acMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the original code did not work? this seems unnecessarily complicated and does the same thing. Or am I missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a way to make the defer do the unlock conditionally, as the code might need to unlock it before returning (see lines 1441 and 1442). We can achieve the same e.g. with a boolean variable, if you prefer.

I think using a pointer for this is good, because, if you forget the nil check, it will panic with an easy to understand stack trace. Whereas if you forget to check a boolean, you'd do a double-unlock and there's a chance that you get weirder problems.

@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from e45e9d7 to e8bb2b4 Compare November 5, 2025 12:28
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from e8bb2b4 to 3fa606c Compare November 20, 2025 08:10
@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@twz123
We had a discussion about this PR during our GitHub issues/PR scrub. Can I request you to split these into separate PRs. The change to the graceful_switch LB policy is much simpler compared to the changes in the clientconn for the other two goroutine leaks. We want to make progress on this as much as possible, but are also a little concerned about the complexity that it adds to the clientconn code (which is already reasonably complex).

Apologies for how long this review has dragged on, but hopefully once these are split into smaller ones, we should be able to move faster.

Thanks for your understanding.

@twz123
Copy link
Contributor Author

twz123 commented Dec 5, 2025

@twz123 We had a discussion about this PR during our GitHub issues/PR scrub. Can I request you to split these into separate PRs. The change to the graceful_switch LB policy is much simpler compared to the changes in the clientconn for the other two goroutine leaks. We want to make progress on this as much as possible, but are also a little concerned about the complexity that it adds to the clientconn code (which is already reasonably complex).

Done: #8746

Apologies for how long this review has dragged on, but hopefully once these are split into smaller ones, we should be able to move faster.

Thanks for your understanding.

No worries. I understand that these things take time.

@twz123 twz123 marked this pull request as draft December 5, 2025 11:15
eshitachandwani pushed a commit that referenced this pull request Dec 11, 2025
Goroutines spawned during balancer swaps could outlive the call to
Balancer.Close(). Monitor these via a wait group and wait for them to
finish before returning from Close(). This prevents any noticeable side
effects that could otherwise occur after Close() returns.

See:

* #8655
* #8666 (comment)

RELEASE NOTES:
- client: Closing a graceful switch balancer will now block until all
pending goroutines complete.

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from 00deb21 to 40605a5 Compare December 12, 2025 14:37
@twz123 twz123 marked this pull request as ready for review December 12, 2025 14:47
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch 2 times, most recently from ad9c99a to e8b6f4d Compare December 22, 2025 14:44
@github-actions
Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Dec 28, 2025
@twz123
Copy link
Contributor Author

twz123 commented Dec 30, 2025

Not stale.

@github-actions github-actions bot removed the stale label Dec 30, 2025
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jan 5, 2026
@arjan-bal arjan-bal added stale and removed stale labels Jan 5, 2026
The health check goroutine could outlive a call to ClientConn.close().
Add a done channel that will be waited on when closing the transport.

RELEASE NOTES:
- Closing a client connection will now block until the health check
  goroutine completes.

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123
Copy link
Contributor Author

twz123 commented Jan 5, 2026

This is not stale. There's other PRs pending before I can un-draft this one again.

This will properly pass on cancellation requests, and reduce usage of a
deprecated method.

Signed-off-by: Tom Wieczorek <[email protected]>
@github-actions github-actions bot removed the stale label Jan 5, 2026
mbissa pushed a commit that referenced this pull request Jan 8, 2026
When a connection attempt is canceled after the transport is created,
ensure that it is closed inline. This prevents untracked goroutines from
being left behind.

See:

* #8655
* #8666 (comment)

RELEASE NOTES:
* When canceling a connection attempt, closing the transport won't use
an untracked goroutine anymore.

Signed-off-by: Tom Wieczorek <[email protected]>
eshitachandwani pushed a commit that referenced this pull request Jan 8, 2026
@github-actions
Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jan 11, 2026
…tdialer' into clientconn-close-waitfor-goroutines
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from e8b6f4d to f69b1fd Compare January 13, 2026 13:14
Three goroutines could outlive a call to ClientConn.close(). Add
mechanics to cancel them and wait for them to complete when closing a
client connection.

RELEASE NOTES:
- Closing a client connection will cancel all pending goroutines and
  block until they complete.

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from f69b1fd to 11b448a Compare January 13, 2026 13:38
@github-actions github-actions bot removed the stale label Jan 13, 2026
@github-actions
Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow waiting for all goroutines to exit on client connection close

5 participants