-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
|
@cyningsun thank you, i took a brief look and it looks promising, will do a proper review in the next couple of days. |
|
@ndyakov, could you please take a first look when you have a moment? Once you think it’s functionally sound, we can then ask @jseparator to help verify if it also addresses the memory leak issue they reported. |
ndyakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one question, other than that the PR looks good.
| func (q *wantConnQueue) discardDoneAtFront() int { | ||
| q.mu.Lock() | ||
| defer q.mu.Unlock() | ||
| count := 0 | ||
| for len(q.items) > 0 { | ||
| if q.items[0].isOngoing() { | ||
| break | ||
| } | ||
|
|
||
| q.items = q.items[1:] | ||
| count++ | ||
| } | ||
|
|
||
| return count | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyningsun is it possible that the first wantConn object is still ongoing, but wantConn after it has failed for some reason / is done and we have to discard it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you've correctly identified that scenario. A wantConn later in the queue can fail while an earlier one is still ongoing.
The current design employs a lazy cleanup strategy: we only immediately clean the completed head of the queue, while intentionally delaying the cleanup of completed elements in the middle. This is a trade-off based on two factors:
- Performance: Removing an element from the middle requires shifting all subsequent elements, a costly operation under lock.
- Cascading Cleanup: When the head element is finally cleaned up, it triggers a sweep that removes all consecutive completed elements behind it.
The cost is temporary higher memory use for “zombie” elements. However, given the default 5-second DialTimeout and the random distribution of failures, cleanup progresses steadily, keeping the accumulation bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will investigate the 2 scenario, but overall I do agree this is good approach.
ndyakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyningsun looks good, pinged @jseparator on the issue to verify if this solves the reported issue for him.
Summary
Fixes zombie
wantConnelements accumulation inwantConnQueueand adds comprehensive test coverage.This PR fixes #3678
Changes
wantConnfrom queue in panic/failure scenariosMutexwithRWMutexinwantConnfor better read performancedialsQueuecleanup assertions to 8 tests