Skip to content

Conversation

@cyningsun
Copy link
Contributor

Summary

Fixes zombie wantConn elements accumulation in wantConnQueue and adds comprehensive test coverage.

This PR fixes #3678

Changes

  • Fix resource cleanup: Properly remove wantConn from queue in panic/failure scenarios
  • Optimize concurrency: Replace Mutex with RWMutex in wantConn for better read performance
  • Enhance existing tests: Add dialsQueue cleanup assertions to 8 tests
  • Add zombie cleanup tests: 4 new test cases covering:
    • High concurrency with continuous dial failures (1000 requests)
    • Request timeout + dial failure scenario
    • Intermittent failures over 5 seconds
    • Queue upper bound enforcement

@jit-ci
Copy link

jit-ci bot commented Jan 20, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

@ndyakov
Copy link
Member

ndyakov commented Jan 20, 2026

@cyningsun thank you, i took a brief look and it looks promising, will do a proper review in the next couple of days.

@cyningsun
Copy link
Contributor Author

@ndyakov, could you please take a first look when you have a moment?

Once you think it’s functionally sound, we can then ask @jseparator to help verify if it also addresses the memory leak issue they reported.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one question, other than that the PR looks good.

Comment on lines +101 to +115
func (q *wantConnQueue) discardDoneAtFront() int {
q.mu.Lock()
defer q.mu.Unlock()
count := 0
for len(q.items) > 0 {
if q.items[0].isOngoing() {
break
}

q.items = q.items[1:]
count++
}

return count
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyningsun is it possible that the first wantConn object is still ongoing, but wantConn after it has failed for some reason / is done and we have to discard it?

Copy link
Contributor Author

@cyningsun cyningsun Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you've correctly identified that scenario. A wantConn later in the queue can fail while an earlier one is still ongoing.

The current design employs a lazy cleanup strategy: we only immediately clean the completed head of the queue, while intentionally delaying the cleanup of completed elements in the middle. This is a trade-off based on two factors:

  1. Performance: Removing an element from the middle requires shifting all subsequent elements, a costly operation under lock.
  2. Cascading Cleanup: When the head element is finally cleaned up, it triggers a sweep that removes all consecutive completed elements behind it.

The cost is temporary higher memory use for “zombie” elements. However, given the default 5-second DialTimeout and the random distribution of failures, cleanup progresses steadily, keeping the accumulation bounded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will investigate the 2 scenario, but overall I do agree this is good approach.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyningsun looks good, pinged @jseparator on the issue to verify if this solves the reported issue for him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak in v9.17.2 with high concurrency - massive queuedNewConn and context object accumulation

2 participants