fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680

cyningsun · 2026-01-20T11:47:12Z

Summary

Fixes zombie wantConn elements accumulation in wantConnQueue and adds comprehensive test coverage.

This PR fixes #3678

Changes

Fix resource cleanup: Properly remove wantConn from queue in panic/failure scenarios
Optimize concurrency: Replace Mutex with RWMutex in wantConn for better read performance
Enhance existing tests: Add dialsQueue cleanup assertions to 8 tests
Add zombie cleanup tests: 4 new test cases covering:
- High concurrency with continuous dial failures (1000 requests)
- Request timeout + dial failure scenario
- Intermittent failures over 5 seconds
- Queue upper bound enforcement

jit-ci · 2026-01-20T11:47:19Z

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

ndyakov · 2026-01-20T11:51:12Z

@cyningsun thank you, i took a brief look and it looks promising, will do a proper review in the next couple of days.

cyningsun · 2026-01-20T11:51:59Z

@ndyakov, could you please take a first look when you have a moment?

Once you think it’s functionally sound, we can then ask @jseparator to help verify if it also addresses the memory leak issue they reported.

ndyakov

Left one question, other than that the PR looks good.

ndyakov · 2026-01-22T13:28:02Z

internal/pool/want_conn.go

+func (q *wantConnQueue) discardDoneAtFront() int {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	count := 0
+	for len(q.items) > 0 {
+		if q.items[0].isOngoing() {
+			break
+		}
+
+		q.items = q.items[1:]
+		count++
+	}
+
+	return count
+}


@cyningsun is it possible that the first wantConn object is still ongoing, but wantConn after it has failed for some reason / is done and we have to discard it?

Yes, you've correctly identified that scenario. A wantConn later in the queue can fail while an earlier one is still ongoing.

The current design employs a lazy cleanup strategy: we only immediately clean the completed head of the queue, while intentionally delaying the cleanup of completed elements in the middle. This is a trade-off based on two factors:

Performance: Removing an element from the middle requires shifting all subsequent elements, a costly operation under lock.

Cascading Cleanup: When the head element is finally cleaned up, it triggers a sweep that removes all consecutive completed elements behind it.

The cost is temporary higher memory use for “zombie” elements. However, given the default 5-second DialTimeout and the random distribution of failures, cleanup progresses steadily, keeping the accumulation bounded.

I will investigate the 2 scenario, but overall I do agree this is good approach.

ndyakov

@cyningsun looks good, pinged @jseparator on the issue to verify if this solves the reported issue for him.

cyningsun added 2 commits January 20, 2026 19:20

discard zombie elements in wantConnQueue

1d21fc0

fix lint

f44cb73

ndyakov approved these changes Jan 22, 2026

View reviewed changes

ndyakov mentioned this pull request Jan 22, 2026

Memory leak in v9.17.2 with high concurrency - massive queuedNewConn and context object accumulation #3678

Open

ndyakov approved these changes Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680

fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680

cyningsun commented Jan 20, 2026

Uh oh!

jit-ci bot commented Jan 20, 2026

Uh oh!

ndyakov commented Jan 20, 2026

Uh oh!

cyningsun commented Jan 20, 2026

Uh oh!

ndyakov left a comment

Uh oh!

ndyakov Jan 22, 2026

Uh oh!

cyningsun Jan 22, 2026 •

edited

Loading

Uh oh!

ndyakov Jan 22, 2026

Uh oh!

ndyakov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680

Are you sure you want to change the base?

fix(pool): fix wantConnQueue zombie elements and add comprehensive test coverage #3680

Conversation

cyningsun commented Jan 20, 2026

Summary

Changes

Uh oh!

jit-ci bot commented Jan 20, 2026

Uh oh!

ndyakov commented Jan 20, 2026

Uh oh!

cyningsun commented Jan 20, 2026

Uh oh!

ndyakov left a comment

Choose a reason for hiding this comment

Uh oh!

ndyakov Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

cyningsun Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndyakov Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

ndyakov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyningsun Jan 22, 2026 •

edited

Loading