Skip to content

net: posix: fix pollable_fd_state leak on cross-shard connection forwarding#3394

Draft
avikivity wants to merge 1 commit into
masterfrom
fix-accept-smp
Draft

net: posix: fix pollable_fd_state leak on cross-shard connection forwarding#3394
avikivity wants to merge 1 commit into
masterfrom
fix-accept-smp

Conversation

@avikivity

Copy link
Copy Markdown
Member

Fix a memory leak of pollable_fd_state objects detected by AddressSanitizer in socket_test (36800 bytes in 100 allocations).

The leak has three contributing factors, all in the cross-shard connection forwarding path of posix_server_socket_impl::accept():

  1. When a connection is forwarded to another shard, the raw file_desc is moved out of the pollable_fd via get_file_desc() and captured in the smp::submit_to lambda. However, the pollable_fd wrapper (holding an intrusive_ptr to the now-empty pollable_fd_state) was left to be destroyed implicitly by the coroutine frame at end of the loop iteration. In practice, the coroutine frame did not destroy the structured binding variable between iterations (observed with the proxy protocol path where I/O is performed on the fd before forwarding), causing the pollable_fd_state to be leaked. Fix this by explicitly calling fd.close() after extracting the raw file_desc.

  2. posix_ap_server_socket_impl::move_connected_socket() unconditionally queues connections into the static conn_q when no acceptor is waiting. If the server_socket has already been destroyed (entry removed from ports), these queued connections are never consumed. Fix by checking ports.contains() before queuing, and closing the fd otherwise.

  3. posix_ap_server_socket_impl's destructor only removed from ports but did not drain conn_q entries for its address. Connections queued between abort_accept() (which drains conn_q) and the destructor would be leaked. Fix by also erasing from conn_q in the destructor.

…arding

Fix a memory leak of pollable_fd_state objects detected by
AddressSanitizer in socket_test (36800 bytes in 100 allocations).

The leak has three contributing factors, all in the cross-shard
connection forwarding path of posix_server_socket_impl::accept():

1. When a connection is forwarded to another shard, the raw file_desc
   is moved out of the pollable_fd via get_file_desc() and captured in
   the smp::submit_to lambda. However, the pollable_fd wrapper (holding
   an intrusive_ptr to the now-empty pollable_fd_state) was left to be
   destroyed implicitly by the coroutine frame at end of the loop
   iteration. In practice, the coroutine frame did not destroy the
   structured binding variable between iterations (observed with the
   proxy protocol path where I/O is performed on the fd before
   forwarding), causing the pollable_fd_state to be leaked. Fix this
   by explicitly calling fd.close() after extracting the raw file_desc.

2. posix_ap_server_socket_impl::move_connected_socket() unconditionally
   queues connections into the static conn_q when no acceptor is
   waiting. If the server_socket has already been destroyed (entry
   removed from ports), these queued connections are never consumed.
   Fix by checking ports.contains() before queuing, and closing the
   fd otherwise.

3. posix_ap_server_socket_impl's destructor only removed from ports
   but did not drain conn_q entries for its address. Connections
   queued between abort_accept() (which drains conn_q) and the
   destructor would be leaked. Fix by also erasing from conn_q in
   the destructor.
@travisdowns

Copy link
Copy Markdown
Contributor

In practice, the coroutine frame did not destroy the structured binding variable between iterations (observed with the proxy protocol path where I/O is performed on the fd before forwarding)

Wait, isn't this a massive compiler bug? Are you saying fd dtor is not called at the closing bracket for the loop? What happens later? Is it ever called? Does the next co_await _lfd.accept(); re-use the space?

@avikivity

Copy link
Copy Markdown
Member Author

In practice, the coroutine frame did not destroy the structured binding variable between iterations (observed with the proxy protocol path where I/O is performed on the fd before forwarding)

Wait, isn't this a massive compiler bug? Are you saying fd dtor is not called at the closing bracket for the loop? What happens later? Is it ever called? Does the next co_await _lfd.accept(); re-use the space?

It's indeed a massive bug.

I started a fruitless bisect, need to retry it.

@avikivity avikivity marked this pull request as draft May 13, 2026 18:42
@travisdowns

Copy link
Copy Markdown
Contributor

It's indeed a massive bug.

I started a fruitless bisect, need to retry it.

Does this occur in clang or gcc?

@avikivity

Copy link
Copy Markdown
Member Author

It's indeed a massive bug.
I started a fruitless bisect, need to retry it.

Does this occur in clang or gcc?

gcc, we'd know about a such clang bug much sooner.

@travisdowns

Copy link
Copy Markdown
Contributor

gcc, we'd know about a such clang bug much sooner.

Right.

I guess I question the wisdom of doing one-off fixes like this one (actually I was unclear what was going on, I had to check that the dtor had the same effect as close()) given the magnitude of the issue, it just seems unsafe to use gcc in this state.

@avikivity

Copy link
Copy Markdown
Member Author

gcc, we'd know about a such clang bug much sooner.

Right.

I guess I question the wisdom of doing one-off fixes like this one (actually I was unclear what was going on, I had to check that the dtor had the same effect as close()) given the magnitude of the issue, it just seems unsafe to use gcc in this state.

I'm against this fix too, this was before I realized it is likely a gcc bug.

@avikivity

Copy link
Copy Markdown
Member Author

gcc, we'd know about a such clang bug much sooner.

Right.
I guess I question the wisdom of doing one-off fixes like this one (actually I was unclear what was going on, I had to check that the dtor had the same effect as close()) given the magnitude of the issue, it just seems unsafe to use gcc in this state.

I'm against this fix too, this was before I realized it is likely a gcc bug.

Or rather, the patch is actually good, just not for its stated purpose. The code now can hang on to an fd for an unlimited amount of time, and it's better to drop the fd early.

@avikivity

Copy link
Copy Markdown
Member Author

@travisdowns

Copy link
Copy Markdown
Contributor

@avikivity ISTM gcc is simply unsuitable for seastar applications until this is fixed? just very hard to avoid that problem.

@avikivity

avikivity commented May 22, 2026

Copy link
Copy Markdown
Member Author

@avikivity ISTM gcc is simply unsuitable for seastar applications until this is fixed? just very hard to avoid that problem.

Coroutines, gcc, structured bindings - pick any two

@travisdowns

Copy link
Copy Markdown
Contributor

Coroutines, gcc, structured bindings - pick any two

Coroutines, structured bindings ... every time :)

@avikivity

Copy link
Copy Markdown
Member Author

Coroutines, gcc, structured bindings - pick any two

Coroutines, structured bindings ... every time :)

I'm with you here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants