net: posix: fix pollable_fd_state leak on cross-shard connection forwarding#3394
net: posix: fix pollable_fd_state leak on cross-shard connection forwarding#3394avikivity wants to merge 1 commit into
Conversation
…arding Fix a memory leak of pollable_fd_state objects detected by AddressSanitizer in socket_test (36800 bytes in 100 allocations). The leak has three contributing factors, all in the cross-shard connection forwarding path of posix_server_socket_impl::accept(): 1. When a connection is forwarded to another shard, the raw file_desc is moved out of the pollable_fd via get_file_desc() and captured in the smp::submit_to lambda. However, the pollable_fd wrapper (holding an intrusive_ptr to the now-empty pollable_fd_state) was left to be destroyed implicitly by the coroutine frame at end of the loop iteration. In practice, the coroutine frame did not destroy the structured binding variable between iterations (observed with the proxy protocol path where I/O is performed on the fd before forwarding), causing the pollable_fd_state to be leaked. Fix this by explicitly calling fd.close() after extracting the raw file_desc. 2. posix_ap_server_socket_impl::move_connected_socket() unconditionally queues connections into the static conn_q when no acceptor is waiting. If the server_socket has already been destroyed (entry removed from ports), these queued connections are never consumed. Fix by checking ports.contains() before queuing, and closing the fd otherwise. 3. posix_ap_server_socket_impl's destructor only removed from ports but did not drain conn_q entries for its address. Connections queued between abort_accept() (which drains conn_q) and the destructor would be leaked. Fix by also erasing from conn_q in the destructor.
Wait, isn't this a massive compiler bug? Are you saying |
It's indeed a massive bug. I started a fruitless bisect, need to retry it. |
Does this occur in clang or gcc? |
gcc, we'd know about a such clang bug much sooner. |
Right. I guess I question the wisdom of doing one-off fixes like this one (actually I was unclear what was going on, I had to check that the dtor had the same effect as close()) given the magnitude of the issue, it just seems unsafe to use gcc in this state. |
I'm against this fix too, this was before I realized it is likely a gcc bug. |
Or rather, the patch is actually good, just not for its stated purpose. The code now can hang on to an fd for an unlimited amount of time, and it's better to drop the fd early. |
|
Likely cause: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124584 |
|
@avikivity ISTM gcc is simply unsuitable for seastar applications until this is fixed? just very hard to avoid that problem. |
Coroutines, gcc, structured bindings - pick any two |
Coroutines, structured bindings ... every time :) |
I'm with you here |
Fix a memory leak of pollable_fd_state objects detected by AddressSanitizer in socket_test (36800 bytes in 100 allocations).
The leak has three contributing factors, all in the cross-shard connection forwarding path of posix_server_socket_impl::accept():
When a connection is forwarded to another shard, the raw file_desc is moved out of the pollable_fd via get_file_desc() and captured in the smp::submit_to lambda. However, the pollable_fd wrapper (holding an intrusive_ptr to the now-empty pollable_fd_state) was left to be destroyed implicitly by the coroutine frame at end of the loop iteration. In practice, the coroutine frame did not destroy the structured binding variable between iterations (observed with the proxy protocol path where I/O is performed on the fd before forwarding), causing the pollable_fd_state to be leaked. Fix this by explicitly calling fd.close() after extracting the raw file_desc.
posix_ap_server_socket_impl::move_connected_socket() unconditionally queues connections into the static conn_q when no acceptor is waiting. If the server_socket has already been destroyed (entry removed from ports), these queued connections are never consumed. Fix by checking ports.contains() before queuing, and closing the fd otherwise.
posix_ap_server_socket_impl's destructor only removed from ports but did not drain conn_q entries for its address. Connections queued between abort_accept() (which drains conn_q) and the destructor would be leaked. Fix by also erasing from conn_q in the destructor.