Commit d3c5628
ksmbd: centralize ksmbd_conn final release to plug transport leak
ksmbd_conn_free() is one of four sites that can observe the last
refcount drop of a struct ksmbd_conn. The other three
fs/smb/server/connection.c ksmbd_conn_r_count_dec()
fs/smb/server/oplock.c __free_opinfo()
fs/smb/server/vfs_cache.c session_fd_check()
end the conn with a bare kfree(), skipping
ida_destroy(&conn->async_ida) and
conn->transport->ops->free_transport(conn->transport). Whenever one
of them is the last putter, the embedded async_ida and the entire
transport struct leak -- for TCP, that is also the struct socket and
the kvec iov.
__free_opinfo() being a final putter is not theoretical. opinfo_put()
queues the callback via call_rcu(&opinfo->rcu, free_opinfo_rcu), so
ksmbd_server_terminate_conn() can deposit N opinfo releases in RCU and
have ksmbd_conn_free() run in the handler thread before any of them
fire. ksmbd_conn_free() then observes refcnt > 0 and short-circuits;
the last RCU-delivered __free_opinfo() falls onto its bare kfree(conn)
branch and the transport is lost.
Reproducer (QEMU/virtme guest, ksmbd server and CIFS client in the
same guest, mounting //127.0.0.1/testshare): each iteration holds 8
files open via sleep processes, force-closes TCP with
`ss -K sport = :445`, kills the holders, lazy-umounts; repeated 10
times, then ksmbd shutdown and kmemleak scan.
Kprobes: ksmbd_conn_alloc, ksmbd_conn_free, ksmbd_tcp_free_transport,
free_opinfo_rcu. (ksmbd_tcp_new_connection did not register stably
so ksmbd_conn_alloc is the lifecycle anchor.)
A/B validation, same image, varying only ksmbd.ko. Pre-patch is HEAD
with only patch 1 of this series applied:
state conn_alloc conn_free tcp_free opi_rcu kmemleak
---------- ---------- --------- -------- ------- --------
pre-patch 20 20 10 160 7
with patch 20 20 20 160 0
Pre-patch conn_free=20 with tcp_free=10 directly demonstrates the
bare-kfree paths skipping transport cleanup, and kmemleak reports 7
unreferenced allocations whose backtraces point into the ksmbd text
(struct tcp_transport, t->iov kvec). With this patch tcp_free matches
conn_free at 20/20 and kmemleak is clean across two independent
post-patch runs. opi_rcu=160 confirms the RCU opinfo release path
that motivates the fix is exercised. The transport leak fix is
established by this A/B comparison.
Move the per-struct final release into __ksmbd_conn_release_work() and
route the three bare-kfree final-put sites through a new
ksmbd_conn_put(). Those sites now pair ida_destroy() and
free_transport() with kfree(conn) regardless of which holder happens
to release the last reference. The stop_sessions() path keeps patch
1's local pin-drop cleanup open-coded instead of routing that
temporary reference through ksmbd_conn_put(), so this patch does not
layer the conn-put conversion on top of the stop_sessions() iteration
rewrite and the two fixes remain independently reviewable and
revertible. Applying this patch alone still leaves stop_sessions()
on its own local cleanup; patch 1 fixes that site separately.
The centralized release reaches sock_release() -> tcp_close() ->
lock_sock_nested() from every final putter. lock_sock_nested() has
might_sleep(), and __free_opinfo() can be released from an RCU
callback (free_opinfo_rcu(), softirq on default kernels). Calling
the final release directly from that path trips
CONFIG_DEBUG_ATOMIC_SLEEP:
BUG: sleeping function called from invalid context at net/core/sock.c:3785
in_atomic(): 1, irqs_disabled(): 0, non_block: 0
Call Trace:
<IRQ>
lock_sock_nested+0x43/0xa0
tcp_close+0x19/0xa0
inet_release+0x44/0x90
sock_release+0x25/0x90
ksmbd_tcp_free_transport+0x16/0x40 [ksmbd]
__ksmbd_conn_release_work+0x... [ksmbd]
ksmbd_conn_put+0x... [ksmbd]
free_opinfo_rcu+0x... [ksmbd]
rcu_do_batch+0x1e5/0x5c0
rcu_core+0x395/0x4d0
Handle this once inside ksmbd_conn_put() by making the final release
unconditional through a dedicated ksmbd_conn_wq workqueue. When the
refcount reaches zero, ksmbd_conn_put() queues the pre-initialized
release_work onto the workqueue and returns; the work handler runs
the sleep-allowed teardown (ida_destroy,
free_transport -> sock_release, kfree) in process context. That makes
ksmbd_conn_put() safe to call from RCU callbacks and other
non-sleeping putter contexts without each call site needing its own
bounce.
Moving the final release onto a workqueue is not by itself enough on
the session_fd_check() path: __close_file_table_ids() holds
write_lock(&ft->lock) across the skip callback, and
session_fd_check() already sleeps in
ksmbd_vfs_copy_durable_owner() -> kstrdup(GFP_KERNEL) and
down_write(&fp->f_ci->m_lock) (a rw_semaphore) before it ever reaches
ksmbd_conn_put(). These sleeps pre-date this patch but would
equally trip CONFIG_DEBUG_ATOMIC_SLEEP on a durable-fd workload.
Refactor __close_file_table_ids() to take a transient reference on
fp and unpublish fp from the session idr *under ft->lock* before
calling skip() outside the lock. A transient ref protects lifetime
but not concurrent field mutation, so the idr_remove() is what keeps
__ksmbd_lookup_fd() through this session's idr from granting a new
ksmbd_fp_get() reference to an fp whose fp->conn / fp->tcon /
fp->volatile_id / op->conn / lock_list links are about to be rewritten
by session_fd_check(). The same unpublished transition also clears
fp->volatile_id under ft->lock, preventing any later final close of
the same fp from removing a reused idr slot. Durable reconnect is
unaffected because it reaches fp through the global durable table
(ksmbd_lookup_durable_fd -> global_ft).
After skip() returns, the preserve path drops the transient with
atomic_dec(): fp keeps the original +1 refcount that used to represent
the session idr entry so the durable scavenger can later expire it
once the timeout elapses. The close path transitions f_state to
FP_CLOSED under ft->lock (matching ksmbd_close_fd()) so ksmbd_fp_get()
lookups via any remaining path fail, then removes fp from m_fp_list
before dropping both the transient and the original session-idr ref
via atomic_sub_and_test(2). The list removal cannot be left for a
deferred final putter because fp->volatile_id has already been cleared
and __ksmbd_remove_fd() will intentionally skip both idr_remove() and
list_del_init(). The subtraction cannot underflow because no
concurrent close path can consume the session-idr ref after the
idr_remove() above. If the subtraction hits zero we finalize fp
ourselves, otherwise the remaining user's ksmbd_fd_put() finalizes via
__put_fd_final() -> __ksmbd_close_fd().
The __close_file_table_ids() refactor is exercised separately on a
debug kernel additionally built with CONFIG_DEBUG_LIST and
CONFIG_DEBUG_OBJECTS_WORK using a same-session two-tcon workload:
one tcon drives an open/write storm while the other tcon repeats 50
tree disconnects on the same session. Trace counts: 52
__close_file_table_ids invocations, 4793 __ksmbd_close_fd calls,
30337 __put_fd_final, 9578 ksmbd_conn_put decrements, 1
__ksmbd_conn_release_work execution. The workload exercises the
idr_remove() / fp->volatile_id clear / m_fp_list unlink coupling
under concurrent fp allocation in the same session table. This run
validates the file-table/id/list rewrite under
DEBUG_LIST/DEBUG_OBJECTS_WORK; it does not re-prove the transport
leak fix, which the abrupt-disconnect A/B above already covered. No
list-corruption, work_struct ODEBUG, sleep-in-atomic, lockdep or
kmemleak reports were observed.
The deferred-final-putter branch in __close_file_table_ids()
(atomic_sub_and_test(2) returning false) is covered by analysis, not
by a dedicated counter in this run: the trace points used above
cannot distinguish a deferred-putter __ksmbd_close_fd from a normal
SMB2_CLOSE __ksmbd_close_fd. __close_file_table_ids() unconditionally
clears fp->volatile_id and unlinks fp from m_fp_list before
atomic_sub_and_test(2), so __ksmbd_remove_fd() invoked from a later
__put_fd_final() correctly skips both idr_remove() and list_del_init().
At module exit, the workqueue is flushed and destroyed after
rcu_barrier(), so any release queued by a trailing RCU callback is
drained before the inode hash and the module text go away. Verified
by kprobe tracing that all 20 __ksmbd_conn_release_work() executions
complete before ksmbd_conn_wq_destroy() enters, with
ksmbd_tcp_free_transport() matching ksmbd_conn_alloc() at 20/20.
The ida_destroy() previously added to ksmbd_conn_free()'s refcount
branch is folded into __ksmbd_conn_release_work() so it runs from
whichever site turns out to be the last putter.
The conversion also closes a pre-existing ksmbd_conn lifetime gap: fp
used to store a borrowed pointer to its connection
(fp->conn = work->conn) without taking a reference, so nothing stopped
the conn from being freed while an fp still held a stale fp->conn.
Teach fp to own a strong reference on fp->conn for as long as
fp->conn is non-NULL:
* ksmbd_open_fd() and ksmbd_reopen_durable_fd() bump conn->refcnt
when assigning fp->conn (matching put on the ksmbd_open_fd() error
path and on the ksmbd_reopen_durable_fd() __open_id() failure
path). Both now set fp->conn and fp->tcon before __open_id()
publishes fp into the session's file table, so a concurrent
teardown that iterates the table via idr cannot observe a valid
volatile_id with fp->conn still NULL and preserve a
partially-initialized fp.
* session_fd_check() (durable preserve) and __ksmbd_close_fd() (fp
destroy) release the owned reference via ksmbd_conn_put() and
clear fp->conn.
With that invariant in place, session_fd_check() needs no local pin
across the op->conn puts -- fp's own reference keeps conn alive for
the entire body of the function, including the subsequent
conn->llist_lock access. The NULL-guard at the top of
session_fd_check() stays: a durable reconnect that has already been
through cleanup once leaves fp->conn cleared, and the lock_list loop
would otherwise dereference NULL.
The kernel under test was built with CONFIG_DEBUG_KMEMLEAK,
CONFIG_PROVE_LOCKING, CONFIG_DEBUG_ATOMIC_SLEEP, CONFIG_DEBUG_OBJECTS
and CONFIG_FAILSLAB, plus CONFIG_DEBUG_LIST and
CONFIG_DEBUG_OBJECTS_WORK for the two-tcon stress run.
The __close_file_table_ids() refactor was also exercised under the
same debug kernel with a local test harness that forces
is_reconnectable() to return true so session_fd_check() reaches the
ksmbd_vfs_copy_durable_owner()/down_write(&ci->m_lock) path for
every fp (Linux cifs.ko does not request durable handles in the
generic mount path so the slow path is otherwise not covered). With
the refactor applied the harness traverses the full sleep path and
CONFIG_DEBUG_ATOMIC_SLEEP / CONFIG_PROVE_LOCKING stay silent.
Reverting only the __close_file_table_ids() hunk while keeping the
harness produces:
BUG: sleeping function called from invalid context at vfs_cache.c:1095
__might_sleep+0x49/0x60
[ BUG: Invalid wait context ]
which confirms that the refactor is what keeps ft->lock out of the
sleepable skip() body.
Fixes: ee426bfb9d09 ("ksmbd: add refcnt to ksmbd_conn struct")
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>1 parent de46ee6 commit d3c5628
5 files changed
Lines changed: 248 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
83 | 139 | | |
84 | 140 | | |
85 | 141 | | |
| |||
98 | 154 | | |
99 | 155 | | |
100 | 156 | | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
| 157 | + | |
114 | 158 | | |
115 | 159 | | |
116 | 160 | | |
| |||
141 | 185 | | |
142 | 186 | | |
143 | 187 | | |
| 188 | + | |
144 | 189 | | |
145 | 190 | | |
146 | 191 | | |
| |||
566 | 611 | | |
567 | 612 | | |
568 | 613 | | |
569 | | - | |
570 | | - | |
| 614 | + | |
571 | 615 | | |
572 | 616 | | |
573 | 617 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| |||
118 | 119 | | |
119 | 120 | | |
120 | 121 | | |
| 122 | + | |
121 | 123 | | |
122 | 124 | | |
123 | 125 | | |
| |||
163 | 165 | | |
164 | 166 | | |
165 | 167 | | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
166 | 171 | | |
167 | 172 | | |
168 | 173 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
142 | | - | |
| 141 | + | |
| 142 | + | |
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
614 | 614 | | |
615 | 615 | | |
616 | 616 | | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
617 | 622 | | |
618 | 623 | | |
| 624 | + | |
| 625 | + | |
619 | 626 | | |
620 | 627 | | |
621 | 628 | | |
| |||
641 | 648 | | |
642 | 649 | | |
643 | 650 | | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
644 | 657 | | |
645 | 658 | | |
646 | 659 | | |
| |||
0 commit comments