Summary
The posix fsyncer thread is shut down via pthread_cancel, which causes
undefined behavior (destroying a locked mutex) and resource leaks (orphaned
call_stub_t entries) on every shutdown path.
Affected code
| Operation |
File |
Line |
posix_fsyncer main loop |
xlators/storage/posix/src/posix-helpers.c |
2463-2511 |
posix_fsyncer_pick (cond_wait = cancel point) |
xlators/storage/posix/src/posix-helpers.c |
2379-2396 |
gf_thread_cleanup_xint (cancel + join) |
xlators/storage/posix/src/posix-common.c |
1315 |
pthread_mutex_destroy on potentially locked mutex |
xlators/storage/posix/src/posix-common.c |
1327 |
Root cause
The fsyncer thread runs a for(;;) loop with no exit condition. The only
shutdown mechanism is pthread_cancel via gf_thread_cleanup_xint() in
posix_fini. This has two problems:
1. Mutex destroyed while locked (UB)
posix_fsyncer_pick holds priv->fsync_mutex while calling
pthread_cond_wait, which is a POSIX-defined cancellation point. When
cancellation is delivered during pthread_cond_wait, POSIX.1-2008
Section 2.9.5 specifies the mutex is re-acquired before cleanup handlers
execute. However, posix_fsyncer registers zero pthread_cleanup_push
handlers (confirmed: no instances in the posix source tree).
After pthread_cancel + pthread_join, the mutex remains locked.
posix_fini then calls pthread_mutex_destroy(&priv->fsync_mutex) on a
locked mutex — undefined behavior per POSIX.1-2008.
2. Stub leak
When cancellation kills the fsyncer, stubs in two locations become
unreachable:
-
Local list: After posix_fsyncer_pick splices stubs from
priv->fsyncs to the stack-local list variable, cancel during
gf_nanosleep or mid-processing leaves those stubs on a destroyed stack
frame.
-
priv->fsyncs: Stubs queued after the last pick are never processed.
posix_fini never drains priv->fsyncs — it goes directly to destroying
the mutex and freeing priv.
Each leaked call_stub_t holds a call_frame_t * → leaked frames → leaked
fd_t and inode_t references → inode table cannot clean up.
Reproduction
All three shutdown paths are affected:
- Path A (normal shutdown):
glusterfsd exit → posix_fini
- Path B (init failure):
posix_init fails after fsyncer is started →
posix_fini
- Path C (brick mux detach): brick detach →
xlator_fini_rec →
posix_fini (process stays alive)
Path C is the most severe because the process continues running with the
leaked resources and undefined mutex state.
The bug is deterministic when cancellation is delivered during
pthread_cond_wait (the fsyncer spends most of its time here). The impact
of the UB is implementation-dependent — glibc may silently succeed,
deadlock, or abort depending on the mutex type and internal state.
Impact
- Undefined behavior in
pthread_mutex_destroy (implementation-dependent:
glibc may silently succeed, deadlock, or abort)
- Resource leaks proportional to the number of fsync stubs in flight at
shutdown time (frames, fds, inodes)
- In the brick mux detach path, leaked inode references may prevent inode
table cleanup for the detached brick
Suggested fix
Replace pthread_cancel with a cooperative exit flag, following the same
pattern already used by the io_uring thread in the same xlator
(priv->uring_thread_exit in posix-io-uring.c):
- Add
gf_boolean_t fsyncer_exit to struct posix_private
- Add
!priv->fsyncer_exit to posix_fsyncer_pick's cond_wait predicate
- Have
posix_fsyncer_pick drain remaining stubs and return -1 on exit
- Have
posix_fsyncer process final stubs and break on count < 0
- Have
posix_fini set the flag under mutex, signal, and pthread_join
(no pthread_cancel)
This ensures:
- The mutex is always unlocked before the fsyncer exits
- All queued stubs are processed (STACK_UNWINDed) before shutdown
pthread_mutex_destroy operates on an unlocked mutex
Verification
The fix was verified using SPIN model checking (exhaustive state-space
exploration of all thread interleavings):
Buggy code (pthread_cancel):
- P1
[] !(mutex_destroyed && fsyncer_holds_mutex) — VIOLATED (depth 24, 13 states)
- P2
[] (priv_freed -> no_stubs_remaining) — VIOLATED (depth 35, 23 states)
Fixed code (exit flag):
- P1 mutex safety — VERIFIED (0 errors, 1358 states, depth 139)
- P2 no stub leak — VERIFIED (0 errors, 1358 states, depth 139)
- P3 fini completes (liveness) — VERIFIED (0 errors, 1358 states, depth 139)
The exit-flag approach was also cross-checked at the shutdown composition
level using TLA+ model checking (20 states, invariant holds).
Counterexample: H1 — mutex destroyed while locked
SPIN's shortest counterexample for property P1, showing the exact
interleaving that leads to undefined behavior. Each row is one step;
mutex shows whether fsync_mutex is free, held by the fsyncer, or
destroyed; cancel shows pthread_cancel state.
Step Thread Action mutex fsyncs local cancel
──── ──────── ─────────────────────────────────── ────────── ────── ───── ────────
1 fsyncer enter posix_fsyncer_pick free 0 0 —
2 fini non-deterministic start free 0 0 —
3 fini pthread_cancel → cancel_pending free 0 0 PENDING
4 fsyncer pthread_mutex_lock(fsync_mutex) HELD 0 0 pending
5 fsyncer fsyncs empty → enter cond_wait HELD 0 0 pending
6 fsyncer *** CANCEL DELIVERED at cond_wait HELD 0 0 DELIVERED
(POSIX: mutex re-acquired,
no pthread_cleanup_push registered
→ mutex stays locked)
7 fsyncer exit (fsyncer_exited = true) HELD 0 0 —
8 fini pthread_join returns HELD 0 0 —
9 fini pthread_mutex_destroy(fsync_mutex) DESTROYED 0 0 —
*** UB: mutex destroyed while locked
*** fsyncer_holds_mutex=1 && mutex_destroyed=1
Terminal state: mutex_locked=1, mutex_destroyed=1, fsyncer_holds_mutex=1.
Property [] !(mutex_destroyed && fsyncer_holds_mutex) violated.
Counterexample: H2 — stub leak after cancel
SPIN's shortest counterexample for property P2, showing how stubs become
orphaned. This trace goes through the same cond_wait cancel point as H1,
then a late-arriving stub is queued after the fsyncer is already dead.
Step Thread Action mutex fsyncs local cancel
──── ──────── ─────────────────────────────────── ────────── ────── ───── ────────
1 fsyncer enter posix_fsyncer_pick free 0 0 —
2 fini non-deterministic start free 0 0 —
3 fini pthread_cancel → cancel_pending free 0 0 PENDING
4 fsyncer pthread_mutex_lock(fsync_mutex) HELD 0 0 pending
5 fsyncer fsyncs empty → enter cond_wait HELD 0 0 pending
6 fsyncer *** CANCEL DELIVERED at cond_wait HELD 0 0 DELIVERED
7 fsyncer exit (fsyncer_exited = true) HELD 0 0 —
8 fini pthread_join returns HELD 0 0 —
9 fini pthread_mutex_destroy(fsync_mutex) DESTROYED 0 0 —
10 producer enqueue stub → fsyncs_list++ DESTROYED 1 0 —
(posix_fsync called during shutdown —
no guard prevents enqueueing)
11 fini GF_FREE(priv) — 1 0 —
*** priv_freed=1 with fsyncs_list=1
*** stub orphaned: no thread will ever
process it → frame/fd/inode leak
Terminal state: priv_freed=1, fsyncs_list=1, local_list=0.
Property [] (priv_freed -> (local_list == 0 && fsyncs_list == 0))
violated.
Provenance
The fsyncer was born without a shutdown mechanism. The pthread_cancel
was bolted on 4 years later for brick mux, without modifying the
for(;;) loop. The correct pattern (cooperative exit flag) was
implemented 3 years after that in the same file for io_uring, but
nobody went back to retrofit the fsyncer.
| Date |
Author |
Commit |
What happened |
| 2013-07 |
Anand Avati |
132fefeffb |
Created the fsyncer — for(;;) loop, no shutdown mechanism. Performance optimization to batch AFR's fsync storms. Gerrit #4746. |
| 2017-05 |
Mohit Agrawal |
d3c0f96633 |
Introduced the bug. Added gf_thread_cleanup_xint(priv->fsyncer) to posix_fini for brick mux (BUG 1453977). Correct intent (stop threads on detach), unsafe mechanism (pthread_cancel with no cleanup handler). Gerrit #17356. |
| 2017-11 |
ShyamsundarR |
a8325a23c7 |
Reorganized posix for rio reuse. Fsyncer cleanup moved from posix.c to posix-common.c. Bug preserved unchanged. |
| 2019-04 |
Raghavendra Bhat |
b922793588 |
Fixed the exact same bug class in bit-rot-stub (BZ 1700078). Signer thread cancelled via gf_thread_cleanup_xint while holding mutex → mutex never released → deadlock on re-enable. Fix: pthread_cleanup_push. The posix fsyncer has the same pattern but was not addressed. |
| 2020-10 |
Ravishankar N |
4b9f95b699 |
Added io_uring support with uring_thread_exit — the correct cooperative exit flag pattern, in the same xlator. The fsyncer was not updated. |
| 2021-10 |
Dmitry Antipov |
0af464c492 |
Fixed thread leak in posix health_check (PR #2851). Touched the same file but only the health_check thread, not the fsyncer. |
The bug has been latent for 9 years (2017–2026).
Prior and related issues
#1338 — "Add synchronization for POSIX xlator's internal threads" (Dmitry Antipov, 2020). Proposed the same fix: replace pthread_cancel with normal termination for all three posix internal threads (health check, disk
space check, and the fsyncer). A Gerrit patch (#24640) was posted but never merged. The issue was closed by the stale bot after 6 months of inactivity.
Dmitry later got a smaller fix merged as PR #2851 (2021), but it only addressed the health_check thread — the fsyncer was left unchanged.
This issue completes the work #1338 started for the fsyncer.
Other related issues in the same teardown-race family:
| Issue |
Relationship |
| #4684 |
Sibling — posix_fini disk_space_check UAF and double-decrement (same investigation) |
| #4666 |
Same theme — glfs_fini/open race, shutdown path missing synchronization gate |
| #4159 |
Related symptom — io_uring SIGABRT; io_uring has the correct exit flag but its fini is never called |
| #3727 |
Brick mux restart crash — locks xlator priv=NULL during teardown (closed wontfix) |
| #1982 |
Brick mux test crash — multiplex-limit-issue-151.t intermittent corefile (closed wontfix) |
Related: other gf_thread_cleanup_xint call sites
This fix addresses only the posix fsyncer. However, there are 12 other
gf_thread_cleanup_xint call sites across 6 xlators that use the same
pthread_cancel mechanism. Only two have pthread_cleanup_push handlers
to release resources held at cancellation points:
- bit-rot-stub (
b922793588 — fixed in 2019)
- AFR self-heal thread
The remaining call sites — posix health_check, upcall reaper_thr,
index thread, and leases recall_thr — have no cleanup handlers and
may be exposed to the same pattern if they hold mutexes or resources at
cancellation points. This is out of scope for this fix but may warrant a
broader audit.
Found by
Manual investigation of posix xlator shutdown protocol, verified by SPIN
concurrent model checking.
Summary
The posix fsyncer thread is shut down via
pthread_cancel, which causesundefined behavior (destroying a locked mutex) and resource leaks (orphaned
call_stub_tentries) on every shutdown path.Affected code
posix_fsyncermain loopxlators/storage/posix/src/posix-helpers.cposix_fsyncer_pick(cond_wait = cancel point)xlators/storage/posix/src/posix-helpers.cgf_thread_cleanup_xint(cancel + join)xlators/storage/posix/src/posix-common.cpthread_mutex_destroyon potentially locked mutexxlators/storage/posix/src/posix-common.cRoot cause
The fsyncer thread runs a
for(;;)loop with no exit condition. The onlyshutdown mechanism is
pthread_cancelviagf_thread_cleanup_xint()inposix_fini. This has two problems:1. Mutex destroyed while locked (UB)
posix_fsyncer_pickholdspriv->fsync_mutexwhile callingpthread_cond_wait, which is a POSIX-defined cancellation point. Whencancellation is delivered during
pthread_cond_wait, POSIX.1-2008Section 2.9.5 specifies the mutex is re-acquired before cleanup handlers
execute. However,
posix_fsyncerregisters zeropthread_cleanup_pushhandlers (confirmed: no instances in the posix source tree).
After
pthread_cancel+pthread_join, the mutex remains locked.posix_finithen callspthread_mutex_destroy(&priv->fsync_mutex)on alocked mutex — undefined behavior per POSIX.1-2008.
2. Stub leak
When cancellation kills the fsyncer, stubs in two locations become
unreachable:
Local list: After
posix_fsyncer_picksplices stubs frompriv->fsyncsto the stack-locallistvariable, cancel duringgf_nanosleepor mid-processing leaves those stubs on a destroyed stackframe.
priv->fsyncs: Stubs queued after the last pick are never processed.posix_fininever drainspriv->fsyncs— it goes directly to destroyingthe mutex and freeing priv.
Each leaked
call_stub_tholds acall_frame_t *→ leaked frames → leakedfd_tandinode_treferences → inode table cannot clean up.Reproduction
All three shutdown paths are affected:
glusterfsdexit →posix_finiposix_initfails after fsyncer is started →posix_finixlator_fini_rec→posix_fini(process stays alive)Path C is the most severe because the process continues running with the
leaked resources and undefined mutex state.
The bug is deterministic when cancellation is delivered during
pthread_cond_wait(the fsyncer spends most of its time here). The impactof the UB is implementation-dependent — glibc may silently succeed,
deadlock, or abort depending on the mutex type and internal state.
Impact
pthread_mutex_destroy(implementation-dependent:glibc may silently succeed, deadlock, or abort)
shutdown time (frames, fds, inodes)
table cleanup for the detached brick
Suggested fix
Replace
pthread_cancelwith a cooperative exit flag, following the samepattern already used by the io_uring thread in the same xlator
(
priv->uring_thread_exitinposix-io-uring.c):gf_boolean_t fsyncer_exittostruct posix_private!priv->fsyncer_exittoposix_fsyncer_pick's cond_wait predicateposix_fsyncer_pickdrain remaining stubs and return -1 on exitposix_fsyncerprocess final stubs and break on count < 0posix_finiset the flag under mutex, signal, andpthread_join(no
pthread_cancel)This ensures:
pthread_mutex_destroyoperates on an unlocked mutexVerification
The fix was verified using SPIN model checking (exhaustive state-space
exploration of all thread interleavings):
Buggy code (pthread_cancel):
[] !(mutex_destroyed && fsyncer_holds_mutex)— VIOLATED (depth 24, 13 states)[] (priv_freed -> no_stubs_remaining)— VIOLATED (depth 35, 23 states)Fixed code (exit flag):
The exit-flag approach was also cross-checked at the shutdown composition
level using TLA+ model checking (20 states, invariant holds).
Counterexample: H1 — mutex destroyed while locked
SPIN's shortest counterexample for property P1, showing the exact
interleaving that leads to undefined behavior. Each row is one step;
mutexshows whetherfsync_mutexis free, held by the fsyncer, ordestroyed;
cancelshowspthread_cancelstate.Terminal state:
mutex_locked=1, mutex_destroyed=1, fsyncer_holds_mutex=1.Property
[] !(mutex_destroyed && fsyncer_holds_mutex)violated.Counterexample: H2 — stub leak after cancel
SPIN's shortest counterexample for property P2, showing how stubs become
orphaned. This trace goes through the same cond_wait cancel point as H1,
then a late-arriving stub is queued after the fsyncer is already dead.
Terminal state:
priv_freed=1, fsyncs_list=1, local_list=0.Property
[] (priv_freed -> (local_list == 0 && fsyncs_list == 0))violated.
Provenance
The fsyncer was born without a shutdown mechanism. The
pthread_cancelwas bolted on 4 years later for brick mux, without modifying the
for(;;)loop. The correct pattern (cooperative exit flag) wasimplemented 3 years after that in the same file for io_uring, but
nobody went back to retrofit the fsyncer.
132fefeffbfor(;;)loop, no shutdown mechanism. Performance optimization to batch AFR's fsync storms. Gerrit #4746.d3c0f96633gf_thread_cleanup_xint(priv->fsyncer)toposix_finifor brick mux (BUG 1453977). Correct intent (stop threads on detach), unsafe mechanism (pthread_cancelwith no cleanup handler). Gerrit #17356.a8325a23c7posix.ctoposix-common.c. Bug preserved unchanged.b922793588gf_thread_cleanup_xintwhile holding mutex → mutex never released → deadlock on re-enable. Fix:pthread_cleanup_push. The posix fsyncer has the same pattern but was not addressed.4b9f95b699uring_thread_exit— the correct cooperative exit flag pattern, in the same xlator. The fsyncer was not updated.0af464c492The bug has been latent for 9 years (2017–2026).
Prior and related issues
#1338 — "Add synchronization for POSIX xlator's internal threads" (Dmitry Antipov, 2020). Proposed the same fix: replace
pthread_cancelwith normal termination for all three posix internal threads (health check, diskspace check, and the fsyncer). A Gerrit patch (#24640) was posted but never merged. The issue was closed by the stale bot after 6 months of inactivity.
Dmitry later got a smaller fix merged as PR #2851 (2021), but it only addressed the health_check thread — the fsyncer was left unchanged.
This issue completes the work #1338 started for the fsyncer.
Other related issues in the same teardown-race family:
posix_finidisk_space_check UAF and double-decrement (same investigation)priv=NULLduring teardown (closed wontfix)Related: other
gf_thread_cleanup_xintcall sitesThis fix addresses only the posix fsyncer. However, there are 12 other
gf_thread_cleanup_xintcall sites across 6 xlators that use the samepthread_cancelmechanism. Only two havepthread_cleanup_pushhandlersto release resources held at cancellation points:
b922793588— fixed in 2019)The remaining call sites — posix
health_check, upcallreaper_thr,index
thread, and leasesrecall_thr— have no cleanup handlers andmay be exposed to the same pattern if they hold mutexes or resources at
cancellation points. This is out of scope for this fix but may warrant a
broader audit.
Found by
Manual investigation of posix xlator shutdown protocol, verified by SPIN
concurrent model checking.