Skip to content

posix: fsyncer pthread_cancel causes UB (locked mutex destroy) and stub leak #4685

Description

@ThalesBarretto

Summary

The posix fsyncer thread is shut down via pthread_cancel, which causes
undefined behavior (destroying a locked mutex) and resource leaks (orphaned
call_stub_t entries) on every shutdown path.

Affected code

Operation File Line
posix_fsyncer main loop xlators/storage/posix/src/posix-helpers.c 2463-2511
posix_fsyncer_pick (cond_wait = cancel point) xlators/storage/posix/src/posix-helpers.c 2379-2396
gf_thread_cleanup_xint (cancel + join) xlators/storage/posix/src/posix-common.c 1315
pthread_mutex_destroy on potentially locked mutex xlators/storage/posix/src/posix-common.c 1327

Root cause

The fsyncer thread runs a for(;;) loop with no exit condition. The only
shutdown mechanism is pthread_cancel via gf_thread_cleanup_xint() in
posix_fini. This has two problems:

1. Mutex destroyed while locked (UB)

posix_fsyncer_pick holds priv->fsync_mutex while calling
pthread_cond_wait, which is a POSIX-defined cancellation point. When
cancellation is delivered during pthread_cond_wait, POSIX.1-2008
Section 2.9.5 specifies the mutex is re-acquired before cleanup handlers
execute. However, posix_fsyncer registers zero pthread_cleanup_push
handlers (confirmed: no instances in the posix source tree).

After pthread_cancel + pthread_join, the mutex remains locked.
posix_fini then calls pthread_mutex_destroy(&priv->fsync_mutex) on a
locked mutex — undefined behavior per POSIX.1-2008.

2. Stub leak

When cancellation kills the fsyncer, stubs in two locations become
unreachable:

  • Local list: After posix_fsyncer_pick splices stubs from
    priv->fsyncs to the stack-local list variable, cancel during
    gf_nanosleep or mid-processing leaves those stubs on a destroyed stack
    frame.

  • priv->fsyncs: Stubs queued after the last pick are never processed.
    posix_fini never drains priv->fsyncs — it goes directly to destroying
    the mutex and freeing priv.

Each leaked call_stub_t holds a call_frame_t * → leaked frames → leaked
fd_t and inode_t references → inode table cannot clean up.

Reproduction

All three shutdown paths are affected:

  • Path A (normal shutdown): glusterfsd exit → posix_fini
  • Path B (init failure): posix_init fails after fsyncer is started →
    posix_fini
  • Path C (brick mux detach): brick detach → xlator_fini_rec
    posix_fini (process stays alive)

Path C is the most severe because the process continues running with the
leaked resources and undefined mutex state.

The bug is deterministic when cancellation is delivered during
pthread_cond_wait (the fsyncer spends most of its time here). The impact
of the UB is implementation-dependent — glibc may silently succeed,
deadlock, or abort depending on the mutex type and internal state.

Impact

  • Undefined behavior in pthread_mutex_destroy (implementation-dependent:
    glibc may silently succeed, deadlock, or abort)
  • Resource leaks proportional to the number of fsync stubs in flight at
    shutdown time (frames, fds, inodes)
  • In the brick mux detach path, leaked inode references may prevent inode
    table cleanup for the detached brick

Suggested fix

Replace pthread_cancel with a cooperative exit flag, following the same
pattern already used by the io_uring thread in the same xlator
(priv->uring_thread_exit in posix-io-uring.c):

  1. Add gf_boolean_t fsyncer_exit to struct posix_private
  2. Add !priv->fsyncer_exit to posix_fsyncer_pick's cond_wait predicate
  3. Have posix_fsyncer_pick drain remaining stubs and return -1 on exit
  4. Have posix_fsyncer process final stubs and break on count < 0
  5. Have posix_fini set the flag under mutex, signal, and pthread_join
    (no pthread_cancel)

This ensures:

  • The mutex is always unlocked before the fsyncer exits
  • All queued stubs are processed (STACK_UNWINDed) before shutdown
  • pthread_mutex_destroy operates on an unlocked mutex

Verification

The fix was verified using SPIN model checking (exhaustive state-space
exploration of all thread interleavings):

Buggy code (pthread_cancel):

  • P1 [] !(mutex_destroyed && fsyncer_holds_mutex)VIOLATED (depth 24, 13 states)
  • P2 [] (priv_freed -> no_stubs_remaining)VIOLATED (depth 35, 23 states)

Fixed code (exit flag):

  • P1 mutex safety — VERIFIED (0 errors, 1358 states, depth 139)
  • P2 no stub leak — VERIFIED (0 errors, 1358 states, depth 139)
  • P3 fini completes (liveness) — VERIFIED (0 errors, 1358 states, depth 139)

The exit-flag approach was also cross-checked at the shutdown composition
level using TLA+ model checking (20 states, invariant holds).

Counterexample: H1 — mutex destroyed while locked

SPIN's shortest counterexample for property P1, showing the exact
interleaving that leads to undefined behavior. Each row is one step;
mutex shows whether fsync_mutex is free, held by the fsyncer, or
destroyed; cancel shows pthread_cancel state.

Step  Thread    Action                               mutex       fsyncs  local  cancel
────  ────────  ───────────────────────────────────   ──────────  ──────  ─────  ────────
 1    fsyncer   enter posix_fsyncer_pick              free        0       0      —
 2    fini      non-deterministic start               free        0       0      —
 3    fini      pthread_cancel → cancel_pending        free        0       0      PENDING
 4    fsyncer   pthread_mutex_lock(fsync_mutex)       HELD        0       0      pending
 5    fsyncer   fsyncs empty → enter cond_wait        HELD        0       0      pending
 6    fsyncer   *** CANCEL DELIVERED at cond_wait     HELD        0       0      DELIVERED
                (POSIX: mutex re-acquired,
                 no pthread_cleanup_push registered
                 → mutex stays locked)
 7    fsyncer   exit (fsyncer_exited = true)          HELD        0       0      —
 8    fini      pthread_join returns                  HELD        0       0      —
 9    fini      pthread_mutex_destroy(fsync_mutex)    DESTROYED   0       0      —
                *** UB: mutex destroyed while locked
                *** fsyncer_holds_mutex=1 && mutex_destroyed=1

Terminal state: mutex_locked=1, mutex_destroyed=1, fsyncer_holds_mutex=1.
Property [] !(mutex_destroyed && fsyncer_holds_mutex) violated.

Counterexample: H2 — stub leak after cancel

SPIN's shortest counterexample for property P2, showing how stubs become
orphaned. This trace goes through the same cond_wait cancel point as H1,
then a late-arriving stub is queued after the fsyncer is already dead.

Step  Thread    Action                               mutex       fsyncs  local  cancel
────  ────────  ───────────────────────────────────   ──────────  ──────  ─────  ────────
 1    fsyncer   enter posix_fsyncer_pick              free        0       0      —
 2    fini      non-deterministic start               free        0       0      —
 3    fini      pthread_cancel → cancel_pending        free        0       0      PENDING
 4    fsyncer   pthread_mutex_lock(fsync_mutex)       HELD        0       0      pending
 5    fsyncer   fsyncs empty → enter cond_wait        HELD        0       0      pending
 6    fsyncer   *** CANCEL DELIVERED at cond_wait     HELD        0       0      DELIVERED
 7    fsyncer   exit (fsyncer_exited = true)          HELD        0       0      —
 8    fini      pthread_join returns                  HELD        0       0      —
 9    fini      pthread_mutex_destroy(fsync_mutex)    DESTROYED   0       0      —
10    producer  enqueue stub → fsyncs_list++          DESTROYED   1       0      —
                (posix_fsync called during shutdown —
                 no guard prevents enqueueing)
11    fini      GF_FREE(priv)                         —           1       0      —
                *** priv_freed=1 with fsyncs_list=1
                *** stub orphaned: no thread will ever
                    process it → frame/fd/inode leak

Terminal state: priv_freed=1, fsyncs_list=1, local_list=0.
Property [] (priv_freed -> (local_list == 0 && fsyncs_list == 0))
violated.

Provenance

The fsyncer was born without a shutdown mechanism. The pthread_cancel
was bolted on 4 years later for brick mux, without modifying the
for(;;) loop. The correct pattern (cooperative exit flag) was
implemented 3 years after that in the same file for io_uring, but
nobody went back to retrofit the fsyncer.

Date Author Commit What happened
2013-07 Anand Avati 132fefeffb Created the fsyncer — for(;;) loop, no shutdown mechanism. Performance optimization to batch AFR's fsync storms. Gerrit #4746.
2017-05 Mohit Agrawal d3c0f96633 Introduced the bug. Added gf_thread_cleanup_xint(priv->fsyncer) to posix_fini for brick mux (BUG 1453977). Correct intent (stop threads on detach), unsafe mechanism (pthread_cancel with no cleanup handler). Gerrit #17356.
2017-11 ShyamsundarR a8325a23c7 Reorganized posix for rio reuse. Fsyncer cleanup moved from posix.c to posix-common.c. Bug preserved unchanged.
2019-04 Raghavendra Bhat b922793588 Fixed the exact same bug class in bit-rot-stub (BZ 1700078). Signer thread cancelled via gf_thread_cleanup_xint while holding mutex → mutex never released → deadlock on re-enable. Fix: pthread_cleanup_push. The posix fsyncer has the same pattern but was not addressed.
2020-10 Ravishankar N 4b9f95b699 Added io_uring support with uring_thread_exitthe correct cooperative exit flag pattern, in the same xlator. The fsyncer was not updated.
2021-10 Dmitry Antipov 0af464c492 Fixed thread leak in posix health_check (PR #2851). Touched the same file but only the health_check thread, not the fsyncer.

The bug has been latent for 9 years (2017–2026).

Prior and related issues

#1338"Add synchronization for POSIX xlator's internal threads" (Dmitry Antipov, 2020). Proposed the same fix: replace pthread_cancel with normal termination for all three posix internal threads (health check, disk
space check, and the fsyncer). A Gerrit patch (#24640) was posted but never merged. The issue was closed by the stale bot after 6 months of inactivity.
Dmitry later got a smaller fix merged as PR #2851 (2021), but it only addressed the health_check thread — the fsyncer was left unchanged.
This issue completes the work #1338 started for the fsyncer.

Other related issues in the same teardown-race family:

Issue Relationship
#4684 Sibling — posix_fini disk_space_check UAF and double-decrement (same investigation)
#4666 Same theme — glfs_fini/open race, shutdown path missing synchronization gate
#4159 Related symptom — io_uring SIGABRT; io_uring has the correct exit flag but its fini is never called
#3727 Brick mux restart crash — locks xlator priv=NULL during teardown (closed wontfix)
#1982 Brick mux test crash — multiplex-limit-issue-151.t intermittent corefile (closed wontfix)

Related: other gf_thread_cleanup_xint call sites

This fix addresses only the posix fsyncer. However, there are 12 other
gf_thread_cleanup_xint call sites across 6 xlators that use the same
pthread_cancel mechanism. Only two have pthread_cleanup_push handlers
to release resources held at cancellation points:

  • bit-rot-stub (b922793588 — fixed in 2019)
  • AFR self-heal thread

The remaining call sites — posix health_check, upcall reaper_thr,
index thread, and leases recall_thr — have no cleanup handlers and
may be exposed to the same pattern if they hold mutexes or resources at
cancellation points. This is out of scope for this fix but may warrant a
broader audit.

Found by

Manual investigation of posix xlator shutdown protocol, verified by SPIN
concurrent model checking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions