Fix deadlocks and crashes seen on free-threaded Python #1133

ngoldbaum · 2025-09-19T19:07:59Z

Closes #1132.

I got to this state after a lot of trial and error and I'd really appreciate it if @taleinat could perhaps look this over, since he also thought about this problem recently.

As I suspected in #1132 (comment), the issue is that BaseObserver.start didn't do any locking at all, so if someone engineers a situation where stop gets called while start is still running, you get a deadlock.

Adding locking in start wasn't quite enough, you also need to check in dispatch_events before dipatching an event if another thread called stop or removes a watch while dispatch_events is running.

I also applied @colesbury's fix for the file handle re-use issue that leads to a Python crash, see #1132 (comment) for more on that.

ngoldbaum · 2025-09-19T19:32:31Z

So it looks like there's still a possible deadlock on Windows that needs to be tracked down - this run is deadlocked on test_tricks: https://github.com/gorakhargosh/watchdog/actions/runs/17867850632/job/50814381030?pr=1133. I'll add a timeout so we can at least get a little more info if it happens.

I'm also not sure if the flaky test failures I'm seeing are "real" given that current master sees some flaky test failures too.

ngoldbaum · 2025-09-19T19:42:50Z

I'm also seeing failures due to the global fixture that tries to check for thread leaks:

watchdog/tests/conftest.py

Lines 23 to 32 in 4bc8f79

    
           @pytest.fixture(autouse=True) 
        
           def _no_thread_leaks(): 
        
               """ 
        
               Fail on thread leak. 
        
               We do not use pytest-threadleak because it is not reliable. 
        
               """ 
        
               old_thread_count = threading.active_count() 
        
               yield 
        
               gc.collect()  # Clear the stuff from other function-level fixtures 
        
               assert threading.active_count() == old_thread_count  # Only previously existing threads

I'm not sure if what the fixture is trying to do makes any sense in the free-threaded build. In particular just calling gc.collect() won't necessarily ensure that other threads will terminate before the assertion happens.

taleinat

I don't see this fixing the issue where .run() may be called after its thread has already been signaled to stop, but that is not checked for. Am I missing it?

I was expecting to see that added to the BaseThread class.

.github/workflows/tests.yml

taleinat · 2025-09-20T10:27:21Z

src/watchdog/observers/api.py

+        # To allow unschedule/stop and safe removal of event handlers
+        # within event handlers itself, check if the handler is still
+        # registered after every dispatch.
+        for handler in self._handlers[watch].copy():


Is .copy() thread-safe? Otherwise it should likely be guarded by a lock as while.

I rearranged so the lock is acquired in the for loop and only gets released during blocking calls, to avoid deadlocks.

Actually I checked again and it looks like I can actually leave the locking as it was before, so never mind.

taleinat · 2025-09-20T10:28:19Z

src/watchdog/observers/api.py

+            with self._lock:
+                if handler not in self._handlers[watch]:
+                    continue
+            handler.dispatch(event)


There may still be a race condition here: At this point the handler could have been removed since the check in the previous line.

taleinat · 2025-09-20T10:30:20Z

src/watchdog/observers/read_directory_changes.py

+        whandle = self._whandle
+        if whandle:
+            self._whandle = None


I'm not sure I understand what race this is guarding against, but it seems like it would be better to have a lock around this.

This is guarding against the file handle re-use crash described here. Let me see if locking also works, since that's a lot more explicit.

I tried adding explicit locking, but that deadlocks in test_emitter.py::test_delete_self. The main thread tries to join the emitter thread, but that is blocked on the observer closing the emitter's file handle.

ngoldbaum · 2025-09-24T20:06:11Z

I was expecting to see that added to the BaseThread class.

Adding an early return to BaseThread.start also seems to fix the deadlock. Wow! Thanks for pointing out my overcomplicating things, I really appreciate the feedback.

Fix deadlock3

ngoldbaum · 2025-09-24T23:00:25Z

Adding an early return to BaseThread.start also seems to fix the deadlock.

I take this back. It does help with the deadlocks but we still need the other early returns.

I think this is ready for review again now.

ngoldbaum · 2025-09-24T23:15:04Z

The Windows 3.14t job is deadlocking. I hadn't seen that deadlock locally but managed to reproduce it after running the deadlocking test 10 or so times. Here's the tracebacks from all the hung threads: https://gist.github.com/ngoldbaum/b32563967ab629ecb0a84c88efc94a16

Is it possible that the Windows kernel APIs we're calling via ctypes need some kind of global lock to avoid simultaneous calls like this?

ngoldbaum · 2025-09-25T02:59:21Z

src/watchdog/utils/__init__.py

+        if self._stopped_event.is_set():
+            # stop was called while we were doing setup,
+            # so don't actually spawn a thread
+            self.on_thread_stop()


Stop calls this, so it's not necessary I think.

ngoldbaum · 2025-09-25T15:26:30Z

The Windows 3.14t job is deadlocking.

I think I might understand what's happening here. The docs for CancelIOEx say you're supposed to check the return value and if it's true, then call GetOverlappedResult to poll for the pending cancel to complete. I think in general the wrappers around the windows kernel APIs in watchdog probably need to check return codes and handle error cases better.

I'm not sure how far I should go touching the windows ctypes wrappers so I think I'll stop here to wait for further code review.

Fix deadlock4

Fix deadlock seen on the free-threaded build

f476645

ngoldbaum force-pushed the fix-deadlock branch from 1210fc1 to 794da9c Compare September 19, 2025 19:10

apply Sam's fix for buggy file handle re-use

4bd5909

ngoldbaum force-pushed the fix-deadlock branch from 794da9c to 4bd5909 Compare September 19, 2025 19:11

force-disable the GIL on free-threaded Mac CI

8bd96b4

ngoldbaum force-pushed the fix-deadlock branch from 79902d8 to 8bd96b4 Compare September 19, 2025 19:24

add a timeout to the tox test runs

0518749

taleinat reviewed Sep 20, 2025

View reviewed changes

ngoldbaum added 8 commits September 24, 2025 15:25

add locking to avoid file handle re-use on Windows

722b2ac

add early returns in BaseThread.start

8d496db

Revert changes to BaseObserver.dispatch_events

35e585a

release lock before possibly blocking calls

7a49c97

remove lock that leads to a deadlock

d616afa

Merge pull request #5 from ngoldbaum/fix-deadlock3

cda5a88

Fix deadlock3

revert changes to github actions configuration

e7796de

try without releasing lock while calling dispatch

8742748

revert unnecessary moved comment

c813d13

ngoldbaum requested a review from taleinat September 24, 2025 23:02

add a timeout to the python tests

a65e950

ngoldbaum commented Sep 25, 2025

View reviewed changes

ngoldbaum added 4 commits September 25, 2025 08:18

delete unnecessary repeated call to on_thread_stop

f6eac57

add note bout concurrent use to docstrings

a8772e2

attempt to add global open handle cache

4e7f9ef

make thread leak check use an upper bound

c94854a

Merge pull request #6 from ngoldbaum/fix-deadlock4

c903cfa

Fix deadlock4

davidism mentioned this pull request Nov 26, 2025

GitHub Actions: Add Python 3.14 and 3.14t to the testing pallets/werkzeug#3064

Merged

BoboTiG mentioned this pull request Nov 26, 2025

enable free threading in watchdog_fsevents.c #1140

Open

Uh oh!

Fix deadlocks and crashes seen on free-threaded Python #1133

Are you sure you want to change the base?

Fix deadlocks and crashes seen on free-threaded Python #1133

Conversation

ngoldbaum commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Sep 24, 2025

Uh oh!

ngoldbaum commented Sep 24, 2025

Uh oh!

ngoldbaum commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngoldbaum commented Sep 19, 2025 •

edited

Loading

ngoldbaum commented Sep 19, 2025 •

edited

Loading

ngoldbaum commented Sep 19, 2025 •

edited

Loading

taleinat left a comment •

edited

Loading

ngoldbaum commented Sep 24, 2025 •

edited

Loading