Skip to content

Conversation

@ngoldbaum
Copy link
Contributor

@ngoldbaum ngoldbaum commented Sep 19, 2025

Closes #1132.

I got to this state after a lot of trial and error and I'd really appreciate it if @taleinat could perhaps look this over, since he also thought about this problem recently.

As I suspected in #1132 (comment), the issue is that BaseObserver.start didn't do any locking at all, so if someone engineers a situation where stop gets called while start is still running, you get a deadlock.

Adding locking in start wasn't quite enough, you also need to check in dispatch_events before dipatching an event if another thread called stop or removes a watch while dispatch_events is running.

I also applied @colesbury's fix for the file handle re-use issue that leads to a Python crash, see #1132 (comment) for more on that.

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Sep 19, 2025

So it looks like there's still a possible deadlock on Windows that needs to be tracked down - this run is deadlocked on test_tricks: https://github.com/gorakhargosh/watchdog/actions/runs/17867850632/job/50814381030?pr=1133. I'll add a timeout so we can at least get a little more info if it happens.

I'm also not sure if the flaky test failures I'm seeing are "real" given that current master sees some flaky test failures too.

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Sep 19, 2025

I'm also seeing failures due to the global fixture that tries to check for thread leaks:

@pytest.fixture(autouse=True)
def _no_thread_leaks():
"""
Fail on thread leak.
We do not use pytest-threadleak because it is not reliable.
"""
old_thread_count = threading.active_count()
yield
gc.collect() # Clear the stuff from other function-level fixtures
assert threading.active_count() == old_thread_count # Only previously existing threads

I'm not sure if what the fixture is trying to do makes any sense in the free-threaded build. In particular just calling gc.collect() won't necessarily ensure that other threads will terminate before the assertion happens.

Copy link
Contributor

@taleinat taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this fixing the issue where .run() may be called after its thread has already been signaled to stop, but that is not checked for. Am I missing it?

I was expecting to see that added to the BaseThread class.

# To allow unschedule/stop and safe removal of event handlers
# within event handlers itself, check if the handler is still
# registered after every dispatch.
for handler in self._handlers[watch].copy():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .copy() thread-safe? Otherwise it should likely be guarded by a lock as while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged so the lock is acquired in the for loop and only gets released during blocking calls, to avoid deadlocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I checked again and it looks like I can actually leave the locking as it was before, so never mind.

with self._lock:
if handler not in self._handlers[watch]:
continue
handler.dispatch(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may still be a race condition here: At this point the handler could have been removed since the check in the previous line.

Comment on lines +61 to +63
whandle = self._whandle
if whandle:
self._whandle = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what race this is guarding against, but it seems like it would be better to have a lock around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guarding against the file handle re-use crash described here. Let me see if locking also works, since that's a lot more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding explicit locking, but that deadlocks in test_emitter.py::test_delete_self. The main thread tries to join the emitter thread, but that is blocked on the observer closing the emitter's file handle.

@ngoldbaum
Copy link
Contributor Author

I was expecting to see that added to the BaseThread class.

Adding an early return to BaseThread.start also seems to fix the deadlock. Wow! Thanks for pointing out my overcomplicating things, I really appreciate the feedback.

@ngoldbaum
Copy link
Contributor Author

Adding an early return to BaseThread.start also seems to fix the deadlock.

I take this back. It does help with the deadlocks but we still need the other early returns.

I think this is ready for review again now.

@ngoldbaum ngoldbaum requested a review from taleinat September 24, 2025 23:02
@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Sep 24, 2025

The Windows 3.14t job is deadlocking. I hadn't seen that deadlock locally but managed to reproduce it after running the deadlocking test 10 or so times. Here's the tracebacks from all the hung threads: https://gist.github.com/ngoldbaum/b32563967ab629ecb0a84c88efc94a16

Is it possible that the Windows kernel APIs we're calling via ctypes need some kind of global lock to avoid simultaneous calls like this?

if self._stopped_event.is_set():
# stop was called while we were doing setup,
# so don't actually spawn a thread
self.on_thread_stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop calls this, so it's not necessary I think.

@ngoldbaum
Copy link
Contributor Author

The Windows 3.14t job is deadlocking.

I think I might understand what's happening here. The docs for CancelIOEx say you're supposed to check the return value and if it's true, then call GetOverlappedResult to poll for the pending cancel to complete. I think in general the wrappers around the windows kernel APIs in watchdog probably need to check return codes and handle error cases better.

I'm not sure how far I should go touching the windows ctypes wrappers so I think I'll stop here to wait for further code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_tricks_from_file[tricks-from] crashes Python on 3.14t on Windows

2 participants