-
Notifications
You must be signed in to change notification settings - Fork 5k
Fix a potential shutdown hang with Environment.Exit()
on Windows depending on timing with GCs
#91739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pending on timing with GCs - On Windows when a thread calls `ExitProcess`, the `TlsDestructionMonitor` for the thread appears to be destructed after all other threads in the process are torn down. It's possible for a GC to be in progress during that time, and the thread cleanup code in `TlsDestructionMonitor` tries to enter cooperative GC mode to fix the frame pointer, leading to a hang. Fixed by deactivating the `TlsDestructionMonitor` for the thread before calling `ExitProcess`. - Also disabled the relevant test due to a different issue #83658 occurring in the same test on multiple platforms/architectures that is not understood yet. Fixes #84006
Tagging subscribers to this area: @mangod9 Issue Details
Fixes #84006
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume this isnt a regression and the issue started showing up now due to timing changes?
The issue and test are a bit sensitive to timing, it seems to occur more frequently in the test with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Can this change introduce deadlocks or data corruptions? Are there places in the runtime where suppressing unregistration of the exiting thread is going to lead to orphaned pointer? |
For example, what is going to happen if a thread exits right after |
Is it possible for a thread to exit right after My rationale was that Also I was thinking the pointers relevant to T1 would continue to be valid (like for walking the stack) until the OS starts tearing down all threads. If it's possible that the memory behind those pointers would be deallocated while another thread is still running, then it would be problematic, but the same issue would have existed back when thread cleanup was done upon |
A different example: |
Just curious, why does any code at all run on Environment.Exit? What is expected to happen other than rapid process termination? |
|
I think returning from the main function would be ok, even if a 3rd party component owns main, because there wouldn't be any managed frames on the stack, and the frame pointer of that thread would be the top frame. If a thread that has managed frames on the stack calls
Aside from |
Note that this comment runtime/src/coreclr/vm/ceemain.cpp Lines 1742 to 1744 in 773c180
ExitThread . I do not see how this can help to make ExitThread work better. And even if it did make ExitThread work better in some cases, I would not have a problem with breaking it. ExitThread on a managed thread is one of those cases that are always going to work poorly.
|
I had done some brief testing with I don't think this change would make |
Would switching to
We do not set any flags before exiting the process in native AOT. It would be best for CoreCLR to be on the same plan. |
I did some testing, it looks like while the I also checked out runtime/src/coreclr/nativeaot/Runtime/startup.cpp Lines 324 to 330 in d34ed28
It looks like the Also noticed the following: runtime/src/coreclr/nativeaot/Runtime/startup.cpp Lines 332 to 336 in d34ed28
Which should probably also be checked in CoreCLR, I'll update to check the thread ID on Unixes. |
Updated to check the thread ID on Unixes in the |
I still think that setting the flag that suppresses thread exit notifications before exiting is going in a wrong direction. I think we should rather go in the direction of deleting or rearranging the problematic pieces in the thread shutdown notification. It is the plan that native AOT is on. The thread shutdown notification does very little work on native AOT. If it is not possible for some reason, it likely means that there is a problem in native AOT too. |
If it weren't for |
I would not mind regressing |
I believe it would be broken, my understanding is that the main purpose of entering cooperative GC mode is to ensure that no other thread is waking that thread's stack before the thread's stack is deallocated, which would make those frame pointers invalid. |
The thread store lock is taken when the runtime is suspended and the stackwalks are happening. The thread detach takes the thread store lock to delete the thread from the thread store. The thread detach will wait on the thread store lock until the runtime is resumed. It should prevent stack walk from ever encountering deallocated thread stack. |
It's quite possible that I missed that, I had checked but I didn't see the thread store lock being taken, could you please point me to where that happens? |
From a very brief rescan it looks like the thread detach adjusts the state of the thread object to "detached" but the thread object wouldn't actually be removed from the thread store until some other point in time, as there could still be references to the managed thread object. |
Could this also lead to a hang? The resume may never happen in |
It seems like this: runtime/src/coreclr/vm/threads.cpp Line 995 in 49a0633
Should be done inside the thread store lock but I don't see it being taken. There is another lock taken here: runtime/src/coreclr/vm/threads.cpp Line 989 in 49a0633
If the detach code would run when the same thread called I suspect there's an issue in NativeAOT too: runtime/src/coreclr/nativeaot/Runtime/threadstore.cpp Lines 183 to 185 in 49a0633
I believe the comment above would not hold when it's the same thread that called |
Yes, you are right. (I have described how the detach works in native AOT. I forgot that CoreCLR uses different scheme.) |
It won't hang, assuming thread store lock is CRITICAL_SECTION. Instead, it is going to exit the process immediately, skipping the remaining cleanup. During shutdown, EnterCriticalSection terminates the process immediately when it is called on a critical section taken by some other thread. |
Yes, it is a tricky lock-free scheme. It is likely that it is has subtle race conditions. I would avoid coupling it with thread store lock. Instead, introduce a single counter that counts the number of foreground threads and check that counter atomically to detect when it is safe to exit. It is what native does here: runtime/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Thread.NativeAot.cs Lines 489 to 502 in 4b75f93
|
Should I close this PR and open a separate one to remove the frame pointer fixup that breaks |
Actually nevermind, that may not work either, and there may be more things to investigate. Closing for now. |
I imagine this would also prevent thread-local destructors and |
Correct, the rest of the cleanup code won't run. |
I guess it's a necessity. |
ExitProcess
, theTlsDestructionMonitor
for the thread appears to be destructed after all other threads in the process are torn down. It's possible for a GC to be in progress during that time, and the thread cleanup code inTlsDestructionMonitor
tries to enter cooperative GC mode to fix the frame pointer, leading to a hang. Fixed by deactivating theTlsDestructionMonitor
for the thread before callingExitProcess
.DLL_THREAD_DETACH
, which is not raised byExitProcess
.exit
theTlsDestructionMonitor
for the thread appears to be destructed while other threads are still running, so it is eventually able to enter cooperative GC mode. It seems the cleanup is unnecessary in this case anyway, and the behavior is similar to before, so I didn't special-case the fix for Windows.Fixes #84006