Skip to content

Fix a potential shutdown hang with Environment.Exit() on Windows depending on timing with GCs #91739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kouvel
Copy link
Contributor

@kouvel kouvel commented Sep 7, 2023

Fixes #84006

…pending on timing with GCs

- On Windows when a thread calls `ExitProcess`, the `TlsDestructionMonitor` for the thread appears to be destructed after all other threads in the process are torn down. It's possible for a GC to be in progress during that time, and the thread cleanup code in `TlsDestructionMonitor` tries to enter cooperative GC mode to fix the frame pointer, leading to a hang. Fixed by deactivating the `TlsDestructionMonitor` for the thread before calling `ExitProcess`.
- Also disabled the relevant test due to a different issue #83658 occurring in the same test on multiple platforms/architectures that is not understood yet.

Fixes #84006
@kouvel kouvel requested review from VSadov and janvorli September 7, 2023 15:47
@kouvel kouvel self-assigned this Sep 7, 2023
@ghost
Copy link

ghost commented Sep 7, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #84006

Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: -

@kouvel kouvel added this to the 9.0.0 milestone Sep 7, 2023
Copy link
Member

@mangod9 mangod9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume this isnt a regression and the issue started showing up now due to timing changes?

@kouvel
Copy link
Contributor Author

kouvel commented Sep 7, 2023

Assume this isnt a regression and the issue started showing up now due to timing changes?

The issue and test are a bit sensitive to timing, it seems to occur more frequently in the test with GCStress=c in checked builds, where the main thread in the test spends a lot of time doing GCs and Environment.Exit is called before the main thread actually starts waiting for the background thread despite the sleep in the background thread. I'm not aware of anything specific that might have changed to cause this to occur more frequently recently.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@jkotas
Copy link
Member

jkotas commented Sep 7, 2023

Can this change introduce deadlocks or data corruptions? Are there places in the runtime where suppressing unregistration of the exiting thread is going to lead to orphaned pointer?

@jkotas
Copy link
Member

jkotas commented Sep 7, 2023

For example, what is going to happen if a thread exits right after DeactivateTlsDestructionMonitor and then another thread triggers GC. The suspend for GC is going to hang or crash - it will try to wait the thread that exited to get to the safe spot, but that is not ever going to happen since the thread is gone.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 7, 2023

For example, what is going to happen if a thread exits right after DeactivateTlsDestructionMonitor and then another thread triggers GC. The suspend for GC is going to hang or crash - it will try to wait the thread that exited to get to the safe spot, but that is not ever going to happen since the thread is gone.

Is it possible for a thread to exit right after DeactivateTlsDestructionMonitor but before ExitProcess starts tearing down the threads?

My rationale was that DeactivateTlsDestructionMonitor is called just before the thread (T1) calls ExitProcess, both in preemptive mode. If a different thread T2 triggers GC in-between T1 deactivating and the call to ExitProcess, T2 shouldn't have to wait for T1 since T1 is in preemptive mode. Besides, T2 will get torn down shortly by the OS.

Also I was thinking the pointers relevant to T1 would continue to be valid (like for walking the stack) until the OS starts tearing down all threads. If it's possible that the memory behind those pointers would be deallocated while another thread is still running, then it would be problematic, but the same issue would have existed back when thread cleanup was done upon DLL_THREAD_DETACH since that's not raised for ExitProcess and the thread calling ExitProcess would not have done any cleanup. On Windows anyway, the thread that called ExitProcess appears to be the last thread that gets torn down.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2023

A different example: ExitProcess can be called by 3rd party component. This is going to the common situation when CoreCLR is hosted and does not own "Main". Is it going to hit the same problem?

@danmoseley
Copy link
Member

Just curious, why does any code at all run on Environment.Exit? What is expected to happen other than rapid process termination?

@jkotas
Copy link
Member

jkotas commented Sep 8, 2023

Just curious, why does any code at all run on Environment.Exit?

Environment.Exit is graceful process termination. It runs both managed (AppDomain.ExitProcess) and unmanaged (atexit, DLL_PROCESS_DETATCH) exit handlers.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 8, 2023

A different example: ExitProcess can be called by 3rd party component. This is going to the common situation when CoreCLR is hosted and does not own "Main". Is it going to hit the same problem?

I think returning from the main function would be ok, even if a 3rd party component owns main, because there wouldn't be any managed frames on the stack, and the frame pointer of that thread would be the top frame.

If a thread that has managed frames on the stack calls ExitProcess directly, bypassing Environment.Exit, then it could run into the same type of hang. I'm not sure at the moment what would be a good way of fixing that. Fixing the frame pointer only in DllMain I expect would resolve the issue when coreclr is loaded as a DLL, but I'm not sure about statically-linked cases.

Just curious, why does any code at all run on Environment.Exit? What is expected to happen other than rapid process termination?

Aside from Environment.Exit, I'm not sure about why C++ thread_local objects are destructed for a thread that calls ExitProcess or exit directly (bypassing Environment.Exit). The thread_local objects for other threads are not destructed, so in that sense it's not a completely graceful exit anyway. But then, the same can be said about the main thread of a process returning while background threads are doing other things, that is considered a graceful exit even though the background threads may not get a chance to clean up. Perhaps there would be more reasons to run the thread-local destructors when the main thread returns than when a different arbitrary thread initiates a process exit, in which case the main thread also would not get a chance to destruct thread-locals.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2023

Note that this comment

// For case where thread calls ExitThread directly, we need to reset the
// frame pointer. Otherwise stackwalk would AV. We need to do it in cooperative mode.
// We need to set m_GCOnTransitionsOK so this thread won't trigger GC when toggle GC mode
talks about ExitThread. I do not see how this can help to make ExitThread work better. And even if it did make ExitThread work better in some cases, I would not have a problem with breaking it. ExitThread on a managed thread is one of those cases that are always going to work poorly.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 8, 2023

I do not see how this can help to make ExitThread work better. And even if it did make ExitThread work better in some cases, I would not have a problem with breaking it. ExitThread on a managed thread is one of those cases that are always going to work poorly.

I had done some brief testing with ExitThread also based on the comment, it looks like while ExitThread would obviously not run destructors on the stack, it does run thread-local destructors on Windows and Linux. That I believe is similar in behavior to before, where DLL_THREAD_DETACH is raised for a thread that calls ExitThread.

I don't think this change would make ExitThread scenarios any better, and I'm supposing that it wouldn't make them worse either.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2023

If a thread that has managed frames on the stack calls ExitProcess directly, bypassing Environment.Exit, then it could run into the same type of hang. I'm not sure at the moment what would be a good way of fixing that. Fixing the frame pointer only in DllMain I expect would resolve the issue when coreclr is loaded as a DLL, but I'm not sure about statically-linked cases.

Would switching to FlsAlloc callback on Windows, like we do in native AOT

g_flsIndex = FlsAlloc(FiberDetachCallback);
help with addressing this?

We do not set any flags before exiting the process in native AOT. It would be best for CoreCLR to be on the same plan.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

Would switching to FlsAlloc callback on Windows, like we do in native AOT help with addressing this?

I did some testing, it looks like while the FlsAlloc callbacks are raised a bit earlier, they can still run after all other threads have been torn down. It looks like other threads can be torn down asynchronously with running the FlsAlloc or thread-local callbacks for the thread that called ExitProcess.

I also checked out atexit on Windows to see if it would detect a direct call to ExitProcess after noticing that NativeAOT is using it to avoid performing cleanup on the same thread:

// Do not try detaching the thread that performs the shutdown.
if (g_threadPerformingShutdown == thread)
{
// At this point other threads could be terminated rudely while leaving runtime
// in inconsistent state, so we would be risking blocking the process from exiting.
return;
}

It looks like the atexit callback is not called when a thread calls ExitProcess unfortunately, it is called though when exiting from main.

Also noticed the following:

// Some Linux toolset versions call thread-local destructors during shutdown on a wrong thread.
if ((Thread*)thread != ThreadStore::GetCurrentThread())
{
return;
}

Which should probably also be checked in CoreCLR, I'll update to check the thread ID on Unixes.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

Updated to check the thread ID on Unixes in the TlsDestructionMonitor, please take a look again. The diff is a bit simpler when ignoring whitespace.

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

I still think that setting the flag that suppresses thread exit notifications before exiting is going in a wrong direction.

I think we should rather go in the direction of deleting or rearranging the problematic pieces in the thread shutdown notification. It is the plan that native AOT is on. The thread shutdown notification does very little work on native AOT.

If it is not possible for some reason, it likely means that there is a problem in native AOT too.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

I think we should rather go in the direction of deleting or rearranging the problematic pieces in the thread shutdown notification.

If it weren't for ExitThread cases, the thread shutdown notification probably wouldn't need to fix the frame pointer at all, as I suppose when the thread is exiting normally there wouldn't be any managed frames on the stack when the thread is exiting. Maybe we can break ExitThread cases and remove the frame pointer fixup?

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

I would not mind regressing ExitThread. Do you know whether ExitThread will be actually broken if you delete the code the resets m_pFrame?

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

Do you know whether ExitThread will be actually broken if you delete the code the resets m_pFrame?

I believe it would be broken, my understanding is that the main purpose of entering cooperative GC mode is to ensure that no other thread is waking that thread's stack before the thread's stack is deallocated, which would make those frame pointers invalid.

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

The thread store lock is taken when the runtime is suspended and the stackwalks are happening. The thread detach takes the thread store lock to delete the thread from the thread store. The thread detach will wait on the thread store lock until the runtime is resumed. It should prevent stack walk from ever encountering deallocated thread stack.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

The thread detach takes the thread store lock to delete the thread from the thread store. The thread detach will wait on the thread store lock until the runtime is resumed.

It's quite possible that I missed that, I had checked but I didn't see the thread store lock being taken, could you please point me to where that happens?

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

From a very brief rescan it looks like the thread detach adjusts the state of the thread object to "detached" but the thread object wouldn't actually be removed from the thread store until some other point in time, as there could still be references to the managed thread object.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

The thread detach will wait on the thread store lock until the runtime is resumed.

Could this also lead to a hang? The resume may never happen in ExitProcess cases on Windows.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

It seems like this:

ThreadStore::CheckForEEShutdown();

Should be done inside the thread store lock but I don't see it being taken.

There is another lock taken here:

UnmarkThreadForAbort();

If the detach code would run when the same thread called ExitProcess, I don't think it's safe for it to wait for anything. Otherwise, the detach code shouldn't run in that case.

I suspect there's an issue in NativeAOT too:

// Note that when process is shutting down, the threads may be rudely terminated,
// possibly while holding the threadstore lock. That is ok, since the process is being torn down.
CrstHolder threadStoreLock(&pTS->m_Lock);

I believe the comment above would not hold when it's the same thread that called ExitProcess and the detach code is running as a result of that. A different thread may have been torn down while holding the thread store lock, and this thread would hang and block the process from exiting, unless there is something special about critical sections during process exit, which I haven't verified.

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

From a very brief rescan it looks like the thread detach adjusts the state of the thread object to "detached"

Yes, you are right. (I have described how the detach works in native AOT. I forgot that CoreCLR uses different scheme.)

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

The thread detach will wait on the thread store lock until the runtime is resumed.

Could this also lead to a hang? The resume may never happen in ExitProcess cases on Windows.

It won't hang, assuming thread store lock is CRITICAL_SECTION. Instead, it is going to exit the process immediately, skipping the remaining cleanup. During shutdown, EnterCriticalSection terminates the process immediately when it is called on a critical section taken by some other thread.

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

It seems like this:

ThreadStore::CheckForEEShutdown();

Should be done inside the thread store lock but I don't see it being taken.

Yes, it is a tricky lock-free scheme. It is likely that it is has subtle race conditions.

I would avoid coupling it with thread store lock. Instead, introduce a single counter that counts the number of foreground threads and check that counter atomically to detect when it is safe to exit. It is what native does here:

if (s_foregroundRunningCount == 0)
{
// current thread is the last foreground thread, so let the runtime finish
return;
}
Volatile.Write(ref s_allDone, new ManualResetEvent(false));
// other foreground threads could have their job finished meanwhile
// Volatile.Write above issues release barrier
// but we need acquire barrier to observe most recent write to s_foregroundRunningCount
if (Volatile.Read(ref s_foregroundRunningCount) == 0)
{
return;
}
s_allDone.WaitOne();
. (If you would like to clean it up, it should be a separate PR.)

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

Should I close this PR and open a separate one to remove the frame pointer fixup that breaks ExitThread cases?

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

remove the frame pointer fixup

Actually nevermind, that may not work either, and there may be more things to investigate. Closing for now.

@kouvel kouvel closed this Sep 14, 2023
@kouvel kouvel deleted the EnvExitTestFix branch September 14, 2023 21:50
@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

It won't hang, assuming thread store lock is CRITICAL_SECTION. Instead, it is going to exit the process immediately, skipping the remaining cleanup. During shutdown, EnterCriticalSection terminates the process immediately when it is called on a critical section taken by some other thread.

I imagine this would also prevent thread-local destructors and DLL_PROCESS_DETACH from being raised. Is that ok?

@jkotas
Copy link
Member

jkotas commented Sep 14, 2023

Correct, the rest of the cleanup code won't run.

@kouvel
Copy link
Contributor Author

kouvel commented Sep 14, 2023

I guess it's a necessity.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure baseservices/threading/regressions/2164/foreground-shutdown/foreground-shutdown.cmd
5 participants