Skip to content

Commit f6fd4b9

Browse files
pdillingerfacebook-github-bot
authored andcommitted
Print stack traces more reliably with concurrency (facebook#12086)
Summary: It's been relatively easy to break our stack trace printer: * If another thread reaches a signal condition such as a related SEGV or assertion failure while one is trying to print a stack trace from the signal handler, it seems to end the process abruptly without a stack trace. * If the process exits normally in one thread (such as main finishing) while another is trying to print a stack trace from the signal handler, it seems the process will often end normally without a stack trace. This change attempts to fix these issues, with * Keep the custom signal handler installed as long as possible, so that other threads will most likely re-enter our custom handler. (We only switch back to default for triggering core dump or whatever after stack trace.) * Use atomics and sleeps to implement a crude recursive mutex for ensuring all threads hitting the custom signal handler wait on the first that is trying to print a stack trace, while recursive signals in the same thread can still be handled cleanly. * Use an atexit handler to hook into normal exit to (a) wait on a pending printing of stack trace when detectable and applicable, and (b) detect and warn when printing a stack trace might be interrupted by a process exit in progress. (I don't know how to pause that *after* our atexit handler has been called; the best I know how to do is warn, "In a race with process already exiting...".) Pull Request resolved: facebook#12086 Test Plan: manual, including with TSAN. I added this code to the end of a unit test file: ``` for (size_t i = 0; i < 3; ++i) { std::thread t([]() { assert(false); }); t.detach(); } ``` Followed by either `sleep(100)` or `usleep(100)` or usual process exit. And for recursive signal testing, inject `abort()` at various places in the handler. Reviewed By: cbi42 Differential Revision: D51531882 Pulled By: pdillinger fbshipit-source-id: 3473b863a43e61b722dfb7a2ed12a8120949b09c
1 parent a140b51 commit f6fd4b9

File tree

1 file changed

+76
-14
lines changed

1 file changed

+76
-14
lines changed

port/stack_trace.cc

Lines changed: 76 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ void* SaveStack(int* /*num_frames*/, int /*first_frames_to_skip*/) {
2525

2626
#include <cxxabi.h>
2727
#include <execinfo.h>
28+
#include <pthread.h>
2829
#include <signal.h>
2930
#include <stdio.h>
3031
#include <stdlib.h>
@@ -48,6 +49,9 @@ void* SaveStack(int* /*num_frames*/, int /*first_frames_to_skip*/) {
4849
#endif // GLIBC version
4950
#endif // OS_LINUX
5051

52+
#include <algorithm>
53+
#include <atomic>
54+
5155
#include "port/lang.h"
5256

5357
namespace ROCKSDB_NAMESPACE {
@@ -311,28 +315,85 @@ void* SaveStack(int* num_frames, int first_frames_to_skip) {
311315
return callstack;
312316
}
313317

318+
static std::atomic<uint64_t> g_thread_handling_stack_trace{0};
319+
static int g_recursion_count = 0;
320+
static std::atomic<bool> g_at_exit_called{false};
321+
314322
static void StackTraceHandler(int sig) {
315-
// reset to default handler
316-
signal(sig, SIG_DFL);
317323
fprintf(stderr, "Received signal %d (%s)\n", sig, strsignal(sig));
318-
// skip the top three signal handler related frames
319-
PrintStack(3);
324+
// Crude recursive mutex with no signal-unsafe system calls, to avoid
325+
// re-entrance from multiple threads and avoid core dumping while trying
326+
// to print the stack trace.
327+
uint64_t tid = 0;
328+
{
329+
const auto ptid = pthread_self();
330+
// pthread_t is an opaque type
331+
memcpy(&tid, &ptid, std::min(sizeof(tid), sizeof(ptid)));
332+
// Essentially ensure non-zero
333+
++tid;
334+
}
335+
for (;;) {
336+
uint64_t expected = 0;
337+
if (g_thread_handling_stack_trace.compare_exchange_strong(expected, tid)) {
338+
// Acquired mutex
339+
g_recursion_count = 0;
340+
break;
341+
}
342+
if (expected == tid) {
343+
++g_recursion_count;
344+
fprintf(stderr, "Recursive call to stack trace handler (%d)\n",
345+
g_recursion_count);
346+
break;
347+
}
348+
// Sleep before trying again
349+
usleep(1000);
350+
}
351+
352+
if (g_recursion_count > 2) {
353+
// Give up after too many recursions
354+
fprintf(stderr, "Too many recursive calls to stack trace handler (%d)\n",
355+
g_recursion_count);
356+
} else {
357+
if (g_at_exit_called.load(std::memory_order_acquire)) {
358+
fprintf(stderr, "In a race with process already exiting...\n");
359+
}
320360

321-
// Efforts to fix or suppress TSAN warnings "signal-unsafe call inside of
322-
// a signal" have failed, so just warn the user about them.
361+
// skip the top three signal handler related frames
362+
PrintStack(3);
363+
364+
// Efforts to fix or suppress TSAN warnings "signal-unsafe call inside of
365+
// a signal" have failed, so just warn the user about them.
323366
#ifdef __SANITIZE_THREAD__
324-
fprintf(stderr,
325-
"==> NOTE: any above warnings about \"signal-unsafe call\" are\n"
326-
"==> ignorable, as they are expected when generating a stack\n"
327-
"==> trace because of a signal under TSAN. Consider why the\n"
328-
"==> signal was generated to begin with, and the stack trace\n"
329-
"==> in the TSAN warning can be useful for that. (The stack\n"
330-
"==> trace printed by the signal handler is likely obscured\n"
331-
"==> by TSAN output.)\n");
367+
fprintf(stderr,
368+
"==> NOTE: any above warnings about \"signal-unsafe call\" are\n"
369+
"==> ignorable, as they are expected when generating a stack\n"
370+
"==> trace because of a signal under TSAN. Consider why the\n"
371+
"==> signal was generated to begin with, and the stack trace\n"
372+
"==> in the TSAN warning can be useful for that. (The stack\n"
373+
"==> trace printed by the signal handler is likely obscured\n"
374+
"==> by TSAN output.)\n");
332375
#endif
376+
}
333377

378+
// reset to default handler
379+
signal(sig, SIG_DFL);
334380
// re-signal to default handler (so we still get core dump if needed...)
335381
raise(sig);
382+
383+
// release the mutex, in case this is somehow recoverable
384+
if (g_recursion_count > 0) {
385+
--g_recursion_count;
386+
} else {
387+
g_thread_handling_stack_trace.store(0, std::memory_order_release);
388+
}
389+
}
390+
391+
static void AtExit() {
392+
// wait for stack trace handler to finish, if needed
393+
while (g_thread_handling_stack_trace.load(std::memory_order_acquire)) {
394+
usleep(1000);
395+
}
396+
g_at_exit_called.store(true, std::memory_order_release);
336397
}
337398

338399
void InstallStackTraceHandler() {
@@ -342,6 +403,7 @@ void InstallStackTraceHandler() {
342403
signal(SIGSEGV, StackTraceHandler);
343404
signal(SIGBUS, StackTraceHandler);
344405
signal(SIGABRT, StackTraceHandler);
406+
atexit(AtExit);
345407
// Allow ouside debugger to attach, even with Yama security restrictions.
346408
// This is needed even outside of PrintStack() so that external mechanisms
347409
// can dump stacks if they suspect that a test has hung.

0 commit comments

Comments
 (0)