Skip to content

Julia hangs forever in Windows #56432

Open
@jsjie

Description

@jsjie

Hard to give a MWE, but a binary compiled by PackageCompiler.jl in Julia 1.9.4 seems to hang forever whenever the abort function is triggered in Windows. The issue is raised here because, according to the following two example outputs, it seems more likely to be a bug in Julia itself instead of PackageCompiler:

Cannot protect page @0000000000000000 of size 1048576 to 0x4 (err 0x1e7)

[12024] signal (22): SIGABRT
in expression starting at none:1
crt_sig_handler at C:/workdir/src\signals-win.c:95
raise at C:\Windows\System32\msvcrt.dll (unknown line)
abort at C:\Windows\System32\msvcrt.dll (unknown line)
protect_page at C:/workdir/src\cgmemmgr.cpp:78 [inlined]
get_wr_ptr at C:/workdir/src\cgmemmgr.cpp:613 [inlined]
alloc at C:/workdir/src\cgmemmgr.cpp:589 [inlined]
allocateCodeSection at C:/workdir/src\cgmemmgr.cpp:870

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: UNKNOWN at 0x7ff963673b19 --
[20632] signal (22): SIGABRT
in expression starting at none:1
crt_sig_handler at C:/workdir/src\signals-win.c:95
raise at C:\Windows\System32\msvcrt.dll (unknown line)
abort at C:\Windows\System32\msvcrt.dll (unknown line)
realloc_s at C:/workdir/src/support\dtypes.h:359 [inlined]
realloc_s at C:/workdir/src/support\dtypes.h:351 [inlined]
gc_mark_stack_resize at C:/workdir/src\gc.c:1934
gc_mark_stack_push at C:/workdir/src\gc.c:1953 [inlined]
gc_mark_loop at C:/workdir/src\gc.c:2821
_jl_gc_collect at C:/workdir/src\gc.c:3407
ijl_gc_collect at C:/workdir/src\gc.c:3713
maybe_collect at C:/workdir/src\gc.c:1083 [inlined]
jl_gc_pool_alloc_inner at C:/workdir/src\gc.c:1450 [inlined]
jl_gc_pool_alloc_noinline at C:/workdir/src\gc.c:1511 [inlined]
jl_gc_alloc_ at C:/workdir/src\julia_internal.h:460 [inlined]
jl_gc_alloc at C:/workdir/src\gc.c:3760
_new_array_ at C:/workdir/src\array.c:134
_new_array at C:/workdir/src\array.c:198 [inlined]
ijl_alloc_array_1d at C:/workdir/src\array.c:436

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: UNKNOWN at 0x7ff963673b19 --

Then the process hangs forever.
I've checked the 1.9.4 tag of Julia, It seems the crt_sig_handler in signals-win.c is triggered:

julia/src/signals-win.c

Lines 88 to 99 in 8e5136f

default: // SIGSEGV, SIGTERM, SIGILL, SIGABRT
if (sig == SIGSEGV && jl_get_safe_restore()) {
signal(sig, (void (__cdecl *)(int))crt_sig_handler);
jl_sig_throw();
}
memset(&Context, 0, sizeof(Context));
RtlCaptureContext(&Context);
if (sig == SIGILL)
jl_show_sigill(&Context);
jl_critical_error(sig, 0, &Context, jl_get_current_task());
raise(sig);
}

Then, in jl_critical_error, it runs

julia/src/signal-handling.c

Lines 437 to 481 in 8e5136f

void jl_critical_error(int sig, int si_code, bt_context_t *context, jl_task_t *ct)
{
jl_bt_element_t *bt_data = ct ? ct->ptls->bt_data : NULL;
size_t *bt_size = ct ? &ct->ptls->bt_size : NULL;
size_t i, n = ct ? *bt_size : 0;
if (sig) {
// kill this task, so that we cannot get back to it accidentally (via an untimely ^C or jlbacktrace in jl_exit)
jl_task_frame_noreturn(ct);
#ifndef _OS_WINDOWS_
sigset_t sset;
sigemptyset(&sset);
// n.b. In `abort()`, Apple's libSystem "helpfully" blocks all signals
// on all threads but SIGABRT. But we also don't know what the thread
// was doing, so unblock all critical signals so that they will crash
// hard, and not just get stuck.
sigaddset(&sset, SIGSEGV);
sigaddset(&sset, SIGBUS);
sigaddset(&sset, SIGILL);
// also unblock fatal signals now, so we won't get back here twice
sigaddset(&sset, SIGTERM);
sigaddset(&sset, SIGABRT);
sigaddset(&sset, SIGQUIT);
// and the original signal is now fatal too, in case it wasn't
// something already listed (?)
if (sig != SIGINT)
sigaddset(&sset, sig);
pthread_sigmask(SIG_UNBLOCK, &sset, NULL);
#endif
if (si_code)
jl_safe_printf("\n[%d] signal (%d.%d): %s\n", getpid(), sig, si_code, strsignal(sig));
else
jl_safe_printf("\n[%d] signal (%d): %s\n", getpid(), sig, strsignal(sig));
}
jl_safe_printf("in expression starting at %s:%d\n", jl_filename, jl_lineno);
if (context && ct) {
// Must avoid extended backtrace frames here unless we're sure bt_data
// is properly rooted.
*bt_size = n = rec_backtrace_ctx(bt_data, JL_MAX_BT_SIZE, context, NULL);
}
for (i = 0; i < n; i += jl_bt_entry_size(bt_data + i)) {
jl_print_bt_entry_codeloc(bt_data + i);
}
jl_gc_debug_print_status();
jl_gc_debug_critical_error();
}

It is then a little bit unclear, but I suppose the following traces to be printed by jl_print_bt_entry_codeloc instead of rec_backtrace_ctx.
It is unclear to me why the output of jl_gc_debug_print_status is not present. It seems that something like "Allocations: xxx (Pool: xxx; Big: xxx); GC: xxx" should be print, but it is not.

Anyway, then there comes the last output, which is identical in the two cases:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: UNKNOWN at 0x7ff963673b19 --

This should be related to the jl_exception_handler in signals-win.c. First of all, this function is registered as the unhandled exception filter in jl_install_default_signal_handlers, and since it's to be used for unhandled exceptions, I don't quite understand why it is called here. Anyway, the code is like the following:

julia/src/signals-win.c

Lines 284 to 338 in 8e5136f

jl_safe_printf("\nPlease submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.\nException: ");
switch (ExceptionInfo->ExceptionRecord->ExceptionCode) {
case EXCEPTION_ACCESS_VIOLATION:
jl_safe_printf("EXCEPTION_ACCESS_VIOLATION"); break;
case EXCEPTION_ARRAY_BOUNDS_EXCEEDED:
jl_safe_printf("EXCEPTION_ARRAY_BOUNDS_EXCEEDED"); break;
case EXCEPTION_BREAKPOINT:
jl_safe_printf("EXCEPTION_BREAKPOINT"); break;
case EXCEPTION_DATATYPE_MISALIGNMENT:
jl_safe_printf("EXCEPTION_DATATYPE_MISALIGNMENT"); break;
case EXCEPTION_FLT_DENORMAL_OPERAND:
jl_safe_printf("EXCEPTION_FLT_DENORMAL_OPERAND"); break;
case EXCEPTION_FLT_DIVIDE_BY_ZERO:
jl_safe_printf("EXCEPTION_FLT_DIVIDE_BY_ZERO"); break;
case EXCEPTION_FLT_INEXACT_RESULT:
jl_safe_printf("EXCEPTION_FLT_INEXACT_RESULT"); break;
case EXCEPTION_FLT_INVALID_OPERATION:
jl_safe_printf("EXCEPTION_FLT_INVALID_OPERATION"); break;
case EXCEPTION_FLT_OVERFLOW:
jl_safe_printf("EXCEPTION_FLT_OVERFLOW"); break;
case EXCEPTION_FLT_STACK_CHECK:
jl_safe_printf("EXCEPTION_FLT_STACK_CHECK"); break;
case EXCEPTION_FLT_UNDERFLOW:
jl_safe_printf("EXCEPTION_FLT_UNDERFLOW"); break;
case EXCEPTION_ILLEGAL_INSTRUCTION:
jl_safe_printf("EXCEPTION_ILLEGAL_INSTRUCTION"); break;
case EXCEPTION_IN_PAGE_ERROR:
jl_safe_printf("EXCEPTION_IN_PAGE_ERROR"); break;
case EXCEPTION_INT_DIVIDE_BY_ZERO:
jl_safe_printf("EXCEPTION_INT_DIVIDE_BY_ZERO"); break;
case EXCEPTION_INT_OVERFLOW:
jl_safe_printf("EXCEPTION_INT_OVERFLOW"); break;
case EXCEPTION_INVALID_DISPOSITION:
jl_safe_printf("EXCEPTION_INVALID_DISPOSITION"); break;
case EXCEPTION_NONCONTINUABLE_EXCEPTION:
jl_safe_printf("EXCEPTION_NONCONTINUABLE_EXCEPTION"); break;
case EXCEPTION_PRIV_INSTRUCTION:
jl_safe_printf("EXCEPTION_PRIV_INSTRUCTION"); break;
case EXCEPTION_SINGLE_STEP:
jl_safe_printf("EXCEPTION_SINGLE_STEP"); break;
case EXCEPTION_STACK_OVERFLOW:
jl_safe_printf("EXCEPTION_STACK_OVERFLOW"); break;
default:
jl_safe_printf("UNKNOWN"); break;
}
jl_safe_printf(" at 0x%Ix -- ", (size_t)ExceptionInfo->ExceptionRecord->ExceptionAddress);
jl_print_native_codeloc((uintptr_t)ExceptionInfo->ExceptionRecord->ExceptionAddress);
jl_critical_error(0, 0, ExceptionInfo->ContextRecord, ct);
static int recursion = 0;
if (recursion++)
exit(1);
else
jl_exit(1);
}

The content of " at 0x%Ix -- " is printed, and I suppose that jl_critical_error is not called, since it should output "in expression starting at" when called, which is absent in the output. Then the problem may be related to jl_print_native_codeloc? The function is like the following:

julia/src/stackwalk.c

Lines 630 to 651 in 8e5136f

void jl_print_native_codeloc(uintptr_t ip) JL_NOTSAFEPOINT
{
// This function is not allowed to reference any TLS variables since
// it can be called from an unmanaged thread on OSX.
// it means calling getFunctionInfo with noInline = 1
jl_frame_t *frames = NULL;
int n = jl_getFunctionInfo(&frames, ip, 0, 0);
int i;
for (i = 0; i < n; i++) {
jl_frame_t frame = frames[i];
if (!frame.func_name) {
jl_safe_printf("unknown function (ip: %p)\n", (void*)ip);
}
else {
jl_safe_print_codeloc(frame.func_name, frame.file_name, frame.line, frame.inlined);
free(frame.func_name);
free(frame.file_name);
}
}
free(frames);
}

Will the free be the problem? But it is only stated that heap routines should not be called in signal-handler routines in the doc, which should be unrelated to this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions