-
Notifications
You must be signed in to change notification settings - Fork 221
Description
Description of the problem
While investigating the Jenkins-SGX-22.04-Sanitizers build failure on PR #2131, I traced the issue to libos_init, not to the PR itself. A child process occasionally aborts with:
(host_thread.c:310:pal_thread_init) error: There are no available TCS pages left for a new thread. Please try to increase sgx.max_threads in the manifest. The current value is 4
This error raises during this call:
gramine/libos/src/libos_init.c
Line 479 in ff71d7a
| int ret = connect_to_process(g_process_ipc_ids.parent_vmid); |
I think the cause is:
At the moment the TLS-handshake thread is created in the above call, all four TCS slots in the manifest may already be occupied:
- Main thread
- IPC worker thread
- Async worker thread
- TLS-handshake thread spawned insde
init_ipc_worker
If the handshake helper thread from init_ipc_worker has not yet unmapped its TCS (unmap_my_tcs), connect_to_process cannot allocate a TCS for the new helper thread and the child exits with the error above.
I logged calls to unmap_my_tcs. In failing runs, exactly one fewer unmap_my_tcs message appears in the log compared to successful runs before connect_to_process attempted to establish its own TLS-handshake thread.
I think this directly reproduces the “no available TCS pages” error and confirms the thread contention described above.
Steps to reproduce
I found this issue while debugging the failure of rlimit_stack on the fork of PR #2131. However, I can reproduce the same issue on the main branch.
- Use the docker image provided in
.ciand clone the main branch of Gramine - Build Gramine in SGX debug mode with ASan and UBSan enabled
CC=clang CXX=clang++ meson setup build/ --werror --prefix=/workspace/install --buildtype=debug -Ddirect=disabled -Dsgx=enabled -Dtests=enabled -Dlibc=glibc -Dubsan=enabled -Dasan=enabled
- Build and install Gramine, then
cd libos/test/regressionandgramine-manifestandgramine-sgx-signtherlimit_stackmanifest - Repeating
gramine-sgx rlimit_stackuntil it fails. I used this command to reproduce the failure (usually within a few minutes):
while gramine-sgx rlimit_stack | grep -q "TEST OK"; do echo "TEST OK found, running again..."; clear; done
Expected results
The TLS-handshake thread created by init_ipc_worker unmaps its TCS before connect_to_process creates its own handshake thread, leaving at least one TCS slot free.
Actual results
Intermittently, the first handshake thread has not yet unmapped its TCS. All four default slots are still being occupied, so connect_to_process fails to create the second handshake thread and the child exits with the error above.