Skip to content

[LibOS] Child process fails during libos_init when all TCS are in use #2145

@forkthus

Description

@forkthus

Description of the problem

While investigating the Jenkins-SGX-22.04-Sanitizers build failure on PR #2131, I traced the issue to libos_init, not to the PR itself. A child process occasionally aborts with:

(host_thread.c:310:pal_thread_init) error: There are no available TCS pages left for a new thread. Please try to increase sgx.max_threads in the manifest. The current value is 4

This error raises during this call:

int ret = connect_to_process(g_process_ipc_ids.parent_vmid);

I think the cause is:
At the moment the TLS-handshake thread is created in the above call, all four TCS slots in the manifest may already be occupied:

  1. Main thread
  2. IPC worker thread
  3. Async worker thread
  4. TLS-handshake thread spawned insde init_ipc_worker

If the handshake helper thread from init_ipc_worker has not yet unmapped its TCS (unmap_my_tcs), connect_to_process cannot allocate a TCS for the new helper thread and the child exits with the error above.

I logged calls to unmap_my_tcs. In failing runs, exactly one fewer unmap_my_tcs message appears in the log compared to successful runs before connect_to_process attempted to establish its own TLS-handshake thread.

I think this directly reproduces the “no available TCS pages” error and confirms the thread contention described above.

Steps to reproduce

I found this issue while debugging the failure of rlimit_stack on the fork of PR #2131. However, I can reproduce the same issue on the main branch.

  1. Use the docker image provided in .ci and clone the main branch of Gramine
  2. Build Gramine in SGX debug mode with ASan and UBSan enabled
CC=clang CXX=clang++ meson setup build/ --werror --prefix=/workspace/install --buildtype=debug -Ddirect=disabled -Dsgx=enabled -Dtests=enabled -Dlibc=glibc -Dubsan=enabled -Dasan=enabled
  1. Build and install Gramine, thencd libos/test/regression and gramine-manifest and gramine-sgx-sign the rlimit_stack manifest
  2. Repeating gramine-sgx rlimit_stack until it fails. I used this command to reproduce the failure (usually within a few minutes):
while gramine-sgx rlimit_stack | grep -q "TEST OK"; do     echo "TEST OK found, running again..."; clear; done

Expected results

The TLS-handshake thread created by init_ipc_worker unmaps its TCS before connect_to_process creates its own handshake thread, leaving at least one TCS slot free.

Actual results

Intermittently, the first handshake thread has not yet unmapped its TCS. All four default slots are still being occupied, so connect_to_process fails to create the second handshake thread and the child exits with the error above.

Gramine commit hash

ff71d7a / 8e0313a

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions