[LibOS] Child process fails during `libos_init` when all TCS are in use

### Description of the problem

While investigating the `Jenkins-SGX-22.04-Sanitizers` build failure on PR #2131, I traced the issue to `libos_init`, not to the PR itself. A child process occasionally aborts with:
```
(host_thread.c:310:pal_thread_init) error: There are no available TCS pages left for a new thread. Please try to increase sgx.max_threads in the manifest. The current value is 4
```
This error raises during this call:
https://github.com/gramineproject/gramine/blob/ff71d7afea730dffd56a97af39bb6a73ee6c7662/libos/src/libos_init.c#L479

I think the cause is:
At the moment the TLS-handshake thread is created in the above call, all four TCS slots in the manifest may already be occupied:
1. Main thread
2. IPC worker thread
3. Async worker thread
4. TLS-handshake thread spawned insde `init_ipc_worker`

If the handshake helper thread from `init_ipc_worker` has not yet unmapped its TCS (`unmap_my_tcs`), `connect_to_process` cannot allocate a TCS for the new helper thread and the child exits with the error above.

I logged calls to `unmap_my_tcs`. In failing runs, exactly one fewer `unmap_my_tcs` message appears in the log compared to successful runs **before** `connect_to_process` attempted to establish its own TLS-handshake thread. 

I think this directly reproduces the “no available TCS pages” error and confirms the thread contention described above.

### Steps to reproduce

I found this issue while debugging the failure of `rlimit_stack` on the fork of PR #2131. However, I can reproduce the same issue on the main branch.
1. Use the docker image provided in `.ci` and clone the main branch of Gramine
2. Build Gramine in SGX debug mode with ASan and UBSan enabled
```
CC=clang CXX=clang++ meson setup build/ --werror --prefix=/workspace/install --buildtype=debug -Ddirect=disabled -Dsgx=enabled -Dtests=enabled -Dlibc=glibc -Dubsan=enabled -Dasan=enabled
```

3. Build and install Gramine, then`cd libos/test/regression` and `gramine-manifest` and `gramine-sgx-sign` the ```rlimit_stack``` manifest
4. Repeating `gramine-sgx rlimit_stack` until it fails. I used this command to reproduce the failure (usually within a few minutes):
```
while gramine-sgx rlimit_stack | grep -q "TEST OK"; do     echo "TEST OK found, running again..."; clear; done
```

### Expected results

The TLS-handshake thread created by `init_ipc_worker` unmaps its TCS before `connect_to_process` creates its own handshake thread, leaving at least one TCS slot free.

### Actual results

Intermittently, the first handshake thread has not yet unmapped its TCS. All four default slots are still being occupied, so `connect_to_process` fails to create the second handshake thread and the child exits with the error above.

### Gramine commit hash

ff71d7afea730dffd56a97af39bb6a73ee6c7662 / 8e0313a5f3ad9f505f583cf6975d48d4ed1eea72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LibOS] Child process fails during `libos_init` when all TCS are in use #2145

Description of the problem

Steps to reproduce

Expected results

Actual results

Gramine commit hash

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LibOS] Child process fails during libos_init when all TCS are in use #2145

Description

Description of the problem

Steps to reproduce

Expected results

Actual results

Gramine commit hash

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[LibOS] Child process fails during `libos_init` when all TCS are in use #2145