Skip to content

Conversation

@forkthus
Copy link

@forkthus forkthus commented Jul 30, 2025

Description of the changes

Fixes #2148

This PR makes execve() wait until every sibling thread’s *clear_child_tid is zeroed before deallocating their VMAs.

Implementation details:

  1. Grab the g_thread_list as soon as the calling thread acquires first.
  2. For each sibling thread, the calling thread will check:
    If *clear_child_tid != 0, invoke futex_wait(); it will be awakened by release_clear_child_tid() via futex_wake().
  3. After all other threads' (except the main thread) *clear_child_tid are cleared, the calling thread then starts to deallocate VMAs.

How to test this PR?

Repeating gramine-sgx exec_same [args_#1...args_#49]

Without this PR – the test usually fails within a few minutes, especially on the branch of PR #1795 because of the issue mentioned above. The main branch takes longer to fail.
With this PR applied – the same loop runs for hours without any failures on the branch of PR #1795 .


This change is Reviewable

… threads’ clear_child_tid before unmapping VMAs

Signed-off-by: Wenhan Ji <[email protected]>
@mkow
Copy link
Member

mkow commented Aug 4, 2025

Jenkins, test this please

@forkthus
Copy link
Author

forkthus commented Aug 4, 2025

Jenkins, retest this please

(All failures seem to be connectivity issues.)

ERROR: Checkout failed
[2025-08-04T11:02:59.002Z] java.io.StreamCorruptedException: invalid stream header: 636F7272
...
25-08-04T11:02:59.002Z] Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to penguins-3-noble
[2025-08-04T11:02:59.002Z] Caused: hudson.remoting.RequestAbortedException

[2025-08-04T11:15:36.177Z] Connecting to busybox.net (busybox.net)|140.211.167.122|:443... connected.
[2025-08-04T11:16:44.937Z] Unable to establish SSL connection.
[2025-08-04T11:16:44.937Z] download: WARNING: Hash mismatch: Expected 415fbd89e5344c96acf449d94a6f956dbed62e18e835fc83e064db33a34bd549 but received e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2025-08-04T11:16:44.937Z] download: ERROR: Failed to download 'busybox.tar.bz2' (415fbd89...)! No URLs left to try.
[2025-08-04T11:16:44.937Z] make: *** [Makefile:13: busybox.tar.bz2] Error 1

@forkthus
Copy link
Author

forkthus commented Aug 4, 2025

Jenkins, retest this please

(All failures seem to be connectivity issues.)

ERROR: Checkout failed
[2025-08-04T11:02:59.002Z] java.io.StreamCorruptedException: invalid stream header: 636F7272
...
25-08-04T11:02:59.002Z] Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to penguins-3-noble
[2025-08-04T11:02:59.002Z] Caused: hudson.remoting.RequestAbortedException
[2025-08-04T11:15:36.177Z] Connecting to busybox.net (busybox.net)|140.211.167.122|:443... connected.
[2025-08-04T11:16:44.937Z] Unable to establish SSL connection.
[2025-08-04T11:16:44.937Z] download: WARNING: Hash mismatch: Expected 415fbd89e5344c96acf449d94a6f956dbed62e18e835fc83e064db33a34bd549 but received e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2025-08-04T11:16:44.937Z] download: ERROR: Failed to download 'busybox.tar.bz2' (415fbd89...)! No URLs left to try.
[2025-08-04T11:16:44.937Z] make: *** [Makefile:13: busybox.tar.bz2] Error 1

Could someone kindly help me trigger a Jenkins retest? My retest command doesn’t seem to work—possibly due to a permission issue. I also don’t have access to the build logs.

@kailun-qin
Copy link
Contributor

Jenkins, retest this please

@donporter
Copy link
Contributor

Add to whitelist

Copy link
Contributor

@donporter donporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: )


libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):

        free_threads_array(threads, count);
        count = needed_count;

Minor: assert needed_count > count?

This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).

Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.

@donporter
Copy link
Contributor

The deb job failed with: The repository 'http://deb.debian.org/debian bullseye-backports Release' no longer has a Release file.

This looks legit, and unrelated to this PR.

@forkthus forkthus marked this pull request as ready for review August 7, 2025 00:38
…r other threads’ clear_child_tid before unmapping VMAs

Signed-off-by: Wenhan Ji <[email protected]>
Copy link
Author

@forkthus forkthus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 4 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @donporter)


libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):

Previously, donporter (Don Porter) wrote…

Minor: assert needed_count > count?

This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).

Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.

Done.
I added comments explaining why the loop terminates (count increases monotonically to the needed size) and an assert on the retry path.
I kept the structure to match the existing style—this pattern mirrors dump_vmas() in libos_vma.c.

@donporter
Copy link
Contributor

Jenkins, retest this please. Just seeing if the webhook gets reactivated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LibOS] Race condition triggers ASan use‑after‑poison in execve path ( release_clear_child_tid )

4 participants