Skip to content

proposal: fix exit_group and thread_suicide #914

@rennergade

Description

@rennergade

Here's my proposal to clean up both #908 and #34 which should close a lot of our threading problems. There's an additional issue about blocking syscalls at the end but that can be handled separately. If I can get approval on this should be able to implement it ASAP.

Replacing thread_suicide() with asyncify-based exit (issues #34, #908)

Problem

thread_suicide() (signal.rs:144) raises Trap::Interrupt to kill a wasm instance. catch_traps() (traphandlers.rs:254) catches this trap, prints "Terminated", and returns Ok(()). This bypasses the entire exit cleanup path:

  • lind_manager.decrement() never called → lind-boot hangs on lind_manager.wait()
  • No zombie created → parent's waitpid() blocks forever
  • No SIGCHLD sent to parent
  • No fdtable cleanup

The exit_call() method (lib.rs:967) already implements proper exit via asyncify unwind — it does cleanup (rm_vmctx, lind_manager.decrement) and sets OnCalledAction::Finish so _start.call() returns normally with an exit code. But signal termination and thread killing bypass this entirely.

Root cause

Two exit mechanisms exist that should be one:

  1. Normal exit: exit() → RawPOSIX → 3i → exit_call() → asyncify unwind → cleanup ✓
  2. Signal/kill exit: signal_handler()thread_suicide() → trap → no cleanup ✗

Fix

Replace thread_suicide() with exit_call() everywhere. signal_handler already has a Caller<'_, T> and can access ctx:

// signal.rs, Terminate branch (currently lines 69-76):
SignalDefaultHandler::Terminate => {
    cage::cage_record_exit_status(cageid, cage::ExitStatus::Signaled(signo, false));
    cage::signal::epoch_kill_all(cageid);
    // OLD: thread_suicide();
    // NEW: proper asyncify exit
    ctx.exit_call(caller, 128 + signo as i32, 1);
    return 0;
}

exit_call sets up asyncify unwind + OnCalledAction::Finish. The unwind propagates back through the call chain (signal_handlerepoch_callback → wasm → _start.call() returns with exit code). All cleanup happens in exit_call before the unwind starts.

Same fix for thread_check_killed (line 44-47): instead of thread_suicide(), call exit_call. For killed non-main threads, pass is_last_thread=0 since the killing thread handles cage cleanup.

exit_group semantics (issue #34)

Currently exit() only exits the calling thread. Other threads in the cage continue. The fix:

In RawPOSIX exit_syscall (sys_calls.rs): Before lind_thread_exit(), call epoch_kill_all(cageid) to mark all other threads for death. Non-main threads hit thread_check_killedexit_call(is_last_thread=0) → asyncify unwind → their _start.call() returns → thread exits. Once all other threads are gone, the calling thread proceeds as last thread with full cage cleanup.

The wait-for-other-threads-to-die can check epoch_handler.len() == 1 (only calling thread remains). May need a condvar or short spin since threads die asynchronously via asyncify.

exec

Same pattern: epoch_kill_all other threads, wait for them to exit via asyncify, then proceed with module re-instantiation.

What this doesn't solve

Threads blocked in host syscalls (libc::read, futex_wait, etc.) won't see the epoch because wasm isn't executing. They need a separate interruption mechanism (likely storing host pthread_t per thread and using pthread_kill to deliver EINTR). This is a separate issue from the exit path itself — the asyncify exit fix handles all threads that are executing wasm, which is the common case.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions