-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Here's my proposal to clean up both #908 and #34 which should close a lot of our threading problems. There's an additional issue about blocking syscalls at the end but that can be handled separately. If I can get approval on this should be able to implement it ASAP.
Replacing thread_suicide() with asyncify-based exit (issues #34, #908)
Problem
thread_suicide() (signal.rs:144) raises Trap::Interrupt to kill a wasm instance. catch_traps() (traphandlers.rs:254) catches this trap, prints "Terminated", and returns Ok(()). This bypasses the entire exit cleanup path:
lind_manager.decrement()never called → lind-boot hangs onlind_manager.wait()- No zombie created → parent's
waitpid()blocks forever - No SIGCHLD sent to parent
- No fdtable cleanup
The exit_call() method (lib.rs:967) already implements proper exit via asyncify unwind — it does cleanup (rm_vmctx, lind_manager.decrement) and sets OnCalledAction::Finish so _start.call() returns normally with an exit code. But signal termination and thread killing bypass this entirely.
Root cause
Two exit mechanisms exist that should be one:
- Normal exit:
exit()→ RawPOSIX → 3i →exit_call()→ asyncify unwind → cleanup ✓ - Signal/kill exit:
signal_handler()→thread_suicide()→ trap → no cleanup ✗
Fix
Replace thread_suicide() with exit_call() everywhere. signal_handler already has a Caller<'_, T> and can access ctx:
// signal.rs, Terminate branch (currently lines 69-76):
SignalDefaultHandler::Terminate => {
cage::cage_record_exit_status(cageid, cage::ExitStatus::Signaled(signo, false));
cage::signal::epoch_kill_all(cageid);
// OLD: thread_suicide();
// NEW: proper asyncify exit
ctx.exit_call(caller, 128 + signo as i32, 1);
return 0;
}exit_call sets up asyncify unwind + OnCalledAction::Finish. The unwind propagates back through the call chain (signal_handler → epoch_callback → wasm → _start.call() returns with exit code). All cleanup happens in exit_call before the unwind starts.
Same fix for thread_check_killed (line 44-47): instead of thread_suicide(), call exit_call. For killed non-main threads, pass is_last_thread=0 since the killing thread handles cage cleanup.
exit_group semantics (issue #34)
Currently exit() only exits the calling thread. Other threads in the cage continue. The fix:
In RawPOSIX exit_syscall (sys_calls.rs): Before lind_thread_exit(), call epoch_kill_all(cageid) to mark all other threads for death. Non-main threads hit thread_check_killed → exit_call(is_last_thread=0) → asyncify unwind → their _start.call() returns → thread exits. Once all other threads are gone, the calling thread proceeds as last thread with full cage cleanup.
The wait-for-other-threads-to-die can check epoch_handler.len() == 1 (only calling thread remains). May need a condvar or short spin since threads die asynchronously via asyncify.
exec
Same pattern: epoch_kill_all other threads, wait for them to exit via asyncify, then proceed with module re-instantiation.
What this doesn't solve
Threads blocked in host syscalls (libc::read, futex_wait, etc.) won't see the epoch because wasm isn't executing. They need a separate interruption mechanism (likely storing host pthread_t per thread and using pthread_kill to deliver EINTR). This is a separate issue from the exit path itself — the asyncify exit fix handles all threads that are executing wasm, which is the common case.