proposal: fix exit_group and thread_suicide

Here's my proposal to clean up both #908 and #34 which should close a lot of our threading problems. There's an additional issue about blocking syscalls at the end but that can be handled separately. If I can get approval on this should be able to implement it ASAP.

## Replacing `thread_suicide()` with asyncify-based exit (issues #34, #908)

### Problem

`thread_suicide()` (`signal.rs:144`) raises `Trap::Interrupt` to kill a wasm instance. `catch_traps()` (`traphandlers.rs:254`) catches this trap, prints "Terminated", and returns `Ok(())`. This bypasses the entire exit cleanup path:

- `lind_manager.decrement()` never called → lind-boot hangs on `lind_manager.wait()`
- No zombie created → parent's `waitpid()` blocks forever
- No SIGCHLD sent to parent
- No fdtable cleanup

The `exit_call()` method (`lib.rs:967`) already implements proper exit via asyncify unwind — it does cleanup (rm_vmctx, lind_manager.decrement) and sets `OnCalledAction::Finish` so `_start.call()` returns normally with an exit code. But signal termination and thread killing bypass this entirely.

### Root cause

Two exit mechanisms exist that should be one:
1. **Normal exit**: `exit()` → RawPOSIX → 3i → `exit_call()` → asyncify unwind → cleanup ✓
2. **Signal/kill exit**: `signal_handler()` → `thread_suicide()` → trap → no cleanup ✗

### Fix

Replace `thread_suicide()` with `exit_call()` everywhere. `signal_handler` already has a `Caller<'_, T>` and can access `ctx`:

```rust
// signal.rs, Terminate branch (currently lines 69-76):
SignalDefaultHandler::Terminate => {
    cage::cage_record_exit_status(cageid, cage::ExitStatus::Signaled(signo, false));
    cage::signal::epoch_kill_all(cageid);
    // OLD: thread_suicide();
    // NEW: proper asyncify exit
    ctx.exit_call(caller, 128 + signo as i32, 1);
    return 0;
}
```

`exit_call` sets up asyncify unwind + `OnCalledAction::Finish`. The unwind propagates back through the call chain (`signal_handler` → `epoch_callback` → wasm → `_start.call()` returns with exit code). All cleanup happens in `exit_call` before the unwind starts.

Same fix for `thread_check_killed` (line 44-47): instead of `thread_suicide()`, call `exit_call`. For killed non-main threads, pass `is_last_thread=0` since the killing thread handles cage cleanup.

### exit_group semantics (issue #34)

Currently `exit()` only exits the calling thread. Other threads in the cage continue. The fix:

**In RawPOSIX `exit_syscall`** (`sys_calls.rs`): Before `lind_thread_exit()`, call `epoch_kill_all(cageid)` to mark all other threads for death. Non-main threads hit `thread_check_killed` → `exit_call(is_last_thread=0)` → asyncify unwind → their `_start.call()` returns → thread exits. Once all other threads are gone, the calling thread proceeds as last thread with full cage cleanup.

The wait-for-other-threads-to-die can check `epoch_handler.len() == 1` (only calling thread remains). May need a condvar or short spin since threads die asynchronously via asyncify.

### exec

Same pattern: `epoch_kill_all` other threads, wait for them to exit via asyncify, then proceed with module re-instantiation.

### What this doesn't solve

Threads blocked in host syscalls (`libc::read`, `futex_wait`, etc.) won't see the epoch because wasm isn't executing. They need a separate interruption mechanism (likely storing host `pthread_t` per thread and using `pthread_kill` to deliver EINTR). This is a separate issue from the exit path itself — the asyncify exit fix handles all threads that are executing wasm, which is the common case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: fix exit_group and thread_suicide #914

Replacing `thread_suicide()` with asyncify-based exit (issues #34, #908)

Problem

Root cause

Fix

exit_group semantics (issue #34)

exec

What this doesn't solve

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: fix exit_group and thread_suicide #914

Description

Replacing thread_suicide() with asyncify-based exit (issues #34, #908)

Problem

Root cause

Fix

exit_group semantics (issue #34)

exec

What this doesn't solve

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Replacing `thread_suicide()` with asyncify-based exit (issues #34, #908)