Skip to content

fix(cli): stop box before hard-exit on non-zero run exit code#622

Open
G4614 wants to merge 5 commits into
boxlite-ai:mainfrom
G4614:fix/cli-run-rm-abnormal-exit
Open

fix(cli): stop box before hard-exit on non-zero run exit code#622
G4614 wants to merge 5 commits into
boxlite-ai:mainfrom
G4614:fix/cli-run-rm-abnormal-exit

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented May 29, 2026

when the box killed badly (non-zero return code), the shim process should be released by RAII

appeared because of #604

Test plan

Existing exit-code tests (branch ⑥: non-zero exit)

The abnormal-exit boxlite run integration tests, each verified two-sided (run.rs reverted to std::process::exit vs. RAII applied) on this branch. Side B's leak surfaces via PerTestBoxHome::Drop panicking with live shim(s): [pid] — the std::process::exit shortcut bypasses RuntimeImpl::Dropshutdown_sync.

test side A (RAII) side B (pre-fix std::process::exit)
test_run_exit_code_125 ok (8.14s) FAILEDlive shim(s): [320092]
test_run_exit_code_custom ok (8.14s) FAILEDlive shim(s)
test_run_signal_exit_code_sigterm ok (16.42s) FAILEDlive shim(s): [323543]
test_run_signal_exit_code_sigkill ok (16.42s) FAILEDlive shim(s)
test_run_signal_exit_code_sigint ok (16.42s) FAILEDlive shim(s)
test_run_python_error_handling ok (8.06s) FAILEDlive shim(s): [326872]
test_run_exit_code_success (control) ok (8.14s) ok (17.85s) — exit 0 never hits the buggy branch

Focused reproducer for branch ⑥ (added on review)

test_run_rm_non_zero_exit_does_not_leak_shim runs the same scenario (run --rm alpine:latest sh -c 'exit 7') but scans <home>/boxes/*/shim.pid for live PIDs in the test body — so the no-leak assertion is visible at the call site instead of buried in PerTestBoxHome::Drop's panic.

step code state result
A RAII fix applied test_run_rm_non_zero_exit_does_not_leak_shim ... ok (5.79s)
B run.rs:93 reverted to std::process::exit(to_shell_exit_code(exit_code)) FAILEDnon-zero boxlite run --rm left live shim PID(s): [286477]
C RAII fix restored test_run_rm_non_zero_exit_does_not_leak_shim ... ok (6.06s)

Other early-return branches in BoxRunner::run

Reviewer's request was "execution and litebox dropped automatically in any branch". The remaining CLI-reachable early-returns each get their own no-leak test:

branch trigger test two-sided?
validate_flags? --tty with non-TTY stdin test_run_tty_error_in_pipe (existing) n/a — fails before any box is created, structurally cannot leak
create_box? image pull failure test_run_image_pull_failure_does_not_leak_shim n/a — pull fails before any shim is spawned, structurally cannot leak
litebox.exec? invoking /etc (a directory) test_run_exec_setup_failure_does_not_leak_shim yes — see below
detach return Ok(0) -d flag test_run_detach (existing, manually rm's the box) n/a — keeping the box alive is the intended behavior
streamer.start? signal-handler init failure none — effectively unreachable in any real CLI invocation; covered by PerTestBoxHome::Drop's implicit guard + Rust's stack-unwinding guarantee
⑥ non-zero exit command exits non-zero tables above yes
Ok(to_shell_exit_code(0)) command exits 0 test_run_exit_code_success + 30+ commands as side-effect implicit
panic internal invariant violated none — guaranteed by Rust: panic unwinds the stack, Drop runs; tests would exercise the panic, not its cleanup

Two-sided verification for branch ③ (the only realistic injection point on a path where a shim is actually alive at the failure moment):

step code state result
A RAII fix applied test_run_exec_setup_failure_does_not_leak_shim ... ok (4.44s)
B litebox.exec().await? replaced with match { Err => std::process::exit(1) } to inject the Drop-bypass pattern onto this branch FAILEDexec-setup failure left live shim PID(s): [395393]
C injection reverted test_run_exec_setup_failure_does_not_leak_shim ... ok (5.22s)

Why this works

Pre-fix, a non-zero command exit took the std::process::exit shortcut that bypasses the box teardown the success path runs on return, leaking the microVM's shim (the source of #604's "orphan shims in /tmp"). Post-fix returns the exit code as Result<i32> and lets main return ExitCode; the runtime drops on every return path and RuntimeImpl::Dropshutdown_sync SIGTERMs the shim — the same teardown the success path already relied on.

make fmt:check + cargo clippy -- -D warnings clean; pre-push CLI matrix 277 tests run: 277 passed, 0 skipped (~83s).

@G4614 G4614 marked this pull request as ready for review May 29, 2026 06:20
`boxlite run` propagated a non-zero command exit via std::process::exit,
which skips Drop and the box's async auto-stop/auto-remove — leaking the
box's shim as a live host process. The success path tears the box down via
normal teardown when run() returns, but the non-zero path never reached it.

Explicitly stop the box (kills the shim; removes it when --rm) before
std::process::exit. Fixes the abnormal-exit run integration tests
(exit_code_125/custom, signal_exit_code_{sigint,sigkill,sigterm},
python_error_handling) that tripped PerTestBoxHome's live-shim guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/cli-run-rm-abnormal-exit branch from 94daf72 to 02df899 Compare June 1, 2026 04:22
Comment thread src/cli/src/commands/run.rs Outdated
let exit_code = streamer.start().await?;
// Exit with box's exit code
if exit_code != 0 {
// Tear the box down before hard-exiting. std::process::exit skips
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to design RAII around this so we execution and litebox will be dropped automatically in any branch

Copy link
Copy Markdown
Contributor Author

@G4614 G4614 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to RAII mode, also wrote test for each branch to confirm it, thx

Comment thread src/cli/src/commands/run.rs Outdated
// Drop and the async auto-stop/auto-remove, so a non-zero command
// exit would otherwise leak the box's shim as a live process (the
// success path stops the box via normal teardown when run returns).
drop(execution);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write a test case to cover this issue

Copy link
Copy Markdown
Contributor Author

@G4614 G4614 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_run_rm_non_zero_exit_does_not_leak_shim added for this, thx

@G4614 G4614 marked this pull request as draft June 1, 2026 08:04
G4614 and others added 4 commits June 1, 2026 08:52
…ess::exit

Address PR boxlite-ai#622 review (DorianZheng): redesign so execution/litebox/the
owning BoxliteRuntime drop on every return path instead of relying on a
manual stop call before std::process::exit. process::exit bypasses Drop
entirely, which is exactly what leaked the box's shim on the non-zero
path; the only true RAII fix is to never call it mid-command.

run::execute and exec::execute now return Result<i32> (the shell exit
code), main returns ExitCode, and run_cli funnels every command through
the same dispatcher. When run_cli returns, BoxliteRuntime drops, and
RuntimeImpl::Drop -> shutdown_sync() reaps the shim - the same teardown
the success path already relied on.

Adds an explicit reproducer: test_run_rm_non_zero_exit_does_not_leak_shim
scans <home>/boxes/*/shim.pid in the test body (not just via
PerTestBoxHome::Drop), so the assertion is visible at the call site.
Two-side verified: pre-fix simulation fails with "live shim PID(s): [..]",
post-fix passes; exposes test_utils::home::live_shim_pids as pub for that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… wrapper

PR boxlite-ai#622 review follow-up: address the aesthetic drift introduced by the
RAII fix (per-branch .map(|_| 0) noise and the ExitCode::from(... as u8)
cast in main).

- Every command's `pub async fn execute` (and `auth::run`) now returns
  `anyhow::Result<i32>`. Unit-success commands `Ok(0)` at the end; run/exec
  pass through the box's mapped shell exit code unchanged. Dispatcher in
  `run_cli` no longer needs `.map(|_| 0)` adapters.
- `main` is back to `fn main()`. The tokio runtime is dropped explicitly
  before `process::exit(code)` so the BoxliteRuntime Drop chain
  (RuntimeImpl::Drop -> shutdown_sync) has already finished by then — the
  hazard called out in boxlite-ai#622 was `process::exit` *mid-command*, not in
  `main` after every stack frame has unwound.

Verified: cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check,
and the focused leak repro (test_run_rm_non_zero_exit_does_not_leak_shim,
ok 8.57s) all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk back the scope-creep portion of 69df038. The shim-leak RAII fix only
needs run.rs / exec.rs / main.rs — making the 15 unit-success commands
also return Result<i32> was purely dispatcher cosmetics, and it had a
type-honesty cost: a command like `boxlite cp` that has no exit-code
concept ended up advertising one (always 0).

This commit:

- Restores `anyhow::Result<()>` + `Ok(())` on auth/cp/create/images/info/
  inspect/list/logs/pull/restart/rm/serve/start/stats/stop.
- Puts the 15 `.map(|_| 0)` adapters back in `run_cli`'s dispatcher,
  collocated so the asymmetry is visible at one site (`run`/`exec` real;
  others adapted).
- Keeps main.rs's `fn main() { ... drop(rt); process::exit(code); }`
  simplification — that part isn't scope creep, it's the RAII fix.

cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check,
test_run_rm_non_zero_exit_does_not_leak_shim (ok 8.22s) all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn paths

The focused boxlite-ai#622 reproducer (test_run_rm_non_zero_exit_does_not_leak_shim)
only exercises the std::process::exit branch that actually leaked. Add
two companion tests that pin the same RAII invariant onto the two other
CLI-reachable early-return points in BoxRunner::run:

- test_run_image_pull_failure_does_not_leak_shim (branch ②) — create_box?
  fails on a non-existent image; no shim ever spawns, but partial-VM
  state must drop cleanly.
- test_run_exec_setup_failure_does_not_leak_shim (branch ③) — litebox.exec?
  fails (invoking a directory) after the box is fully running; Drop has
  to reap the live shim.

Both pass under today's RAII fix (4.4 s each in parallel). These branches
are not affected by the original boxlite-ai#622 bug (they use `?` rather than
process::exit, so Drop always ran) — the value is forward-looking: if
anyone introduces a Drop-bypass shortcut on these paths later, the tests
fail.

The remaining two early-exits in BoxRunner::run — streamer.start?
(signal-handler init, effectively unreachable in normal CLI invocation)
and a panic mid-`run()` — are left to Rust's stack-unwinding guarantee
plus PerTestBoxHome::Drop's implicit guard, documented inline so the
choice is visible. No mock-injection tests for those.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review June 1, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants