fix(cli): stop box before hard-exit on non-zero run exit code#622
Open
G4614 wants to merge 5 commits into
Open
fix(cli): stop box before hard-exit on non-zero run exit code#622G4614 wants to merge 5 commits into
G4614 wants to merge 5 commits into
Conversation
`boxlite run` propagated a non-zero command exit via std::process::exit,
which skips Drop and the box's async auto-stop/auto-remove — leaking the
box's shim as a live host process. The success path tears the box down via
normal teardown when run() returns, but the non-zero path never reached it.
Explicitly stop the box (kills the shim; removes it when --rm) before
std::process::exit. Fixes the abnormal-exit run integration tests
(exit_code_125/custom, signal_exit_code_{sigint,sigkill,sigterm},
python_error_handling) that tripped PerTestBoxHome's live-shim guard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94daf72 to
02df899
Compare
DorianZheng
reviewed
Jun 1, 2026
| let exit_code = streamer.start().await?; | ||
| // Exit with box's exit code | ||
| if exit_code != 0 { | ||
| // Tear the box down before hard-exiting. std::process::exit skips |
Member
There was a problem hiding this comment.
try to design RAII around this so we execution and litebox will be dropped automatically in any branch
Contributor
Author
There was a problem hiding this comment.
changed to RAII mode, also wrote test for each branch to confirm it, thx
DorianZheng
reviewed
Jun 1, 2026
| // Drop and the async auto-stop/auto-remove, so a non-zero command | ||
| // exit would otherwise leak the box's shim as a live process (the | ||
| // success path stops the box via normal teardown when run returns). | ||
| drop(execution); |
Member
There was a problem hiding this comment.
write a test case to cover this issue
Contributor
Author
There was a problem hiding this comment.
test_run_rm_non_zero_exit_does_not_leak_shim added for this, thx
…ess::exit Address PR boxlite-ai#622 review (DorianZheng): redesign so execution/litebox/the owning BoxliteRuntime drop on every return path instead of relying on a manual stop call before std::process::exit. process::exit bypasses Drop entirely, which is exactly what leaked the box's shim on the non-zero path; the only true RAII fix is to never call it mid-command. run::execute and exec::execute now return Result<i32> (the shell exit code), main returns ExitCode, and run_cli funnels every command through the same dispatcher. When run_cli returns, BoxliteRuntime drops, and RuntimeImpl::Drop -> shutdown_sync() reaps the shim - the same teardown the success path already relied on. Adds an explicit reproducer: test_run_rm_non_zero_exit_does_not_leak_shim scans <home>/boxes/*/shim.pid in the test body (not just via PerTestBoxHome::Drop), so the assertion is visible at the call site. Two-side verified: pre-fix simulation fails with "live shim PID(s): [..]", post-fix passes; exposes test_utils::home::live_shim_pids as pub for that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… wrapper PR boxlite-ai#622 review follow-up: address the aesthetic drift introduced by the RAII fix (per-branch .map(|_| 0) noise and the ExitCode::from(... as u8) cast in main). - Every command's `pub async fn execute` (and `auth::run`) now returns `anyhow::Result<i32>`. Unit-success commands `Ok(0)` at the end; run/exec pass through the box's mapped shell exit code unchanged. Dispatcher in `run_cli` no longer needs `.map(|_| 0)` adapters. - `main` is back to `fn main()`. The tokio runtime is dropped explicitly before `process::exit(code)` so the BoxliteRuntime Drop chain (RuntimeImpl::Drop -> shutdown_sync) has already finished by then — the hazard called out in boxlite-ai#622 was `process::exit` *mid-command*, not in `main` after every stack frame has unwound. Verified: cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check, and the focused leak repro (test_run_rm_non_zero_exit_does_not_leak_shim, ok 8.57s) all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk back the scope-creep portion of 69df038. The shim-leak RAII fix only needs run.rs / exec.rs / main.rs — making the 15 unit-success commands also return Result<i32> was purely dispatcher cosmetics, and it had a type-honesty cost: a command like `boxlite cp` that has no exit-code concept ended up advertising one (always 0). This commit: - Restores `anyhow::Result<()>` + `Ok(())` on auth/cp/create/images/info/ inspect/list/logs/pull/restart/rm/serve/start/stats/stop. - Puts the 15 `.map(|_| 0)` adapters back in `run_cli`'s dispatcher, collocated so the asymmetry is visible at one site (`run`/`exec` real; others adapted). - Keeps main.rs's `fn main() { ... drop(rt); process::exit(code); }` simplification — that part isn't scope creep, it's the RAII fix. cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check, test_run_rm_non_zero_exit_does_not_leak_shim (ok 8.22s) all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn paths The focused boxlite-ai#622 reproducer (test_run_rm_non_zero_exit_does_not_leak_shim) only exercises the std::process::exit branch that actually leaked. Add two companion tests that pin the same RAII invariant onto the two other CLI-reachable early-return points in BoxRunner::run: - test_run_image_pull_failure_does_not_leak_shim (branch ②) — create_box? fails on a non-existent image; no shim ever spawns, but partial-VM state must drop cleanly. - test_run_exec_setup_failure_does_not_leak_shim (branch ③) — litebox.exec? fails (invoking a directory) after the box is fully running; Drop has to reap the live shim. Both pass under today's RAII fix (4.4 s each in parallel). These branches are not affected by the original boxlite-ai#622 bug (they use `?` rather than process::exit, so Drop always ran) — the value is forward-looking: if anyone introduces a Drop-bypass shortcut on these paths later, the tests fail. The remaining two early-exits in BoxRunner::run — streamer.start? (signal-handler init, effectively unreachable in normal CLI invocation) and a panic mid-`run()` — are left to Rust's stack-unwinding guarantee plus PerTestBoxHome::Drop's implicit guard, documented inline so the choice is visible. No mock-injection tests for those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
when the box killed badly (non-zero return code), the shim process should be released by RAII
appeared because of #604
Test plan
Existing exit-code tests (branch ⑥: non-zero exit)
The abnormal-exit
boxlite runintegration tests, each verified two-sided (run.rsreverted tostd::process::exitvs. RAII applied) on this branch. Side B's leak surfaces viaPerTestBoxHome::Droppanicking withlive shim(s): [pid]— thestd::process::exitshortcut bypassesRuntimeImpl::Drop→shutdown_sync.std::process::exit)test_run_exit_code_125live shim(s): [320092]test_run_exit_code_customlive shim(s)test_run_signal_exit_code_sigtermlive shim(s): [323543]test_run_signal_exit_code_sigkilllive shim(s)test_run_signal_exit_code_sigintlive shim(s)test_run_python_error_handlinglive shim(s): [326872]test_run_exit_code_success(control)Focused reproducer for branch ⑥ (added on review)
test_run_rm_non_zero_exit_does_not_leak_shimruns the same scenario (run --rm alpine:latest sh -c 'exit 7') but scans<home>/boxes/*/shim.pidfor live PIDs in the test body — so the no-leak assertion is visible at the call site instead of buried inPerTestBoxHome::Drop's panic.test_run_rm_non_zero_exit_does_not_leak_shim ... ok(5.79s)run.rs:93reverted tostd::process::exit(to_shell_exit_code(exit_code))non-zero boxlite run --rm left live shim PID(s): [286477]test_run_rm_non_zero_exit_does_not_leak_shim ... ok(6.06s)Other early-return branches in
BoxRunner::runReviewer's request was "execution and litebox dropped automatically in any branch". The remaining CLI-reachable early-returns each get their own no-leak test:
validate_flags?--ttywith non-TTY stdintest_run_tty_error_in_pipe(existing)create_box?test_run_image_pull_failure_does_not_leak_shimlitebox.exec?/etc(a directory)test_run_exec_setup_failure_does_not_leak_shimdetach return Ok(0)-dflagtest_run_detach(existing, manually rm's the box)streamer.start?PerTestBoxHome::Drop's implicit guard + Rust's stack-unwinding guaranteeOk(to_shell_exit_code(0))test_run_exit_code_success+ 30+ commands as side-effectTwo-sided verification for branch ③ (the only realistic injection point on a path where a shim is actually alive at the failure moment):
test_run_exec_setup_failure_does_not_leak_shim ... ok(4.44s)litebox.exec().await?replaced withmatch { Err => std::process::exit(1) }to inject the Drop-bypass pattern onto this branchexec-setup failure left live shim PID(s): [395393]test_run_exec_setup_failure_does_not_leak_shim ... ok(5.22s)Why this works
Pre-fix, a non-zero command exit took the
std::process::exitshortcut that bypasses the box teardown the success path runs on return, leaking the microVM's shim (the source of #604's "orphan shims in /tmp"). Post-fix returns the exit code asResult<i32>and letsmainreturnExitCode; the runtime drops on every return path andRuntimeImpl::Drop→shutdown_syncSIGTERMs the shim — the same teardown the success path already relied on.make fmt:check+cargo clippy -- -D warningsclean; pre-push CLI matrix277 tests run: 277 passed, 0 skipped(~83s).