fix hangs in CI due to unbounded wait in Vmm:drop() caused by undelivered signals due to resource leak#5943
Draft
Manciukic wants to merge 4 commits into
Draft
Conversation
32a37cc to
9513d01
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5943 +/- ##
=======================================
Coverage 83.00% 83.00%
=======================================
Files 277 277
Lines 30106 30112 +6
=======================================
+ Hits 24989 24995 +6
Misses 5117 5117
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
afb28a0 to
01374e2
Compare
Add debugging instrumentation to help diagnose intermittent unit test hangs in CI (test_build_and_boot_microvm hanging indefinitely): - RUST_BACKTRACE=1: get a backtrace on panic - --nocapture: stream test output in real time so BK logs show which test was running when the hang occurs - timeout=540s: kill cargo test 60s before pytest's 600s timeout fires, ensuring clean failure reporting and artifact upload Signed-off-by: Riccardo Mancini <mancio@amazon.com>
01374e2 to
1a24b3d
Compare
conftest sets PR_SET_CHILD_SUBREAPER so daemonized descendants reparent to the pytest session, but the framework only waitpid()s the firecracker PID. Other helpers (screen, ssh, vhost-user backends, socat-forked cat) linger as zombies, and their queued signals stay charged against the RLIMIT_SIGPENDING pool until reaped, which can cause later signal sends to be dropped. Add a per-test reap loop that drains exited descendants on teardown, ordered after the microVM is killed via a microvm_factory dependency. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
VcpuHandle::drop join()ed the vCPU thread unconditionally, so a thread that never observed its Finish event would block teardown forever. Poll for exit with a 1s timeout and panic if exceeded, so teardown fails fast instead of hanging. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
The bounded join added to VcpuHandle::drop polls with thread::sleep, which issues clock_nanosleep on musl. VcpuHandle::drop runs on the vmm thread (via Vmm::drop -> shutdown_vcpus) under seccomp, so the syscall must be allowlisted or shutdown traps with SIGSYS. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
0bdb6be to
4939608
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_unittestswas intermittently hanging in CI, hitting the 600s pytest timeout — most often duringtest_build_and_boot_microvm. Retrying usually passed, and it was worse onm5n.metaland on nightly.Root cause
The hang is caused by exhaustion of the per-
RLIMIT_SIGPENDINGqueued-signal pool:conftest.pysetsPR_SET_CHILD_SUBREAPER, so daemonized descendants (firecracker after the jailer double-fork, plus helpers likescreen,ssh, vhost-user backends, and thecatprocessessocatforks in the vsock tests) reparent to the pytest session.waitpid()s the firecracker PID, so every other descendant lingers as a zombie. A zombie's pending signals are not freed until it is reaped (release_task), so each one keeps its queued signal (e.g. theSIGTERMfrom teardown) charged against theRLIMIT_SIGPENDINGpool.SIGRTMINkick Firecracker uses to interrupt a vCPU thread. The vCPU never sees itsFinishevent, andVcpuHandle::dropblocked forever onjoin(), hanging teardown until the outer timeout.This was confirmed on an affected agent: thousands of zombie
cat/jailer/sshprocesses,SigQnear theRLIMIT_SIGPENDINGlimit, and each zombie holding one queued signal.Changes
tests/conftest.py: reap orphaned descendants on teardown after each test, so zombies can't accumulate and saturate the signal pool. Ordered to run after the microVM is killed.src/vmm/src/vstate/vcpu.rs: bound thejoin()inVcpuHandle::dropwith a 1s timeout so a vCPU thread that never observes itsFinishevent makes teardown fail fast instead of hanging.tests/host_tools/cargo_build.py(original diagnostic change, kept as it's independently useful):RUST_BACKTRACE=1,--nocapture, and a 540s cargo-test timeout so failures are reported cleanly before pytest's 600s timeout.The earlier debug-only instrumentation commits (KVM_RUN/per-event logging, watchdog thread dump) have been dropped now that the root cause is understood.