Skip to content

fix hangs in CI due to unbounded wait in Vmm:drop() caused by undelivered signals due to resource leak#5943

Draft
Manciukic wants to merge 4 commits into
firecracker-microvm:mainfrom
Manciukic:debug-unittest-timeout
Draft

fix hangs in CI due to unbounded wait in Vmm:drop() caused by undelivered signals due to resource leak#5943
Manciukic wants to merge 4 commits into
firecracker-microvm:mainfrom
Manciukic:debug-unittest-timeout

Conversation

@Manciukic

@Manciukic Manciukic commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

test_unittests was intermittently hanging in CI, hitting the 600s pytest timeout — most often during test_build_and_boot_microvm. Retrying usually passed, and it was worse on m5n.metal and on nightly.

Root cause

The hang is caused by exhaustion of the per-RLIMIT_SIGPENDING queued-signal pool:

  • conftest.py sets PR_SET_CHILD_SUBREAPER, so daemonized descendants (firecracker after the jailer double-fork, plus helpers like screen, ssh, vhost-user backends, and the cat processes socat forks in the vsock tests) reparent to the pytest session.
  • The framework only waitpid()s the firecracker PID, so every other descendant lingers as a zombie. A zombie's pending signals are not freed until it is reaped (release_task), so each one keeps its queued signal (e.g. the SIGTERM from teardown) charged against the RLIMIT_SIGPENDING pool.
  • As zombies accumulate, the pool saturates. Once it is full, the kernel silently drops further real-time signal sends — including the SIGRTMIN kick Firecracker uses to interrupt a vCPU thread. The vCPU never sees its Finish event, and VcpuHandle::drop blocked forever on join(), hanging teardown until the outer timeout.

This was confirmed on an affected agent: thousands of zombie cat/jailer/ssh processes, SigQ near the RLIMIT_SIGPENDING limit, and each zombie holding one queued signal.

Changes

  • tests/conftest.py: reap orphaned descendants on teardown after each test, so zombies can't accumulate and saturate the signal pool. Ordered to run after the microVM is killed.
  • src/vmm/src/vstate/vcpu.rs: bound the join() in VcpuHandle::drop with a 1s timeout so a vCPU thread that never observes its Finish event makes teardown fail fast instead of hanging.
  • tests/host_tools/cargo_build.py (original diagnostic change, kept as it's independently useful): RUST_BACKTRACE=1, --nocapture, and a 540s cargo-test timeout so failures are reported cleanly before pytest's 600s timeout.

The earlier debug-only instrumentation commits (KVM_RUN/per-event logging, watchdog thread dump) have been dropped now that the root cause is understood.

@Manciukic Manciukic force-pushed the debug-unittest-timeout branch 2 times, most recently from 32a37cc to 9513d01 Compare June 9, 2026 13:54
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.00%. Comparing base (d300fd2) to head (4939608).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5943   +/-   ##
=======================================
  Coverage   83.00%   83.00%           
=======================================
  Files         277      277           
  Lines       30106    30112    +6     
=======================================
+ Hits        24989    24995    +6     
  Misses       5117     5117           
Flag Coverage Δ
5.10-m5n.metal 83.30% <100.00%> (+<0.01%) ⬆️
5.10-m6a.metal 82.65% <100.00%> (+<0.01%) ⬆️
5.10-m6g.metal 79.94% <100.00%> (+<0.01%) ⬆️
5.10-m6i.metal 83.31% <100.00%> (+<0.01%) ⬆️
5.10-m7a.metal-48xl 82.64% <100.00%> (+<0.01%) ⬆️
5.10-m7g.metal 79.94% <100.00%> (+<0.01%) ⬆️
5.10-m7i.metal-24xl 83.28% <100.00%> (+<0.01%) ⬆️
5.10-m7i.metal-48xl 83.28% <100.00%> (+<0.01%) ⬆️
5.10-m8g.metal-24xl 79.94% <100.00%> (+<0.01%) ⬆️
5.10-m8g.metal-48xl 79.94% <100.00%> (+<0.01%) ⬆️
5.10-m8i.metal-48xl 83.28% <100.00%> (+<0.01%) ⬆️
5.10-m8i.metal-96xl 83.28% <100.00%> (+<0.01%) ⬆️
6.1-m5n.metal 83.34% <100.00%> (+<0.01%) ⬆️
6.1-m6a.metal 82.68% <100.00%> (+<0.01%) ⬆️
6.1-m6g.metal 79.94% <100.00%> (-0.01%) ⬇️
6.1-m6i.metal 83.33% <100.00%> (+<0.01%) ⬆️
6.1-m7a.metal-48xl 82.67% <100.00%> (+0.01%) ⬆️
6.1-m7g.metal 79.94% <100.00%> (+<0.01%) ⬆️
6.1-m7i.metal-24xl 83.34% <100.00%> (ø)
6.1-m7i.metal-48xl 83.34% <100.00%> (+<0.01%) ⬆️
6.1-m8g.metal-24xl 79.94% <100.00%> (+<0.01%) ⬆️
6.1-m8g.metal-48xl 79.94% <100.00%> (+<0.01%) ⬆️
6.1-m8i.metal-48xl 83.34% <100.00%> (+<0.01%) ⬆️
6.1-m8i.metal-96xl 83.35% <100.00%> (+<0.01%) ⬆️
6.18-m5n.metal 83.34% <100.00%> (+0.01%) ⬆️
6.18-m6a.metal 82.68% <100.00%> (+<0.01%) ⬆️
6.18-m6g.metal 79.94% <100.00%> (+<0.01%) ⬆️
6.18-m6i.metal 83.33% <100.00%> (+<0.01%) ⬆️
6.18-m7a.metal-48xl 82.67% <100.00%> (+<0.01%) ⬆️
6.18-m7g.metal 79.94% <100.00%> (+<0.01%) ⬆️
6.18-m7i.metal-24xl 83.34% <100.00%> (+<0.01%) ⬆️
6.18-m7i.metal-48xl 83.34% <100.00%> (ø)
6.18-m8g.metal-24xl 79.94% <100.00%> (+<0.01%) ⬆️
6.18-m8g.metal-48xl 79.94% <100.00%> (+<0.01%) ⬆️
6.18-m8i.metal-48xl 83.35% <100.00%> (+<0.01%) ⬆️
6.18-m8i.metal-96xl 83.35% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Manciukic Manciukic force-pushed the debug-unittest-timeout branch 12 times, most recently from afb28a0 to 01374e2 Compare June 11, 2026 17:11
@Manciukic Manciukic changed the title fix(tests): add timeout, nocapture, and full backtrace to cargo unit tests fix(tests): reap orphaned descendants to prevent vCPU kick signal loss Jun 11, 2026
Add debugging instrumentation to help diagnose intermittent unit test
hangs in CI (test_build_and_boot_microvm hanging indefinitely):

- RUST_BACKTRACE=1: get a backtrace on panic
- --nocapture: stream test output in real time so BK logs show which
  test was running when the hang occurs
- timeout=540s: kill cargo test 60s before pytest's 600s timeout fires,
  ensuring clean failure reporting and artifact upload

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
@Manciukic Manciukic force-pushed the debug-unittest-timeout branch from 01374e2 to 1a24b3d Compare June 11, 2026 17:18
@Manciukic Manciukic changed the title fix(tests): reap orphaned descendants to prevent vCPU kick signal loss fix hangs in CI due to unbounded wait in Vmm:drop() caused by undelivered signals due to resource leak Jun 11, 2026
conftest sets PR_SET_CHILD_SUBREAPER so daemonized descendants reparent
to the pytest session, but the framework only waitpid()s the firecracker
PID. Other helpers (screen, ssh, vhost-user backends, socat-forked cat)
linger as zombies, and their queued signals stay charged against the
RLIMIT_SIGPENDING pool until reaped, which can cause later signal sends
to be dropped.

Add a per-test reap loop that drains exited descendants on teardown,
ordered after the microVM is killed via a microvm_factory dependency.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
VcpuHandle::drop join()ed the vCPU thread unconditionally, so a thread
that never observed its Finish event would block teardown forever. Poll
for exit with a 1s timeout and panic if exceeded, so teardown fails fast
instead of hanging.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
The bounded join added to VcpuHandle::drop polls with thread::sleep,
which issues clock_nanosleep on musl. VcpuHandle::drop runs on the vmm
thread (via Vmm::drop -> shutdown_vcpus) under seccomp, so the syscall
must be allowlisted or shutdown traps with SIGSYS.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
@Manciukic Manciukic force-pushed the debug-unittest-timeout branch from 0bdb6be to 4939608 Compare June 11, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant