Summary
On dev/vllm-align branch (commit d9838193), :full_moon: Diffusion X2I · Function Test with H100 · Multi-GPU Layered/LongCat/FLUX errors during teardown because the API server process becomes a zombie and cannot be waited on within the 5-second timeout.
ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2]
psutil.TimeoutExpired: timeout after 5 seconds (pid=9955)
The test itself likely passed — the error occurs in __exit__ → _kill_process_tree() during cleanup.
Root cause: TCPStore shutdown ordering race
The teardown timeline from the raw log:
| Timestamp |
Event |
02:10:19 |
API server (PID 9955) initiates shutdown. Orchestrator shuts down clients. |
02:10:19 |
Worker 0 receives shutdown → event loop terminates → Shutdown complete |
02:10:20 |
Worker 0 tears down process group → TCPStore destroyed |
02:10:22 |
Worker 1 HeartbeatMonitor thread polls TCPStore → gets 0 bytes (connection closed) |
02:10:22 |
Worker 1 then receives shutdown message and completes |
02:10:29 |
Orchestrator warns: Orchestrator thread did not exit in time |
| Teardown |
_kill_process_tree() → psutil.Process(9955).wait(timeout=5) → timeout |
[rank1]: Failed to check the "should dump" flag on TCPStore,
(maybe TCPStore server has shut down too early), with error:
Failed to recv, got 0 bytes. Connection was likely closed.
Did the remote server shutdown or crash?
The HeartbeatMonitor (a C++ background thread in ProcessGroupNCCL) polls the TCPStore periodically. When Worker 0 destroys the process group (and thus the TCPStore) before Worker 1 has exited its heartbeat loop, Worker 1's recv() gets 0 bytes and the thread hangs. Since NCCL_ASYNC_ERROR_HANDLING is disabled in omni workers, the thread has no watchdog to break it out.
The parent process becomes a zombie: its main thread finished, but the hung heartbeat thread keeps it alive from the kernel's perspective, making pidfd_open()+poll() wait forever.
Steps to reproduce
pytest tests/e2e/online_serving/test_flux_2_dev_expansion.py -k "parallel_cfg_2" -v -m "full_model"
Requires 2x H100 GPUs. The failure may be intermittent depending on shutdown timing.
Expected behavior
Test teardown should complete cleanly without zombie processes or timeout errors.
Suggested fix
Increase teardown timeouts (tests/helpers/runtime.py)
Extend the wait timeouts from 5s to 10s to allow NCCL heartbeat threads time to detect the broken connection and exit:
# Line 343 — psutil.wait_procs timeout
gone, still_alive = psutil.wait_procs(children, timeout=10) # was 5
# Line 355 — parent.wait timeout after kill
parent.wait(timeout=10) # was 5
This matches the 10-second timeout already used in the benchmark script at tests/dfx/perf/scripts/run_diffusion_benchmark.py.
(Optional) Add NCCL barrier before shutdown (diffusion_worker.py)
As a more robust long-term fix, add a barrier before destroy_distributed_env() in diffusion_worker.py to ensure all ranks synchronize before the TCPStore is destroyed:
def shutdown(self) -> None:
"""Shutdown the worker and cleanup distributed environment."""
if torch.distributed.is_initialized():
torch.distributed.barrier() # sync all ranks before teardown
torch.distributed.destroy_process_group()
destroy_distributed_env()
Additional context
- Buildkite: https://buildkite.com/vllm/vllm-omni-rebase/builds/1638
- The same teardown issue can affect any multi-GPU diffusion test where
ProcessGroupNCCL is used
NCCL_ASYNC_ERROR_HANDLING is explicitly disabled in omni workers (gpu_ar_worker.py:34, gpu_generation_worker.py:33) to allow memory snapshots before NCCL buffer allocation
- This is a known PyTorch distributed pattern:
HeartbeatMonitor thread lifetime must be bounded by the TCPStore lifetime
🤖 Generated with Claude Code
Summary
On
dev/vllm-alignbranch (commitd9838193),:full_moon: Diffusion X2I · Function Test with H100 · Multi-GPU Layered/LongCat/FLUXerrors during teardown because the API server process becomes a zombie and cannot be waited on within the 5-second timeout.The test itself likely passed — the error occurs in
__exit__→_kill_process_tree()during cleanup.Root cause: TCPStore shutdown ordering race
The teardown timeline from the raw log:
02:10:1902:10:19Shutdown complete02:10:2002:10:22HeartbeatMonitorthread polls TCPStore → gets 0 bytes (connection closed)02:10:2202:10:29Orchestrator thread did not exit in time_kill_process_tree()→psutil.Process(9955).wait(timeout=5)→ timeoutThe
HeartbeatMonitor(a C++ background thread inProcessGroupNCCL) polls the TCPStore periodically. When Worker 0 destroys the process group (and thus the TCPStore) before Worker 1 has exited its heartbeat loop, Worker 1'srecv()gets 0 bytes and the thread hangs. SinceNCCL_ASYNC_ERROR_HANDLINGis disabled in omni workers, the thread has no watchdog to break it out.The parent process becomes a zombie: its main thread finished, but the hung heartbeat thread keeps it alive from the kernel's perspective, making
pidfd_open()+poll()wait forever.Steps to reproduce
Requires 2x H100 GPUs. The failure may be intermittent depending on shutdown timing.
Expected behavior
Test teardown should complete cleanly without zombie processes or timeout errors.
Suggested fix
Increase teardown timeouts (
tests/helpers/runtime.py)Extend the wait timeouts from 5s to 10s to allow NCCL heartbeat threads time to detect the broken connection and exit:
This matches the 10-second timeout already used in the benchmark script at
tests/dfx/perf/scripts/run_diffusion_benchmark.py.(Optional) Add NCCL barrier before shutdown (
diffusion_worker.py)As a more robust long-term fix, add a barrier before
destroy_distributed_env()indiffusion_worker.pyto ensure all ranks synchronize before the TCPStore is destroyed:Additional context
ProcessGroupNCCLis usedNCCL_ASYNC_ERROR_HANDLINGis explicitly disabled in omni workers (gpu_ar_worker.py:34,gpu_generation_worker.py:33) to allow memory snapshots before NCCL buffer allocationHeartbeatMonitorthread lifetime must be bounded by the TCPStore lifetime🤖 Generated with Claude Code