Skip to content

[Rebase][Bug] Diffusion Multi-GPU Layered/FLUX: teardown zombie process from TCPStore shutdown race #3811

@tzhouam

Description

@tzhouam

Summary

On dev/vllm-align branch (commit d9838193), :full_moon: Diffusion X2I · Function Test with H100 · Multi-GPU Layered/LongCat/FLUX errors during teardown because the API server process becomes a zombie and cannot be waited on within the 5-second timeout.

ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2]
psutil.TimeoutExpired: timeout after 5 seconds (pid=9955)

The test itself likely passed — the error occurs in __exit___kill_process_tree() during cleanup.

Root cause: TCPStore shutdown ordering race

The teardown timeline from the raw log:

Timestamp Event
02:10:19 API server (PID 9955) initiates shutdown. Orchestrator shuts down clients.
02:10:19 Worker 0 receives shutdown → event loop terminates → Shutdown complete
02:10:20 Worker 0 tears down process group → TCPStore destroyed
02:10:22 Worker 1 HeartbeatMonitor thread polls TCPStore → gets 0 bytes (connection closed)
02:10:22 Worker 1 then receives shutdown message and completes
02:10:29 Orchestrator warns: Orchestrator thread did not exit in time
Teardown _kill_process_tree()psutil.Process(9955).wait(timeout=5)timeout
[rank1]: Failed to check the "should dump" flag on TCPStore,
  (maybe TCPStore server has shut down too early), with error:
  Failed to recv, got 0 bytes. Connection was likely closed.
  Did the remote server shutdown or crash?

The HeartbeatMonitor (a C++ background thread in ProcessGroupNCCL) polls the TCPStore periodically. When Worker 0 destroys the process group (and thus the TCPStore) before Worker 1 has exited its heartbeat loop, Worker 1's recv() gets 0 bytes and the thread hangs. Since NCCL_ASYNC_ERROR_HANDLING is disabled in omni workers, the thread has no watchdog to break it out.

The parent process becomes a zombie: its main thread finished, but the hung heartbeat thread keeps it alive from the kernel's perspective, making pidfd_open()+poll() wait forever.

Steps to reproduce

pytest tests/e2e/online_serving/test_flux_2_dev_expansion.py -k "parallel_cfg_2" -v -m "full_model"

Requires 2x H100 GPUs. The failure may be intermittent depending on shutdown timing.

Expected behavior

Test teardown should complete cleanly without zombie processes or timeout errors.

Suggested fix

Increase teardown timeouts (tests/helpers/runtime.py)

Extend the wait timeouts from 5s to 10s to allow NCCL heartbeat threads time to detect the broken connection and exit:

# Line 343 — psutil.wait_procs timeout
gone, still_alive = psutil.wait_procs(children, timeout=10)  # was 5

# Line 355 — parent.wait timeout after kill
parent.wait(timeout=10)  # was 5

This matches the 10-second timeout already used in the benchmark script at tests/dfx/perf/scripts/run_diffusion_benchmark.py.

(Optional) Add NCCL barrier before shutdown (diffusion_worker.py)

As a more robust long-term fix, add a barrier before destroy_distributed_env() in diffusion_worker.py to ensure all ranks synchronize before the TCPStore is destroyed:

def shutdown(self) -> None:
    """Shutdown the worker and cleanup distributed environment."""
    if torch.distributed.is_initialized():
        torch.distributed.barrier()  # sync all ranks before teardown
        torch.distributed.destroy_process_group()
    destroy_distributed_env()

Additional context

  • Buildkite: https://buildkite.com/vllm/vllm-omni-rebase/builds/1638
  • The same teardown issue can affect any multi-GPU diffusion test where ProcessGroupNCCL is used
  • NCCL_ASYNC_ERROR_HANDLING is explicitly disabled in omni workers (gpu_ar_worker.py:34, gpu_generation_worker.py:33) to allow memory snapshots before NCCL buffer allocation
  • This is a known PyTorch distributed pattern: HeartbeatMonitor thread lifetime must be bounded by the TCPStore lifetime

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions