[Rebase][Bug] Diffusion Multi-GPU Layered/FLUX: teardown zombie process from TCPStore shutdown race

## Summary

On `dev/vllm-align` branch (commit `d9838193`), `:full_moon: Diffusion X2I · Function Test with H100 · Multi-GPU Layered/LongCat/FLUX` errors during teardown because the API server process becomes a zombie and cannot be waited on within the 5-second timeout.

```
ERROR tests/e2e/online_serving/test_flux_2_dev_expansion.py::test_flux_2_dev[parallel_cfg_2]
psutil.TimeoutExpired: timeout after 5 seconds (pid=9955)
```

The test itself likely passed — the error occurs in `__exit__` → `_kill_process_tree()` during cleanup.

### Root cause: TCPStore shutdown ordering race

The teardown timeline from the raw log:

| Timestamp | Event |
|-----------|-------|
| `02:10:19` | API server (PID 9955) initiates shutdown. Orchestrator shuts down clients. |
| `02:10:19` | Worker 0 receives shutdown → event loop terminates → `Shutdown complete` |
| `02:10:20` | Worker 0 tears down process group → **TCPStore destroyed** |
| `02:10:22` | Worker 1 `HeartbeatMonitor` thread polls TCPStore → **gets 0 bytes** (connection closed) |
| `02:10:22` | Worker 1 *then* receives shutdown message and completes |
| `02:10:29` | Orchestrator warns: `Orchestrator thread did not exit in time` |
| Teardown | `_kill_process_tree()` → `psutil.Process(9955).wait(timeout=5)` → **timeout** |

```
[rank1]: Failed to check the "should dump" flag on TCPStore,
  (maybe TCPStore server has shut down too early), with error:
  Failed to recv, got 0 bytes. Connection was likely closed.
  Did the remote server shutdown or crash?
```

The `HeartbeatMonitor` (a C++ background thread in `ProcessGroupNCCL`) polls the TCPStore periodically. When Worker 0 destroys the process group (and thus the TCPStore) before Worker 1 has exited its heartbeat loop, Worker 1's `recv()` gets 0 bytes and the thread hangs. Since `NCCL_ASYNC_ERROR_HANDLING` is disabled in omni workers, the thread has no watchdog to break it out.

The parent process becomes a **zombie**: its main thread finished, but the hung heartbeat thread keeps it alive from the kernel's perspective, making `pidfd_open()+poll()` wait forever.

## Steps to reproduce

```bash
pytest tests/e2e/online_serving/test_flux_2_dev_expansion.py -k "parallel_cfg_2" -v -m "full_model"
```

Requires 2x H100 GPUs. The failure may be intermittent depending on shutdown timing.

## Expected behavior

Test teardown should complete cleanly without zombie processes or timeout errors.

## Suggested fix

### Increase teardown timeouts (`tests/helpers/runtime.py`)

Extend the wait timeouts from 5s to 10s to allow NCCL heartbeat threads time to detect the broken connection and exit:

```python
# Line 343 — psutil.wait_procs timeout
gone, still_alive = psutil.wait_procs(children, timeout=10)  # was 5

# Line 355 — parent.wait timeout after kill
parent.wait(timeout=10)  # was 5
```

This matches the 10-second timeout already used in the benchmark script at `tests/dfx/perf/scripts/run_diffusion_benchmark.py`.

### (Optional) Add NCCL barrier before shutdown (`diffusion_worker.py`)

As a more robust long-term fix, add a barrier before `destroy_distributed_env()` in `diffusion_worker.py` to ensure all ranks synchronize before the TCPStore is destroyed:

```python
def shutdown(self) -> None:
    """Shutdown the worker and cleanup distributed environment."""
    if torch.distributed.is_initialized():
        torch.distributed.barrier()  # sync all ranks before teardown
        torch.distributed.destroy_process_group()
    destroy_distributed_env()
```

## Additional context

- Buildkite: https://buildkite.com/vllm/vllm-omni-rebase/builds/1638
- The same teardown issue can affect any multi-GPU diffusion test where `ProcessGroupNCCL` is used
- `NCCL_ASYNC_ERROR_HANDLING` is explicitly disabled in omni workers (`gpu_ar_worker.py:34`, `gpu_generation_worker.py:33`) to allow memory snapshots before NCCL buffer allocation
- This is a known PyTorch distributed pattern: `HeartbeatMonitor` thread lifetime must be bounded by the TCPStore lifetime

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Timestamp	Event
`02:10:19`	API server (PID 9955) initiates shutdown. Orchestrator shuts down clients.
`02:10:19`	Worker 0 receives shutdown → event loop terminates → `Shutdown complete`
`02:10:20`	Worker 0 tears down process group → TCPStore destroyed
`02:10:22`	Worker 1 `HeartbeatMonitor` thread polls TCPStore → gets 0 bytes (connection closed)
`02:10:22`	Worker 1 then receives shutdown message and completes
`02:10:29`	Orchestrator warns: `Orchestrator thread did not exit in time`
Teardown	`_kill_process_tree()` → `psutil.Process(9955).wait(timeout=5)` → timeout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rebase][Bug] Diffusion Multi-GPU Layered/FLUX: teardown zombie process from TCPStore shutdown race #3811

Summary

Root cause: TCPStore shutdown ordering race

Steps to reproduce

Expected behavior

Suggested fix

Increase teardown timeouts (`tests/helpers/runtime.py`)

(Optional) Add NCCL barrier before shutdown (`diffusion_worker.py`)

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Rebase][Bug] Diffusion Multi-GPU Layered/FLUX: teardown zombie process from TCPStore shutdown race #3811

Description

Summary

Root cause: TCPStore shutdown ordering race

Steps to reproduce

Expected behavior

Suggested fix

Increase teardown timeouts (tests/helpers/runtime.py)

(Optional) Add NCCL barrier before shutdown (diffusion_worker.py)

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Increase teardown timeouts (`tests/helpers/runtime.py`)

(Optional) Add NCCL barrier before shutdown (`diffusion_worker.py`)