-
Notifications
You must be signed in to change notification settings - Fork 83
Open
Description
🐛 Describe the bug
Description
In a Docker environment, running the command below triggers networking errors (IPv6 address chosen, IPv4 expected).
Command
uv run python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yamlWorkaround
Based on the logic in monarch_executor.py, the host IP can be overridden via an environment variable:
def _get_host_ip() -> str:
if host_ip := os.environ.get("VLLM_HOST_IP"):
return host_ipSetting the following resolves the issue in my environment:
export VLLM_HOST_IP=127.0.0.1A more robust _get_host_ip() (e.g., preferring IPv4 or avoiding link-local IPv6 addresses in containers) could help. I'm happy to open a PR if that would be useful.
Error message
(EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:386: [actor=<root>]
[MonarchExecutor] Head node: fe80::222:48ff:fe49:ba90:51391
(EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:393: [actor=<root>]
[MonarchExecutor] Using allocated GPUs: ['1']
WARNING 01-29 01:26:43 [worker_base.py:301]
[actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]
Missing `shared_worker_lock` argument from executor. This argument is needed for
mm_processor_cache_type='shm'.
INFO 01-29 01:26:47 [parallel_state.py:1203]
[actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]
world_size=1 rank=0 local_rank=0 distributed_init_method=env:// backend=nccl
[W129 01:26:47.186735869 socket.cpp:767] [c10d] The client socket has failed to connect to
[train16node-master]:51391 (errno: 22 - Invalid argument).
[W129 01:26:47.186767930 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:47.876868742 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:48.613986049 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:49.280112312 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:51.628231771 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:54.085405747 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:26:59.063519507 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W129 01:27:07.065667397 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,
51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
Environment
git log -1
commit cd9e295c49b2a1a6e07eea2d77fa295613729638 (HEAD -> main, origin/main, origin/HEAD)
Author: Jiyue Wang <[email protected]>
Date: Wed Jan 28 16:40:10 2026 -0500
[vllm] Upgrade vllm version to v0.13.0 (#737)
# Check core components
python -c "import torch, forge, monarch, vllm; print('All imports successful')"
# Check specific versions
python -c "
import torch
import forge
import vllm
print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'vLLM: {vllm.__version__}')
print(f'CUDA: {torch.version.cuda}')
"
All imports successful
PyTorch: 2.9.0+cu128
TorchForge:
vLLM: 0.13.0
CUDA: 12.8Versions
No response
Metadata
Metadata
Assignees
Labels
No labels