Skip to content

Networking error in Docker due to host IP detection (workaround: set VLLM_HOST_IP) #743

@insop

Description

@insop

🐛 Describe the bug

Description

In a Docker environment, running the command below triggers networking errors (IPv6 address chosen, IPv4 expected).

Command

uv run python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workaround

Based on the logic in monarch_executor.py, the host IP can be overridden via an environment variable:

if host_ip := os.environ.get("VLLM_HOST_IP"):
    return host_ip

Setting the following resolves the issue in my environment:

export VLLM_HOST_IP=127.0.0.1

A more robust _get_host_ip() (e.g., preferring IPv4 or avoiding link-local IPv6 addresses in containers) could help. I'm happy to open a PR if that would be useful.

Error message

  (EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:386: [actor=<root>]                 
  [MonarchExecutor] Head node: fe80::222:48ff:fe49:ba90:51391                                                  
  (EngineCore_DP0 pid=8447) [2026-01-29 01:26:35] INFO monarch_executor.py:393: [actor=<root>]                 
  [MonarchExecutor] Using allocated GPUs: ['1']                                                                
  WARNING 01-29 01:26:43 [worker_base.py:301]                                                                  
  [actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]           
  Missing `shared_worker_lock` argument from executor. This argument is needed for                             
  mm_processor_cache_type='shm'.                                                                               
  INFO 01-29 01:26:47 [parallel_state.py:1203]                                                                 
  [actor=<root>.<forge.actors.vllm.v1.forge_executor.ForgeWorkerWrapper vllm_workers{'procs': 0/1}>]           
  world_size=1 rank=0 local_rank=0 distributed_init_method=env:// backend=nccl                                 
  [W129 01:26:47.186735869 socket.cpp:767] [c10d] The client socket has failed to connect to                   
  [train16node-master]:51391 (errno: 22 - Invalid argument).                                                   
  [W129 01:26:47.186767930 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:47.876868742 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:48.613986049 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:49.280112312 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:51.628231771 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:54.085405747 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:26:59.063519507 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).                      
  [W129 01:27:07.065667397 socket.cpp:767] [c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90,     
  51391) cannot be retrieved (gai error: -9 - Address family for hostname not supported).  

Environment

git log -1
commit cd9e295c49b2a1a6e07eea2d77fa295613729638 (HEAD -> main, origin/main, origin/HEAD)
Author: Jiyue Wang <[email protected]>
Date:   Wed Jan 28 16:40:10 2026 -0500

    [vllm] Upgrade vllm version to v0.13.0 (#737)

# Check core components
python -c "import torch, forge, monarch, vllm; print('All imports successful')"

# Check specific versions
python -c "
import torch
import forge
import vllm

print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'vLLM: {vllm.__version__}')
print(f'CUDA: {torch.version.cuda}')
"
All imports successful
PyTorch: 2.9.0+cu128
TorchForge: 
vLLM: 0.13.0
CUDA: 12.8

Versions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions