Skip to content

[Core] grpcio from conda-forge on remote VM can cause failure in starting ray cluster #2605

Open
@Michaelvll

Description

@Michaelvll

When a VM has grpcio==1.51.3 installed in a custom image, it can trigger the following issue when starting ray cluster. After downgrading the grpcio to 1.51.1, the ray cluster can be started normally.

$ ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 -
-dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --temp-dir /tmp/ray_skypilot/
Stopped all 3 Ray processes.
Usage stats collection is disabled.

Local node IP: 10.128.0.17
2023-09-25 16:11:38,229 ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-09-25 16:11:38,230 ERROR services.py:1222 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-25 16:11:38,230 ERROR services.py:1266 -- 
The last 20 lines of /tmp/ray_skypilot/session_2023-09-25_16-11-35_935010_2994/logs/dashboard.log (it contains the error message from the dashboard): 
2023-09-25 16:11:38,077 INFO head.py:239 -- Starting dashboard metrics server on port 44227


--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.128.0.17:6380'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status

This issue seems hard to reproduce by manually install the grpcio==1.51.3 on the VM. We should test if creating a custom image will trigger the issue.

Related issue: ray-project/ray#35383 ray-project/ray#34662
It seems the grpcio built-in with conda-forge causes this issue.

To reproduce:

  1. Add conda install -c conda-forge -y grpcio=1.51.1; before
    (pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}});
  2. sky launch --cloud gcp --cpus 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions