Open
Description
When a VM has grpcio==1.51.3
installed in a custom image, it can trigger the following issue when starting ray cluster. After downgrading the grpcio
to 1.51.1
, the ray cluster can be started normally.
$ ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 -
-dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir /tmp/ray_skypilot/
Stopped all 3 Ray processes.
Usage stats collection is disabled.
Local node IP: 10.128.0.17
2023-09-25 16:11:38,229 ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-09-25 16:11:38,230 ERROR services.py:1222 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-25 16:11:38,230 ERROR services.py:1266 --
The last 20 lines of /tmp/ray_skypilot/session_2023-09-25_16-11-35_935010_2994/logs/dashboard.log (it contains the error message from the dashboard):
2023-09-25 16:11:38,077 INFO head.py:239 -- Starting dashboard metrics server on port 44227
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='10.128.0.17:6380'
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
This issue seems hard to reproduce by manually install the grpcio==1.51.3
on the VM. We should test if creating a custom image will trigger the issue.
Related issue: ray-project/ray#35383 ray-project/ray#34662
It seems the grpcio built-in with conda-forge causes this issue.
To reproduce:
- Add
conda install -c conda-forge -y grpcio=1.51.1;
beforeskypilot/sky/templates/gcp-ray.yml.j2
Line 309 in 61cb5e4
sky launch --cloud gcp --cpus 2