[Core] `grpcio` from conda-forge on remote VM can cause failure in starting ray cluster

When a VM has `grpcio==1.51.3` installed in a custom image, it can trigger the following issue when starting ray cluster. After downgrading the `grpcio` to `1.51.1`, the ray cluster can be started normally.
```console
$ ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 -
-dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --temp-dir /tmp/ray_skypilot/
Stopped all 3 Ray processes.
Usage stats collection is disabled.

Local node IP: 10.128.0.17
2023-09-25 16:11:38,229 ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-09-25 16:11:38,230 ERROR services.py:1222 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-25 16:11:38,230 ERROR services.py:1266 -- 
The last 20 lines of /tmp/ray_skypilot/session_2023-09-25_16-11-35_935010_2994/logs/dashboard.log (it contains the error message from the dashboard): 
2023-09-25 16:11:38,077 INFO head.py:239 -- Starting dashboard metrics server on port 44227


--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.128.0.17:6380'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
```

~~This issue seems hard to reproduce by manually install the `grpcio==1.51.3` on the VM. We should test if creating a custom image will trigger the issue.~~

Related issue: https://github.com/ray-project/ray/issues/35383 https://github.com/ray-project/ray/issues/34662
It seems the grpcio built-in with conda-forge causes this issue.

To reproduce:
1. Add `conda install -c conda-forge -y grpcio=1.51.1;` before https://github.com/skypilot-org/skypilot/blob/61cb5e4314b65003421437ca5f90c43bc46dd7d5/sky/templates/gcp-ray.yml.j2#L309
2. `sky launch --cloud gcp --cpus 2`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] `grpcio` from conda-forge on remote VM can cause failure in starting ray cluster #2605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core] grpcio from conda-forge on remote VM can cause failure in starting ray cluster #2605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Core] `grpcio` from conda-forge on remote VM can cause failure in starting ray cluster #2605