Skip to content

[Bug] ray cluster detection is broken on startup #90

@tabbas97

Description

@tabbas97
root@psap-h200-au-syd-3:/workspace/auto-tuning-vllm# auto-tune-vllm optimize --config examples/kimi-k2.yaml --python-executable python3 --max-concurrent 1
Starting auto-tune-vllm optimization
Configuration: examples/kimi-k2.yaml
Backend: ray
Python executable: python3
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Generated study name: kimi-k2-code_13948 from prefix: kimi-k2-code
Loaded study: kimi-k2-code_13948
__ Saved study config to: optuna_studies/kimi-k2-code_13948/study_config.yaml
2025-10-09 05:45:48,623 INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.245.128.4:6379...




^C[2025-10-09 05:45:53,633 W 141339 141339] gcs_rpc_client.h:155: Failed to connect to GCS at address 10.245.128.4:6379 within 5 seconds.
[2025-10-09 05:46:23,635 W 141339 141339] gcs_client.cc:184: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.

A more elegant check of ray cluster availability needs to be done. This does not respond to user input until it finishes the chain of timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions