This example demonstrates how to launch a Ray cluster on OSMO. Ray is a unified framework for scaling AI and Python applications, and OSMO provides native support for running Ray clusters, making it easy to leverage Ray's distributed computing capabilities.
The workflow launches a Ray cluster with one master node and one or more worker nodes. The master node runs the Ray head process that coordinates the cluster, while worker nodes connect to the master to form a distributed compute cluster.
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/integration_and_tools/ray/ray.yaml
osmo workflow submit ray.yamlYou can customize the cluster size and resources:
osmo workflow submit ray.yaml --set num_nodes=4 --set gpu=2 --set cpu=16 --set memory=128GiPort-forward the dashboard ports to access the Ray dashboard and Prometheus metrics:
# Get the workflow ID from the submit command output
osmo workflow port-forward <workflow-id> master --port 8265,9090The Ray dashboard will be available at http://localhost:8265.
The Prometheus dashboard will be available at http://localhost:9090.
Set the Ray address environment variable to use Ray CLI:
export RAY_ADDRESS="http://localhost:8265"-
Resource Allocation: Ensure your resource requests match your workload requirements. Ray works best when it has accurate information about available resources.
-
Monitoring: Use the Ray dashboard to monitor cluster health, task progress, and resource utilization.
-
Port Configuration: The default Ray port (6376) can be customized using the
ray_portparameter if needed. -
Timeouts: Consider setting appropriate timeouts to manage cluster lifecycle and avoid unnecessary resource consumption.