This example demonstrates distributed model training on Kubernetes using KubeRay and Ray Train. It supports both Azure (AKS) and Nebius clusters.
It is adapted from the Ray E2E Multimodal AI Workloads — Distributed Training tutorial (originally designed for Anyscale).
- Preprocess — Uses CLIP to embed dog breed images into 512-dimensional vectors using Ray Data
map_batcheswith GPU actors. - Train — Trains a simple 2-layer PyTorch classifier with Ray Train
TorchTraineracross multiple GPU workers using DDP. - Track — Logs metrics and model artifacts to an MLflow model registry on shared storage.
- Evaluate — Runs batch inference on a held-out test set and computes precision, recall, F1, and accuracy.
| Component | Version / Details |
|---|---|
| Kubernetes cluster | AKS or Nebius with GPU node pool |
| NVIDIA GPU DRA driver | gpu.nvidia.com device class available on GPU nodes |
| KubeRay operator | v1.5.1 (installed by run.sh if not present) |
| Shared PVC | cluster-storage (100Gi ReadWriteMany), created via configs/pvc.yaml |
| Ray | 2.48.0 |
distributed-training/
├── main.py # Training script (runs on the RayCluster)
├── run.sh # One-command launcher (azure or nebius)
├── doggos/embed.py # CLIP embedding actor for Ray Data
├── base/
│ ├── kustomization.yaml # Kustomize base
│ ├── rayjob.yaml # Cloud-agnostic RayJob manifest
│ └── gpu-claim.yaml # DRA ResourceClaimTemplate (8 GPUs per worker)
└── overlays/
├── azure/
│ ├── kustomization.yaml # Azure overlay
│ └── rayjob-patch.yaml # nodeSelector for Azure
└── nebius/
├── kustomization.yaml # Nebius overlay
└── rayjob-patch.yaml # nodeSelector for Nebius
GPU allocation is defined in base/gpu-claim.yaml as a standalone ResourceClaimTemplate (multi-gpu) that requests 8 NVIDIA H100 GPUs per worker via the gpu.nvidia.com device class. The RayJob references this template by name. Each cloud overlay applies JSON patches to place pods on the correct node pools:
| Pod | Azure | Nebius |
|---|---|---|
| Submitter / Head | agentpool: cpu |
agentpool: nebius-cpu |
| GPU Workers | agentpool: gpu |
agentpool: nebius-gpu |
./run.sh azure (or: kubectl apply -k overlays/azure)
│
▼
┌─────────────────────────────────────────────────┐
│ RayJob: multimodel-distributed-training │
│ │
│ Head Pod (CPU node pool) │
│ ├── main.py (entrypoint) │
│ ├── doggos/embed.py (CLIP actor) │
│ └── /mnt/cluster_storage (shared PVC) │
│ │
│ Worker Pods (GPU) × 2 │
│ ├── Ray Train workers (DDP) │
│ ├── Ray Data actors (CLIP embedding) │
│ └── /mnt/cluster_storage (shared PVC) │
└─────────────────────────────────────────────────┘
The training script is mounted via a ConfigMap. Pip dependencies (torch, transformers, mlflow, doggos, etc.) are installed on all nodes at job start via runtimeEnvYAML in rayjob.yaml. The doggos helper package is installed directly from GitHub.
kubectl apply -f configs/storageclass.yaml
kubectl apply -f configs/pvc.yamlThe run.sh script handles ConfigMap creation, cleanup of previous runs, KubeRay operator installation (if needed), and applies the correct kustomize overlay:
./run.sh azure # for AKS clusters
./run.sh nebius # for Nebius clustersOr apply manually:
# Create the ConfigMap
kubectl create configmap multimodel-distributed-training-scripts \
--from-file=main.py \
-n ray --dry-run=client -o yaml | kubectl apply -f -
# Apply the overlay
kubectl apply -k overlays/azure # or overlays/nebius# Watch job status
kubectl -n ray get rayjob multimodel-distributed-training -w
# Stream logs
kubectl -n ray logs -f -l job-name=multimodel-distributed-training --tail=100
# Ray Dashboard
kubectl -n ray port-forward svc/multimodel-distributed-training-head-svc 8265:8265# Run MLflow server inside the head pod
head_pod=$(kubectl get pod -l ray.io/node-type=head -o=jsonpath='{.items[0].metadata.name}' -n ray)
kubectl -n ray exec -it $head_pod -- \
mlflow server -h 0.0.0.0 -p 8080 \
--backend-store-uri /mnt/cluster_storage/mlflow/doggos
# Port-forward in another terminal
kubectl -n ray port-forward $head_pod 8080:8080MLflow UI is now available at http://localhost:8080.
The default base/rayjob.yaml is configured for 2 GPU worker nodes with 8 GPUs each (e.g., Standard_ND96asr_v4). Adjust for your setup:
| Node Pool VM SKU | GPUs/Node | NUM_WORKERS |
resources_per_worker |
|---|---|---|---|
Standard_NC6s_v3 (V100) |
1 | 2 | {"CPU": 4, "GPU": 1} |
Standard_NC24ads_A100_v4 (A100) |
4 | 2 | {"CPU": 8, "GPU": 4} |
Standard_ND96asr_v4 (A100 x8) |
8 | 2 | {"CPU": 8, "GPU": 8} |
Update both base/rayjob.yaml (worker replicas) and the GPU count in base/gpu-claim.yaml, along with the constants at the top of main.py.
| Anyscale | KubeRay on AKS / Nebius |
|---|---|
| Notebook runs inside Anyscale Workspace | Script runs on the cluster via RayJob |
accelerator_type="T4" |
Removed — GPU type determined by VM SKU |
| S3 user storage for artifacts | Shared PVC at /mnt/cluster_storage |
| Anyscale runtime env auto-setup | runtimeEnvYAML in RayJob spec |
anyscale job submit |
./run.sh azure or kubectl apply -k overlays/<cloud> |
kubectl -n ray delete rayjob multimodel-distributed-training
kubectl -n ray delete configmap multimodel-distributed-training-scripts