This example demonstrates batch inference (CLIP image embedding generation) on Kubernetes using KubeRay and Ray Data. It supports both Azure (AKS) and Nebius clusters.
It is adapted from the Ray E2E Multimodal AI Workloads — Batch Inference tutorial (originally designed for Anyscale).
- Distributed Read — Reads dog breed images from a public S3 bucket using Ray Data (CPU).
- Preprocessing — Adds class labels extracted from file paths using
map(CPU). - Batch Embedding — Generates CLIP embeddings with GPU actors via
map_batches(GPU). - Materialize — Materializes embeddings into Ray's shared memory object store.
- Similarity Search — Embeds a query image and retrieves the most similar images by cosine similarity.
The pipeline uses Ray Data's streaming execution, which processes data in chunks as they're loaded — avoiding OOM errors on large datasets and maximizing GPU utilization by overlapping CPU preprocessing with GPU inference.
S3 (images) ──► read_images (CPU) ──► map(add_class) (CPU)
│
▼
map_batches(EmbedImages) (GPU × N)
│
▼
materialize() ──► Ray Object Store
| Component | Version / Details |
|---|---|
| Kubernetes cluster | AKS or Nebius with GPU node pool |
| NVIDIA GPU DRA driver | gpu.nvidia.com device class available on GPU nodes |
| KubeRay operator | v1.5.1+, installed via cluster setup |
| Ray | 2.48.0 |
batch-inference/
├── main.py # Batch inference script (runs on the RayCluster)
├── run.sh # One-command launcher (azure or nebius)
├── base/
│ ├── kustomization.yaml # Kustomize base
│ ├── rayjob.yaml # Cloud-agnostic RayJob manifest
│ └── gpu-claim.yaml # DRA ResourceClaimTemplate (1 GPU per worker)
└── overlays/
├── azure/
│ ├── kustomization.yaml # Azure overlay
│ └── rayjob-patch.yaml # nodeSelector for Azure
└── nebius/
├── kustomization.yaml # Nebius overlay
└── rayjob-patch.yaml # nodeSelector for Nebius
GPU allocation is defined in base/gpu-claim.yaml as a standalone ResourceClaimTemplate (single-gpu) that requests 1 NVIDIA H100 GPU per worker via the gpu.nvidia.com device class. The RayJob references this template by name. Each cloud overlay applies JSON patches to place pods on the correct node pools:
| Pod | Azure | Nebius |
|---|---|---|
| Submitter / Head | agentpool: cpu |
agentpool: nebius-cpu |
| GPU Workers | agentpool: gpu |
agentpool: nebius-gpu |
./run.sh azure (or: kubectl apply -k overlays/azure)
│
▼
┌──────────────────────────────────────────────────┐
│ RayJob: multimodel-batch-inference │
│ │
│ Head Pod (CPU node pool) │
│ ├── main.py (entrypoint via ConfigMap) │
│ └── Drives the Ray Data pipeline │
│ │
│ Worker Pods (GPU) × 2 │
│ ├── 1 GPU each (2 total) │
│ └── CLIP embedding actors via map_batches │
│ │
│ Ray Object Store (shared memory) │
│ └── Materialized embeddings (ephemeral) │
└──────────────────────────────────────────────────┘
The script is mounted via a ConfigMap. Pip dependencies (torch, transformers, doggos, etc.) are installed on all nodes at job start via runtimeEnvYAML in rayjob.yaml.
The run.sh script handles ConfigMap creation, cleanup of previous runs, and applies the correct kustomize overlay:
./run.sh azure # for AKS clusters
./run.sh nebius # for Nebius clustersOr apply manually:
# Create the ConfigMap
kubectl create configmap multimodel-batch-inference-scripts \
--from-file=main.py \
-n ray --dry-run=client -o yaml | kubectl apply -f -
# Apply the overlay
kubectl apply -k overlays/azure # or overlays/nebiusThis creates a RayCluster (head + 2 GPU workers with 1 GPU each), installs pip dependencies via runtimeEnvYAML, runs main.py, and keeps the cluster alive for inspection.
# Watch job status
kubectl -n ray get rayjob multimodel-batch-inference -w
# Stream logs
kubectl -n ray logs -f -l job-name=multimodel-batch-inference --tail=100
# Ray Dashboard
kubectl -n ray port-forward svc/multimodel-batch-inference-head-svc 8265:8265Then open http://localhost:8265 for the Ray Dashboard.
Embeddings are materialized into Ray's in-memory object store and used directly for the similarity search. The top-K results are printed in the job logs:
Top 5 similar images:
1. class=border_collie similarity=0.8176 path=s3://...
2. class=yorkshire_terrier similarity=0.8079 path=s3://...
...
Since embeddings live in Ray's object store, they are ephemeral — they exist only while the RayCluster is running. Set shutdownAfterJobFinishes: false (the default in rayjob.yaml) to keep the cluster alive for interactive inspection via the Ray Dashboard.
| Variable | Default | Description |
|---|---|---|
BATCH_SIZE |
64 |
Batch size for CLIP embedding |
NUM_GPU_ACTORS |
2 |
Number of GPU actor replicas |
TOP_K |
5 |
Number of similar images to retrieve |
SAMPLE_IMAGE_URL |
https://doggos-dataset.s3...samara.png |
Query image for similarity demo |
These are set in runtimeEnvYAML inside base/rayjob.yaml and can be overridden there.
The default base/rayjob.yaml uses 2 GPU worker nodes with 1 GPU each (2 total) and 2 CLIP embedding actors. Adjust for your setup:
| Node Pool VM SKU | GPUs/Node | Suggested NUM_GPU_ACTORS |
Worker Replicas |
|---|---|---|---|
Standard_NC6s_v3 (V100) |
1 | 1 | 1 |
Standard_NC24ads_A100_v4 (A100) |
1 | 1 | 4 |
Standard_ND96asr_v4 (A100 x8) |
8 | 8 | 1 |
gpu-h100-sxm-8gpu (H100 x8) |
8 | 8-16 | 1-2 |
Update the replicas, num-gpus, and the GPU count in base/gpu-claim.yaml along with the NUM_GPU_ACTORS environment variable.
Note: The
runtimeEnvYAMLpip install runs per-actor on each worker node at startup. With large dependencies liketorch(~2.8 GB), expect a 1-2 minute delay before GPU actors begin processing. To eliminate this delay, bake dependencies into a custom container image.
| Anyscale | KubeRay on AKS / Nebius |
|---|---|
| Notebook runs inside Anyscale Workspace | Script runs on the cluster via RayJob |
accelerator_type="T4" |
Removed — GPU type determined by VM SKU |
| S3 user storage for Parquet artifacts | Ray object store via materialize() (ephemeral) |
| Anyscale runtime env auto-setup | runtimeEnvYAML in RayJob spec |
anyscale job submit |
./run.sh azure or kubectl apply -k overlays/<cloud> |
doggos pip package pre-installed |
doggos installed via runtimeEnvYAML from GitHub |
kubectl -n ray delete rayjob multimodel-batch-inference
kubectl -n ray delete configmap multimodel-batch-inference-scripts