Name	Name	Last commit message	Last commit date
parent directory ..
base	base
overlays	overlays
README.md	README.md
main.py	main.py
run.sh	run.sh

Batch Inference on AKS with KubeRay

This example demonstrates batch inference (CLIP image embedding generation) on Kubernetes using KubeRay and Ray Data. It supports both Azure (AKS) and Nebius clusters.

It is adapted from the Ray E2E Multimodal AI Workloads — Batch Inference tutorial (originally designed for Anyscale).

What This Example Does

Distributed Read — Reads dog breed images from a public S3 bucket using Ray Data (CPU).
Preprocessing — Adds class labels extracted from file paths using map (CPU).
Batch Embedding — Generates CLIP embeddings with GPU actors via map_batches (GPU).
Materialize — Materializes embeddings into Ray's shared memory object store.
Similarity Search — Embeds a query image and retrieves the most similar images by cosine similarity.

The pipeline uses Ray Data's streaming execution, which processes data in chunks as they're loaded — avoiding OOM errors on large datasets and maximizing GPU utilization by overlapping CPU preprocessing with GPU inference.

S3 (images) ──► read_images (CPU) ──► map(add_class) (CPU)
                                           │
                                           ▼
                        map_batches(EmbedImages) (GPU × N)
                                           │
                                           ▼
                              materialize() ──► Ray Object Store

Prerequisites

Component	Version / Details
Kubernetes cluster	AKS or Nebius with GPU node pool
NVIDIA GPU DRA driver	`gpu.nvidia.com` device class available on GPU nodes
KubeRay operator	v1.5.1+, installed via cluster setup
Ray	2.48.0

Directory Structure

batch-inference/
├── main.py                          # Batch inference script (runs on the RayCluster)
├── run.sh                           # One-command launcher (azure or nebius)
├── base/
│   ├── kustomization.yaml           # Kustomize base
│   ├── rayjob.yaml                  # Cloud-agnostic RayJob manifest
│   └── gpu-claim.yaml               # DRA ResourceClaimTemplate (1 GPU per worker)
└── overlays/
    ├── azure/
    │   ├── kustomization.yaml       # Azure overlay
    │   └── rayjob-patch.yaml        # nodeSelector for Azure
    └── nebius/
        ├── kustomization.yaml       # Nebius overlay
        └── rayjob-patch.yaml        # nodeSelector for Nebius

GPU allocation is defined in base/gpu-claim.yaml as a standalone ResourceClaimTemplate (single-gpu) that requests 1 NVIDIA H100 GPU per worker via the gpu.nvidia.com device class. The RayJob references this template by name. Each cloud overlay applies JSON patches to place pods on the correct node pools:

Pod	Azure	Nebius
Submitter / Head	`agentpool: cpu`	`agentpool: nebius-cpu`
GPU Workers	`agentpool: gpu`	`agentpool: nebius-gpu`

Architecture

./run.sh azure   (or: kubectl apply -k overlays/azure)
        │
        ▼
┌──────────────────────────────────────────────────┐
│ RayJob: multimodel-batch-inference               │
│                                                  │
│  Head Pod (CPU node pool)                        │
│  ├── main.py (entrypoint via ConfigMap)          │
│  └── Drives the Ray Data pipeline                │
│                                                  │
│  Worker Pods (GPU) × 2                           │
│  ├── 1 GPU each (2 total)                        │
│  └── CLIP embedding actors via map_batches       │
│                                                  │
│  Ray Object Store (shared memory)                │
│  └── Materialized embeddings (ephemeral)         │
└──────────────────────────────────────────────────┘

The script is mounted via a ConfigMap. Pip dependencies (torch, transformers, doggos, etc.) are installed on all nodes at job start via runtimeEnvYAML in rayjob.yaml.

Quick Start

1. Run the example

The run.sh script handles ConfigMap creation, cleanup of previous runs, and applies the correct kustomize overlay:

./run.sh azure    # for AKS clusters
./run.sh nebius   # for Nebius clusters

Or apply manually:

# Create the ConfigMap
kubectl create configmap multimodel-batch-inference-scripts \
    --from-file=main.py \
    -n ray --dry-run=client -o yaml | kubectl apply -f -

# Apply the overlay
kubectl apply -k overlays/azure   # or overlays/nebius

This creates a RayCluster (head + 2 GPU workers with 1 GPU each), installs pip dependencies via runtimeEnvYAML, runs main.py, and keeps the cluster alive for inspection.

2. Monitor

# Watch job status
kubectl -n ray get rayjob multimodel-batch-inference -w

# Stream logs
kubectl -n ray logs -f -l job-name=multimodel-batch-inference --tail=100

# Ray Dashboard
kubectl -n ray port-forward svc/multimodel-batch-inference-head-svc 8265:8265

Then open http://localhost:8265 for the Ray Dashboard.

3. Review Output

Embeddings are materialized into Ray's in-memory object store and used directly for the similarity search. The top-K results are printed in the job logs:

Top 5 similar images:
  1. class=border_collie        similarity=0.8176  path=s3://...
  2. class=yorkshire_terrier    similarity=0.8079  path=s3://...
  ...

Since embeddings live in Ray's object store, they are ephemeral — they exist only while the RayCluster is running. Set shutdownAfterJobFinishes: false (the default in rayjob.yaml) to keep the cluster alive for interactive inspection via the Ray Dashboard.

Configuration

Environment Variables

Variable	Default	Description
`BATCH_SIZE`	`64`	Batch size for CLIP embedding
`NUM_GPU_ACTORS`	`2`	Number of GPU actor replicas
`TOP_K`	`5`	Number of similar images to retrieve
`SAMPLE_IMAGE_URL`	`https://doggos-dataset.s3...samara.png`	Query image for similarity demo

These are set in runtimeEnvYAML inside base/rayjob.yaml and can be overridden there.

Scaling

The default base/rayjob.yaml uses 2 GPU worker nodes with 1 GPU each (2 total) and 2 CLIP embedding actors. Adjust for your setup:

Node Pool VM SKU	GPUs/Node	Suggested `NUM_GPU_ACTORS`	Worker Replicas
`Standard_NC6s_v3` (V100)	1	1	1
`Standard_NC24ads_A100_v4` (A100)	1	1	4
`Standard_ND96asr_v4` (A100 x8)	8	8	1
`gpu-h100-sxm-8gpu` (H100 x8)	8	8-16	1-2

Update the replicas, num-gpus, and the GPU count in base/gpu-claim.yaml along with the NUM_GPU_ACTORS environment variable.

Note: The runtimeEnvYAML pip install runs per-actor on each worker node at startup. With large dependencies like torch (~2.8 GB), expect a 1-2 minute delay before GPU actors begin processing. To eliminate this delay, bake dependencies into a custom container image.

Key Differences from Anyscale Version

Anyscale	KubeRay on AKS / Nebius
Notebook runs inside Anyscale Workspace	Script runs on the cluster via RayJob
`accelerator_type="T4"`	Removed — GPU type determined by VM SKU
S3 user storage for Parquet artifacts	Ray object store via `materialize()` (ephemeral)
Anyscale runtime env auto-setup	`runtimeEnvYAML` in RayJob spec
`anyscale job submit`	`./run.sh azure` or `kubectl apply -k overlays/<cloud>`
`doggos` pip package pre-installed	`doggos` installed via `runtimeEnvYAML` from GitHub

Cleanup

kubectl -n ray delete rayjob multimodel-batch-inference
kubectl -n ray delete configmap multimodel-batch-inference-scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Batch Inference on AKS with KubeRay

What This Example Does

Prerequisites

Directory Structure

Architecture

Quick Start

1. Run the example

2. Monitor

3. Review Output

Configuration

Environment Variables

Scaling

Key Differences from Anyscale Version

Cleanup

FilesExpand file tree

batch-inference

Directory actions

More options

Directory actions

More options

Latest commit

History

batch-inference

Folders and files

parent directory

README.md

Batch Inference on AKS with KubeRay

What This Example Does

Prerequisites

Directory Structure

Architecture

Quick Start

1. Run the example

2. Monitor

3. Review Output

Configuration

Environment Variables

Scaling

Key Differences from Anyscale Version

Cleanup