Step-by-step guide for deploying and running the WVA multi-model scaling benchmark on an OpenShift cluster using the Go-based deployment tool (deploy/multimodel/). This covers everything from cluster access to running the benchmark and interpreting results.
Verify the following tools are installed on your machine:
oc version --client
kubectl version --client
helm version --short
yq --version
jq --version
go versionIf any are missing, install via Homebrew: brew install openshift-cli kubectl helm yq jq go
- OpenShift cluster credentials (API URL + token)
- HuggingFace token with access to the models you want to deploy
Get your login token from the OpenShift web console:
- Open the OpenShift console in your browser
- Click your username (top right) → Copy login command
- Click Display Token
- Copy the
oc logincommand and run it:
oc login --token=sha256~XXXXXXXXXXXXXXXXXXXX --server=https://api.your-cluster.example.com:6443Verify access:
oc whoami
kubectl get nodesCheck available GPUs on the cluster:
kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'Create a fresh namespace for the benchmark. Using a dedicated namespace avoids conflicts with other users. Replace <your-namespace> with a name of your choice. Mine was--> (e.g. wva-bench-test):
kubectl create namespace <your-namespace>Label the namespace for OpenShift user-workload monitoring (so Prometheus can scrape metrics):
kubectl label namespace <your-namespace> openshift.io/user-monitoring=true --overwriteThe only environment variable you need to export is the HuggingFace token (required for model downloads):
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"All other configuration is passed directly to the deploy/test commands in later steps.
If you haven't already:
git clone https://github.com/llm-d/llm-d-workload-variant-autoscaler.git
cd llm-d-workload-variant-autoscalerMake sure you're on the correct branch:
git checkout main
# Or check out a specific PR branch:
# gh pr checkout <pr-number>Replace <your-namespace> with your namespace (e.g. wva-bench-test):
# 1. Undeploy previous run (clean slate)
make undeploy-multi-model-infra \
ENVIRONMENT=openshift \
WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"
# 2. Deploy multi-model infrastructure
make deploy-multi-model-infra \
ENVIRONMENT=openshift \
WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
NAMESPACE_SCOPED=true SKIP_BUILD=true \
DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"
# 3. Run the benchmark
make test-multi-model-scaling \
ENVIRONMENT=openshift \
LLMD_NS=<your-namespace> \
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"Expected result: make test-multi-model-scaling passes with exit code 0.
In a separate terminal, watch the scaling behavior:
watch kubectl get hpa -n <your-namespace>
watch kubectl get variantautoscaling -n <your-namespace>kubectl delete namespace <your-namespace>