Skip to content

Latest commit

 

History

History
148 lines (100 loc) · 3.62 KB

File metadata and controls

148 lines (100 loc) · 3.62 KB

Multi-Model Benchmark Guide

Step-by-step guide for deploying and running the WVA multi-model scaling benchmark on an OpenShift cluster using the Go-based deployment tool (deploy/multimodel/). This covers everything from cluster access to running the benchmark and interpreting results.

Prerequisites

Required Tools

Verify the following tools are installed on your machine:

oc version --client
kubectl version --client
helm version --short
yq --version
jq --version
go version

If any are missing, install via Homebrew: brew install openshift-cli kubectl helm yq jq go

Required Access

  • OpenShift cluster credentials (API URL + token)
  • HuggingFace token with access to the models you want to deploy

Step 1: Log In to the OpenShift Cluster

Get your login token from the OpenShift web console:

  1. Open the OpenShift console in your browser
  2. Click your username (top right) → Copy login command
  3. Click Display Token
  4. Copy the oc login command and run it:
oc login --token=sha256~XXXXXXXXXXXXXXXXXXXX --server=https://api.your-cluster.example.com:6443

Verify access:

oc whoami
kubectl get nodes

Check available GPUs on the cluster:

kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'

Step 2: Set Up Your Namespace

Create a fresh namespace for the benchmark. Using a dedicated namespace avoids conflicts with other users. Replace <your-namespace> with a name of your choice. Mine was--> (e.g. wva-bench-test):

kubectl create namespace <your-namespace>

Label the namespace for OpenShift user-workload monitoring (so Prometheus can scrape metrics):

kubectl label namespace <your-namespace> openshift.io/user-monitoring=true --overwrite

Step 3: Export Your HuggingFace Token

The only environment variable you need to export is the HuggingFace token (required for model downloads):

export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

All other configuration is passed directly to the deploy/test commands in later steps.


Step 4: Clone the Repository

If you haven't already:

git clone https://github.com/llm-d/llm-d-workload-variant-autoscaler.git
cd llm-d-workload-variant-autoscaler

Make sure you're on the correct branch:

git checkout main
# Or check out a specific PR branch:
# gh pr checkout <pr-number>

Step 5: Run the Multi-Model Benchmark

Replace <your-namespace> with your namespace (e.g. wva-bench-test):

# 1. Undeploy previous run (clean slate)
make undeploy-multi-model-infra \
  ENVIRONMENT=openshift \
  WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
  MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"

# 2. Deploy multi-model infrastructure
make deploy-multi-model-infra \
  ENVIRONMENT=openshift \
  WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
  NAMESPACE_SCOPED=true SKIP_BUILD=true \
  DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \
  MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"

# 3. Run the benchmark
make test-multi-model-scaling \
  ENVIRONMENT=openshift \
  LLMD_NS=<your-namespace> \
  MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"

Expected result: make test-multi-model-scaling passes with exit code 0.

Monitor During the Benchmark

In a separate terminal, watch the scaling behavior:

watch kubectl get hpa -n <your-namespace>
watch kubectl get variantautoscaling -n <your-namespace>

Delete the Namespace (optional, after you're done)

kubectl delete namespace <your-namespace>