Automated deployment script for WVA and llm-d infrastructure on OpenShift clusters.
Note: This guide covers OpenShift-specific deployment details. For a complete overview of deployment methods, Helm chart configuration, and the full configuration reference, see the main deployment guide.
- Overview
- Prerequisites
- Quick Start
- Configuration Options
- Usage Examples
- Script Features
- What Gets Deployed
- Troubleshooting
- Post-Deployment
- Cleanup
This script automates the complete deployment process on OpenShift cluster including:
- Workload-Variant-Autoscaler controller
- llm-d infrastructure (Gateway, Scheduler, vLLM)
- Prometheus Adapter for external metrics
- HPA integration
- All required ConfigMaps and RBAC
- Automatic GPU detection
- Deployment verification
- oc (OpenShift CLI)
- kubectl
- helm (v3+)
- yq (v4+)
- jq
- git
- OpenShift cluster with admin privileges
- Logged in via
oc login - GPUs available in the cluster (H100, A100, L40S...) Note to check the available GPU types on your OCP cluster, you can run:
kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'- HuggingFace token for model downloads
# Required: Set your HuggingFace token
export HF_TOKEN="your-hf-token-here"
# Optional: Customize deployment
export MODEL_ID="unsloth/Meta-Llama-3.1-8B" # Default
export WVA_IMAGE="ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest" # Defaultmake deploy-wva-on-openshiftThat's it! The script will:
-
Check prerequisites
-
Detect GPU types on your OpenShift cluster
-
Deploy all components, including WVA, llm-d, and the Prometheus-Adapter for HPA
-
Verify the deployment
-
Print a summary with next steps
For a complete list of environment variables and configuration options, see the Configuration Reference in the main deployment guide.
Key environment variables for OpenShift:
export HF_TOKEN="hf_xxxxx" # Required: HuggingFace token
export MODEL_ID="unsloth/Meta-Llama-3.1-8B" # Model to deploy
export ACCELERATOR_TYPE="H100" # GPU type (auto-detected)
export GATEWAY_PROVIDER="istio" # Gateway: istio or kgateway
# Performance tuning (optional)
export VLLM_MAX_NUM_SEQS=64 # vLLM max concurrent sequences (batch size)
export HPA_STABILIZATION_SECONDS=240 # HPA stabilization windowDeployment flags - Control which components to deploy:
export DEPLOY_WVA=true # Deploy WVA controller
export DEPLOY_LLM_D=true # Deploy llm-d infrastructure
export DEPLOY_PROMETHEUS_ADAPTER=true # Deploy Prometheus AdapterNote: OpenShift uses the built-in User Workload Monitoring (Thanos) instead of deploying a separate Prometheus stack.
export HF_TOKEN="hf_xxxxx"
make deploy-wva-on-openshiftexport HF_TOKEN="hf_xxxxx"
export BASE_NAME="my-inference"
export MODEL_ID="meta-llama/Llama-2-7b-hf"
make deploy-wva-on-openshiftexport HF_TOKEN="hf_xxxxx"
export HPA_STABILIZATION_SECONDS=30 # Fast scaling for testing
export VLLM_MAX_NUM_SEQS=8 # Low batch size for easy saturation
export E2E_TESTS_ENABLED=true
make deploy-wva-on-openshiftexport DEPLOY_WVA=true
export DEPLOY_LLM_D=false
export DEPLOY_PROMETHEUS_ADAPTER=false
make deploy-wva-on-openshiftexport HF_TOKEN="hf_xxxxx"
export MODEL_ID="unsloth/Meta-Llama-3.1-8B"
export VLLM_MAX_NUM_SEQS=64 # Match desired max batch size
make deploy-wva-on-openshiftexport HF_TOKEN="hf_xxxxx"
export ACCELERATOR_TYPE="A100"
make deploy-wva-on-openshift- GPU Type: Automatically detects H100, A100, L40S etc... GPUs
- Thanos URL: Finds the correct Prometheus/Thanos endpoint
- OpenShift Connection: Verifies cluster connectivity
- Exits on any error (
set -e) - Validates prerequisites before starting
- Checks for required environment variables
- Provides detailed error messages
- Color-coded output (INFO, SUCCESS, WARNING, ERROR)
- Step-by-step progress indicators
- Detailed logging of each operation
After deployment, the script verifies:
-
WVA controller is running
-
llm-d infrastructure is deployed
-
Prometheus Adapter is running
-
VariantAutoscaling resource exists
-
HPA is configured
-
External metrics API is accessible
Displays:
-
All deployed components
-
Resource names and namespaces
-
Next steps and useful commands
-
How to verify and test
- Namespace:
workload-variant-autoscaler-system - Components:
- Controller manager deployment
- Service for metrics
- ServiceMonitor for Prometheus
- ConfigMaps (service classes, accelerator costs)
- RBAC (roles, bindings, service account)
- Namespace:
llm-d-inference-scheduler(default) - Components:
- Gateway (kgateway)
- Inference Scheduler (GAIE)
- vLLM deployment with model
- Service for vLLM
- ServiceMonitor for vLLM metrics
- HuggingFace token secret
- Namespace:
openshift-user-workload-monitoring - Components:
- Prometheus Adapter deployment (2 replicas)
- ConfigMap with CA certificate
- RBAC for cluster monitoring
- External metrics API configuration
- VariantAutoscaling: Custom resource for WVA optimization
- HPA: HorizontalPodAutoscaler for deployment scaling
- Probes: Health checks for vLLM pods
[ERROR] Missing required tools: yq helmSolution: Install missing tools:
# macOS
brew install yq helm
# Linux
# Follow official installation guides for yq and helm[ERROR] Not logged into OpenShift clusterSolution: Log in first:
oc login --token=<your-token> --server=<your-server>[ERROR] HF_TOKEN environment variable is not setSolution: Set your HuggingFace token:
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"Wait 1-2 minutes for:
-
Prometheus to scrape metrics
-
Prometheus Adapter to process them
-
External metrics API to update
Check status:
kubectl get pods -n openshift-user-workload-monitoring | grep prometheus-adapter
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/llm-d-inference-scheduler/wva_desired_replicas" | jqCheck logs:
kubectl logs -n llm-d-inference-scheduler deployment/ms-inference-scheduling-llm-d-modelservice-decodeCommon issues:
-
Insufficient GPU resources
-
HuggingFace token invalid/expired
-
Model download timeout
-
Inappropriate SLOs for the deployed model and GPU types: update the
SLO_TPOTandSLO_TTFTvariables with appropriate SLOs given the model and employed GPU type
# Check all components
kubectl get pods -n workload-variant-autoscaler-system
kubectl get pods -n llm-d-inference-scheduler
kubectl get variantautoscaling -n llm-d-inference-scheduler
kubectl get hpa -n llm-d-inference-scheduler
# Check external metrics
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/llm-d-inference-scheduler/wva_desired_replicas" | jqkubectl logs -n workload-variant-autoscaler-system \
deployment/workload-variant-autoscaler-controller-manager \
-fexport ENVIRONMENT=openshift
make test-e2e-full# Create a load generation job
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: vllm-bench-test
namespace: llm-d-inference-scheduler
spec:
template:
spec:
containers:
- name: vllm-bench
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args:
- |
python3 -m vllm.entrypoints.cli.main bench serve \
--backend openai \
--base-url http://infra-inference-scheduling-inference-gateway:80 \
--model unsloth/Meta-Llama-3.1-8B \
--request-rate 20 \
--num-prompts 1000
restartPolicy: Never
EOFTo remove all deployed components:
# Delete llm-d infrastructure
helm uninstall infra-inference-scheduling -n llm-d-inference-scheduler
helm uninstall gaie-inference-scheduling -n llm-d-inference-scheduler
helm uninstall ms-inference-scheduling -n llm-d-inference-scheduler
# Delete Prometheus Adapter
helm uninstall prometheus-adapter -n openshift-user-workload-monitoring
# Delete WVA
make undeploy
# Delete namespaces
kubectl delete namespace llm-d-inference-scheduler